30-Oct

As discussed in the previous post, I have applied the Cohen’s d to the given data through Python.

The code snippet is as follows:

The result wich i got is 0.58. Cohen’s d value of 0.58 tells us that there’s a noticeable age difference between black and white individuals in the dataset. Think of it like this: if I am comparing the average ages of these two groups, I had find that they’re somewhat different, and this difference is meaningful to some extent.

Imagine you’re looking at a group of people, and you’re comparing the ages of black and white individuals. A Cohen’s d of 0.58 suggests that, on average, there’s a moderate difference in their ages. This difference isn’t huge, but it’s certainly there, and it’s worth paying attention to, especially if you’re doing research or making decisions where age plays a role.

27-Oct

I am going to explain about Cohen’s d in this post.

Cohen’s d is a statistical measure that helps us understand and quantify the size of the difference between two groups or populations. It’s particularly useful when we want to compare the means (averages) of two groups, such as comparing the heights of men and women, test scores of two classes, or the effectiveness of two different treatments.

To put it in simpler terms, Cohen’s d provides a way to answer questions like, “Is the difference between these two groups significant, or is it just due to random variability?” In other words, it helps us determine if a difference we observe is meaningful or if it could have occurred by chance.

Cohen’s d is calculated by taking the difference between the means of the two groups and dividing it by a measure of the variability or spread of data within the groups. This measure of spread is often the standard deviation. The result is a number that represents the effect size, which tells us how “big” or “small” the difference is.

A larger Cohen’s d indicates a more substantial difference between the groups, while a smaller Cohen’s d suggests a smaller difference. By looking at the magnitude of Cohen’s d, researchers can assess the practical significance of the observed difference, helping them make informed decisions or draw meaningful conclusions from their data.

I will apply this Cohen’s d method to the given project in my next post

25-Oct

Today I am interested in finding the age distribution for people killed by police and if possible find some difference between black and white people. I am trying to implement this via Python.

Code snippet is as follows:

The above code only has calculation part. I have also inclued the print part to display the result and the result is as follows:

Statistics for All Ages:

Minimum: 2.0

Maximum: 92.0

Mean: 37.20922789705294

Median: 35.0

Standard Deviation: 12.97948974722669

Statistics for Black Race Ages:

Minimum: 13.0

Maximum: 88.0

Mean: 32.92811594202899

Median: 31.0

Standard Deviation: 11.38864900700022

Statistics for White Race Ages:

Minimum: 6.0

Maximum: 91.0

Mean: 40.12546239210851

Median: 38.0

Standard Deviation: 13.162144214944696

For All Ages:

The mean age is slightly higher than the median , which suggests that the ages are a bit skewed to the right. It means that there are some older individuals in the dataset pulling the average age higher. The kurtosis value, which measures the shape of the age distribution, is close to 3. This indicates that the age distribution is relatively normal, with no significant peakiness or extreme values.

For Black People:

For black people, the mean age is a bit higher than the median age, indicating a slight right skew in the age distribution. It means that there are some older individuals among black people in the dataset. The kurtosis value is almost 4, which is somewhat higher than 3 (typical for a normal distribution). This suggests that there is some peakiness around the mean age or a fat tail, meaning there may be a group of black individuals with ages closer to the mean but a few with significantly higher ages.

For White People:

For white people, the mean age is also a bit higher than the median age, indicating a slight right skew in the age distribution, similar to the overall dataset. The skewness value is around 0.5, suggesting a milder skew compared to black people. The kurtosis is almost 3, indicating little peakiness around the mean and no extensive fat tail, meaning the age distribution for white people is closer to a normal distribution with fewer extreme values.

I will talk about Cohen’s d in next post.

23-Oct

As I said before in my previous post, I am going to explain about the pros and cons of the DBSCAN in this post.

Pros of DBSCAN:

DBSCAN has several advantages that make it a valuable clustering algorithm. One of its key strengths is its ability to discover clusters of arbitrary shapes within your data. Unlike some other methods, it doesn’t assume that clusters are always spherical or have a specific geometry. This flexibility is highly valuable in real-world data analysis because clusters can often have irregular shapes.

Another important advantage of DBSCAN is its ability to automatically determine the number of clusters without requiring you to specify it in advance. This eliminates the need for subjective decisions about the number of clusters, making the algorithm more data-driven.

DBSCAN is also robust when it comes to handling noise and outliers in your data. It explicitly identifies and labels points that don’t belong to any cluster as “noise” or “outliers.” This feature is especially useful in applications where the presence of outliers can significantly impact the quality of clustering results.

Cons of DBSCAN:

While DBSCAN has many advantages, it also has some limitations. One of the challenges with DBSCAN is that it depends on two key parameters: ‘eps’ (the maximum distance) and ‘min_samples’ (the minimum number of points to form a core point). Selecting appropriate values for these parameters can be challenging, and the clustering results can be sensitive to their choices. If these parameters are poorly chosen, DBSCAN may produce suboptimal clustering results.

DBSCAN may struggle when dealing with datasets of varying densities. If your data has clusters with significantly different densities, you might need to adjust the ‘eps’ and ‘min_samples’  parameters for each cluster, which can be cumbersome.

Additionally, DBSCAN may not perform well when clusters have varying sizes. It tends to create larger clusters faster than smaller ones, which can result in uneven cluster sizes.

In summary, while DBSCAN offers many advantages, such as its ability to discover arbitrary-shaped clusters and handle noise, it also requires careful parameter tuning and may not work optimally with datasets of varying densities or cluster sizes.

20-Oct

As I have used DBSCAN method, I delved deeper about the DBSCAN clusturing method.

DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is like a detective for finding groups of data points in a map or dataset. Imagine you have a map with many dots on it, and you want to know if there are any meaningful clusters of points. DBSCAN helps with this.

DBSCAN doesn’t just look at how far apart points are; it also considers how many points are close to each other. It’s like saying, “Let’s see if a bunch of dots are huddled together.” If they are, DBSCAN says, “Aha! That’s a cluster.”

In these clusters, there are some special points called “core points.” Core points are like the central hubs of a cluster, with lots of other points around them. They’re like the heart of the cluster.

DBSCAN also looks at how points are connected. If you can travel from one point to another by hopping through a chain of nearby points, DBSCAN says those points are part of the same cluster. It’s like connecting the dots on the map to find a cluster shape.

Not every point is in a cluster, though. Some points are lonely and don’t have many friends nearby. DBSCAN calls them “noisy” or “outliers.” These points don’t belong to any cluster; they’re just on their own.

One of the neat things about DBSCAN is that you don’t have to tell it how many clusters you expect to find. It figures that out by itself based on the data. So, it’s like having an automatic cluster-finding detective for your data analysis.

I will discuss about pros and cons of DBSCAN during my next update

18-Oct

In the previous code for clustering, all datapoints are not categorized into the clusters. Some points are neglected and made as noise points as they couldn’t come under any cluster. In clustering algorithms, noise points (or outliers) are data points that do not belong to any cluster. These are points that the algorithm determines are too far away or dissimilar from any cluster to be assigned to one. So I tweaked the code in some parts.

I have post-processed the noise points by assigning them to the nearsest cluster. I computed the distance of each noise point to the centroid of each cluster and assigned the noise point to the cluster with the nearest centroid.

To visualize the clusters on a map, I used the “folium” library in Python. When I run the code, it will show an interactive map with each cluster represented by a different color.

16-Oct

I have tried to implement the DBSCAN Clustering to the latitude/longitude pair given in the dataset in Python. First I have imported the data and taken only latitude and longitude values as a set. I have removed the missing value as these hinder the DBSCAN clustering.

I have done clustering in California. So added a condition that only the cells containing CA under state should be added to perform clustering.

And performed the DBSCAN clusring with the data after cleaning it.

In the DBSCAN syntax, I have added to tale minimum 40 points as a cluster and within a radius of 50. So the resultwhich I got is:
[248, 582, 42, 54]
Number of clusters: 4

13-Oct

I have delved deeper into the dataset and found out a new term “CLUSTERING”. As it was a new term to me, I began to increase my knowledge about it.

Clustering is a branch of unsupervised machine learning that focuses on identifying patterns within data by grouping similar data points together. In contrast to supervised learning where model is provided by labeled data, clustering deals with no label data and searches for naturally occurring clusters in it. The purpose is to make it easier for one to recognize patterns in the dataset by making sure that data points belonging to a particular group are more similar compared to those belonging to others.

Various algorithms can achieve clustering, each with its strengths and methodologies. K-Means, for instance, partitions data into distinct clusters by minimizing the variance within each cluster. Hierarchical clustering builds a tree of nested clusters, either by successively merging small clusters into larger ones or vice versa. DBSCAN, a density-based approach, creates clusters based on the density of data points, allowing for more irregularly shaped clusters.

According to the pdf of the professor, I can see that he has used DBSCAN clustering in Mathematica. I am looking into videos about clustering, its types, why it is necessary, and how it can be done on various platforms. I am trying the same to do in Python. I will update my findings about it in the next update.

 

11 – Oct

          Upon researching the website containing the provided dataset, I discovered that, on average, police officers in the United States fatally shoot more than 1,000 individuals every year, as reported by The Washington Post’s ongoing analysis. In the aftermath of the 2014 killing of Michael Brown, an unarmed Black man, by police in Ferguson, Mo., The Washington Post conducted an investigation and found that the data on fatal police shootings reported to the FBI was significantly undercounted, missing more than half of the incidents. This gap has further widened in recent years, with only one-third of departments’ fatal shootings appearing in the FBI database by 2021. One of the primary reasons for this discrepancy is that local police departments are not obligated to report these incidents to the federal government. Additionally, an updated FBI reporting system and confusion among local law enforcement about reporting responsibilities exacerbate the problem. As part of its investigation, The Post began in 2015 to log every person shot and killed by an on-duty police officer in the United States. Since then, reporters have recorded thousands of deaths. In 2022, The Post updated its database to standardize and publish the names of the police agencies involved in each shooting to better measure accountability at the department level.

Information gathered from the dataset I obtained:

          Between the years 2015 and 2022, a total of 8002 individuals were fatally killed by the police. Within this figure, there are 454 cases where the identity of the individuals remains unknown. Delving deeper into the data regarding the nature of these incidents, a staggering 7664 were shot dead, with 460 of these unarmed at the time of their death. A smaller fraction, 338 individuals, met their demise through a combination of being shot and tasered.

          When observing the gender dynamics of this data, 31 individuals did not have their gender specified, while 358 of the deceased were female and the majority, 7613, were male. The racial and ethnic background of the victims is varied: 121 were of Asian descent, 1766 were Black, 1166 Hispanic, 105 Native American, 19 classified as Other, 3300 were White, and for 1517 individuals, their racial or ethnic identity remains unknown.

          A significant aspect to consider in these occurrences is the mental health of the individuals. Of the total, 6331 did not exhibit signs of mental illness. In contrast, 1671 did show signs of mental health challenges. The circumstances of these confrontations also offer insight: 1289 were trying to flee using a car, 1022 were attempting to escape on foot, while 4430 were not in the act of fleeing when the fatal incident occurred.

9 – Oct

 

 

After submitting my Project-1, I have thought to work on k-fold on the dataset from Project-1. The k-fold cross-validation technique is a method utilized to evaluate the effectiveness of machine learning models. In this approach, the dataset is divided randomly into k equal-sized sub-datasets. Out of these sub-datasets, one is set aside as the validation set to test the model, while the remaining k-1 sub-datasets are employed as training data. This process is repeated k times, with each sub-dataset being used as the validation set exactly once. The results from each fold can then be averaged to generate a single estimate of the model’s performance.

 

We can see that the R-squared value is around 26.5 but the value that we got in our Project is 0.52 which is higher than the value of R-squared we got from the k-fold validation. I have tested for k=10 to get 26.5. I have also tested for k=5, k=8. The value of R-squared which I got were 27.16 and 23.6 respectively.