2 Oct

Until today I learned about simple linear regression, multiple linear regression, about p-value, t-tests, cross-validation and k-fold test.  Now I am learning how to use these techniques in the project.

The data which is provided consists of entries of diabetes, obesity and inactivity of 2018 in each state in the country. We can see that there is unique FIPS number for each state. There are 3142 entries in diabetes, 363 entries in obesity and 1370 entries in inactivity. We can see that there are 354 common entries for all three sets. We copied all the common entries in a single spreadsheet.

To use simple linear regression, we need one dependent and one independent variable. So we have two cases for applying simple linear regression by keeping diabetes as dependent variable and keeping obesity as independent variable for one case and keeping inactivity as independent variable for another case. For multiple linear regression, we keep diabetes as dependent variable and keeping both inactivity and obesity as two independent variables.

In the next update, I will update the plots and the models of simple linear regression and multiple linear regression.

27-Sept

In our morning session, we studied how to evaluate a model that looks at factors related to obesity, inactivity, and diabetes. We used 5-fold cross-validation on the obesity, inactivity and diabetes data.

We had a dataset with 354 pieces of information about these factors. To test our model, we divided this dataset into five equal parts. From the given dataset we have divided four parts with 71 data points and the remaining one with 70 data points. We used four of these parts to train our model and the remaining part to see how well it works. We did this five times, each time using a different part as the test.

We also looked at how well our model fits the entire dataset. To do this, we trained the model on all the data and calculated a measure of how well it predicts the real values.

From our results, we noticed that as we made our model more complex, it did a better job of fitting the data.

When we test our model, it’s important to divide the data into these five groups carefully, especially when there might be duplicate information. We did this carefully to make sure our test is fair. We found that sometimes our model didn’t work as well on the test data, especially towards the end of the testing process.

So, in simple terms, we studied a model to understand factors related to health. We tested it on different parts of our data to see how well it works and found that it works better when we make it more complex. We also made sure our testing was fair, considering possible duplicate information in the data.

Sept 25

I have watched the video of the resampling about Training data, testing data, cross-validation and Validation test approach. And also learned about bootstrapping.

Cross-validation, validation set and bootstrapping come under resampling methods. Resampling is a process in which it allows us to repeatedly draw samples from a set of observations from an existing data set and create a new data set from it. According to statistician Jim Frost “Bootstrapping is a statistical procedure that resamples a single data set to create many simulated samples. This process allows for the calculation of standard errors, confidence intervals, and hypothesis testing”. For Validation set approach, we divide the data into training set and testing set. We can say that it is the method which can estimate the error rate by taking out a subset from a known set. It is the testing set. The remaining set can be called as training set. And when a model is made using the training set, we can apply it to the testing set to calculate the error rate.

Cross-Validation

The term “Train-Test Split” first came to my attention when studying the concept of cross-validation and the various sorts of it. Let’s look at the Train-Test Split first.

Test-Train Split:

  1. A training set and a testing set are separated from a dataset to create a train-test split.
  2. A statistical model is constructed and trained using a training set, and its performance is assessed using a testing set.
  3. To make sure the model efficiently learns patterns, the training set typically makes up a higher amount of the data, frequently between 70 and 80 percent.

Cross-Validation:

Cross-validation is a statistical technique used to assess the performance of a predictive model. It involves dividing the dataset into multiple subsets also called folds and systematically rotating them as both training and testing sets.

  •  Leave-One-Out Cross-Validation (LOOCV):

With LOOCV, each data set is treated as a single-fold. We do ‘n’ training and testing iterations for ‘n’ observations. The model is trained using the ‘n-1’ data points, with one data point being omitted as the testing set.

It also has drawbacks. LOOCV can be pricey, particularly for sizable datasets. It can take a lot of time and resources to train and test a model N times, where N is the amount of data points. LOOCV may produce overly optimistic predictions of a model’s performance when the dataset is small. The training and test sets are frequently highly linked when there are few data points, which might exaggerate performance measures.

 

  • K-Fold Cross-Validation (KFCV):

K-fold involves splitting the dataset into ‘K’ equal-sized folds, where ‘K’ is a user-defined number. This model is trained ‘K’ times, with the first ‘K-1’ folds serving as test sets and the remaining folds serving as training folds. Additionally, we average the results we obtain to produce a single result, which lowers the variance.

Additionally, there are certain drawbacks. Randomly partitioning data into folds can lead to some folds having insufficient representation of minority classes when working with datasets that are imbalanced (where one class is considerably more common than others). This could lead to biased evaluations. Depending on how the data is divided into folds, the performance metrics obtained in K-fold CV can change. If we take a binary number set, the test set might only have 1s or 0s, which has an impact on the accuracy result. This is an example of a fold that only has that one element.

 

  • Stratified Cross-Validation:

In Stratified Cross-Validation, we divide the dataset into ‘K’ equally sized folds, just like in traditional K-fold CV. However, it ensures that each fold has a similar class distribution to the overall dataset. This is crucial for maintaining the representation of minority classes, preventing bias in model evaluation. Stratified CV is particularly useful when dealing with imbalanced datasets, as it provides a more accurate estimate of a model’s performance by preserving the relative proportions of classes in each fold.

 

 

  • Time-Series Cross-Validation:

This technique is used for checking if models that predict things in the future are accurate. Imagine you’re trying to predict tomorrow’s weather or the future prices of stocks. To make sure your prediction works well, you do a few things.

First, you take all the data you have, like past weather or stock prices, and split it into two parts: one for learning and one for testing. It’s like studying from one set of books and then taking a test from another set of books to see if you really learned. Now, to see if your prediction can work over time, you don’t use all the data at once. Instead, you use a small piece of it, like a window that moves forward one step at a time. You use what you’ve learned from the past to predict the next step in the future, and you keep doing this as the window slides ahead. This way, you can check if your predictions are good not just for one time but for many times in the future.

 

 

 

 

 

 

 

 

Crab Molt Model

Today a problem was discussed in the class. The problem is about molts of crabs.

A data is given that has pair of pre-molt and post-molt ( pre-molt is the size of the shell of the crab before molting and post-molt is the size of the shell after molting ). A linear model is used and its purpose is to predict pre-molt size from the post-molt size. By using linear model we can see that r-squared  value is around 0.98. And we got the descriptive analysis containing information about median, mean, standard deviation, variance, skewness, kurtosis of pre-molt and post-molt data and also contains the smooth-histogram and quantile plot of the pre and post molt datas.

When we compare the histograms of pre-molt and post-molt, we can see that the shapes of both of them are similar and have a difference of 14.6858 in means. When we see the difference in means, the question that arises is whether the disparity in means carries statistical significance or not. To clarify this, we use a standard statistical technique of t-test.

 

” What is t-test?

          As we came across a new term, let us take a look at the basics of it. A t-test is used to determine if there is a statistically significant difference between the means of two groups. It is commonly used when comparing the means of two samples to assess whether any observed differences are likely due to a real effect or simply the result of random variation. There are three types of t-tests. They are:

  •  Independent samples t-test: This test is used when comparing the means of two independent groups or samples.
  • Paired Samples T-Test: This test is used when comparing the means of two related or paired groups at different times such as before and after.
  • One-Sample T-Test: This test is used when comparing the mean of a single sample to a known mean. “

 

Let’s revisit the crab molt problem. I am thinking that we can apply paired sample t-test as we are comparing the means of crab shells at different times that is before molting and after molting. So when we apply the t-test, the estimated p value is 0.34 which is probability of getting a 5 heads in a row from successive tosses of a fair coin. According to a widely accepted standard in statistical analysis across various fields, this is considered statistically significant because when p is less than 0.05.

Therefore, we can reject the null hypothesis indicating that there is no real difference in means of pre-molt to post-molt.

 

Quadratic model

I have explored about quadratic model and learnt about overfitting also.

Quadratic Model:

A quadratic model is a type of regression model that describes the relationship between a dependent variable and one or more independent variables using a quadratic equation. The general form of a quadratic model is:

In this equation:

  • represents the dependent variable.
  • represents the independent variable.
  • A, B and C are coefficients that need to be estimated from the data.
  • represents the squared term of the independent variable.
  • represents the error term, which accounts for the variability in the dependent variable that cannot be explained by the model.

A quadratic model allows for a curved relationship between the independent and dependent variables. When the coefficient ‘A’ is positive, it indicates an upward-facing curve, while a negative ‘A’ corresponds to a downward-facing curve.

Quadratic models are used when a linear model does not adequately capture the underlying relationship in the data. They are commonly employed in various fields, including physics, economics, engineering, and social sciences, to model complex relationships between variables.

Overfitting:

Overfitting is a phenomenon where a model learns the training data too well, capturing noise or random fluctuations in the data rather than the true underlying patterns. There is a relationship between overfitting and quadratic models, and it is important to understand this relationship:

Quadratic models are relatively flexible because they can capture nonlinear relationships between variables, including curved patterns. This flexibility can be both an advantage and a disadvantage. Quadratic models, due to their flexibility, have a higher capacity to fit complex data patterns. While this is beneficial when the data truly follows a quadratic relationship, it also makes them prone to overfitting when applied to data that doesn’t exhibit a quadratic pattern. When a quadratic model is applied to data that is essentially linear or follows a simpler relationship, the model may attempt to fit a quadratic curve to noise or random variations in the data. This leads to overfitting, as the model captures patterns that do not generalize well to unseen data.

To prevent overfitting in quadratic models, it’s essential to use appropriate regularization techniques, such as ridge regression or Lasso regression, which constrain the model’s coefficients and reduce its tendency to fit noise. Additionally, using cross-validation can help assess whether a quadratic model is overfitting by evaluating its performance on validation data.

In summary, quadratic models can capture complex relationships, they must be applied judiciously and with caution to avoid overfitting, particularly when simpler linear models might provide a more accurate representation of the underlying data. Regularization and model evaluation techniques play a crucial role in managing overfitting in quadratic models.

What is p-value?

I have explored about the p-value which the professor has explained and when I delved into this topic, I got to know about null hypothesis also.

NULL HYPOTHESIS:

The null hypothesis, often abbreviated as H0, is a fundamental concept used in hypothesis testing. It represents a statement or assumption that there is no significant difference, effect, or relationship between variables or groups in a population. The null hypothesis is typically formulated as the default or initial hypothesis to be tested against.

The null hypothesis serves as a baseline or point of reference for statistical inference. It is essential for drawing conclusions from data and making decisions based on evidence. However, it’s important to recognize that failing to reject the null hypothesis does not prove that it is true; it simply means that the data you collected did not provide sufficient evidence to suggest otherwise. Statistical hypothesis testing is a way to quantify and formalize this decision-making process in statistics.

p-Value:

In statistics, the p-value can be called as probability value and is a measure that helps assess the evidence against a null hypothesis. It quantifies the likelihood of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data if the null hypothesis were true. In simpler terms, it tells you whether the results you’ve obtained from your sample data are statistically significant or simply due to random chance.

Here’s how I am thinking the concept of p-value works:

1. Formulating a null hypothesis (H0): This is a statement that there is no effect, no difference, or no association in the population we are studying. It’s often the hypothesis you want to test against.

2. Collecting and analyzing data: You collect data from your sample and perform the necessary statistical analysis to calculate a test statistic. The choice of test statistic depends on the type of analysis you’re conducting.

3. Calculating the p-value: The p-value is calculated based on the test statistic and the probability distribution associated with the test. The p-value represents the probability of observing a test statistic as extreme as, or more extreme than, the one you calculated, assuming the null hypothesis is true.

4. Comparing the p-value to a significance level (A): If we assume the significance level as A, then it is predetermined before conducting the test and represents the threshold for statistical significance. Common choices for A include 0.05 (5%) and 0.01 (1%).

– If p-value ≤ A: You reject the null hypothesis. This suggests that the observed results are statistically significant, and there’s evidence to support your alternative hypothesis.
– If p-value > A: You fail to reject the null hypothesis. This means that the observed results are not statistically significant, and you do not have sufficient evidence to support your alternative hypothesis.

It’s important to note that a small p-value (typically ≤ 0.05) does not prove that the null hypothesis is false; it only suggests that your data provide evidence against it. The p-value does not provide information about the effect size or the practical significance of the results.

Interpreting p-values requires caution, and they should be considered in the context of the specific research question, study design, and the potential for other sources of bias or error in the data collection process. Additionally, p-values are just one component of statistical hypothesis testing, and it’s essential to consider effect sizes, confidence intervals, and other relevant statistical measures in the interpretation of your results.

9/11 – Simple Linear Regression -1

LINEAR REGRESSION:

  • Linear regression is used to predict one variable using the other variables.
  • The variable which we need to predict can be called as dependent variable and the variables used here to predict the dependent variable can be called as independent variable.
  • The linear regression assumes a linear relationship between the dependent and independent variables and find the best fitting line which describes this relationship.
  • If the linear regression is used to predict one dependent variable using one independent variable, then it is called SIMPLE LINEAR REGRESSION.
  • If the linear regression is used to predict one dependent variable using two or more independent variable, then it is called MULTIPLE LINEAR REGRESSION.

SIMPLE LINEAR REGRESSION:

  • As we need only one dependent and one independent variable for simple linear regression module to find the best fit line, we can define it by formula y=m*x+c, where y is the independent variable, m is the slope, x is the dependent variable and c is the intercept at x=0.
  • As we need best values for m and c for finding best fit line, we need to find the  minimum error between the predicted values and actual value. And for this we use Residual Sum of Squares where residuals are the difference between the observed value of the dependent variable and predicted variable. Here predicted value is our mx+c.

THINGS I OBSERVED FROM DIABETES DATA SET:

  • The given data is about the percentage of obesity, inactivity and diabetes in the states of USA in the year 2018.
  • There are 3142 entries in Diabetes, 363 entries in obesity and 1370 entries in inactivity.
  • If we take only obesity to build a model to predict diabetes then i think we should use simple linear regression.
  • If we take both obesity and inactivity to build a model to predict diabetes then i think we should use multiple linear regression.