A correlation measures the relationship between two variables — for example, is a person’s IQ related to their income?
For a Pearson correlation, we need two variables. Typically, both variables need to be continuous, normally distributed, and unbounded, like height or age. If a variable is categorical, like profession, or if there are a lot of bounded scores, like a lot of 0s or 100s on a test, it won’t work.
The test score for a Pearson correlation is r, which has a range from -1 to +1. The r score tells you two things about the relationship between the two variables: the strength and the direction of the relationship. The larger the absolute value of the r score, the stronger the relationship. If the number is positive, then the two variables are directly related: as one goes up, the other goes up. If the value is negative, then they are inversely related: as one goes up, the other goes down, and vice-versa.
It’s important to remember that although a Pearson correlation can identify a relationship between two variables, it cannot (by itself) determine whether there is a causal relationship, let alone which variable is causing the other. Some relationships are clearly the product of a third variable. For example, ice cream sales are positively correlated with drownings. Now, does buying ice cream cause people to drown? Of course not. In reality, a third variable (temperature) is responsible for the relationship between ice cream and drowning: as it gets hotter, people are more likely to eat ice cream and more likely to go swimming.
The r score is also associated with a p value, which tests for statistical significance. The p value assesses how likely we would obtain this dataset by chance, if the null hypothesis were true. So, the lower the p value, the less likely it is that the null hypothesis is true. Typically, our alpha level, the threshold for statistical significance, is set at .05. That is, if our p value is below .05, then we reject the null hypothesis.
The p value for a Pearson correlation is governed by two things: the strength of the relationship, and the degrees of freedom. The stronger the relationship (either negative or positive), the lower the p value. The degrees of freedom for a Pearson correlation is N minus 2, so the larger your sample size, the more degrees of freedom, and the lower your p value.
So now that we know what a correlation is, let’s look at an example. Let’s say that we want to know whether a person’s IQ is related to their income. We have the following dataset.
Our hypothesis is that smarter people are more skilled and in higher demand, and therefore make more money. However, the relationship between IQ and income isn’t perfect, is it? There’s a lot more that goes into a person’s income than just their IQ: what field they work in, how much experience they have, even where they live. So, it won’t be a perfect relationship between IQ and income, and it probably won’t even be a particularly strong relationship. So, we’ll hypothesize a moderate, positive relationship between IQ and income. In general, we want to have hypotheses that are backed by theory. That way we can avoid “fishing expeditions” which throw variables together randomly. Performing a test without a hypothesis grounded in theory increases the likelihood that any relationship you might find is just due to chance. In the last column, we also have the foot size of each individual. Obviously, we would not hypothesize any difference between foot size and either IQ or income.
So now that we have our hypothesis, let’s see how to perform a correlation on MagicStat (version 1.1.3).
1-) Select a data file
Select your own dataset by clicking the “Choose a data file” button. If you would like to use a sample data file, click “Sample datasets” on the toolbar, save it to your hard drive, then click “Choose a data file” and navigate to where you saved it.
2-) Explore the dataset
After you select your dataset, click the “Explore” button.
After you select your dataset, click the “Explore” button. On the right side of the window is information-at-a-glance about your dataset, including variable information, bar graphs, and histograms.
4- Choose the “Pearson Correlation” model
Click “Select a model to analyze your data”, and select “Pearson Correlation” on the dropdown.
5- Choose variables
Click the “Select variables” button, and pick which variables you want to include in the model. Here, we’re selecting “IQ”, “Income” and “Foot_Size”.
6- Analyze the dataset
Finally, click the “Analyze” button.
Now it is time to interpret the results we obtained in the previous steps.
In a Pearson correlation, the degrees of freedom is purely a function of sample size, N minus 2. So, it is 52.
Next is our correlation table. We have a moderate correlation between IQ and Income at .41, as we hypothesized, and no correlation between Foot_Size and IQ or Income.
Below that is the p value for each relationship, and we see that moderate correlation between IQ and Income has a p value of 0.02, which means that if there were no relationship between IQ and Income, we’d expect to get this dataset about two times out of a thousand — not very likely!. And the p values for Foot_Size-IQ and Foot_Size-Income are close to 1, which means it’s not very likely that there is a relationship between them.
After the correlation table, MagicStat gives us some graphs. First is a correlation heatmap, to show where the strongest relationships are. We can see the moderate relationship between IQ and Income in purple, and the lack of relationship with Foot_Size in blue.
Then, we can select a scatterplot to visualize the relationship and check for outliers. If we look at the IQ-Income scatterplot, there do not seem to be any obvious outliers. With this scatterplot and a theoretical link between IQ and Income, we can feel confident in the relationship we found in our dataset.
Written by the MagicStat Team