Cholesterol is a very importance substance in our body for digesting foods, producing hormones, and generating Vitamin D. This blog includes an analysis of an cholesterol dataset retrieved from MASH at The University of Sheffield. This dataset contains a mixture of Between-Subjects (type of margarine) as well as within-subjects factors (length of intervention). This leaves us room to make many comparisons but we will begin with the most straightforward comparison of whether participation in these interventions lead to a change in cholesterol.

Begin your analysis by going to the MagicStat website (version 1.1.3), uploading the dataset `Cholesterol.csv` and pressing `Explore`.

This dataset contains a mixture of Between-Subjects (type of margarine) as well as within-subjects factors (length of intervention). This leaves us room to make many comparisons but we will begin with the most straightforward comparison of whether participation in these interventions lead to a change in cholesterol.

Paired-Samples t-Tests

Although there were 2 different types of margarine used we will first begin by asking the simple question: was cholesterol different from the beginning of the experiment until the end of the 8-week program? The appropriate analysis for this is the paired-samples t-test.

After loading your data click `select a model to analyze your data` and choose `Paired Samples t-test`

1. After loading your data click `select a model to analyze your data` and choose `Paired Samples t-test`.

2. The next step is to choose variables for groups 1 and 2.

• `select a variable for group 1` and pick `Before`
• `select a variable for group 2` and pick `After8Weeks`

In this comparison, we are ignoring the factor of which margarine the participant was assigned to use and simply asking whether 8 weeks of using margarine in our experiment leads to a difference in cholesterol score at the end. This doesn’t tell us everything we want to know but it gives us an idea of whether our intervention is impacting the outcome we are measuring.

3. Press `Analyze` and you should get the following table of statistical results.

4. Summary of Stats: This table contains the statistics for the two groups of interest. `Group 1`, Before, and `Group 2`, After8Weeks. Here we are given the respective group means `6.41` and `5.78` as well as standard deviations (`sd`) and standard errors (`sem`). These values are informative in describing our data but alone they cannot tell us whether the observed differences in groups are statistically significant. For this question we move onto the next table.

5. Group 1 and Group 2 Stats: The first three numbers of this table: `df`, `t`, and `p` are relevant to our question of statistical significance. `p` is the probability of the observed group difference under the null-hypothesis. In other words, how likely would the `Before` and `After8Weeks` groups have this observed `mean diff` by chance?

As a general rule-of-thumb, `p` values of < 0.05 are said to be statistically significant group differences. While the word “significant” carries certain connotations in English use, this is not the same significance we are talking about with statistical significance. Statistical significance can only tell you a result is unlikely to be caused by chance. This does not mean a result is big, impactful, or useful as an intervention. This only means a difference is likely not a random difference. To ask about the relevance of a difference we need a different measure.

Cohen’s D

Statistical reliability/significance is an important tool to keep us from chasing after results or interventions which only appear to have an effect but are actually due to chance. Knowing our effect is non-random is a good start but in many practical use-cases it is even more important to have a tool that lets us measure the size of the effect we’re having.

Initially, the straightforward idea would be to compare the differences between the two group means. Unfortunately, this kind of raw measure would vary wildly by something as simple as changing the unit of measurement. Imagine two groups of experimenters observing the exact same set of subjects but one is measuring them using inches and another using centimeters. Although the differences they’re observing between the control and experimental groups are identical they’d both get different measures of their effect size. This is a silly example because you can convert between inches and centimeters easily but imagine a more complex situation.

Imagine a business is trying to decide which of two training programs to hold for their employees.

• Program A: decreases mean employee stress level from 65 to 60
• Program B: increases mean employee satisfaction from 65 to 70

Both are a change in 5 points but there is no clear way to relate stress and job satisfaction. What do you do?

To make a more informed decision, we can look at each program’s effect size using the cohen’s d statistic. Cohen’s d is so useful because it scales the raw mean difference relative to how much the underlying data already varies. Instead of focusing on the 5 point difference, we ask how much variation is naturally in the data (`sd` or standard deviation) and compare the raw difference relative to natural variation.

More concretely, if the standard deviation of job stress is 15 and the standard deviation of job satisfaction is 5 we can compute cohen’s d for both these groups.

``````change = 5                  # both programs change mean scores by 5 points

stress = change / 15        # Cohen's d of 0.33
satisfaction = change / 5   # Cohen's d of 1.0``````

These effect size statistics tell a very different tale than the raw group differences. The stress reduction program has a small-to-moderate effect size, `0.33`, when compared to the job satisfaction program’s large effect size of `1.0`. If you were torn between which of the two programs to choose these effect size numbers would be good reason to prefer the job satisfaction program. We can expect more people will be helped and in a bigger way than with the stress reduction program.

Of course, no statistic alone can blindly guide decision making because there is always a question of values outside of the realm of statistics. Maybe stress in the office has been the cause of a lot of recent troubles or has been continually mentioned by employees as in particular need of improvement. Maybe job satisfaction is already so high that you think there would be diminishing returns from improving it further.

Statistics are powerful tools but we must remain thinking and skeptical agents. We empower ourselves when we understand the meaning of our results and we enslave ourselves when we forget to heed those limits.

Returning to our t-test results, we our comparison of the `Before` and `After8Weeks` groups produces a cohen’s d of `0.55`. This value is considered a moderate effect size between groups. If you picked scores at random from each of the two groups you’d expect the `After8Weeks` cholesterol level would be lower 65% of the time.

General guidance on Cohen’s d is shown in the table below.

Next page: Analyzing Cholesterol Dataset – Part 2