## Analyzing the Titanic Dataset with MagicStat – Part 1

Titanic has become one of the most famous ships in history. It was very sad and tragic that the “unsinkable” ship was built in three years but sank within three days after its departure. Passengers were from the wealthiest people in the world to the middle class to emigrants.

Have you ever wondered whether there is a significant relationship between class and survival or gender and survival? What about other factors such as age of passenger, number of siblings or spouses on board, number of parents or children on board and cost of ticket that most predict or explain whether or not someone survived the Titanic? In this blog, we will try to answer these questions.

This blog contains a step by step guide analysis of the Titanic dataset by conducting two different types of analyses using MagicStat. The two analyses reviewed in this blog are the Chi-Square Tests for Independence and the Logistic Regression. Chi-Square Test for Independence are used to evaluate the relationship between two categorical variables. Logistic regressions are used to determine the degree to which a set of independent variables predict categorical variables, or outcomes. We would also like to note that we will show you how to perform a Pearson Correlation using MagicStat as an assumption check for multicollinearity in the instructions for the Logistic regression in this document.

Within the blog, you will find instructions for uploading your data to MagicStat and exploring it. Then you will see a rationale and examples for the two analyses discussed in the blog. Next, you will see step by step instructions for conducting each analysis. The blog also provides step by step instructions for interpreting the results of each analysis. Finally, you will see APA formatted examples demonstrating how to write up the results from each of the analyses.

**Chi-Square Test of Independence**

**1. Uploading and exploring your data.**

Begin the analysis by uploading the Titanic.sav and press explore. The .sav indicates this is an SPSS data file.

**2. Exploring your data**

Once the data is uploaded and you click explore, you should notice on the right side of the screen several data elements. First, you should notice that there are 1309 observations within the data set (See screen shot below).

You should also notice that there are 11 categorical variables, 4 numeric variables, and a table summarizing the numeric variables within the dataset (See screen shots below).

**3. Selecting the right model – Examining relationships between categorical variables.**

Many times, researchers and students are interested in the relationship between two categorical variables. An example of categorical variables would home ownership and education. For the most part, people either own or rent their home. As such, there would be two categories for the variable of home ownership – own and rent. Additionally, most everyone has some degree of education (e.g., GED, High School, Some College, College Degree). In this example, the researcher may be interested in determining if there is a relationship between home ownership and education. Or, is a person more likely to rent or own depending on their level of education?

There are many types of analyses that can be used to determine relationships between variables (e.g., correlation and regression). However, when a researcher is interested in examining the relationship between two categorical variables, the best type of analysis for this situation would be the Chi-Square Test for Independence.

There are two types of Chi-Square analyses. One type is referred to as the Chi-Square for Goodness of Fit. Researchers use the Chi-Square for Goodness of Fit analysis to examine the relationship between the proportion within a sample (e.g., data you collected) compared to the proportion within a population (e.g., data that is known). For example, is the proportion of college students who smoke at your university similar to the proportion of college students who smoke in the USA? In this case, we would use the Chi-Square for Goodness of Fit Analysis to determine if the proportion of college students who smoke at your university differs from the proportion of college students who smoke in the USA. If you are interested in examining the relationship between two categorical variables from the sample data set, then the Chi-Square Test for Independence would be used to answer this type of research question.

**Model summary – Chi-Square Test for Independence**

- What you need – You will need two categorical variables from the same data set.
- Example research question – is there a relationship between gender and political affiliation?
- Assumptions – Lowest expected frequency in any cell should be 5 or more.

**4. Running the analysis Chi-Square Test for Independence.**

The Titanic Dataset offers several categorical variables that are well suited for running a Chi-Square Test for Independence. The first question we will consider will be: * is there a relationship between class and survival?* To get started, we will need to select a model to analyze our data. Click the button, select a model to analyze your data (See Screen shot below).

Once you have selected the Chi-Square test for Independence, you should see some changes on your screen. There should be an option to select categorical variables for your analysis. See the example below.

In order to run a Chi-Square Test for Independence we will need to select the two categorical variables, in this case the two variables are class (pclass) and survival (survived).

- To enter these variables into the analysis, click – Select a categorical variable – and then select pclass.
- Repeat this step by clicking – Select a categorical variable – and then select survived.
- In the event you do not see these variables, click shift and refresh on your browser to clear your cache and again click – Select a categorical variable – to select the appropriate variable.
- The screen shot provided below demonstrates what the screen should look like once the correct variables are selected. If you are seeing the same thing on your screen go ahead and click analyze.

**5. Your results of Chi-Square Test for Independence**

You should see the following output on your screen after clicking the “Analyze” button.

- The first output is a crosstabulation of the two categorical variables of class and survival. By scanning left to right, you can see that
`123`

people from 1^{st}class died and`200`

from 1^{st}class survived. - You will also see the output for the Chi-Square results of
`pclass`

and`survived`

.

**6. Interpreting your crosstabulation**

- The first step for interpreting your results is evaluating the crosstabulation for class and survival. We want to ensure we did not violate any of our assumptions so that we can interpret the statistical results of the analysis with confidence. Scan through the table below and ensure that none of count totals are less than
`5`

. For example, the count for the number of people in first class that died equals`123`

. There are not any cells with a count size smaller than`5`

. This is good news; we did not violate any of our assumptions for this analysis and can interpret the results with confidence. Additionally, did you notice any patterns in the data? An examination of the crosstabulation indicated that death rate increased by class – 1^{st}class (

), 2*n*= 123^{nd}class (

), and 3*n*= 158^{rd}class (

). However, the increase in the death rate is not enough to conclude the increase is significant. We now need to interpret the Chi-Square results.*n*= 528

**7. Interpreting the Chi-Square results of pclass and survived**

- The second step for interpreting your results is evaluating the Chi-Square results of
`pclass`

and`survived`

. There are several data points within these results. The Chi-square stat is the actual statistic that you will report in writing up the analysis. The df is the degrees of freedom within the analysis. In a Chi-Square Test for Independence the df represents the degree to which an independent variable can vary within an analysis. The formula for df is`df = (r-1)(c-1)`

, with r being rows and c being columns. There is one data point within the table below that is relevant to our question of statistical significance, this data point is the*p*value (`0.000`

). The*p*value is the probability that the survival rate varies significantly by class. In other words, how likely are the differences between survival rates between classes due to chance? In order to be considered significant, the*p*value needs to be less than`0.05`

– in this case our*p*value is`0.000`

which is less than`0.05`

.

- As a result, we can conclude that that there are significant differences in the survival rates between groups based on class amongst passengers on the Titanic since our
*p*value is less than`0.05`

.

- As a general rule-of-thumb,
*p*values of`< 0.05`

are said to be statistically significant group differences. While the word “significant” carries certain connotations in English use, statistical significance can only tell you a result is unlikely to be caused by chance. This only means a difference is likely not a random difference.

**8. Writing up your results**

In the event you need to write up the results of a Chi-Square Test for Independence we provided an example to guide your efforts.

- There are several data points you will need for the write up.
- These data points can be found in the Chi-Square Results of pclass and survived.
- You will need the values associated with the Chi-Square stat, the degrees of freedom (df), and the
*p*value.

The write up – We are interested in exploring the relationship between class and survival. As such, the main research question was:

The results of the analysis indicated that the survival rate varied between the classes considered within the analysis – 1*is there a relationship between class and survival?*^{st} class (

), 2*n* = 123^{nd} class (

), and 3*n* = 158^{rd} class (

). Additionally, the results of the Chi-Square Test for independence indicated that there are significant differences in death rates by class: *n* = 528`c`

, ^{2 }(2) = 127.86*p* value `< 0.001`

. These results mean that the differences in death rates by class are not due to chance.

In the above write up the symbol c^{2 }stands for Chi-Square. The (2) represents the degrees of freedom within the analysis. The value of `127.86`

is the actual Chi-Square statistic. The

indicates that our *p* < 0.001*p* value is less than 0.001.

Let’s run another Chi-Square Test for Independence to explore the relationship between gender and survival. Follow along with the instructions and examples provided below.

**1. Running another analysis with Chi-Square Test for Independence**

The next question we will consider will be: * is there a relationship between gender and survival?* To get started, we will need to select a model to analyze our data. Click the button, select Chi-Square Test for Independence (See Screen shot below).

Once you have selected the Chi-Square test for Independence, you should see some changes on your screen. There should be an option to select categorical variables for your analysis. See the example below.

In order to run the Chi-Square Test for Independence we need to select the two categorical variables, in this case the two variables are class (gender) and survival (survived). To enter these variables, click – Select a categorical variable – and select gender first and survived second. The screen shot provided below demonstrates what the screen should look like once the correct variables are selected. If you are seeing the same thing on your screen go ahead and click analyze.

**2. Results – Chi-Square Test for Independence**

You should see the following output on your screen after clicking the “Analyze” button.

- The first output is a crosstabulation of the two categorical variables of gender and survival. By scanning left to right, you can see that
`127`

women from 1^{st}class died and`339`

from 1^{st}class survived. - You will also see the output for the Chi – Square results of gender and survived.

**3. Interpreting your crosstabulation**

- The first step for interpreting your results is evaluating the crosstabulation for gender and survival. We want to ensure we did not violate any of our assumptions so that we can interpret the statistical results of the analysis with confidence. Scan through the table below and ensure that none of count totals are less than
`5`

. For example, the count for the number of Females that died equals`127`

. There are not any cells with a count size smaller than`5`

. This is good news; we did not violate any of our assumptions for this analysis and can interpret the results with confidence. Additionally, did you notice any patterns in the data? An examination of the crosstabulation indicated that death rate is higher among Males (

) compared to Females (*n*= 682

). However, the difference in death rate by gender is not enough to conclude the difference is statistically significant. We now need to interpret the Chi – Square results.*n*= 127

**4. Interpreting the Chi-Square results of gender and survived**

- The second step for interpreting your results is evaluating the Chi-Square results of gender and survived. There are several data points within these results. The Chi-Square stat is the actual statistic that you will report in writing up the analysis. The df is the degrees of freedom within the analysis. In a Chi-Square Test for Independence the
`df`

represents the degree to which an independent variable can vary within an analysis. The formula for df is`df = (r-1)(c-1)`

, with r being rows and c being columns. There is one data point within the table below that is relevant to our question of statistical significance, this data point is the*p*value (`0.000`

). The*p*value is the probability that the survival rate varies significantly by gender. In other words, how likely are the differences between survival rates between the genders due to chance? In order to be considered significant, the*p*value needs to be less than`0.05`

– in this case our*p*value is`0.000`

which is less than`0.05`

.

- As a result, we can conclude that that there are significant differences in the survival rates between groups based on gender amongst passengers on the Titanic since our
*p*value is less than`0.05`

.

**5. Writing up your results**

The write up for the results for this Chi-Square Test for Independence are provided below. There are several data points you will need for the write up. These data points can be found in the Chi-Square Results of gender and survived. You will need the values associated with the Chi-Square stat, the degrees of freedom (df), and the *p* value.

The researcher was interested in exploring the relationship between gender and survival. As such, the main research question was: `is there a relationship between gender and survival?`

The results of the analysis indicated that the survival rate varied between the genders considered within the analysis – Males (

) compared to Females (*n* = 682

). Additionally, the results of the Chi-Square Test for independence indicated that there are significant differences in death rates by gender: *n* = 127`c`

, ^{2 }(1) = 365.89

. These results mean that the differences in death rates by gender are not due to chance. In the above write up the symbol c*p* < 0.001^{2 }stands for Chi-Square. The (1) represents the degrees of freedom within the analysis. The value of `365.89`

is the actual Chi-Square statistic. The

indicates that our *p* < 0.001*p* value is less than `0.001`

.

Go to the next page: Analyzing the Titanic Dataset with MagicStat – Part 2