Go to the previous page: Analyzing the Titanic Dataset with MagicStat – Part 1

Many times, researchers are interested in predicting categorical variables or specific outcomes. An example of a categorical variable, would be students passing an exam (e.g., pass or fail). In this case of passing an exam, there would be two categories, either students pass or fail an exam. In situations like this, researchers may be interested in determining specific factors that predict or explain categories within a categorical variable of interest. Using our example of passing an exam, researchers may be interested in determining the factors that most influence whether a student passes or fails an exam. Time spent studying, previous exam scores, and time spent working a tutor are all examples of variables that might influence whether or not a student passes an exam.  

Logistic Regression

Regression is the main type of analysis that can be used to determine the degree to which a set of variables (e.g., independent variables) explains or predicts scores on an outcome variable (e.g., dependent variable). There are two main types of regression, standard multiple regression and logistic regression. Standard multiple regression is used when a researcher has multiple independent variables, that are scale based, and is interested in testing the degree to which these independent variables explain or predict scores on one scale based dependent variable. Logistic regression is similar to standard multiple regression except for one difference, the type of dependent variable. Logistic regression is just like standard multiple regression in that the researcher has multiple, scale based, independent variables; however, in logistic regression the dependent variable is categorical, there are only two categories. If your dependent variable has more than two categories, then a multinomial logistic regression would need to be used. Going back to our example, passing or failing an exam, the logistic regression would be the best statistical test for answering these types of research questions. If you are interested in learning more about logistic regression, follow along with the example provided below.

  • Model summary – Logistic Regression
    • What you need – You will need several scale based independent variables and one dependent variable that is categorical (e.g. only two categories).
      • Some researchers argue categorical predictors can be used in logistic regression. For simplicity sake, we will only use scale based variables. 
    • Example research question – What factors most predict or explain whether or not someone survived the Titanic?
    • Assumptions – There are no assumptions regarding the distribution of scores for the predictor variables within logistic regression. However, logistic regression is sensitive to predictor variables that are highly intercorrelated, a condition referred to as multicollinearity. 

1. Assumption Test – Correlation Analysis

In order to determine if we have violated the assumption of multicollinearity, we must run a correlation analyzing the degree to which independent variables are correlated with each other. Let’s do this now. The titanic data set should already be uploaded to MagicStat and we will select the Pearson correlation model.

Once we have selected the Pearson Correlation model, we will need to address a few issues.

  • Handling Missing Data? – When running a Pearson Correlation model, we need to tell MagicStat how to handle missing data. Listwise deletion removes cases if any data is missing. Pairwise deletion includes cases if data is available for analysis. Since we are analyzing the correlation between multiple variable and want to maximize power, we will select pairwise deletion.
  • Selecting variables – Magicstat.co automatically selects scale based variables within a dataset when running a Pearson Correlation. In this case, there are four variables. Select all of these variables for the analysis.
    • Age – Age of passenger.
    • Sibsp – Number of siblings or spouses on board.
    • Parch – Number of parents or children on board.
    • Fare – Cost of ticket. 
  • Analyze – Click analyze once you have selected your variables.
  • Output
    • Degrees of Freedom – In this case, we have 1307 which means the number of data cases included in the analysis was 1307. That number is not too different than the 1309 cases within the data set. Only two data cases were excluded from the analysis. We need to pay attention to the difference in these numbers when conducting analyses as large drops in these numbers signal poor data quality or issues in the data.
    • Pearson Correlation (r) – Interpreting the results, the Pearson Correlation table takes a minute to learn how to interpret the results. The data in the table mirror each other on either side of the perfect correlations between the same variables. Notice the red and blue triangles below. The numbers in each triangle mirror each other. In this case, what we are interested in is the degree to which the variables are correlated with each other. Variables with correlations above r > 0.70 signal a problem with multicollinearity. In this case, there are no correlations larger than r > 0.70. Since there are no correlations larger than 0.70, we can confidently assume we have not violated the assumption of multicollinearity and can confidently proceed with the analysis.

2. Running the analysis – Logistic Regression

The Titanic Dataset offers several independent variables and one important dependent variable that is well suited for running a Logistic Regression. Our research question will be: what factors predict the likelihood that someone survived the Titanic? To get started, we will need to select a model to analyze our data. Click the button, select a model to analyze your data (See Screen shot below) and select the Logistic Regression.

Once you have selected the Logistic Regression, you should see some changes on your screen. There should be an option to select categorical independent and dependent variables for your analysis. See the example below.

Independent Variables – Let’s select the independent variables for our analysis. We select the following variables: age (how old the passenger was), sibsp (number of children on board), parch (number of parents or spouses on board), and fare (the cost of the individuals ticket). See the example below. Now that we have selected our independent variables, let’s select our dependent variable.

Dependent Variable – We want to select the following variable – survived. See the example below. Now that we have selected our dependent variables, let’s review our variables.

Reviewing the Model – Let’s review the variables we selected for the model. The independent variables seem correct. We selected age, sibsp, parch, and fare. Notice that there is an option to select categorical for each variable. MagicStat allows researchers to include categorical variables in their logistic regressions. None of our variables are categorical. Ensure none of these boxes are checked. The dependent variable is correct – survived. Now that we have selected our variables, let’s run the analysis.

3. Results – Logistic Regression

You should see the following output on your screen after clicking the “Analyze” button.

  • The first output is an indicator of how many missing cases we have in the data. Logistic regression automatically leverages listwise deletion meaning if a case is missing data for any of the variables included in the analysis the data case will be excluded from the analysis.

4. Interpreting – Overview of logit regression results

  • There are several pieces to this output that need to be interpreted and considered. First, we want to ensure that our dependent variable is correct and in this case it is, survived. We also want to ensure we selected the correct model and we did as the model is logit. Next, we need to interpret the model. When running regressions, researchers are building their hypotheses around the idea that a certain set of variables, sometimes referred to as a model, are likely to predict of influence changes in the dependent variable. To this point, the first step in interpreting the results of a regression is to interpret the overall significance of the model. The first output “Overview of logit regression results” contains the data points we need to determine if the model is significant. Those values are the LLR p-value and the Pseudo R-square. We want our model to be significant and in this case it is, p = 0.000, as the p value is less than 0.05. However, we need to also consider the amount of variance explained by the model, the Pseudo R-squ = 0.071. While there are no hard or fast rules for interpreting the values in the Pseudo R-squ our model only explained 7% of variance in the categories of our dependent variable. In regression, you can have models that are significant that do not explain much variance, or change, in the dependent variable. In cases where the model is significant and does not explain much variance, there are usually independent variables that are not very strong predictors of the dependent variable. Now we must consider the other pieces of output.

5. Interpreting – Details of logit regression results

There are several pieces of information within this table; however, one piece of information is important to understanding which variables are significantly contributing to variance, or differences, in the dependent variable. The column P > |z| contains the significance values for each of the independent variables entered into our model. We want to see significant p values, p < 0.05. A quick scan of the table indicates that each of our variables are significantly contributing to variance, or differences, in the dependent variable. However, one variable is less significant than the others. Which one? Parch or parents or children on board is not as significant as the other variables (p = 0.038). Now that we know each of our independent variables are significantly contributing to the model, we need to understand the degree to which each of these variables is influencing differences in our dependent variable, whether or not someone survived the titanic.

Another piece of information within this table that we need to consider is the direction of the coefficient. The column coeff contains the directional prediction of the specific independent variables. Negative coefficient values indicate increases in the independent variable predict decreases in the likelihood of a specific outcome, in this case survival. Positive coefficient values indicate increases in the independent variable predict increases in the likelihood of a specific outcome, in this case survival. In this case, increases in age and the number of siblings or spouses on board decrease the likelihood that an individual survived the titanic. 

6. Interpreting – Odds ratios and 95% Confidence Intervals

The final table in the output contains a combination of previous tables. The Odds ratios and 95% Confidence Intervals table conveniently provides two pieces of information, the confidence intervals and odds ratios, related to our independent variables that we need to interpret our output. When interpreting the confidence intervals and odds ratios we need to consider a couple of issues. The first issue is the confidence intervals, we want to see confidence intervals that do not contain the value of 1. For example, age does not contain the value of 1 in the confidence intervals (e.g., 0.97 – 0.99). This means that the confidence interval does not contain the value of one and should be considered statistically significant (p < 0.05) and we can assume that the odds ratio is more than likely correct. In the event that the odds ratio contains a value of 1, we cannot conclude the odds ratio is statistically significant, which means there is still equal probability that the variable equally influences the two outcomes in the dependent variable. 

7. Writing up your results

In the event you need to write up the results of a logistic regression we provided an example to guide your efforts.

  • There are several data points you will need for the write up.
  • These data points can be found in the first table Overview of Logistic Regression Results.
  • You will need the values associated with the p value and the Pseudo R-squ.
  • You also need data points found in the Details of logit regression results.
  • You will need the values associated with the p value and the coefficients.

The write up – We were interested in determining if specific variables predicted changes in likelihood an individual survived the Titanic disaster. The independent variables included in the analysis were: age, number of children on board, number of parents or spouses on board, and the cost of the individuals ticket. The dependent variable was whether or not someone survived. We ran a logistic regression to test this research question. The result of the analysis indicated that the model was significant (p < 0.05); however, the model only explained 7% of the variance in likelihood that an individual died or survived the Titanic disaster.

A closer examination of the results indicated that all of the variables included in the model significantly contributed to the model (See Table 1). The results indicated that as age increases the likelihood of surviving the Titanic decreased. Additionally, the results indicated that as the number of siblings or spouses on board increased the likelihood of surviving the Titanic decreased. Lastly, while the cost of the ticket and the number of children on board tested as significant the confidence intervals indicated that we could not confidently conclude that the variables significantly contributed in either direction to the prediction of whether or not an individual survived the Titanic disaster. 

Table 1 – Logistic Regression Results
 CoefficientP0.25 CI0.97 CIOdds Ratio
Age-0.020.000.970.990.98
Fare 0.010.001.011.011.01
Parch 0.180.041.011.201.20
Sibsp-0.300.000.630.740.74

Leave a Reply