6  Binary Logistic Regression

6.1 Introduction to Logistic Regression

There are several forms of logistic regression:

  1. Multinomial logistic regression (a.k.a nominal logistic regression) is based on a nominal y variable with more than two categories. Nominal variables are categorical variables (a.k.a qualitative variables) that are different in name only (i.e. there is no inherent order in the categories). For example, eye color represented with category names such as blue, brown, green or represented with category names such as 0, 1, and 2. See Hua, Choi, and Shi (2021) chapter 11 for an example.

  2. Binary logistic regression (the topic of this chapter) is based on a binary y variable. Binary variables are categorical variables (a.k.a qualitative variables) that are nominal variables represented with only two categories. For example, the answer to a survey question represented with category names such as yes and no or represented with category names such as 0 and 1. See Hua, Choi, and Shi (2021) chapter 10 for an example.

  3. Ordinal logistic regression is based on an ordinal y variable. Ordinal variables are categorical variables (a.k.a qualitative variables) that are different in name and order. For example, quality rating represented with category names such as good, better, best, or represented with category names such as 0, 1, and 2. See Hua, Choi, and Shi (2021) chapter 12 for an example.

Logistic regression models do not estimate the y variable like linear regression models. Instead, logistic regression models estimate the probability of the y variable being a specific category. For example, in binary logistic regression with category names 0 and 1, the model estimates the probability that y is in the category 1.

6.1.1 Logistic regression equation

The final form of the logistic regression equation (Equation 6.1) differs from the linear regression equation on the left side of the equation. The right side of the equation continues to have x variables and corresponding coefficients (a.k.a parameters) represented by β0, β1, β2, β3, …, βn. However, β0 is not the y-intercept.

\[logit(p) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + ... + \beta_{n}x_{n} \tag{6.1}\]

6.1.1.1 Left side of the equation

We begin with the concepts of odds (Equation 6.2) and probability (Equation 6.3). For example, if the probability of an event occurring is 80%, then the odds of the event occurring are 4:1 because \(\frac{80}{(1-80)} = 4\). If the odds of an event occurring are 4:1 then the probability of the event occurring is 80% because \(\frac{4}{(1+4)} = 0.80\).

\[odds = \frac{p}{1-p} \tag{6.2}\]

\[p = \frac{odds}{1+odds} \tag{6.3}\]

Logistic regression modeling relies on natural logarithm, which means the base is \(e\) (a.k.a Euler’s number). The value of \(e\), much like \(π\), is a mathematical constant with special properties (Aghaeiboorkheili and Kawagle (2022)). The value of \(e\) to 15 decimal places is 2.718281828459045. Equation 6.4 and Equation 6.5 illustrate the meaning of a logarithm, specifically a natural logarithm because the base is \(e\).

\[log_{e}(a) = b \tag{6.4}\] \[e^b = a \tag{6.5}\]

The left side of the logistic regression equation can be represented in one of the equivalent forms in Equation 6.6 where \(odds\) are the odds of \(y = 1\) and \(p\) is the probability of \(y = 1\). The final form, \(logit(p)\) is the most common form used to represent the left side of the logistic regression equation. The term logit refers to a logistic transformation of the odds.

\[log_{e}(odds) = log_{e}(\frac{p}{1-p}) = logit(p) \tag{6.6}\]

Thus, the logistic regression equation produces the logit of \(p\), which is the natural log of the odds of \(y = 1\). This value ranges from 0 to 1 and is interpreted as the probability that \(y = 1\). This probability is then rounded to create the categories \(y = 1\) or \(y = 0\). The rounding threshold depends on the application. The common practice of using 0.50 as the rounding threshold works for many situations.

6.1.1.2 Right side of the equation

The coefficients on the right side of the logistic regression equation are estimated using the maximum-likelihood method. The equations for maximum-likelihood estimation (MLE) are beyond the scope of this chapter. See STAT415 (n.d.) for details.

The coefficients (β1, …, βn ) are interpreted as follows (UCLA (n.d.a)).

  1. For continuous \(x\) variables, a one-unit change in the \(x\) variable causes the \(logit(p)\) (i.e. natural log of the odds of \(y = 1\)) to change by the value of the coefficient of that \(x\) variable.

  2. For binary categorical \(x\) variables, a value of \(x = 1\) causes the \(logit(p)\) (i.e. natural log of the odds of \(y = 1\)) to change by the value of the coefficient of that \(x\) variable because \(1 * β = β\). In contrast, a value of \(x = 0\) causes no change in the \(logit(p)\) because \(0 * β = 0\).

6.1.2 Logistic regression concept

The \(logit(p)\) (i.e. natural log of the odds of \(y = 1\)) is always bounded between 0 and 1 and forms a sigmoidal shape (i.e. an s-curve, blue line in Figure 6.1). Values of the \(logit(p)\) fall on this line and are interpreted as \(y = 1\) or \(y = 0\) based on a rounding threshold. The rounding threshold depends on the application. The common practice of using 0.50 as the rounding threshold works for many situations. The 0.50 rounding threshold (dashed line in Figure 6.1) would convert the orange dot into \(y = 1\) and the red dot into \(y = 0\) (Figure 6.1).

Figure 6.1: Logistic Regression s-curve plot

6.2 Binary Logistic Regression Assumptions

As with assumptions of linear regression model building (Section 5.2), explanations of the assumptions of logistic regression model building are all over the board. Quite surprisingly, none of the resources used in Section 5.2 (Freund and Wilson (2003), Groebner, Shannon, and Fry (2014), Jaggia et al. (2023), Shmueli et al. (2023), and Statistics by Jim) contain information about the assumptions of logistic regression. The assumptions of logistic regression belong to statisticians, not data analysts. I am not a statistician. I have gleaned the following assumptions from reconciling the explanations in Foley (2022, chap. 2, section 1), Hua, Choi, and Shi (2021, chap. 10, section 2), James et al. (2022, chap. 4, section 17), Leung (2021) with GitHub page, and Schreiber-Gregory and Bader (2018),

I am convinced that at least part of the problem with the disparity in descriptions is the use of the word assumptions. What do we mean by assumptions? We mean requirements of the model that should be checked, and if you do not check these requirements then you are assuming they are true.

A logistic regression model that violates one or more of these assumptions may still be useful for estimating \(y\) values.

Logistic regression model building requirements that should be checked or assumed.

  1. Categorical \(y\) variable: The \(y\) variable must meet the form of logistic regression being performed (Section 6.1): Multinomial; binary; or ordinal. Logistic regression is not suited for calculating a continuous \(y\) variable.

  2. Linearity with \(logit(p)\): There is a linear relationship between each continuous \(x\) variable as specified and the \(logit(p)\) (i.e. natural log of the odds of \(y = 1\)). The as specified refers to the form of the \(x\) variables (e.g. no transformation, raised to a power, inverse, log, etc.). This can be checked with scatter plots and the Box-Tidwell test (See Section 6.3.1 and Foley (2022) Assumptions).

  3. Low multicolinearity: Multicolilnearity refers to a linear relationship between any of the x variables {x1, x2, x3, …, xn}. We want little to no multicolinearity among our x variables because linear relationships between the x variables adversely impacts the accuracy of the coefficient calculations. Variance Inflation Factor (VIF) can be used to test for multicolinearity among the x variables. Pairwise scatter plots can be used to visually evaluate multicolinearity among the x variables.

  4. Independent observations: The observations (i.e. rows of data) are not correlated with one another. Observations that are correlated will cause repeating patterns in the errors.

The following are not requirements that should be checked, but are important considerations.

  1. Linearity in coefficients: The equation is linear in the coefficients {β1, β2, β3, …, βn}, which means no coefficient is raised to a power (other than 1) and there is no multiplication or division of a coefficient by another coefficient. This means the form of Equation 6.1 is the only possible form of coefficients for a logistic regression equation. The functions used in R ensure this linearity of coefficients. This does not prevent the x variables {x1, x2, x3, …, xn} from being mathematically altered (e.g. raised to a power, inverse, log, etc.) to increase the estimation ability of the model.

6.3 Maximum Likelihood Estimation

The glm() function is used to run MLE logisitic regression. The first parameter of the glm() function is the variables to be used in the equation. Equation 6.7 illustrates the basic form of this parameter, which is used when there is no interaction in x variables. The ‘+’ sybmol in Equation 6.7 can be read as and.

\[y \sim {x}_{1} + {x}_{2} + {x}_{3} + ... + {x}_{n} \tag{6.7}\]

The second parameter of the glm() function is the family or type of model. We use binomial for binary logistic regression.

The example below includes variables named \(y\), \(x1\), and \(x2\). You can use your imagination to think about \(y\) as some yes or no variable which we are attempting to estimate based on two continuous \(x\) variables.

# Read in the fake data
fake_data <- read_csv("fake_logistic2.csv")

# Create the linear regression model 
my_model <- glm(y ~ x1 + x2, family = "binomial", data = fake_data)

The results of glm() should be stored in an object, my_model in this example, and can be summarized with the summary() function.

# Output the summary report of the binary logistic regression model
summary(my_model)

Call:
glm(formula = y ~ x1 + x2, family = "binomial", data = fake_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4965  -0.7282  -0.2097   0.6544   1.7820  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -11.8702     3.8872  -3.054 0.002261 ** 
x1            1.2952     0.3852   3.362 0.000773 ***
x2            0.9664     0.3711   2.604 0.009205 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 68.029  on 49  degrees of freedom
Residual deviance: 42.226  on 47  degrees of freedom
AIC: 48.226

Number of Fisher Scoring iterations: 5

The summary output provides \(β_{0}\) (Figure 6.2, green box) and coefficients for the \(x\) variables (Figure 6.2, yellow box), which allows us to write the equation in Equation 6.8.

Figure 6.2: Logistic Regression Coefficients

\[logit(p) = -11.87 + 1.30 * x_{1} + 0.97 * x_{2} \tag{6.8}\]

6.3.1 Checking linearity with \(logit(p)\)

You may recall that one of the requirements that should be checked (Section 6.2) is the linearity of each \(x\) variable with the \(logit(p)\). This can be checked visually (Figure 6.3) and by using the Box-Tidwell test. See Foley (2022) Assumptions for details on running the Box-Tidwell test.

library(tidyr)   # for pivot_longer function
library(ggpubr)  # for stat_cor function to add Pearson correlation to the plot

# Prepare the data set for plotting the estimated y values against each x variable

# Copy the original data frame. I am always reluctant to change my original data frame
#   even though I know that I can simply read in the CSV file again to restore my 
#   original data frame. 
print_data <- fake_data 

# Add the fitted values from the model
print_data$fitted <- my_model$fitted.values

# Pivot the x variables to columns
# Label the new column "x-variables"
# The value of teach x variable will be moved into a column named "value"
print_data <- print_data |> 
  pivot_longer(c(x1, x2), names_to = "x_variables") 

# View the prepared data frame
head(print_data)
# A tibble: 6 × 4
      y fitted x_variables value
  <dbl>  <dbl> <chr>       <dbl>
1     1 0.820  x1           4.59
2     1 0.820  x2           7.7 
3     0 0.0304 x1           1.29
4     0 0.0304 x2           6.97
5     1 0.502  x1           2.68
6     1 0.502  x2           8.7 
# Create a plot for each x variable
ggplot(data = print_data,
       mapping = aes(x = value, y = fitted)) +  # value is the data frame variable containing the values of each x variable 
  geom_point() +
  facet_wrap(facets = vars(x_variables), scales = "free_x") +  # Create side-by-side plots based on x variables
  geom_smooth(method = "lm", formula = "y ~ x", se = FALSE) +  # Add the linear model line
  stat_cor(method = "pearson", label.y = 1) +   # Add the Pearson correlation and p-value      
  labs(
    title = "Residuals vs Estimated Values",
    x = "Value for each x variable",
    y = "logit(p)"
  )

Figure 6.3: Checking linearity of each x variable with the logit(p)

The visual approach in this section is adapted from Foley (2022) Assumptions.

I have added the Pearson correlation (Figure 6.3) for measuring the linear relationship between \(logit(p)\) and each \(x\) variable. The Pearson correlation for \(x1\) is strong (\(R = 0.82\)) and statistically significant (\(p < 0.05\)). The Pearson correlation for \(x2\) is moderate (\(R = 0.47\)) and statistically significant (\(p < 0.05\)). See Frost (n.d.) for details on interpreting Pearson correlation.

6.4 Model Performance Metrics

Deviance Measures

The Null Deviance and Residual Deviance can be used to calculate chi-square (Equation 6.9) and the associated p-value {Equation 6.10} to evaluate the quality of the model.

\[\chi^2 = null \ deviance - residual \ deviance \tag{6.9}\]

\[p{-}value_{\chi^2} = df_{null \ deviance} - df_{residual \ deviance} \tag{6.10}\]

Figure 6.4: Deviance Measures

Here are the calculations for our example. These values can be calculated from results in the model and used to calculate the p-value of chi-square as shown in the code example.

\(\chi^2 = 68.029 - 42.226 = 25.803\)

\(p{-}value_{\chi^2} = 49 - 47 = 2\)

# calculate chi-square as null deviance - residual deviance
chi2 <- my_model$null.deviance - my_model$deviance

# Calculate chi-square degrees of freedom as df{null deviance} - df{residual deviance}
pv_chi2 <- my_model$df.null - my_model$df.residual

# Calculate the p-value for chi-square
pchisq(q=chi2, df=pv_chi2, lower.tail=FALSE)
[1] 2.494247e-06

The p-value shown above is less than the chosen significant level of 0.05. Thus, the model is acceptable to use for estimating the \(y\) variable.

Akaiki Information Criterion (AIC)

AIC is used to compare different models developed on the same data set. AIC is based on log-likelihood and includes a penalty for increasing the number of \(x\) variables. A lower AIC indicates a better model for that data set. AIC is not an absolute measure and cannot be used to evaluate a single model. It can only be used to compare two or more models developed on the same data set.

Pseudo R-Square

Pseudo R-square is not shown in the summary() output. However, Pseudo R-square is a model evaluation metric for logistic regression. There are many possible formulas for Pseudo R-square (UCLA (n.d.b)). Pseudo R-square is not like R-square for linear regression! (1) The range of Pseudo R-squared is not limited to 0-1, (2) Pseudo R-squared does not represent the amount of variation in the dependent variable that is explained by the independent variables. Pseudo R-square is used to compare different models developed on the same data set. A higher Pseudo R-square, calculated with the identical formula, indicates a better model for that data set. Pseudo R-square is not an absolute measure and cannot be used to evaluate a single model. It can only be used to compare two or more models developed on the same data set.

Confusion Matrix

The confusion matrix is not shown in the summary() output. It must be generated using a function. The confusion matrix produces the best performance metrics for logistic regression. Specifically, the Accuracy measurement. A confusion matrix compares the count of false negatives, true negatives, false positives, and true positives. These counts are calculated from comparing \(\hat{y}\) (estimated \(y\)) with the actual \(y\) for each observation.

True positive (\(TP\)) is when \(\hat{y} = y = 1\)

False positive (\(FP\)) is when \(\hat{y} = 1\) but \(y = 0\)

True negative (\(TN\)) is when \(\hat{y} = y = 0\)

False negative (\(FN\)) is when \(\hat{y} = 0\) but \(y = 1\)

The confusion matrix is displayed as a 2 x 2 matrix where the intersection of rows and columns hold the count of the true and false values described above.

Figure 6.5: Confusion Matrix

library(caret)    # for the confusionMatrix() function

# Calculate the estimate for each observation based on rounding of the logit(p) for each observation
# The round() function uses the default threshold of 0.50. This default can be changed.
fake_data$estimate <- round(predict(my_model, type = "response"))

# Create the confusion matrix and associated metrics
# The `estimate` and `y` must be type factor and must have the same number of levels (i.e. categories) 
confusionMatrix(as.factor(fake_data$estimate),as.factor(fake_data$y), positive = "1", c("Estimated", "Actual"))
Confusion Matrix and Statistics

         Actual
Estimated  0  1
        0 25  5
        1  4 16
                                          
               Accuracy : 0.82            
                 95% CI : (0.6856, 0.9142)
    No Information Rate : 0.58            
    P-Value [Acc > NIR] : 0.0002834       
                                          
                  Kappa : 0.6281          
                                          
 Mcnemar's Test P-Value : 1.0000000       
                                          
            Sensitivity : 0.7619          
            Specificity : 0.8621          
         Pos Pred Value : 0.8000          
         Neg Pred Value : 0.8333          
             Prevalence : 0.4200          
         Detection Rate : 0.3200          
   Detection Prevalence : 0.4000          
      Balanced Accuracy : 0.8120          
                                          
       'Positive' Class : 1               
                                          

Accuracy is the percentage of \(y\) (Actual) correctly estimated by \(\hat{y}\) (Estimated) and is calculated using the following formula.

\[Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]

Sensitivity (a.k.a Recall) is the percentage of \(y\) (Actual) positives (\(y = 1\)) correctly estimated by \(\hat{y}\) (Estimated) and is calculated using the following formula.

\[Sensitivity = \frac{TP}{TP + FN}\]

Specificity is the percentage of \(y\) (Actual) negatives (\(y = 0\)) correctly estimated by \(\hat{y}\) (Estimated) and is calculated using the following formula.

\[Specificity = \frac{TN}{TN + FP}\]

Other metrics in the output of confusionMatrix() are less useful and can be explored via searching online resources.