



STATISTICAL RESOURCE 

Year : 2021  Volume
: 4
 Issue : 3  Page : 551554 

Logistic regression: A simple primer
Ankita Pal
Mahamana Pandit Madan Mohan Malaviya Cancer Center, and Homi Bhabha Cancer Hospital, Tata Memorial Center, Varanasi, Uttar Pradesh, India
Date of Submission  17Jul2021 
Date of Decision  31Aug2021 
Date of Acceptance  18Sep2021 
Date of Web Publication  08Oct2021 
Correspondence Address: Ankita Pal Mahamana Pandit Madan Mohan Malaviya Cancer Centre, Varanasi, Uttar Pradesh India
Source of Support: None, Conflict of Interest: None
DOI: 10.4103/crst.crst_164_21
Logistic regression is used to obtain the odds ratio in the presence of more than one explanatory variable. This procedure is quite similar to multiple linear regression, with the only exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest. The main advantage of performing logistic regression is to avoid the effects of confounders by analyzing the association of all the variables together. In this article, we explain how to perform a logistic regression using practical examples. After defining the technique, the assumptions that need to be checked are explained, along with the process of checking them using the R software.
Keywords: Diagnostics, logistic regression, odds ratio, R, regression analysis
How to cite this article: Pal A. Logistic regression: A simple primer. Cancer Res Stat Treat 2021;4:5514 
Introduction   
Most of the understanding of the biological effects and their determinants are gained through statistical analysis. Clinical studies that evaluate the relative contributions of various factors to a binary outcome, such as death or disease, are the most common way of gaining an understanding of biological effects and determinants. In this article, we aim to provide a brief and simplified outline on performing a logistic regression, which would be sufficient to permit clinicians who are unfamiliar with regression methodology to understand and interpret the results.^{[1]}
Logistic Regression   
The Multivariate Logistic Regression is the statistical technique used when we wish to estimate the probability of a dichotomous outcome, such as the presence or absence of disease or of death. The probability of the outcome is referred to as the dependent variable, and the various factors that influence it are the independent variables, sometimes termed risk factors.
The probability of an outcome is expressed as a proportion or a percentage. For instance, suppose there were 600 patients with cancer of which 30 died. The proportion of deaths is 30/600, or 0.05 or 5%. In general, the results of logistic regression are presented in terms of the odds rather than the probability of the outcome. There is a direct relationship between probabilities and odds, that is, the odds of the occurrence are the probability of the outcome occurring divided by the probability of the outcome not occurring. In this example, the odds of death were obtained by dividing 0.05, the proportion of deaths by 0.95, the proportion of survivors, and determined to be 1:19. The probability of death can be obtained from the odds simply by dividing the odds by 1 plus the odds or (1/19)/(1 + 1/19) =0.05.^{[2]}
Logistic regression uses the past experience of a group of patients to estimate the odds of an outcome by mathematically modeling or simulating that experience and describing it by means of a regression equation. Symbolically, a logistic regression equation is given as,
Where,
 x_{1} and x_{2} are the two predictor variables
 Y is a binary (Bernoulli) response variable, which is denoted as p = P (Y = 1)
 is the logodds
 βare the parameters of the model (i = 0, 1, 2).
A key feature in modeling a clinical experience is the selection of the independent variables that influence the result. The method for calculating the regression coefficients takes into consideration all the possible combinations of the independent variables. It then maximizes the probability that, for any given individual with a specific combination of independent variables, the chances of the result are going to be on the brink of the particular or observed outcome of all other individuals possessing the same combination of independent variables.^{[2]}
The general form of the logistic regression equation is similar to that of multivariate linear regression; however, the logarithm of the odds of the outcome termed the logit or logodds, is used as the dependent variable. The regression coefficients also are expressed as natural logarithms.
Logistic Regression Diagnostics   
Many assumptions need to be checked before performing a Logistic Regression analysis. The assumptions are listed below along with a guide on how to check them with the help of R.
Dependent variable
The first assumption is that the binary logistic regression requires the dependent variable to be binary, and in case of ordinal logistic regression, the dependent variable needs to be ordinal.
Independent observations
In order to perform a logistic regression, the observations need to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.
Large sample size
In general, logistic regression typically requires a large sample size. There is a general guideline that one needs a minimum of 10 cases with the least frequent outcome for each independent variable in a model. For example, if there are 5 independent variables and the expected probability of the least frequent outcome is .10 then a minimum sample size of 500 (10 * 5/.10) will be required.^{[3]}
Linearity assumption
The linear relationship between the continuous predictor variables and the logit of the outcome is checked. This can be done by visually inspecting the scatter plot between each predictor and the logit values.
If the scatter plot shows nonlinearity, there is a need to perform other methods to build the model such as including 2 or 3power terms, fractional polynomials, and spline function.
Influential values
Influential values are extreme individual data points that can alter the quality of the logistic regression model. The most extreme values in the data can be examined by visualizing the Cook's distance values. Here we label the top 3 largest values.
A point to be noted is that not all outliers are influential observations. To check whether the data contain potential influential observations, the standardized residual error can be inspected. Data points with an absolute standardized residual above 3 represent possible outliers and may deserve closer attention.
The following R code computes the standardized residuals (.std.resid) and the Cook's distance (.cooksd) using the R function augment() [broom package].
When outliers are present in a continuous predictor, the potential solutions include:
 Removing the concerned records
 Transforming the data into a log scale
 Using nonparametric methods
Multicollinearity
Multicollinearity corresponds to a situation in which the data contain highly correlated predictor variables. Multicollinearity is an important issue in regression analysis and should be fixed by removing the concerned variables. It can be assessed using the R function vif()[car package], which computes the variance inflation factors:
As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
Logistic Regression in R   
The general mathematical equation for logistic regression that is used in R software is,
where,
 y is the response variable
 x is the predictor variable
 a and b are the coefficients which are numeric constants.
The function used to create the regression model is the glm() function.
Syntax
The basic syntax for glm() function from the stats package in logistic regression is,
The description of the parameters mentioned in the above function is,
formula An object of class “formula” (or one that can be coerced to that class): a symbolic description of the model to be fitted.
family A description of the error distribution and link function to be used in the model. For glm this can be a character string naming a family function or the result of a call to a family function.
data An optional data frame, list or environment containing the variables in the model.
Reallife example
Suppose a reallife dataset known as the Cleveland Heart Disease dataset is considered, where the dataset contains information about patients who have or do not have heart disease. The dataset contains many medical indicators. It contains 76 attributes using which the medical history of patients of Hungarian and Switzerland origin was captured. The dataset is available online at: https://archive.ics.uci.edu/ml/datasets/heart+Disease.
The aim is to predict if a person has heart disease or not based on attributes such as blood pressure, heart rate, and others. Here, the dependent/response variable is target (whether the patient has heart disease or not) which is a binary variable, as it only takes the values 0 (= No) or 1 (= Yes). All the other variables are independent/predictor variables that will be used for predicting the response variable.
Therefore, a Logistic Regression Model is built in R with the help of the following R code.
To understand the above code, it can be broken down into parts and explained.
glm is the generalized linear model we will be using.
target ~ means that we want to model the target using (~) every available feature.
family = bionomial() is used because we are predicting a binary outcome. On running the above code, the result that was obtained is as below.
From the above output, it is clearly observed that a lot of variables are not significant, with the help of the P values (denoted as Pr(>z)). Hence, based on the least significance levels, the variables which were found to be significant will be removed one by one and checked for the best model by applying the glm function each time. Thus, on obtaining the best logistic regression model, it will be used for predicting the response variable.
Conclusions   
No one knows better than a doctor how multiple factors can combine to produce patient outcomes. Logistic regression analysis is a powerful tool for assessing the relative importance of factors that determine outcome. It is increasingly used in clinical medicine to develop diagnostic algorithms and evaluate prognosis. Yet, this tool is both imperfect and subject to misuse. An article by Shahian et al.^{[4]} describes the deficiencies of the method as currently employed in the production of “report cards.” A basic understanding of logistic regression analysis is the first step to appreciating both the usefulness and the limitations of the technique.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References   
1.  Glantz SA, Slinker BK. Primer of Applied Regression and Analysis of Variance. 3 ^{rd} ed. New York: McGrawHill; 1990. 
2.  Anderson RP, Jin R, Grunkemeier GL. Understanding logistic regression analysis in clinical reports: An introduction. Ann Thorac Surg 2003;75:7537. 
3.  Sperandei S. Understanding logistic regression analysis. Biochem Med (Zagreb) 2014;24:128. 
4.  Shahian DM, Normand SL, Torchiana DF, Lewis SM, Pastore JO, Kuntz RE, et al. Cardiac surgery report cards: Comprehensive review and statistical critique. Ann Thorac Surg 2001;72:215568. 
