



STATISTICAL RESOURCE 

Year : 2018  Volume
: 1
 Issue : 1  Page : 4145 

A stepwise guide to performing survival analysis
Santam Chakraborty
Tata Medical Centre, Kolkata, West Bengal, India
Date of Web Publication  12Dec2018 
Correspondence Address: Dr. Santam Chakraborty Tata Medical Centre, 14, MAR (EW), New Town, Rajarhat, Kolkata  700 160, West Bengal India
Source of Support: None, Conflict of Interest: None
DOI: 10.4103/CRST.CRST_5_18
Survival analysis refers to statistical techniques which have been designed to circumvent the issues arising out of incomplete information regarding the time until which a desired event or endpoint occurs. The reasons for this may be manifold, for example, lost to followup, dropouts from the study, lack of sufficient research budget, and short followup period. It is one of the common but complicated analysis done in trials. The current article provides a stepwise guide toward understanding survival functions and performing it.
Keywords: Kaplan–Meier curve, survival analysis, time to event analysis
How to cite this article: Chakraborty S. A stepwise guide to performing survival analysis. Cancer Res Stat Treat 2018;1:415 
Survival Analysis Definition   
Survival analysis refers to statistical techniques which have been designed to circumvent the issues arising out of incomplete information regarding the time until which a desired event or end point occurs. Synonyms include reliability analysis, duration analysis, and event history analysis. While duration and proportion survived are the most common outcomes of interest that are estimated using these methods, the analytical techniques can be applied to other end points when we are interested in the time taken for an event to occur, for example, duration of time stayed at a hotel and recidivism after incarceration.
Why Do Survival Analysis?   
Continuous variables such as height, weight, and blood pressure can be measured objectively, and hence measures of central tendency (e.g., mean, median, and mode) and dispersion (range and standard deviation) are easy to understand and calculate in these situations. While it may be tempting to consider time to an event as a continuous variable, it is important to remember that in most cases, we cannot wait long enough for the entire population to experience the specific event of interest. Of course, if the entire population were to experience the event that is of interest, then time could be considered as a continuous variable, and routine measures of central tendency could be used to describe it. For example, if the event of interest were the duration of survival in fruit flies (irrespective of the cause of death) kept in a jar, while being unable to reproduce, the time to death would be recorded in all individual fruit flies within a practical time span of 60 days (as the lifespan is approximately 30 days). On the other hand, doing the same in a population of humans would certainly not be practical.
Understanding Censoring   
The key issue that these analytical techniques attempt to circumvent is that all members of the population do not get to experience the event of interest within the time frame of the period of observation. The reasons for this may be manifold (e.g., lost to followup, dropouts from the study, lack of sufficient research budget, and short followup period), but all these are situations in which the events could be observed if the observation time and resources were further extended. The way these analytical techniques deal with this situation is referred as censoring. Censoring is of several types, but the most common type of censoring done is also known as right censoring, where the event may occur after the duration of observation. For example, if we wished to estimate the duration of time taken to finish a marathon race but have only 2–3 h to record the timings before we go home. In this case, some runners would have finished the marathon, some would be just finishing, and some would still be struggling at various distances or would have given up. In this case, the people who did not complete the marathon within 3 h would be censored. Note here that in a marathon, the people usually start at the same time, but survival analytic techniques deal with staggered entries typically seen in clinical trials and studies (i.e., all patients do not enter in the study at the same time point) [Figure 1].  Figure 1: Graphical depiction of the concept of event and censoring. Each line represents a single patient and the blueshaded area is the time during which the study runs. Patients whose lines terminate in a circle experienced the event, while others have not. As shown, few may experience the event at the end of the observation period, but others may not. In addition, entries into the study are staggered. Some patients do not complete the observation period but do not experience the event either (e.g., lost to followup). In this case, all patients except the two who experience the event within the study duration will be censored
Click here to view 
One of the key concepts to remember is that most survival techniques are designed to work with “noninformative” censoring.^{[1]} The seemingly innocuous term means that censoring should not be determined by the event under consideration, i.e., the reasons that patients drop out of a study should be unrelated to the study question. If patients in a study were to drop out only if they experienced a recurrence, then censoring would be informative, and an accurate estimate of the recurrencefree survival could not be obtained. Noninformative censoring is necessary as traditional survival analytic methods like the Kaplan–Meier method assume that the population that remains after censoring is representative of the entire population.
Preparing the Dataset   
Before starting out with survival analysis, there are certain basic data points that should be recorded in your dataset:
 The date/time of start of observation: This is usually the time at which you start observing your patients. This time point should be available for all patients and should be chosen such that there is minimal ambiguity. For randomized studies, the date of randomization is usually chosen as the start date, while in retrospective studies, a good choice is the date of registration or diagnosis. The key is to have a consistent way to record this date
 Date/time to event: This is basically the time at which your patient has experienced the event in question. The best practice is to record this as a date. The actual duration of survival is calculated from the date of start of observation to the date of event. Patients who do not experience an event should have the last date of followup noted in this column
 Event indicator: An indicator column which indicates whether the event of interest has happened. The best practice is to record the event as Yes or 1 and the absence of event as 0 or No.
Note here that if you are interested in multiple events of interest, then it is best to record these events separately in your dataset. It is also important to maintain consistency in encoding events in such a scenario. [Figure 2] depicts an example on how to arrange your datasheet prior to the analysis.  Figure 2: Example datasheet to show the data setup for estimation of survival. In this case, we have twin objectives in mind – to calculate the overall survival of the population and to calculate the recurrencefree survival. The cells highlighted in red indicate that the event has occurred. Note that raw dates have been entered. This is useful as errors in the calculation of duration of survival are eliminated. Further duration can be expressed in multiple ways (e.g., days, weeks, and months) and data consistency checks can be made (serves as a second check, e.g., if the duration of survival is a negative figure, then it indicates an error in data entry for one of the dates). LFU = Lost to followup
Click here to view 
Note that in the example, patients who have been lost to followup have not been coded as dead even if they had a recurrence. Instead, these have been censored at the last date of followup known for the patient. This is because while this patient may have subsequently died, we do not have concrete information on the same. If, on the other hand, we had known that this patient had died (e.g., by making a phone call/looking at the death certificate), then this patient would have been recorded as dead and the date of death would have been noted appropriately. Survival analytic methods are useful because they incorporate information on censoring and as such do not use subjective “data juggling” to account for what is not known about the patient's actual outcome. The corollary of this is that the figures obtained from this are not actual figures but always an estimate and therefore associated with a degree of uncertainty. How to quantify this uncertainty has been discussed in the subsequent sections. Note that for patients who have not died, the date of last known followup is recorded. In several situations in developing nations, where followup is not perfect, phone calls, letters, and mails are important tools to obtain survival information. In this setting, while objective events like death can be recorded accurately, endpoints like recurrence should not be recorded via these modes of communication if exact details are not available.
Prior to the actual analysis, the duration of survival/recurrencefree survival needs to be calculated. As shown in [Figure 3], the duration was calculated using the available spreadsheet functions. These functions differ depending on the specific software utilized but for Statistical Package for the Social Sciences (SPSS), a good resource can be found online: https://www. SPSStutorials. com/SPSSdatevariablestutorial/  Figure 3: Duration of survival calculated using spreadsheet function (in months)
Click here to view 
Understanding Hazard   
One of the oftrepeated terms used in survival analysis is hazard. Semantically, the term “hazard” refers to an agent that may cause harm, while “risk” is the probability of the hazardous event occurring. In case of survival analysis, the term hazard is often used to imply risk. The term hazard refers to the probability that an individual who is under observation at a time t has an event at that time.^{[2]} Plotting this instantaneous hazard against time gives the hazard function, and the rate at which this instantaneous hazard changes with respect to time is called the hazard rate. The ratio of two hazard rates is also referred to as hazard ratio.
Conducting the Analysis   
Prior to conducting the analysis, it is important to understand what estimate you desire and which analytical techniques you will need. Survival analytic methods can be broadly grouped into the following three categories:
 Nonparametric: These techniques do not make any assumptions about the distribution of the survival times. As such, these are very versatile and can be used for any survival distributions. However, these techniques are not ideal if you are attempting to generate predictive models. In the current manuscript, we will be dealing with an example of this form of analysis
 Semiparametric: The most popular technique of this method is the Cox proportional hazards model, which is considered semiparametric as it does not make an assumption about the baseline hazard. However, it is assumed that the factors influencing the survival are linearly related to the survival. This is useful when you desire to obtain a predictive model without wanting to “guess” the actual distribution of hazard^{[3]}
 Parametric: These models assume that the hazard function follows a specific distribution. This is particularly useful when you want to calculate the hazard of the event for a specific person in your population and therefore is a technique used for advanced survival modeling.
Basic Survival Analysis: Kaplan–meier Method   
We will start with the simplest and practically the most (ab) used method for survival analysis, the Kaplan–Meier product limit estimator.^{[4]} This technique, first described in a collaborative article in 1958, has become immensely popular owing to its simplicity and ease of interpretation. The complete details of how the calculation is done will not be presented here owing to space limitations, but a good description has been previously provided by Rich et al.^{[5]}
In order to analyze the survival using this method, the basic requirements are the time and event descriptors. This page gives stepwise details on how to perform this analysis in SPSS (https://statistics. laerd.com/SPSStutorials/kaplanmeierusingSPSSstatistics.php). The output of the analysis comprises a survival curve and a survival table. Note that for calculation of the survival time, the event is death and vice versa. Similarly, for calculation of the recurrencefree survival, the event of interest is the development of recurrence.
The Kaplan–Meier curve as depicted in [Figure 4] consists of the survival probability on the Yaxis and the time duration on the Xaxis. The curve itself comprises a series of steps with tick marks. The steps are indicative of an event, while the tick marks indicate censoring. We can obtain the cumulative survival probability at any point of time by reading the corresponding value on the Yaxis at the corresponding time point on the Xaxis.  Figure 4: Standard Kaplan–Meier plot demonstrating the salient features of a survival curve. Tick marks on the curve indicate where a patient has been censored, while steps represent events
Click here to view 
The survival table obtained from this analysis can provide us with the mean and median estimates of survival. Reporting of mean times is noninformative in general as they are affected by extremes. Median times are usually reported along with the 95% confidence intervals of the estimate. Note that median times may not be “reached” in cases where less than half of the patients have experienced an event. As a matter of practice, it is usual to report the proportion of patients who survive at the median followup duration. The stock Kaplan–Meier curves obtained in SPSS often fail to highlight important aspects of the survival distribution. [Figure 5] depicts a more enhanced version of the Kaplan–Meier curve which has been generated using R software (Vienna, Austria). The important additions are as follows:  Figure 5: An enhanced Kaplan–Meier plot, generated using R. Panel A shows the estimated time to event outcome. Xaxis shows the probability of this outcome, while Yaxis shows the time. Panel B shows the number at risk; the Yaxis labels below the line denote the time, while the values above the line denote the numbers at risk. Panel C shows the number of censoring; the Yaxis labels below the line denote the time from enrollment into the study, while the vertical lines shown on the Yaxis denote the time when a patient has been censored. The thickness of the vertical line denotes the number of patients censored
Click here to view 
 95% confidence intervals of the estimated cumulative survival at any point of time depicted by translucent bands
 Number at risk below the Yaxis which denotes the number of patients at risk for the event at any point of time
 Censoring plot: This shows the distribution of censoring and can be useful in special situations to understand if techniques to deal with informative or unequal censoring are required.
It should be noted that the 95% confidence intervals widen appreciably toward the end of the curve when the number of patients at risk is low. Therefore, estimates of the survival should be reported only from the time points when there is an adequate number of patients at risk (typically considered more than 10–15 patients).
The survival table provided in the output [Figure 6] also allows us to estimate the cumulative survival probability at a given time point. It is a good practice to report the 95% confidence intervals of this estimate which can also be calculated from the same table using the standard error.  Figure 6: The survival table obtained from a standard Kaplan–Meier analysis in SPSS. The cumulative proportion surviving at time 1.700 is 0.75 (75%) with a standard error of 0.089. If the appropriate option is checked, then the 95% confidence intervals can be calculated by using Std. Error. Std. Error = Standard error. Image courtesy: Dr. Vijay M Patil, Associate Professor, Medical Oncology, Tata Memorial Hospital, Mumbai, Maharashtra, India. Used with permission
Click here to view 
Comparing the Survival of Two (Or More) Groups   
While obtaining survival estimates of the population is often adequate, we are usually interested in comparing the survival times between two populations. The Kaplan–Meier method also permits this form of analysis [Figure 7]. The key to this analysis is to have a separate column indicative of the groups that you want to compare. For example, the group column can be gender. The key issue is that the data should be in a categorical format. For example, it is acceptable to have Stages I, II, III, and IV, but not height in centimeter. This is because the method estimates the cumulative survival probability for each group separately. In SPSS, the compare groups option must be checked to get the output. While the plot provides a visual indication, a formal statistical comparison is also possible. The most common technique for this formal comparison is the logrank test which has been used to derive the P value shown in the plot.^{[6]} The test computes the Chisquare value for the two groups at each time point and summates the result. The summated Chisquare value for each group is compared using a standard Chisquare test.  Figure 7: A Kaplan–Meier curve comparing the survival of two groups. N. risk – the number at risk table is inside the plot in this graph and is represented above the Yaxis
Click here to view 
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References   
1.  Rebolj Kodre A, Pohar Perme M. Informative censoring in relative survival. Stat Med 2013;32:4791802. 
2.  Clark TG, Bradburn MJ, Love SB, Altman DG. Survival analysis part I: Basic concepts and first analyses. Br J Cancer 2003;89:2328. 
3.  Carroll R, Lin X. Nonparametric and semiparametric regression methods. Chapman & Hall/CRC Handbooks of Modern Statistical Methods Longitudinal Data Analysis. London: UK; 2008. p. 1917. 
4.  Kaplan EL, Meier P. Nonparametric estimation based on incomplete observations. Journal of the American Statistical Association 1958;53:45781. 
5.  Rich JT, Neely JG, Paniello RC, Voelker CC, Nussenbaum B, Wang EW. A practical guide to understanding KaplanMeier curves. Otolaryngol Head Neck Surg 2010;143:3316. 
6.  Bland JM, Altman DG. The logrank test. BMJ 2004;328:1073. 
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7]
