



STATISTICAL RESOURCE 

Year : 2019  Volume
: 2
 Issue : 2  Page : 163168 

Basics of Statistics1
HS Darling
Department of Medical Oncology and Hematooncology, Narayana Superspeciality Hospital, Gurugram, Haryana, India
Date of Web Publication  20Dec2019 
Correspondence Address: H S Darling Superspeciality Hospital, Gurugram  122 002, Haryana India
Source of Support: None, Conflict of Interest: None
DOI: 10.4103/CRST.CRST_87_19
How to cite this article: Darling H S. Basics of Statistics1. Cancer Res Stat Treat 2019;2:1638 
Initial Data Handling   
Learning oncology is not as difficult as is to keep abreast with the latest developments. This brings with it the need to learn statistical methods and interpretation of study results in a realistic way to sensibly transform the research benefits from bench to bedside. Evidencebased medicine practice demands that clinical medicine be backed by robust preclinical and clinical data. This necessitates the conduct of high volume rational clinical studies and their precise statistical analysis. We attempt to demystify statistics through this section.^{[1],[2],[3],[4]} Through this series, we will try to unshackle the intricacies of statistical methods from a beginner's point of view.
Data collection and collation is the key step. In any given study, the number of factors related to the events under evaluation are depicted as variables and these are represented as numbers. There are various methods which are utilized to derive meaningful results from outcomes in the form of numerical values. For this, these numbers need to be arranged in a particular fashion, which decides the further application of statistical tests.
Types of Variables   
Variables are broadly of 2 types:
 Categorical or qualitative variables have an individual fitting into only one category. These are further divided into nominal or ordinal variables
 Nominal variable: There are qualitatively distinct categories with no order or sequence of data, for example, human blood group has 8 distinct categories, where each individual can fit into only one category.
 Ordinal variable: Categories are distinct and there is a certain order, for example: smoking history: none, light, moderate, and heavy. Again here, each individual is allotted one category.
 Numerical or quantitative variables are represented by a numerical value. These can be discrete or continuous.
 Discrete variable: This is generally an integer value, for example, number of episodes of seizures per day in an epilepsy patient in a given week or number of emergency admissions per month in a hospital in a given calendar year. This value is countable.
 Continuous variable: These numbers can take any value, for example, height or weight of school boys in a particular class. The value can include fractions or decimals. This value is measurable.
Outliers   
Outliers are the occasional variables that inappropriately lie outside the range of the majority of data. These should be taken with a pinch of salt and crossverified, as they may just be typing or measurement errors.
Methods of Data Presentation   
Numerical data from a research study are required to present the original study results or to support an argument or to compare or analyze historical data. The complexity and bulk of data makes it tedious, monotonous and timeconsuming for a reader to interpret and remember. Hence, the data are displayed graphically and diagrammatically for compactness, easier understanding, and presentation.
There are three ways to present numerical information in an article or presentation:
 To include it in the main body of the text
 Tabular presentation
 In the form of graphs or charts
Conventionally, when there are only two values to compare, they are included in the main body of the text. For example, if the pass percentage in an examination is 76% and failure percentage is 24%, this is simple to comprehend. Any data with three or more values are better represented in table or graph form. For example, in an office, the percentage of individuals commuting by car, bike or public transport is 18%, 33%, and 49%. This information is better represented in table form [Figure 1].  Figure 1: Percentagewise commuting methods at an office depicted serially as table, pie chart, and bar chart
Click here to view 
How to choose between table and graph
Most of the data are initially collected and stored in tabular form before they are analyzed. At the time of publication or presentation, it is decided whether it will be better depicted in table form or graphic form. Table is more appropriate for unrelated variables, small amount of data or in situ ations where the individual values are more important than overall trend, the latter being better depicted by a graph or chart. Minor differences in values are better depicted in a table. For example, values 14.3, 14.22, 14.43, 14.37, and 14.10 are better depicted in a table, whereas values 4, 7 and 3 are better represented in a graph [Figure 2]. Table form is also important when one or two variables are too low or high (outliers) but are equally important. In graph form, they may appear unimportant, for example, when discussing the agewise incidence of nonHodgkin lymphoma, the numbers of patients in the agegroup of 0–20 years and >80 years will be very small but, nevertheless, important. Graphs are important as they can transform complex data into a simplified catchy visual image. A large number of values can be easily and more effectively depicted in the graphs.  Figure 2: Table is more suitable than graph in the variables where individual values are more important than overall trend
Click here to view 
Types of graphs
 Bar charts: These are the most commonly used graphs to depict data regarding multiple unrelated variables. The Xaxis represents discrete entities or variables, hence there is no scale. Bar charts are most suitable for discrete variables [Figure 1]
 Segmented column chart: These are very commonly used charts. It is similar to a bar chart except that a third parameter can be added. One axis shows the discrete values of a variable (e.g. gender and percentage) [Figure 3]. The second axis shows a percentagewise comparison of another parameter in a stacked segmentation of individual bars in different categories (e.g., blood groups)
 Histograms: This is a type of bar chart where the Yaxis represents continuous rather than discrete categories. For example, the birth weight record of newborns in a hospital. The data can be grouped into categories to simplify the graph, e.g., agewise distribution of a cancer type in a population, which can be stratified into 0–10 years, 11–20 years, 20–30 years and so on [Figure 4]. Histograms are more suitable for continuous variables
 Pie chart: This is a visual representation of distribution of total data between the categories. A pie chart is suitable for depicting percentages of a few related variables, e.g., percentages of Hindus, Muslims, Sikhs, Buddhists, and other religions in India. As described earlier, it is a better visual depiction of trend or relative proportions rather than actual figures [Figure 1]
 Line graph: This is a linear graphical presentation of a change in a variable over time or along a dimension, e.g., partial oxygen pressure along the increasing depth in the sea or change in the incidence of cervical cancer over last 5 decades [Figure 5]
 Dot plot: This is one of the simplest plots, where univariate data are plotted in the form of small filled circles representing each participant individually. It is used for a small to moderatesized dataset (say <50 values). It can be used for continuous as well as discrete data. For example, number of people with various blood groups in a population sample of 100 people [Figure 6]
 Scatter plot: This is a graphical representation of pairs of quantitative measurements for an individual or an object. e.g., a graph of the relationship between price and sale of ten brands of a drug in a month [Figure 7]
 Stem and leaf plot: This is a histogram rotated by 90 degrees. Stem and leaves are divided by a vertical line. Stem represents first few digits arranged in order and leaf represents the next one or two digits arranged in the same order horizontally. The stem depicts the groups of observations and each leaf displays one subgroup of individual observations. e.g., 26, 20, 28, 13, 10, 18, 13, 15, 48, 16, 15, 5, 18, 16, 28 [Figure 8]
 Box and Whisker plot: This is a rectangle representing the interquartile range (explained in next section). In a horizontal plot, the left end of the box is 25^{th} percentile and the right end represents 75^{th} percentile. A vertical line in between is the marker of median [Figure 9].
 Figure 3: A segmented (stacked) column chart of genderwise percentage distribution of human blood groups in a study
Click here to view 
 Figure 4: Histogram showing agewise distribution of a cancer type in a population group
Click here to view 
 Figure 5: The rising incidence of a cancer type from year 2011 to 2018 depicted as a line plot
Click here to view 
 Figure 6: A dot plot of number of children of each of 22 employees in an office. Xaxis shows number of children and dots represent number of employees in each category
Click here to view 
 Figure 9: Box and Whisker plot of dataset of heart rate (in beats per minute) of 20 young participants at rest (60, 73, 54, 94, 110, 84, 55, 74, 78, 82, 94, 89, 68, 70, 92, 86, 102, 82, 74, 79)
Click here to view 
Measures of Central Tendency   
 eAny data collected needs to be summarized in the most precise and concise way to communicate the results effectively. Oneway to do this is by tables or graphs. The second way is to condense the data by providing the measures which describe the important characteristics of the data. Looking at the individual numerical values, the best suited method is to be chosen. The idea is to have the best central representative value and a measure of the spread of data on both sides of it
 Arithmetic mean: This is obtained by adding all the individual values and then dividing the sum total by the total number of values. Taking the dataset in [Figure 9], arithmetic mean is 80
 Median: On arranging the data in ascending order, the central value is the median with an equal number of values above and below it. If the number of values is even, then the median is the arithmetic mean of the central 2 values. The median is equal to the mean if the data are symmetrical, for example, the median is 80.5 in [Figure 9] dataset, indicating symmetrical nature of the data. The median is less than the mean, if the data are skewed to the right (more values are larger than the median) and more than the mean, if skewed to the left (i.e., more values are smaller than the median)
 Mode: This is the most frequently occurring value in the dataset. This measure is rarely used in practice. In [Figure 9], the values 82, 94, 74, each appeared 2 times
 Geometric mean: Arithmetic mean is an inappropriate representation of skewed data. To make the spread of the data more symmetrical, we can take the logarithmic values of the original data. The antilog of the arithmetic mean of this transformed data is called the geometric mean which is closer to the “median” of the original data. In the [Figure 9] data, geometric mean is 78.667024604082, which is not of much significance, as the data is symmetrical
 Weighted mean: When certain values in the data have different (more or less) significance than the others, weight is attached to the individual values while calculating the arithmetic mean. For example, a student obtains 80% marks in internal assessment, 90% marks in sports, 50% in final theory and 60% in viva. Arithmetic mean will be 70%, whereas weighted mean will be 61% [Table 1].
 Table 1: Table depicting the difference between arithmetic mean and weighted mean
Click here to view 
Measures of Spread of the Data   
 Range: The difference in the magnitude of the smallest and the largest value in the dataset. In dataset of [Figure 9], the range is 56
 Percentile: Consider arranging the data into ascending order and comparing it with an arbitrary dataset of 1–100. The value of the original data that has 10% of observations below it is the 10^{th} percentile, and the value with 10% observations above it is the 90^{th} percentile. Such divisions into 10 s are called deciles and into 25 s are called quartiles. The interquartile range is the difference between the 1^{st} [72.25 in [Figure 9] and 3^{rd} quartiles [89.75 in [Figure 9], also often depicted in the Box and Whisker plot. Interdecile range will contain observations between the 10^{th} and 90^{th} percentile
 Variance: This is a measure of difference of each observation from the arithmetic mean. It is calculated as the sum of root squares to take into account all the negative and positive differences. The value comes in root square of the unit of the original data. In [Figure 9], variance is 206.6
 Standard deviation (SD): This is the square root of variance. It is the average of the differences of individual observations from the arithmetic mean. SD divided by arithmetic mean and expressed as percentage is called “coefficient of variation.” SD is applicable to normally distributed (i.e., symmetrical) data. One SD is “a range of 1 SD above and below the mean” (±1 SD). One SD includes 68.2% observations, 2 SD include 95.4% observations and 3 SD include 99.7% observations. In [Figure 9], SD is 14.37, which is square root of 206.6, the variance in this case.
Data Distribution   
 Frequency distribution: This is a method of categorizing all categorical or discrete values from a dataset into each possible observation, in the form of a table or a graph. Replacing each observed frequency by a relative frequency (percentage of total events) allows us to compare one dataset with another similar dataset [Figure 10].
 Normal distribution (the Gaussian distribution), is a bellshaped symmetrical probability distribution curve depicting data near the 'mean' to be the more frequent in occurrence than data towards both the ends [Figure 11].
 Data distribution graphs: Displaying the frequency distribution can give us unimodal (single peak), multimodal (multiple peaks), uniform (straight line), symmetrical (normal distribution) or asymmetrical (skewed distribution) graphs. [Figure 4] is an example of asymmetrical, negatively skewed histogram with left sided tail
 Probability is a term relating to the chance of occurrence of an event. Its value always lies between 0 and 1. Zero means that the event cannot occur and one means that the event must occur
 Probability distribution: Instead of actually observed values, it shows the theoretical probabilities of all observations occurring. Similar to the frequency distribution, probability statistics uses terms like mean, variance, standard deviation etc., Depending on the types of variables, it can be continuous or discrete. Various probability distribution curves are possible according to the data studied. For example, for continuous data, normal (for symmetrical data), Chisquared, t, F and lognormal distributions (for asymmetrical data). In all such graphs, the probability of all variables is represented as area under the curve, which is one (100%), and the probability of an observation to exist between two theoretical values is calculated as “area under the curve” between those two values. For discrete data, binomial and Poisson distributions are used. Here, the sum of all probabilities is “one.” Categorizing the data into such distribution patterns helps apply further appropriate statistical analysis tests to derive the most relevant interpretations.
 Figure 10: An example of frequency distribution table for dataset in Figure 9
Click here to view 
 Figure 11: A symmetrical frequency distribution of dataset from Figure 9
Click here to view 
Let's test what we learnt:
Q 1. Which one of these is not a categorical variable?
 Ethnicity
 Blood group
 Severity of anemia (mild/moderate/severe)
 Number of blood transfusions per month in a thalassemia patient.
Q 2. Choose the discrete variable
 Quarterly HbA1c values for a diabetic patient
 Random serum prostatespecific antigen levels of a population group
 Number of fever episodes per month in a patient
 Weekly serum creatinine of a patient on maintenance hemodialysis.
Q 3. The most logical decision on “outlier” in this data should be
100.9, 100.1, 10.3, 99.7, 101.1, 100.5
 This is the most important value
 This needs to be reverified
 Tabular presentation should be preferred
 This value should be ignored.
Q 4. The best form of graphical representation of the Pulse and Respiratory rate of a group of patients will be
 Histogram
 Bar chart
 Scatter plot
 Line Chart.
Q 5. In this dataset
40, 20, 40, 50, 30, 20, 20, 30, 50, 10, 20, 20, 40
 Median is same as the mode
 The data is symmetrical
 Arithmetic mean is same as the mode
 Median is less than the arithmetic mean.
Q 6. In a box and whisker plot, the line representing the 50^{th} percentile
 Is the median
 Should be in the center of the box
 Is the arithmetic mean
 Is the weighted mean.
Q 7. Choose the incorrect option.
In a symmetrically distributed data
 Interquartile range contains less observations than 1 SD
 Interdecile range will contain less observations than 2 SD
 Median is in the center of interdecile range
 Observations in the first five percentiles are not included in 2 SD.
Q 8. Which of the following is applicable only to symmetrical data distribution?
 Variance
 Median
 Range
 Percentile.
Q 9. In a Chisquared distribution, the best measure of the probability of an occurrence is
 Standard deviation
 Variance
 Area under the curve
 All the above.
Q10. In Gaussian distribution, area under the curve is
 100%
 99.7%
 95.4%
 Cannot be predicted.
Answers: 1 (d), 2 (c), 3 (b), 4 (c), 5 (b), 6 (a), 7 (d), 8 (a), 9 (c), 10 (a
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References   
1.  Chakraborty S. A stepwise guide to performing survival analysis. Cancer Res Stat Treat 2018;1:415. [Full text] 
2.  Bhattacharjee A, Vishwakarma GK, Banerjee S. A Bayesian approach for dynamic treatment regimes in the presence of competing risk analysis. Cancer Res Stat Treat 2018;1:517. [Full text] 
3.  Dessai S, Simha V, Patil V. Stepwise cox regression analysis in SPSS. Cancer Res Stat Treat 2018;1:16770. [Full text] 
4.  Dessai S, Patil V. Testing and interpreting assumptions of COX regression analysis. Cancer Res Stat Treat 2019;2:10811. [Full text] 
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7], [Figure 8], [Figure 9], [Figure 10], [Figure 11]
[Table 1]
