|Year : 2019 | Volume
| Issue : 2 | Page : 163-168
Basics of Statistics-1
Department of Medical Oncology and Hemato-oncology, Narayana Superspeciality Hospital, Gurugram, Haryana, India
|Date of Web Publication||20-Dec-2019|
H S Darling
Superspeciality Hospital, Gurugram - 122 002, Haryana
Source of Support: None, Conflict of Interest: None
|How to cite this article:|
Darling H S. Basics of Statistics-1. Cancer Res Stat Treat 2019;2:163-8
| Initial Data Handling|| |
Learning oncology is not as difficult as is to keep abreast with the latest developments. This brings with it the need to learn statistical methods and interpretation of study results in a realistic way to sensibly transform the research benefits from bench to bedside. Evidence-based medicine practice demands that clinical medicine be backed by robust preclinical and clinical data. This necessitates the conduct of high volume rational clinical studies and their precise statistical analysis. We attempt to demystify statistics through this section.,,, Through this series, we will try to unshackle the intricacies of statistical methods from a beginner's point of view.
Data collection and collation is the key step. In any given study, the number of factors related to the events under evaluation are depicted as variables and these are represented as numbers. There are various methods which are utilized to derive meaningful results from outcomes in the form of numerical values. For this, these numbers need to be arranged in a particular fashion, which decides the further application of statistical tests.
| Types of Variables|| |
Variables are broadly of 2 types:
- Categorical or qualitative variables have an individual fitting into only one category. These are further divided into nominal or ordinal variables
- Nominal variable: There are qualitatively distinct categories with no order or sequence of data, for example, human blood group has 8 distinct categories, where each individual can fit into only one category.
- Ordinal variable: Categories are distinct and there is a certain order, for example: smoking history: none, light, moderate, and heavy. Again here, each individual is allotted one category.
- Numerical or quantitative variables are represented by a numerical value. These can be discrete or continuous.
- Discrete variable: This is generally an integer value, for example, number of episodes of seizures per day in an epilepsy patient in a given week or number of emergency admissions per month in a hospital in a given calendar year. This value is countable.
- Continuous variable: These numbers can take any value, for example, height or weight of school boys in a particular class. The value can include fractions or decimals. This value is measurable.
| Outliers|| |
Outliers are the occasional variables that inappropriately lie outside the range of the majority of data. These should be taken with a pinch of salt and cross-verified, as they may just be typing or measurement errors.
| Methods of Data Presentation|| |
Numerical data from a research study are required to present the original study results or to support an argument or to compare or analyze historical data. The complexity and bulk of data makes it tedious, monotonous and time-consuming for a reader to interpret and remember. Hence, the data are displayed graphically and diagrammatically for compactness, easier understanding, and presentation.
There are three ways to present numerical information in an article or presentation:
- To include it in the main body of the text
- Tabular presentation
- In the form of graphs or charts
Conventionally, when there are only two values to compare, they are included in the main body of the text. For example, if the pass percentage in an examination is 76% and failure percentage is 24%, this is simple to comprehend. Any data with three or more values are better represented in table or graph form. For example, in an office, the percentage of individuals commuting by car, bike or public transport is 18%, 33%, and 49%. This information is better represented in table form [Figure 1].
|Figure 1: Percentagewise commuting methods at an office depicted serially as table, pie chart, and bar chart|
Click here to view
How to choose between table and graph
Most of the data are initially collected and stored in tabular form before they are analyzed. At the time of publication or presentation, it is decided whether it will be better depicted in table form or graphic form. Table is more appropriate for unrelated variables, small amount of data or in situ ations where the individual values are more important than overall trend, the latter being better depicted by a graph or chart. Minor differences in values are better depicted in a table. For example, values 14.3, 14.22, 14.43, 14.37, and 14.10 are better depicted in a table, whereas values 4, 7 and 3 are better represented in a graph [Figure 2]. Table form is also important when one or two variables are too low or high (outliers) but are equally important. In graph form, they may appear unimportant, for example, when discussing the age-wise incidence of non-Hodgkin lymphoma, the numbers of patients in the age-group of 0–20 years and >80 years will be very small but, nevertheless, important. Graphs are important as they can transform complex data into a simplified catchy visual image. A large number of values can be easily and more effectively depicted in the graphs.
|Figure 2: Table is more suitable than graph in the variables where individual values are more important than overall trend|
Click here to view
Types of graphs
- Bar charts: These are the most commonly used graphs to depict data regarding multiple unrelated variables. The X-axis represents discrete entities or variables, hence there is no scale. Bar charts are most suitable for discrete variables [Figure 1]
- Segmented column chart: These are very commonly used charts. It is similar to a bar chart except that a third parameter can be added. One axis shows the discrete values of a variable (e.g. gender and percentage) [Figure 3]. The second axis shows a percentage-wise comparison of another parameter in a stacked segmentation of individual bars in different categories (e.g., blood groups)
- Histograms: This is a type of bar chart where the Y-axis represents continuous rather than discrete categories. For example, the birth weight record of newborns in a hospital. The data can be grouped into categories to simplify the graph, e.g., age-wise distribution of a cancer type in a population, which can be stratified into 0–10 years, 11–20 years, 20–30 years and so on [Figure 4]. Histograms are more suitable for continuous variables
- Pie chart: This is a visual representation of distribution of total data between the categories. A pie chart is suitable for depicting percentages of a few related variables, e.g., percentages of Hindus, Muslims, Sikhs, Buddhists, and other religions in India. As described earlier, it is a better visual depiction of trend or relative proportions rather than actual figures [Figure 1]
- Line graph: This is a linear graphical presentation of a change in a variable over time or along a dimension, e.g., partial oxygen pressure along the increasing depth in the sea or change in the incidence of cervical cancer over last 5 decades [Figure 5]
- Dot plot: This is one of the simplest plots, where univariate data are plotted in the form of small filled circles representing each participant individually. It is used for a small- to moderate-sized dataset (say <50 values). It can be used for continuous as well as discrete data. For example, number of people with various blood groups in a population sample of 100 people [Figure 6]
- Scatter plot: This is a graphical representation of pairs of quantitative measurements for an individual or an object. e.g., a graph of the relationship between price and sale of ten brands of a drug in a month [Figure 7]
- Stem and leaf plot: This is a histogram rotated by 90 degrees. Stem and leaves are divided by a vertical line. Stem represents first few digits arranged in order and leaf represents the next one or two digits arranged in the same order horizontally. The stem depicts the groups of observations and each leaf displays one subgroup of individual observations. e.g., 26, 20, 28, 13, 10, 18, 13, 15, 48, 16, 15, 5, 18, 16, 28 [Figure 8]
- Box and Whisker plot: This is a rectangle representing the interquartile range (explained in next section). In a horizontal plot, the left end of the box is 25th percentile and the right end represents 75th percentile. A vertical line in between is the marker of median [Figure 9].
|Figure 3: A segmented (stacked) column chart of genderwise percentage distribution of human blood groups in a study|
Click here to view
|Figure 4: Histogram showing age-wise distribution of a cancer type in a population group|
Click here to view
|Figure 5: The rising incidence of a cancer type from year 2011 to 2018 depicted as a line plot|
Click here to view
|Figure 6: A dot plot of number of children of each of 22 employees in an office. X-axis shows number of children and dots represent number of employees in each category|
Click here to view
|Figure 9: Box and Whisker plot of dataset of heart rate (in beats per minute) of 20 young participants at rest (60, 73, 54, 94, 110, 84, 55, 74, 78, 82, 94, 89, 68, 70, 92, 86, 102, 82, 74, 79)|
Click here to view
| Measures of Central Tendency|| |
- eAny data collected needs to be summarized in the most precise and concise way to communicate the results effectively. One-way to do this is by tables or graphs. The second way is to condense the data by providing the measures which describe the important characteristics of the data. Looking at the individual numerical values, the best suited method is to be chosen. The idea is to have the best central representative value and a measure of the spread of data on both sides of it
- Arithmetic mean: This is obtained by adding all the individual values and then dividing the sum total by the total number of values. Taking the dataset in [Figure 9], arithmetic mean is 80
- Median: On arranging the data in ascending order, the central value is the median with an equal number of values above and below it. If the number of values is even, then the median is the arithmetic mean of the central 2 values. The median is equal to the mean if the data are symmetrical, for example, the median is 80.5 in [Figure 9] dataset, indicating symmetrical nature of the data. The median is less than the mean, if the data are skewed to the right (more values are larger than the median) and more than the mean, if skewed to the left (i.e., more values are smaller than the median)
- Mode: This is the most frequently occurring value in the dataset. This measure is rarely used in practice. In [Figure 9], the values 82, 94, 74, each appeared 2 times
- Geometric mean: Arithmetic mean is an inappropriate representation of skewed data. To make the spread of the data more symmetrical, we can take the logarithmic values of the original data. The antilog of the arithmetic mean of this transformed data is called the geometric mean which is closer to the “median” of the original data. In the [Figure 9] data, geometric mean is 78.667024604082, which is not of much significance, as the data is symmetrical
- Weighted mean: When certain values in the data have different (more or less) significance than the others, weight is attached to the individual values while calculating the arithmetic mean. For example, a student obtains 80% marks in internal assessment, 90% marks in sports, 50% in final theory and 60% in viva. Arithmetic mean will be 70%, whereas weighted mean will be 61% [Table 1].
|Table 1: Table depicting the difference between arithmetic mean and weighted mean|
Click here to view
| Measures of Spread of the Data|| |
- Range: The difference in the magnitude of the smallest and the largest value in the dataset. In dataset of [Figure 9], the range is 56
- Percentile: Consider arranging the data into ascending order and comparing it with an arbitrary dataset of 1–100. The value of the original data that has 10% of observations below it is the 10th percentile, and the value with 10% observations above it is the 90th percentile. Such divisions into 10 s are called deciles and into 25 s are called quartiles. The interquartile range is the difference between the 1st [72.25 in [Figure 9] and 3rd quartiles [89.75 in [Figure 9], also often depicted in the Box and Whisker plot. Interdecile range will contain observations between the 10th and 90th percentile
- Variance: This is a measure of difference of each observation from the arithmetic mean. It is calculated as the sum of root squares to take into account all the negative and positive differences. The value comes in root square of the unit of the original data. In [Figure 9], variance is 206.6
- Standard deviation (SD): This is the square root of variance. It is the average of the differences of individual observations from the arithmetic mean. SD divided by arithmetic mean and expressed as percentage is called “coefficient of variation.” SD is applicable to normally distributed (i.e., symmetrical) data. One SD is “a range of 1 SD above and below the mean” (±1 SD). One SD includes 68.2% observations, 2 SD include 95.4% observations and 3 SD include 99.7% observations. In [Figure 9], SD is 14.37, which is square root of 206.6, the variance in this case.
| Data Distribution|| |
- Frequency distribution: This is a method of categorizing all categorical or discrete values from a dataset into each possible observation, in the form of a table or a graph. Replacing each observed frequency by a relative frequency (percentage of total events) allows us to compare one dataset with another similar dataset [Figure 10].
- Normal distribution (the Gaussian distribution), is a bell-shaped symmetrical probability distribution curve depicting data near the 'mean' to be the more frequent in occurrence than data towards both the ends [Figure 11].
- Data distribution graphs: Displaying the frequency distribution can give us unimodal (single peak), multimodal (multiple peaks), uniform (straight line), symmetrical (normal distribution) or asymmetrical (skewed distribution) graphs. [Figure 4] is an example of asymmetrical, negatively skewed histogram with left sided tail
- Probability is a term relating to the chance of occurrence of an event. Its value always lies between 0 and 1. Zero means that the event cannot occur and one means that the event must occur
- Probability distribution: Instead of actually observed values, it shows the theoretical probabilities of all observations occurring. Similar to the frequency distribution, probability statistics uses terms like mean, variance, standard deviation etc., Depending on the types of variables, it can be continuous or discrete. Various probability distribution curves are possible according to the data studied. For example, for continuous data, normal (for symmetrical data), Chi-squared, t, F and log-normal distributions (for asymmetrical data). In all such graphs, the probability of all variables is represented as area under the curve, which is one (100%), and the probability of an observation to exist between two theoretical values is calculated as “area under the curve” between those two values. For discrete data, binomial and Poisson distributions are used. Here, the sum of all probabilities is “one.” Categorizing the data into such distribution patterns helps apply further appropriate statistical analysis tests to derive the most relevant interpretations.
|Figure 10: An example of frequency distribution table for dataset in Figure 9|
Click here to view
|Figure 11: A symmetrical frequency distribution of dataset from Figure 9|
Click here to view
Let's test what we learnt:
Q 1. Which one of these is not a categorical variable?
- Blood group
- Severity of anemia (mild/moderate/severe)
- Number of blood transfusions per month in a thalassemia patient.
Q 2. Choose the discrete variable
- Quarterly HbA1c values for a diabetic patient
- Random serum prostate-specific antigen levels of a population group
- Number of fever episodes per month in a patient
- Weekly serum creatinine of a patient on maintenance hemodialysis.
Q 3. The most logical decision on “outlier” in this data should be
100.9, 100.1, 10.3, 99.7, 101.1, 100.5
- This is the most important value
- This needs to be re-verified
- Tabular presentation should be preferred
- This value should be ignored.
Q 4. The best form of graphical representation of the Pulse and Respiratory rate of a group of patients will be
- Bar chart
- Scatter plot
- Line Chart.
Q 5. In this dataset
40, 20, 40, 50, 30, 20, 20, 30, 50, 10, 20, 20, 40
- Median is same as the mode
- The data is symmetrical
- Arithmetic mean is same as the mode
- Median is less than the arithmetic mean.
Q 6. In a box and whisker plot, the line representing the 50th percentile
- Is the median
- Should be in the center of the box
- Is the arithmetic mean
- Is the weighted mean.
Q 7. Choose the incorrect option.
In a symmetrically distributed data
- Interquartile range contains less observations than 1 SD
- Interdecile range will contain less observations than 2 SD
- Median is in the center of interdecile range
- Observations in the first five percentiles are not included in 2 SD.
Q 8. Which of the following is applicable only to symmetrical data distribution?
Q 9. In a Chi-squared distribution, the best measure of the probability of an occurrence is
- Standard deviation
- Area under the curve
- All the above.
Q10. In Gaussian distribution, area under the curve is
- Cannot be predicted.
Answers: 1 (d), 2 (c), 3 (b), 4 (c), 5 (b), 6 (a), 7 (d), 8 (a), 9 (c), 10 (a
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Chakraborty S. A step-wise guide to performing survival analysis. Cancer Res Stat Treat 2018;1:41-5. [Full text]
Bhattacharjee A, Vishwakarma GK, Banerjee S. A Bayesian approach for dynamic treatment regimes in the presence of competing risk analysis. Cancer Res Stat Treat 2018;1:51-7. [Full text]
Dessai S, Simha V, Patil V. Stepwise cox regression analysis in SPSS. Cancer Res Stat Treat 2018;1:167-70. [Full text]
Dessai S, Patil V. Testing and interpreting assumptions of COX regression analysis. Cancer Res Stat Treat 2019;2:108-11. [Full text]
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7], [Figure 8], [Figure 9], [Figure 10], [Figure 11]