More ways to summarize data

2021-06-08 · 8 min read · Data Science ·

Locations

Often, it is not possible or nor feasible to display plots to express features of data. In such cases we need some other ways to summarize. If you recall from our last article, histograms are mostly to understand How many in which, min, max, most occuring or How many less than or more than in data. For e.g. do most students weigh below 60 kg? Without plotting, some of these information can be conveyed by other non-graphic means. These are locations and shape.

Central Location

Perhaps the most widely used central location is average or mean. If you have data of weights of 1000 students and you want to convey all these weights as one number, central location is the way to go. So, when we use one number or value to represent a sample or population parameter (for e.g. weights of 1000 students) we use central location.

The most widely used central location is average or mean. To calculate mean, all the values are added and then divided by number of such values.

$$ \mu =\frac{\sum_{i = 1}^{n}{(x_i)}}{n} $$ where $\mu$ means mean or average $x_i$ represent the values and $n$ is the number of values

i.e. if there are 10 numbers 1,2,3,2,5,4,3,6,5, and 6, their average equals

$$ \frac{(1+2+3+2+5+4+3+6+5+6)}{10} = 3.7 $$

In R, there is a function called mean to calculate this. Code to calculate the mean of the above mentioned numbers is shown below.

1mean(c(1,2,3,2,5,4,3,6,5,6))

1## [1] 3.7

If we add another number or couple of numbers that are not very far away from the existing numbers in the series, we will get a very similar number as average. Below, I have added three more numbers, 2,4 and 7.

1mean(c(1,2,3,2,5,4,3,6,5,6,2,4,7))

1## [1] 3.846154

The new average or mean is not very far away from the average calculated earlier.

However, things may change if we add a number that is somewhat far away from the existing numbers in the series.

1mean(c(1,2,3,2,5,4,3,6,5,6,12))

1## [1] 4.454545

In this case we added one number (12) and the average shot up to 4.45. So, presence of just one high value or low value can make average a poor indicator or central location. This can happen quite often and hence, average is not very robust. And one must always take precaution while using average.

Another central location measure, which is more robust than average is median. It is the number in the middle, if and when the numbers are ordered.

If there are odd number of observations, it equals

$$ median(x)=x_{(n+1)/2} $$

So if there are 5 numbers 1,2,3,2 and 5, we order them. It becomes 1,2,2,3,5. There are 5 numbers. So n=5 and n+1=6. (n+1)/2=3. So, median is the third number.

If there are even number of observations, it equals

$$ median(x)=\frac{x_{n/2}+x_{(n/2)+1}}{2} $$

So, if we have 6 numbers 1,2,3,2,5 and 4, we first order them. It becomes 1,2,2,3,4,5. There are 6 numbers. So, n=6. n/2=3 and (n/2)+1=4. 3rd and 4th numbers are 2 and 3.So, median is (2+3)/2=2.5.

In R, there is a function called median to calculate these.

1median(c(1,2,3,2,5,4))

1## [1] 2.5

What if the variable is not numeric? How do we measure central location? Well, we do that by using mode. Which defines the observation that occurs the highest number of times. It is valid for numeric variables as well. We can measure the number of times a number occurs. Or, we can create bins (for example, 1-5, 6-10 and so on) and measure the occurrences.

If we have numbers 1,2,3,2,5 and 4, 2 occurs the highest number of times. Hence, 2 is the mode. If we have values a,b,c,a,c,a,b and a, the mode is a because it occurs the highest number of times. There can be more than one modes in a sample.

There is no inbuilt function to calculate mode in R. However, you can write your own.

1mode.cal <- function(x) {
2   unique.x <- unique(x)
3   unique.x[which.max(tabulate(match(x, unique.x)))]
4}
5
6mode.cal(c(1,2,3,2,5,4))

1## [1] 2

There are two other measures of central location often used.

Geometric mean is used to understand the central location of growth. For example, if price of a share has increased by 1%, 1.5%, 4% in three consecutive months, then the average growth is calculated by geometric mean.

$$ GM=\sqrt[n]{x_1\times x_2\times x_3 ... x_n} $$ The other is weighted mean which is used when different observations have different weights. For example, cost of hiring someone for 10 hours is 100. Cost of hiring the same person for 6 hours is 80 and for 2 hours is 20. Then the average cost per hour is given by

$$ \frac {{100 \times 10} + {80 \times 6} + { 20 \times 2}}{10+6+2} $$

The formula is

$$ {Weighted} \space {\bar x} = \frac {\sum{f_i \times x_i}}{\sum f_i} $$

Geometric mean is calculated in R like this

1# growth of 1% means new value is 1.01 (i.e. 1 + 1%)
2x<-c(1.01,1.015,1.04) # Stock price increased 1%, 1.5% and 4%
3
4exp(mean(log(x)))

1## [1] 1.021583

Weighted mean is calculated in R using weighted.mean function. Two vectors are provided as input. First one is the value and second one is the weight.

1cost<-c(550,420,800)
2hour<-c(8,6,2)
3
4weighted.mean(cost, hour)

1## [1] 532.5

Non-Central Locations

Central locations provide a representation of the observations. However, to understand the properties of the distribution you often need more. If average provides a central number, we may want to know more than that. What is the value below which the lowest 25 % of the observations lie. We use quartiles and percentiles in such case. Quartiles are measures or values that divide an ordered data in four equal parts (based on number of observations).

25% of the observations lie between lowest value and first quartile (Q1), 25% more lie between Q1 and Q2 and so on. Q2 defines the observation in the middle, or the 50% percentile. Hence, it is the median.

Image not found

Web path: https://i.imgur.com/gfXdYQu.png

Disk path: /static/https://i.imgur.com/gfXdYQu.png

Using Page Bundles: false

Quartiles can be calculated in R using the function quantile. It takes the vector of values as the first parameter.Then, it requires the parameter probs where you can specify the percentile. For Q1, you may enter 0.25 (25%). You can provide a vector of different percentiles. For example c(0.25,0.5,0.75).

1quantile(c(1,2,3,4,5,6), c(0.25,0.5,0.75))

1##  25%  50%  75% 
2## 2.25 3.50 4.75

You find out the values that lie in any percentile (0 to 100) using this. 25th percentile is Q1, 50th is Q2 ans so on.

Spread

Spread or dispersion indicates the extent to which the observations are scattered. It influences our confidence in central location measure. If average of a set of observation is 20 and range of the observations is between 19 and 21 (All observations lie between 19 and 21), we can be very confident on the representation of the set of observation by the average. However, if the range is from -20 to 100, the confidence becomes lucid.

Range indicates the minimum and maximum values (Both inclusive) between which all the values of the observation lies. We can use the range function in R to calculate this.

1range(c(1,2,3,4,3,5,3,6))

1## [1] 1 6

Variance indicates the (squared) distance of the observations from the mean. It is the average of squared difference of the observations and the mean.

$$ s^2 = \frac {\sum (x_i - \bar {x} )^2}{n-1} $$ $s^2$ is the variance $\bar x$ is the mean $n$ is the number of observation

The square is used to avoid negatives. When we take the square root of this number or variance, we get the standard deviation. In R, variance is calculated using var function and standard deviation is calculated using sd function. Now suppose that you have two samples and you want to understand which one has higher variance. One sample is in m and the other is in cm. To create a robust example, let us take equal values. We have a sample in cm. We take another sample in m by converting the values in cm to m.

1in.cm<-c(100,200,300,400)
2in.m<-c(1,2,3,4)
3c(sd.cm=sd(in.cm), sd.m=sd(in.m))

1##      sd.cm       sd.m 
2## 129.099445   1.290994

Interesting! Same values, with different standard deviations because of difference in units. Well, with lower number of samples or convertible units, this can still be managed. But with non convertible units (for example kg, m etc.) this can become an issue. To compare variance of such samples, we use coefficient of variance. Coefficient of variance is given in percentage and calculated by dividing standard deviation by mean.

Let us calculate from the above example.

1c(cov.cm=sd(in.cm)/mean(in.cm), cov.m=sd(in.m)/mean(in.m))

1##    cov.cm     cov.m 
2## 0.5163978 0.5163978

Both have exact same coefficient of variance!

Skewness

Skewness indicates shape of unimodal¹ distribution of numerical variables. The three types are described in the image below.

Image not found

Web path: https://s3.amazonaws.com/libapps/accounts/73082/images/Skeweness.jpg

Disk path: /static/https://s3.amazonaws.com/libapps/accounts/73082/images/Skeweness.jpg