Understand The Data using Simple Measures

In our daily lives, we collect or deal with different kinds of the data. Almost, the data exists in any business nowadays and it presents the assets of our business, for example, in the health industry, you may need to know the data about the number appointment, the number of the doctors per speciality, the staff’ performance, the bed occupations, the mortality rate, the busiest day, the revenues, …etc.

In this article, we will define the types of data we deal with every day, and what are the measures of center and spread for data, which are powerful measurement tools to understand the data in a simple, better, and an easy way. 

Absolutely, there is a dozen available ready-made tools and applications to calculate these measures easily, while I believe it is important to know how to calculate, use them and why.

What are the types of data?

Any data can be categorized in two categories; Quantitative and Categorical.

Quantitative Data

Quantitative data has numeric meaning and used in calculations, for example, number of employees, number of sales, …etc. Quantitative data has two categories either continuous or discrete.

Continuous data mainly is infinite, measurable, can be broken into smaller units, for example, age, weight, or height, …etc.

Discrete data mainly is finite, countable, and cannot be broken into smaller units, for example, number of customers, number of employees, …etc.

Categorical Data

A categorical data describes a quality or characteristic of something without numerical or quantitative values, for example, Movies rating, Education Level, Nationalities, …etc. Similarly, categorical data has two categories either ordinal or nominal.

Ordinal data mainly is ordered, for example, ratings of movies, education levels, age levels, …etc.

Nominal data is the kind of data which cannot be ordered and the order will not provide any meaningful value, so the values are just labels, for example, countries, gender, colors, …etc.

Furthermore, the below table summarizes different examples of the types of data

Quantitative Data

Continuous

Discrete

Age – weight – height – Blood Pressure – Speed

Number of customers – Number of employees – Days in a week – Days in a year

Categorical Data

Ordinal

Nominal

Movies ratings – Education levels – Age levels – Height levels – Agreement levels – Satisfactions Levels

Countries – Gender – Colors – Blood groups – Industry Group

Types of data examples

Measures of Data Center

The measures of the center are mainly used with the quantitative data only because they are numeric values as we explained before. We have 3 measures of center. And it is valuable to understand our data.

Meanwhile, to understand that, let us assume that we have the number of sales per month as per the graph below.

We have this value according to the months:

[Jan = 5, Feb = 6, Mar = 10, Apr = 11, May = 1, Jun = 2, Jul = 5, Aug = 10, Sep = 5]

The number of values here, which is equal to the dataset size is = 9

Sales Chart

Sales per month – line chart

Mean

Mean is the average value of the data. Also, it is known as “arithmetic mean,” it is calculated by adding all the data values in a set and dividing this summation value by the number of the values in the dataset.

Example:

Using the same example, we may be interested to know the average of the sales across all months so you can simply do that by using the values we plotted here.

[5, 6, 10, 11, 1, 2, 5, 10, 5]

Number of values = 9

Mean = 5 + 6 + 10 + 11 + 1 + 2 + 5 + 10 + 5 / 9 = 6.11

Mode

The mode is the number which appears most often in a set of numbers. And it is not commonly used in statistical calculations.

Example:

Similarly, by using the same example, we may be interested to know the mode of the sales across all months so you can simply do that by using the values we plotted here. The mode is used mainly when we have some abnormal cases in our data, for example, one of the months performed very high while it is not the normal case, for example, in October we did sales like 100$ which is not the normal case, so, the mean value may be not the best presentation of the data.

[5, 6, 10, 11, 1, 2, 5, 10, 5]

If we ordered the data, it will look like, [1, 2, 5, 5, 5, 6, 10, 10, 11]

We will observe that the most repeated value is 5, as it appeared 3 times in the dataset.

Mode = 5

You do not have to order the data to get the mode, but it is easier to observe it. If all the values have the same appearance, then there is no mode. And you can also observe that two or more values have the same number of appearance so you can have more than one mode then.

Median

Median is the middle number in a sorted data set. Also, The median is important to know how the data is distributed, for example, if the median and mean are equal we will have a normal distribution of the data, if they are not equal, it will inform us that the data is left-skewed or right-skewed.  

Skewing is when the mean is higher or lower than the median because of very high or very low values, please look at the image below.

Distribution

Distribution shapes

Example:

Using the same example values.

[5, 6, 10, 11, 1, 2, 5, 10, 5]

Here, you will need to sort the data first, the sorted list will look like this:

[1, 2, 5, 5, 5, 6, 10, 10, 11]

Median = 5

We will see the middle number is 5, and this is easily calculated because the number of values is odd. So, what if we have an even value, we will simply get the average of the two middle values. For example, in the sorted list, if we added 100 for October at the end of the list to be like

[1, 2, 5, 5, 5, 6, 10, 10, 11, 100]

The middle values will be [5,6]

Median = 5 + 6 / 2 = 5.5

Measures of Data Spread

The measures of spread also are used with the quantitative data, and they are valuable to know the range of our data and how it is distributed

Range

Simply, the range of a set of data is the difference between the largest and smallest values. And it is used to understand how the data is spanned.

Example

Using the same example of our sorted data

[1, 2, 5, 5, 5, 6, 10, 10, 11]

The minimum value is 1 and the maximum value is 11

Range = 10 – 1 = 9

Interquartile Range

The interquartile range (IQR) also called the midspread or middle 50% and it is a measure of variability, based on dividing a dataset into quartiles. The quartiles divide a rank-ordered dataset into equal parts. Therefore, the values that divide each part are called the first, second, and third quartiles and they are denoted by Q1, Q2 (the median), and Q3.

IQR = Q3 – Q1

It is not that complex, but how we can calculate it?

Example

Using the same sorted data, and it has to be sorted

[1, 2, 5, 5, 5, 6, 10, 10, 11]

First, we split our dataset into three parts based on the center.

Part 1

Part 2 Part 3
[1, 2, 5, 5] [5]

[6, 10, 10, 11]

Actually, we will be interested here in part 1 and part 3, so, we will calculate each part median as individual datasets, as there are even numbers of datasets, we will use the same method to calculate the median number for even size of the dataset.

(Q1) Part 1 median = 2 + 5 / 2 = 3.5

(Q3) Part 3 median = 10 + 10 / 2 = 10

Therefore, we will calculate the interquartile range (IQR) like this

IQR = Q3 – Q1 = 10 – 3.5 = 6.5

Accordingly, this value shows the interquartile range that, presents the most variability of the data and that most of the data presented between these quartiles, to plot that we use a boxplot graph which is the most famous graph to show the range and the interquartile range.

Boxplot

Boxplot

As we can notice, we can easily now show the values we have discussed in the example.

What if we have an even dataset, it will be the same calculation, while we will split the dataset into sets instead of 3 as the median was between two number in the original set. If we used the modified dataset

[1, 2, 5, 5, 5, 6, 10, 10, 11, 100]

The parts will resemble like this

Part 1

Part 2
[1, 2, 5, 5, 5]  

[6, 10, 10, 11, 100]

Furthermore, the plot will look like this

Boxplot 2

BoxPlot

Then Q1 = 5

Q3 = 10

IQR = 5

While, we will notice that the value of 100 is abnormal now, and it is detected that it is an outlier and abnormal value across the dataset.

Variance

Variance is a measurement of the spread between numbers in a dataset. As well as the standard deviation which we will discuss later, It is used to measure the variability of our data and its spread of the center value, which is the mean value. 

It is calculated by the average of the squared differences from the Mean.

Example

Using the same standard values

[5, 6, 10, 11, 1, 2, 5, 10, 5]

We have calculated our mean to equal 6.11, let us round it to 6 for simplicity. Therefore, to calculate the variance, we should subtract the mean from each value in the dataset, then square the difference, then summing all the differences. Finally, divide the calculated value by the dataset’s size to get the variance.

Variance

Standard Deviation

Accordingly, The standard deviation is a statistic that measures the dispersion of the data relative to its mean and is calculated as the square root of the variance.

Example

STD

The importance of the Variance and the Standard Deviation

In order to understand the importance of them, let us compare the below datasets:

1st dataset

2nd dataset 3rd dataset
[-2,-1,0,1,2] [-20,-10,0,10,20]

[20,20,20,20,20]

So, If we use the same calculation, we will see that the first two datasets have identical mean and median that equal 0, if the business users depend only on knowing the mean or the median as a presentation of the data, this could be a tricky and not a good presentation for all the data.

While, if we calculate the standard deviation for the same datasets, the results will equal to 1.414 and 14.14 and this presents how the data values are very variance in the 2nd dataset.

Similarly, if we calculate the standard deviation for the 3rd dataset, it will be equal 0 which says that all our data values are the same, and this can be not a good performance indicator for some business or this can lead to a fraud case in the data if it is unusual to have similarity of the data like that.

The small standard deviation tells us that the data values are near to the mean, the large standard deviation tells us that the data values are far from the mean.

Summary

In conclusion, this article discussed the simple measures to understand the data. These measures belong to the descriptive statistics, that used to describe the data regarding its center, its spread, and its shape as well. There are different types of data analytics, you can learn more about them from this article.

Finally, if we would like to have a good presentation of the data quickly, I believe these measures (mean, median, standard deviation, min, and max) are frequently used and enough to do the job. Moreover, there is a 5 number summary of the data (min, Q1, Q2 (median), Q3, max) which can give an overview of the data as well.

Cite this article as: Mohamed Sami, (July 13, 2018). "Understand The Data using Simple Measures," in Mohamed Sami - Personal blog. Retrieved November 21, 2018, from https://melsatar.blog/2018/07/13/understand-the-data-using-simple-measures/.
Donate-Button

Help to do more!

The content you read is available for free. If you’ve liked any of the articles at this site, please take a second to help us write more and more articles based on real experiences and maintain them for you and others. Your support will make it possible for us.

$5.00

Summary
Understand The Data using Simple Measures
Article Name
Understand The Data using Simple Measures
Description
What are the measures of center and spread for data, which are powerful measures to understand the data in a simple, better, and an easy way
Author
Publisher Name
https://melsatar.blog
Publisher Logo
Advertisements

One thought on “Understand The Data using Simple Measures

Let me know your thoughts