In our daily lives, we collect or deal with different kinds of data. Almost, the data exists in any business nowadays and it presents the assets of our business, for example, in the health industry, you may need to know the data about the number appointment, the number of the doctors per specialty, the staff’ performance, the bed occupations, the mortality rate, the busiest day, the revenues, …etc.
In this article, we will define the types of data we deal with every day, and what are the measures of center and spread for data, which are powerful measurement tools to understand the data in a simple, better, and an easy way.
Absolutely, there is a dozen available ready-made tools and applications to calculate these measures easily, while I believe it is important to know how to calculate, use them and why.
What are the types of data?
Any data can be categorized in two categories; Quantitative and Categorical.
Quantitative Data
Quantitative data has numeric meaning and used in calculations, for example, number of employees, number of sales, …etc. Quantitative data has two categories either continuous or discrete.
Continuous data mainly is infinite, measurable, can be broken into smaller units, for example, age, weight, or height, …etc.
Discrete data mainly is finite, countable, and cannot be broken into smaller units, for example, number of customers, number of employees, …etc.
Categorical Data
A categorical data describes a quality or characteristic of something without numerical or quantitative values, for example, Movies rating, Education Level, Nationalities, …etc. Similarly, categorical data has two categories either ordinal or nominal.
Ordinal data mainly is ordered, for example, ratings of movies, education levels, age levels, …etc.
Nominal data is the kind of data which cannot be ordered and the order will not provide any meaningful value, so the values are just labels, for example, countries, gender, colors, …etc.
Furthermore, the below table summarizes different examples of the types of data
Quantitative Data | |
Continuous | Discrete |
Age – weight – height – Blood Pressure – Speed | Number of customers – Number of employees – Days in a week – Days in a year |
Categorical Data | |
Ordinal | Nominal |
Movies ratings – Education levels – Age levels – Height levels – Agreement levels – Satisfactions Levels | Countries – Gender – Colors – Blood groups – Industry Group |
Types of data examples
Measures of Data Center
The measures of the center are mainly used with the quantitative data only because they are numeric values as we explained before. We have 3 measures of center. And it is valuable to understand our data.
Meanwhile, to understand that, let us assume that we have the number of sales per month as per the graph below.
We have this value according to the months:
[Jan = 5, Feb = 6, Mar = 10, Apr = 11, May = 1, Jun = 2, Jul = 5, Aug = 10, Sep = 5]
The number of values here, which is equal to the dataset size is = 9
Mean
Mean is the average value of the data. Also, it is known as “arithmetic mean,” it is calculated by adding all the data values in a set and dividing this summation value by the number of the values in the dataset.
Example:
Using the same example, we may be interested to know the average of the sales across all months so you can simply do that by using the values we plotted here.
[5, 6, 10, 11, 1, 2, 5, 10, 5]
Number of values = 9
Mean = 5 + 6 + 10 + 11 + 1 + 2 + 5 + 10 + 5 / 9 = 6.11
Mode
The mode is the number which appears most often in a set of numbers. And it is not commonly used in statistical calculations.
Example:
Similarly, by using the same example, we may be interested to know the mode of sales across all months so you can simply do that by using the values we plotted here. The mode is used mainly when we have some abnormal cases in our data, for example, one of the months performed very high while it is not the normal case, for example, in October we did sales like 100$ which is not the normal case, so, the mean value may be not the best presentation of the data.
[5, 6, 10, 11, 1, 2, 5, 10, 5]
If we ordered the data, it will look like, [1, 2, 5, 5, 5, 6, 10, 10, 11]
We will observe that the most repeated value is 5, as it appeared 3 times in the dataset.
Mode = 5
You do not have to order the data to get the mode, but it is easier to observe it. If all the values have the same appearance, then there is no mode. And you can also observe that two or more values have the same number of appearance so you can have more than one mode then.
Median
Median is the middle number in a sorted data set. Also, The median is important to know how the data is distributed, for example, if the median and mean are equal we will have a normal distribution of the data, if they are not equal, it will inform us that the data is left-skewed or right-skewed.
Skewing is when the mean is higher or lower than the median because of very high or very low values, please look at the image below.
Example:
Using the same example values.
[5, 6, 10, 11, 1, 2, 5, 10, 5]
Here, you will need to sort the data first, the sorted list will look like this:
[1, 2, 5, 5, 5, 6, 10, 10, 11]
Median = 5
We will see the middle number is 5, and this is easily calculated because the number of values is odd. So, what if we have an even value, we will simply get the average of the two middle values. For example, in the sorted list, if we added 100 for October at the end of the list to be like
[1, 2, 5, 5, 5, 6, 10, 10, 11, 100]
The middle values will be [5,6]
Median = 5 + 6 / 2 = 5.5
Measures of Data Spread
The measures of spread also are used with the quantitative data, and they are valuable to know the range of our data and how it is distributed
Range
Simply, the range of a set of data is the difference between the largest and smallest values. And it is used to understand how the data is spanned.
Example
Using the same example of our sorted data
[1, 2, 5, 5, 5, 6, 10, 10, 11]
The minimum value is 1 and the maximum value is 11
Range = 10 – 1 = 9
Interquartile Range
The interquartile range (IQR) also called the midspread or middle 50% and it is a measure of variability, based on dividing a dataset into quartiles. The quartiles divide a rank-ordered dataset into equal parts. Therefore, the values that divide each part are called the first, second, and third quartiles and they are denoted by Q1, Q2 (the median), and Q3.
IQR = Q3 – Q1
It is not that complex, but how we can calculate it?
Example
Using the same sorted data, and it has to be sorted
[1, 2, 5, 5, 5, 6, 10, 10, 11]
First, we split our dataset into three parts based on the center.
Part 1 | Part 2 | Part 3 |
[1, 2, 5, 5] | [5] | [6, 10, 10, 11] |
Actually, we will be interested here in part 1 and part 3, so, we will calculate each part median as individual datasets, as there are even numbers of datasets, we will use the same method to calculate the median number for even size of the dataset.
(Q1) Part 1 median = 2 + 5 / 2 = 3.5
(Q3) Part 3 median = 10 + 10 / 2 = 10
Therefore, we will calculate the interquartile range (IQR) like this
IQR = Q3 – Q1 = 10 – 3.5 = 6.5
Accordingly, this value shows the interquartile range that, presents the most variability of the data and that most of the data presented between these quartiles, to plot that we use a boxplot graph which is the most famous graph to show the range and the interquartile range.
As we can notice, we can easily now show the values we have discussed in the example.
What if we have an even dataset, it will be the same calculation, while we will split the dataset into sets instead of 3 as the median was between two number in the original set. If we used the modified dataset
[1, 2, 5, 5, 5, 6, 10, 10, 11, 100]
The parts will resemble this
Part 1 | Part 2 |
[1, 2, 5, 5, 5] | [6, 10, 10, 11, 100] |
Furthermore, the plot will look like this
Then Q1 = 5
Q3 = 10
IQR = 5
While, we will notice that the value of 100 is abnormal now, and it is detected that it is an outlier and abnormal value across the dataset.
Variance
Variance is a measurement of the spread between numbers in a dataset. As well as the standard deviation which we will discuss later, It is used to measure the variability of our data and its spread of the center value, which is the mean value.
It is calculated by the average of the squared differences from the Mean.
Example
Using the same standard values
[5, 6, 10, 11, 1, 2, 5, 10, 5]
We have calculated our mean to equal 6.11, let us round it to 6 for simplicity. Therefore, to calculate the variance, we should subtract the mean from each value in the dataset, then square the difference, then summing all the differences. Finally, divide the calculated value by the dataset’s size to get the variance.
Standard Deviation
Accordingly, The standard deviation is a statistic that measures the dispersion of the data relative to its mean and is calculated as the square root of the variance.
Example
The importance of the Variance and the Standard Deviation
In order to understand the importance of them, let us compare the below datasets:
1st dataset | 2nd dataset | 3rd dataset |
[-2,-1,0,1,2] | [-20,-10,0,10,20] | [20,20,20,20,20] |
So, If we use the same calculation, we will see that the first two datasets have identical mean and median that equal 0, if the business users depend only on knowing the mean or the median as a presentation of the data, this could be a tricky and not a good presentation for all the data.
While, if we calculate the standard deviation for the same datasets, the results will equal to 1.414 and 14.14 and this presents how the data values are very variance in the 2nd dataset.
Similarly, if we calculate the standard deviation for the 3rd dataset, it will be equal 0 which says that all our data values are the same, and this can be not a good performance indicator for some business or this can lead to a fraud case in the data if it is unusual to have similarity of the data like that.
The small standard deviation tells us that the data values are near to the mean, the large standard deviation tells us that the data values are far from the mean.
Summary
In conclusion, this article discussed the simple measures to understand the data. These measures belong to the descriptive statistics, that used to describe the data regarding its center, its spread, and its shape as well. There are different types of data analytics, you can learn more about them from this article.
Finally, if we would like to have a good presentation of the data quickly, I believe these measures (mean, median, standard deviation, min, and max) are frequently used and enough to do the job. Moreover, there is a 5 number summary of the data (min, Q1, Q2 (median), Q3, max) which can give an overview of the data as well.
Help to do more!
The content you read is available for free. If you’ve liked any of the articles at this site, please take a second to help us write more and more articles based on real experiences and maintain them for you and others. Your support will make it possible for us.
$5.00
One thought on “Understand The Data using Simple Measures”