DM Menu


Basic Statistical Descriptions of Data




Basic statistical descriptions can be used to identify properties of the data and highlight which data values should be treated as noise or outliers.

For data preprocessing tasks, we want to learn about data characteristics regarding central tendency of the data.

  • Measures of central tendency include Mean, Median, and Mode.

Mean

The most common and effective numeric measure of the “center” of a set of data is the (arithmetic) mean. Let x1, x2, ……, xN be a set of N values or observations, such as for some numeric attribute X, like salary.

The mean of this set of values is

= i=1 N x i N = x 1 + x 2 + ... + x N N

Example: Mean. Suppose we have the following values for salary (in thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Using above Eq.,

we have

= 30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 + 110 12 = 696 12 = 58

Thus, the mean salary is $58,000.

Sometimes, each value xi in a set may be associated with a weight wi for i = 1,, ,N. The weights reflect the significance, importance, or occurrence frequency attached to their respective values. In this case, we can compute

= i=1 N w i x i i=1 N w i = w 1 x 1 + w 2 x 2 + ... + w N x N w 1 + w 2 + ... + w N

This is called the weighted arithmetic mean or the weighted average.


Median

Another measure of the center of data is the median. Suppose that a given data set of N distinct values is sorted in numerical order.

  • If N is odd, the median is the middle value of the ordered set;
  • If N is even, the median is the average of the middle two values.

Example: Median. Suppose we have the following values for salary (in thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

There is an even number of observations (i.e., 12); therefore, the median is not unique. It can be any value within the two middlemost values of 52 and 56 (that is, within the sixth and seventh values in the list). By convention, we assign the average of the two middlemost values as the median;

that is

52 + 56 2 = 108 2 = 54

Thus, the median is $54,000.

In probability and statistics, the median generally applies to numeric data; however, we may extend the concept to ordinal data.

Suppose that a given data set of N values for an attribute X is sorted in increasing order.

  • If N is odd, then the median is the middle value of the ordered set.
  • If N is even, then the median may not be not unique.

In this case, the median is the two middlemost values and any value in between.

Mode

Another measure of central tendency is the mode. The mode for a set of data is the value that occurs most frequently in the set.

It is possible for the greatest frequency to correspond to several different values, which results in more than one mode.

  • Data sets with one, two, or three modes: called unimodal, bimodal, and trimodal.
  • At the other extreme, if each data value occurs only once, then there is no mode.

Example: Mode. Suppose we have the following values for salary (in thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

The above data has bimodal mode.i. e The two modes are 52 and 70.

Midrange

The midrange can also be used to assess the central tendency of a numeric data set. It is the average of the largest and smallest values in the set.

Example: Midrange. Suppose we have the following values for salary (in thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

The midrange of the data is

30,000 + 110,000 2 = $70,000

Thus, the median is $70,000

Central Tendency Measures for different attributes:

Central Tendency Measures for Numerical Attributes: Mean, Median, Mode

Central Tendency Measures for Categorical Attributes:

  • Central Tendency Measures for Nominal Attributes: Mode
  • Central Tendency Measures for Ordinal Attributes: Mode, Median
Example:

What are central tendency measures (mean, median, mode) for the following attributes?

Solution:

attr1 = {2,4,4,6,8,24}
mean = (2+4+4+6+8+24)/6 = 8 average of all values
median = (4+6)/2 = 5 avg. of two middle values
mode = 4 most frequent item


attr2 = {2,4,7,10,12}
mean = (2+4+7+10+12)/5 = 7 average of all values
median = 7 middle value
mode = any of them (no mode) all of them has same freq.


attr3 = {xs, s, s, s, m, m, l}
mean is meaningless for categorical attributes.
median = s middle value
mode = s most frequent item


Next Topic :Knowledge Discovery from Data (KDD)