|
Home >
Exploring Excel
>
An Excel Tutorial:An Introduction to Excel's
Normal Distribution Functions
Excel provides several spreadsheet functions for working with normal
distributions or "bell-shaped curves." Here's a brief introduction
for people who are statistically challenged.
by Charley Kyd
July, 2006
(Update:
To learn how to create figures like this in Excel, see
How to Create Normal Curves In Excel, With Shaded Areas.)
Recently, a reader asked me how to generate a random number from a
Normal distribution. This question set me to thinking about doing
statistics with Excel.
Many of us were introduced to statistics in school and then forgot
what little we learned...often within seconds of graduation. Also, when we took
statistics, many of us weren't taught how to work with statistics using Excel. This is unfortunate, because in business it's often useful to have
some grasp of that topic.
For all these reasons, I thought it would be worthwhile to briefly
explore normal -- or "bell-shaped" -- curves in Excel. This is a
commonly used area of statistics, and one for which Excel provides
several useful functions.
One interesting thing about the normal curve is that it occurs frequently in many different settings:
- The height of each gender in a population is normally distributed.
- The measure of LDL cholesterol is normally distributed in adults.
- The width of stripes on a zebra is said to be normally distributed.
- Most measurement errors are assumed to be normally
distributed.
- Many Six-Sigma calculations assume normal distribution.
- Etc.
As a final example, here's a surprising occurrence of the normal
curve: Take any population, whether it's normally distributed or not.
Randomly select at least 30 members from that population, measure them
for some characteristic, and then find the average of those measures.
That average is one data point. Return the samples, select another
random sample of the same number, and find the average of their
measures. Do the same again and again. The Central Limit Theorem says
that those averages tend to have a normal distribution.
Normal distributions are all around us. Therefore, as painlessly as
possible, let's take a closer look at how we work with them using Excel.
Brief Definitions
We need
to get some brief definitions out of the way so that
we can start to describe data using Excel functions.
From cholesterol to zebra stripes, the normal probability distribution
describes the proportion of a population having a specific range of
values for an attribute. Most members have amounts that are near the average; some
have amounts that are farther away from the average; and some have
amounts extremely distant from the average.
For example, a population could be all the stripes on all the zebras
in the world. The normal curve would show the proportion of stripes that
have various widths.
The standard deviation of a sample is a measure of the spread of the sample
from the mean. In a normal distribution, about 68% of a sample is within one standard
deviation of the mean. About 95% is within two standard deviations. And
about 99.7% is within three standard deviations. The numbers in this
figure mark standard deviations from the mean.
The z value is the distance between a value and the mean in
terms of standard deviations. In the figure above, each number is a z
value.
Calculating or Estimating the Standard Deviation
Several of the following functions require a value for the standard
deviation. There are at least two ways to find that value.
First, if you have a sample of the data, you can estimate the
standard deviation from the sample using this formula:
=STDEV(range_of_values)
Second, if you're working with rough estimates, you must take a different
approach, because you don't have actual data to support your estimates.
In this case, first calculate the Range. This is the smallest
likely value subtracted from the largest likely value. Let's assume that
all possible values will be within that range about 95% of the time.
Remember that about 95% of a sample is within two standard deviations
on each side of the mean. (This is a total of four standard deviations,
of course.) Therefore, if we divide the range by four we should have the
approximate standard deviation.
Merely dividing the range by four might seem to be a slipshod approach. But
consider the way this calculation often is used.
Suppose you're forecasting sales for next year. You think sales will
be about 1,000, but the number could be as high as 1,200 and as low as 800.
With that information, you can put a normal curve around your estimated
sales and begin to generate a variety of forecasts for profits and cash
flow.
To emphasize, these numbers are only your best estimates. Therefore,
using an estimated standard deviation doesn't seem quite as sloppy as it
otherwise might.
Based on these estimates, your mean sales will be about 1,000 and
your standard deviation will be about (1200 - 800) / 4 = 100. With this
information, you can use the following functions to perform many of the
calculations you will need in your analysis.
NORMDIST(x, mean, standard_dev, cumulative)
NORMDIST gives the probability that a number falls at or below a
given value of a normal distribution.
- x -- The value you want to test.
- mean -- The average value of the distribution.
- standard_dev -- The standard deviation of the distribution.
- cumulative -- If FALSE or zero, returns the probability that x
will occur; if TRUE or non-zero, returns the probability that the
value will be less than or equal to x.
Example: The distribution of heights of American women aged 18 to 24
is approximately normally distributed with a mean of 65.5 inches (166.37
cm) and a standard deviation of 2.5 inches (6.35 cm). What percentage of
these women is taller than 5' 8", that is, 68 inches (172.72 cm)?
The percentage of women less than or equal to 68 inches is:
=NORMDIST(68, 65.5, 2.5, TRUE) = 84.13%
Therefore, the percentage of women taller than 68 inches is 1 -
84.13%, or approximately 15.87%. This value is represented by the shaded
area in the chart above.
NORMSDIST(z)
NORMSDIST
translates the number of standard deviations (z) into cumulative
probabilities.
To illustrate:
=NORMSDIST(-1) = 15.87%
=NORMSDIST(+1) = 84.13%
Therefore, the probability of a value being within one standard
deviation of the mean is the difference between these values, or 68.27%.
This range is represented by the shaded area of the chart.
NORMINV(probability, mean, standard_dev)
NORMINV
is the inverse of the NORMDIST function. It calculates the x variable
given a probability.
To illustrate, consider the heights of the American women used in the
illustration of the NORMDIST function above. How tall would a woman need
to be if she wanted to be among the tallest 75% of American women?
Using NORMINV, she would learn that she needs to be at least 63.81
inches tall, as shown by this formula:
=NORMINV(0.25, 65.5, 2.5) = 63.81 inches
The figure shows the area represented by the 25% of the American women
who are shorter than this height.
NORMSINV(probability)
NORMSINV
is the inverse of NORMSDIST function. Given the probability that a
variable is within a certain distance of the mean, it finds the z value.
To illustrate, suppose you care about the half of the sample that its
closest to the mean. That is, you want the z values that mark the
boundary that is 25% less than the mean and 25% more than the mean.
The following two formulas provide those boundaries of -.674 and
+.674, as illustrated by the figure.
=NORMSINV(0.25)
=NORMSINV(0.75)
STANDARDIZE(x, mean, standard_dev)
STANDARDIZE returns the z value for a specified value, mean, and
standard deviation.
To illustrate, in the NORMINV example above, we found that a woman
would need to be at least 63.81 inches tall to avoid the bottom 25% of
the population, by height. The STANDARDIZE function tells us that the z
value for 63.81 inches is:
=STANDARDIZE(63.81, 65.5, 2.5) = -0.6745
We can check this number by using the NORMSDIST function:
=NORMSDIST(-0.6745) = 25%
That is, a z value of -.6745 has a probability of 25%.
How to Calculate a Random Number from a Normal Distribution
Remember that the NORMINV function returns a value given a
probability:
NORMINV(probability, mean, standard_dev)
Also, remember that RAND() function returns a random number between 0
and 1. That is, RAND() generates random probabilities. Therefore, it
seems logical that you could use the NORMINV function to calculate a random number
from a normal distribution, using this formula:
=NORMINV(RAND(), mean, standard_dev)
However, Jerry W. Lewis -- a former Excel MVP and a professional
statistician -- offers a stern comment about this approach. "Using
NORMINV(RAND(),...) to generate Normal variates is totally unacceptable
prior to Excel XP, and is only marginal in XP. This is because of
inadequacies in NORMINV and in the tails of NORMDIST itself."
"NORMINV prior to Excel XP produced a very un-normal fraction of
values around 6 million standard deviations from the mean," Jerry wrote.
"This is due to inaccuracies in the implementation of NORMDIST and
NORMINV. Excel XP brought those values into a less obviously wrong
location, but otherwise did little to improve the situation. NORMINV in
Excel 2003 is a decent implementation."
Instead, Jerry recommends the Box-Muller method described here:
http://mathworld.wolfram.com/Box-MullerTransformation.html
This method, he wrote, is limited only by the inadequacies of the RAND()
function prior to Excel 2003, which had unacceptable autocorrelation.
The Box-Muller approach suggests that Excel users should use this
formula to calculate a random number from a normal distribution:
=SQRT(-2*LN(RAND()))*SIN(2*PI()*RAND())
The Box-Muller method is mathematically exact, Jerry writes, if
implemented with a perfect uniform random number generator and infinite
precision.
A Note About The Charts
I created all of the figures for this article in Excel. If you would
like to know how I created them, check back in several weeks. I'll
explain how you can create similar charts, and I'll point you to a web
site that provides more about this topic.
References
I must have at least 15 statistics books gathering dust on bookshelves in the
basement. Even so, these two books offered clear explanations that you
might find useful:
Statistical Analysis with Excel for Dummies,
by Joseph Schmuller, PhD
Excel Data Analysis for Dummies,
by Stephen L. Nelson, MBA, CPA
(Email Comments)
|