Benford's Law Explained
Also called the "first-digit law", Benford's Law says that for many real-life sources of data, the first digit will be "1" about 30% of the time and that small first-digits will occur more frequently than large first-digits.
The law is named after Dr. Frank Benford, a physicist working for the General Electric Company. In 1938 he noticed that the pages of logarithms for low numbers were dirtier and more worn than pages for high numbers, and that pages for numbers starting with 1 were the most heavily used of all.
Intrigued, he analyzed thousands of sets of numbers. These included lists of the areas of rivers, street addresses in a phone book, baseball statistics, population statistics, and numbers used in magazine articles. In all cases, the smaller that the first digit of the number was, the more likely was that first digit to appear.
Benford derived a formula that states the expected frequency of each beginning digit. In Excel, we would express this formula as...
...where FirstDigit can be a whole number 1 through 9.
This table shows the results of the formula. To illustrate, the probability of 17.6% in the table for a first digit of 2 was calculated using this formula:
Testing Benford's Law With Stock Data
I tested Benford's Law using two sets of stock data from the Wall Street Journal. This page gave me data for the New York Stock Exchange and this page gave me data for the Nasdaq Capital Market, formerly called the Nasdaq SmallCap Market. In both cases, the data was for July 5, 2007.
To test Benford, I found the first-digit totals for a variety of data for all stocks in the two tables, then used charts to compare how frequently each digit appeared with the frequency that the Benford formula predicts.
To get started, I used links provided in the two WSJ pages to download a worksheet containing each dataset. I moved the two worksheets into one workbook; saved the workbook under a new name; named one worksheet NYSE and the other Nasdaq; deleted the columns I didn't care about in each sheet; and formatted the table in each worksheet. To illustrate, here's my formatted table for the NYSE worksheet:
An Aside: Assigning the Range Names
So that you can understand the formulas that follow, I need to interrupt this description to explain how I handled my range names.
To make my formulas easier to use and understand, I decided to assign the column titles as range names for the columns of data. And because I saved the NYSE and Nasdaq tables in two sheets of the same workbook, I needed to assign names that were local to each worksheet.
Most range names are workbook-global. That is, if you assign the name Close to a range in one worksheet, you can refer to the name in any worksheet in that workbook. To illustrate, if I assigned the name Close to a column of data, I could use that name in any formula in the workbook, like this:
But in this workbook, I wanted names that were local to each worksheet. With worksheets named NYSE and Nasdaq, I wanted to be able to specify which range I wanted, using references like these:
The time-consuming way to assign local names is with the New Name dialog. To illustrate, I could select the range C3:C2104 in the NYSE worksheet above, choose Formulas, Defined Names, Define Name, Define Name...; enter Close in the Names edit box; choose NYSE as the scope, and then choose OK. Then I would need to repeat that process for each of the other columns I wanted to name.
The quick way to assign these names is to use the Create Names From Selection dialog. This dialog assigns a name globally if a global version hasn't already been defined; otherwise, it assigns the name locally. Therefore, in cases like this, I always set up a block of Reserved Names that I define globally, but plan never to use; then I assign the real names locally. To make sure that I don't accidentally use the global versions, I enter =NA() in the named cells. That way, my formula returns the #N/A error value if I use the global version by mistake.
To set up the global versions of the names, I first added a new worksheet to the workbook and named it Control. Then I copied the row of titles shown in the table above to a column in the Control worksheet.
To do so, I copied the range A2:H2 above, selected cell A3 in the Control sheet, shown here; chose Home, Clipboard, Paste, Paste Special; chose Values and Transpose in the Paste Special dialog; then I chose OK.
I entered the formula =NA() in cell B3 and copied it down the column as shown here. I added some minor formatting, then assigned the text in column A as names in column B. To do so, I selected the range A3:B10; chose Formulas, Defined Names, Create From Selection; made sure that only Left Column was selected in the Create Names From Selection dialog; then chose OK.
In the NYSE worksheet, illustrated above, I selected the range A2:H2104, chose Formulas, Defined Names, Create From Selection,; made sure that only Top Row was selected in the Create Names From Selection dialog; then chose OK. I used a similar process to name the table of data in the Nasdaq worksheet.
Testing Benford's Law, Resumed
After I named the ranges of data for the two exchanges, I summarized it in two tables. Here's the table with the NYSE data:
Column E contains the numbers 1 through 9.
The formulas in columns A and B count the number of cells that have numbers beginning with each of the nine digits. The first two formulas return the closing price and the volume, respectively:
In cell A4, the LEFT function returns the left-most digit as text and the VALUE function converts the text to a number. Then the formula compares that left-most digit to the number in cell E4. Because SUMPRODUCT works with arrays, the formula performs this calculation on each cell in the Close range, and stores TRUE or FALSE in a temporary list in memory. Adding zero to those values returns a list of 1 and 0 values, respectively. Finally, the SUMPRODUCT sums those ones and zeros, returning a count of all numbers in the NYSE's Close range that begin with the value 1.
The first time I entered these formulas, Excel returned a #VALUE! error. As I looked through the data I saw that it used the text "..." to show missing values. Rather than creating a complex formula to deal with this problem, I used Excel's Find and Replace dialog to replace all such text with 0 (zero). Later, I saw that the PE column used "dd" to indicate a loss in the prior four quarters; so I also replaced "dd" with 0.
The formula in cell B4 performs the same calculation for the NYSE's Volume range.
Columns C and D contain dollar and percentage changes in stock price. Because this data consists of decimal fractions that can be less than zero, we must use a slightly different formula.
The ABS function finds the absolute value of each number, removing any minus signs. And multiplying each number by 100 moves the first digit to the left of the decimal point. The rest of these formulas is like those for columns A and B.
I copied the range A4:D4 down the column as shown. Then in the range A13:D13 I used SUM functions to return the total of each column. (These columns show slightly different values because data is missing from some of the columns.)
Now that I had the summary data, I could write the formulas for the chart.
This formula is Excel's version of Benford's formula, as I described earlier. Copy it down the column as shown above.
This formula finds the percentage of values in the Close column that begin with the digit "1". When you copy this formula to the range G4:J12 it will adapt to all nine digits for each of the four categories of NYSE data.
Finally, I looked for a way to score how closely the actual distribution of first digits matches the Benford pattern. I settled on this formula, which I copied to the range H13 through J13:
The SUMXMY2 function subtracts each value in one column from the corresponding value in the other column, squares each difference, and then sums the results. The function's name is derived from: SUM of X Multiplied by Y, each quantity squared.
Dividing the result by 9 returns an average value for each of the nine squared deviations. And finding the square root of this average value returns an average deviation between the two columns of nine numbers.
I call this final score the Average Deviation, or AD. The closer the AD is to zero, the more closely our actual data matches Benford's number pattern.
Honestly, statisticians probably would object to this calculation. But it's simple to use and makes sense to non-statisticians: It merely represents the average amount that our data deviates from the Benford pattern.
To give you an idea of how a chart of your data might correspond with a specific AD score, the figure below illustrates six data series, in green, and the Benford numbers in orange. You can see how the green line strays farther from the orange line as the AD grows from .9% to 12.4%.
Near the end of this article I'll explain how you might use the AD for business planning.
Charting Benford Data
After I created the NYSE table, I made another one for Nasdaq data. Then I created this display of the charts of both data sets:
What do these charts tell us?
In both charts, Benford's calculation is plotted as a thick orange line. But in both charts, the line is barely visible because plots of the actual data match Benford's calculation so precisely...with one exception, the green line showing the closing price.
Notice that the NYSE has fewer stocks than predicted with closing prices that begin with "1" and more stocks with closing prices that begin with "3", "4", and "5".
Further, compare the Average Deviation for the the Close column in cell G13 of the table above with the other ADs shown in that figure. The AD for the Close column is 4.1%. This value is notably higher than the other three ADs.
I suspect this pattern for closing prices arises because the average NYSE stock is expensive, averaging more than $96 per share. As a consequence, companies often split their stock when its price gets above $100. This makes "round lots" of the stock (100 shares) more affordable for small investors.
We see an opposite pattern in the small-cap Nasdaq stocks, which have an average price of less than $10 per share. In those firms, the chart shows that many Nasdaq companies apparently prefer their stock prices to begin with "1". After all, a $1.49 stock seems more important than a stock costing $.49, and a $12.95 stock seems more important than a stock priced at $3.95.
The NYSE pattern got me wondering whether stock splitting made financial sense for those companies and their investors.
The plot of the PE ratio in this chart seems to show that NYSE investors
valued stocks with prices beginning with "4" significantly above other
stocks. Also, in general, it appears that the larger the first digit, the more
expensive a stock seemed to investors, and the less they valued it.
Where Benford's Law Works Best
Benford's first-digit number pattern is typically found in real-life data independent of human or categorical influence. Here are some more specific requirements:
The data should vary by at least an order of magnitude. The more variation, the better.
No built-in maximum, minimum, or often-repeated numbers.
The Benford pattern wouldn't be found in a measure of the height of adults or in the ages of the members of a local senior center.
A record of purchases would be distorted by the frequent purchase of one item for one price. Similarly, in a list of company expense reports, if company policy states that expenses of $25 or more must include a copy of the receipt, expect many expenses to be listed at $24.95. This would distort the first-digit pattern, of course.
Large sample size.
You want enough numbers to give the Benford number pattern a chance to show itself. Unfortunately, I haven't found a source that offers well-reasoned advice about how small a sample size can be for first-digit analysis.
This chart illustrates a non-mathematical way to address this issue. It shows the average deviation for sample sizes of 10 through 610 numbers for both the closing prices and volumes for the NYSE data.
As you can see, the AD started at about 10% for a sample size of 10 for both series and dropped quickly to about 5% for a sample size of 60. From there, the green line for the closing price quickly approached its true value of about 4.1%. And the brown line for the volume continued downward, coming close to its actual value of 0.5%.
Based on this limited test, it looks that about 50 numbers will give you some idea about the degree to which the Benford pattern applies to your data. But 500 numbers should give you an excellent idea.
This assumes, of course, that your data is in no particular order. If it's sorted by the column you're testing, you would need to test the entire list, no matter how long it is.
Results of standard transactions or calculations.
Numbers generated in the normal course of business tend to follow Benford's Law.
In contrast, lists of check numbers or purchase order numbers would not work. These are numeric labels and not true values.
And particularly, numbers made up out of thin air would not follow Benford's pattern. This is why Benford's Law is useful for auditors searching for fraud, and why it might also be useful for improving business plans.
Using Benford's Law to Improve Business Planning
Back when I was a CFO for small companies, I quickly learned to distrust the budgets and forecasts that certain managers gave me. These people would open my budget forms and quickly enter their guesstimates. I variously called these either No-Thought Plans or Five-Minute Plans.
My challenge was to know which plans were the result of real thought and which were the result of no-thought guesses. I first assumed that any number ending with several zeros was a no-thought guess. But managers quickly learned to add randomly chosen numbers to their guesses, making their plans seem more precise.
Now, years later, I realize that Benford's Law might have helped me to find the no-thought plans. To see how this might work, consider two sales managers.
One manager spends five minutes with his sales forecast. He takes a wild guess about how each product, region, or sales person will perform; he reduces his guesses by a fudge factor; and then he quickly enters his "forecast" by month.
Because this data comes mostly from the manager's imagination, you wouldn't expect it to follow Benford's Law.
The other manager is much more precise. She carefully considers past sales performance by region and month. She estimates the business she expects to do with each of her largest customers. She looks at price quotes outstanding and the likelihood that each quote will produce a purchase order. She considers the likely returns from planned marketing projects. She gets detailed estimates from her sales staff. And she might even use some form of statistical time-series forecasting.
Because this forecast is more closely linked to real-world data, you would expect it to follow the Benford distribution more closely than the other manager's guesstimate.
The same logic could apply to budgeting expenses.
Therefore, here's how you probably could improve your business plans:
(1) Calculate the Benford distribution for each plan that your managers submit, and then calculate an Average Deviation score for each plan.
(2) Sort the results so that the largest AD is at the top of the list.
(3) Work your way down the list, quizzing managers about the methods they used to create their plans. And request revised plans as necessary.
(4) Continue down the list until you either find managers who truly have thought about their plans, or until you run out of time...whichever comes first. If you do run out of time, you'll know that you probably have improved the worst of the plans.
Alternatively, if you know you'll run out of time, you could isolate the plans that have the highest ADs and sort them by the total amount of the plans. That way, you can work down a list of the largest of the worst.