Benford's Law: How Tax Frauds are caught...

If you list all the countries in the world and their populations, 27% of the numbers will start with the digit 1. Only 3% of them will start with the digit 9. Something very similar holds if you look at the heights of the 60 tallest structures in the world — whether you measure in meters or in feet. 

The first Digit Phenomenon
This phenomenon — called Benford's Law —helps auditors detect fraud in things like taxes and elections, but it also connects up in striking ways to modern physics and mathematics (e.g., power laws in statistical distributions, as well as ergodic theory).

Benford's Law often strikes people as unintuitive because it seems that every digit should have an equal opportunity to start country populations or heights of skyscrapers, like this:


Normal Human Perception
(The delightful figures are from http://www.thecleverest.com/benf...)

This egalitarian intuition about leading digits turns out to be misleading. The situation where every digit is equally likely to start numbers is actually the anomalous one. 

==

Simon Newcomb
The fact that the non-uniform pattern is the common one was named for physicist Frank Benford, who, in 1938, showed that it holds in a wide variety of real lists of numbers (river lengths, molecular weights, street addresses, etc.). But the fact was first discovered in 1881 by Simon Newcomb. He noticed it while thumbing through logarithm books -- tables used at that time by scientists to do arithmetic with large numbers. Newcomb became intrigued by the fact that the pages listing numbers starting with 1 were far more worn than the other pages. This would not happen if every digit occurred equally often as a first digit in the numbers scientists worked with.

(Newcomb was a remarkable polymath. Despite having little formal education, he made an early, quite accurate measurement of the speed of light and was the first to enunciate the fundamental equation of exchange in economics.)

==

The reason Benford's Law is useful in fraud detection is that most fraudsters, in the process of making up numbers, do not pay attention to the pattern of first digits that shows up in organic data sets [6]. The leading digits in large spreadsheets of legitimate financial numbers (light green in the figure below) tend to be very close to Benford's Law (blue), while ones filled in by guessing randomly look way off (orange), and fraudulent numbers (red) tend to look even more bizarre. When tax sleuths notice these tell-tale "unnatural" patterns in data sets, they call people in for a human audit.

Tax Data and Benford's Law
What are the fraudsters missing? To get a sense of why the uniform distribution isn't so natural, we can reason as follows.

  • First, observe that if you multiply a number by 2, then very often the first digit of the result will be 1. Certainly if the original number started with 5, 6, 7, 8 or 9. So if you begin with the intuitively appealing uniform distribution of leading digits (every leading digit being equally likely) and then multiply all the numbers by 2, the distribution of leading digits will no longer be uniform — there will now be a lot of leading 1's. 

    (To describe this phenomenon, I say that multiplication by 2 privileges 1 as a leading digit.)

    This simple observation already tells you that the uniform distribution of leading digits is not really very stable. It doesn't like to persist. It is easy to upset by the innocuous operation of multiplying everything by 2, which is difficult to avoid in the wild!
  • Second, it turns out that many naturally occurring tables of numbers can bethought of as arising from taking some original list and multiplying each entry by a random number of twos.

In view of this, it is natural that we see lower digits overrepresented, and higher digits under-represented, in many naturally occurring data sets.

To explore the explanation in more depth, let's focus on the example of country populations. These tend to grow over time. Think of growing as starting from a random size and being multiplied by 2 a (random) number of times, different for each country (depending on growth rate). Since multiplication by 2 privileges the digit 1 as a leading digit, it's not surprising that a lot of the final numbers start with ones. More than start with nines.

(By the way, it's not just multiplication by 2 that privileges 1 as a leading digit. Multiplication by most numbers privileges lower initial digits, in a sense that is made precise below. So does division by most numbers.) 

Maybe the way to think about it is this. To get a list of numbers not to satisfy Benford's Law you need to build it that way (say, by writing down a list of 6-digit numbers and rolling a 10-sided die to pick all the digits). And then you need to make sure no creature comes along after you are done and multiplies all of them by something a bit unpredictable. But actually, it's very hard to exclude such a creature, because sometimes it is nature (as with population growth) and sometimes it is another source of unpredictable proportional change. And those idiosyncratic multiplications (or divisions) typically privilege lower initial digits.

==

This explains the qualitative phenomenon that 1 appears as a leading digit more often than 9 does. But what explains the quantitative Benford's Law distribution? 

That is, why do we expect to see that about 30% of numbers start with 1, while 10% of numbers have a leading 4, and only 5% of numbers start with 9? Where do those percentages come from?

We saw above that the uniform distribution of leading digits — an 11% probability for each potential leading digit — is not stable when you multiply all the numbers by 2. If every leading digit starts out being equally represented, that stops being true after you multiply by 2.

It turns out there is a distribution of leading digits that does not get upset after multiplying by 2 in this way — it remains stable.  That special distribution is precisely the Benford's Law distribution in the first figure in this answer. And that's not just true for multiplication by 2 — the distribution is stable when you multiply by any number between 1 and 10.  The Benford's Law distribution is the only one that has this property, and once you know that, it is easy to work out what it has to be.

For more detail on this and many other mathematical facts about Benford's Law, see the beautiful blog post by Terry Tao at http://terrytao.wordpress.com/20..., as well as a presentation (slides only) by Michelle Manes athttp://www.math.hawaii.edu/home/....

==
For more info, check out:
[1] The Effective Use of Benford's Law to Detect Fraud in Accounting Data,http://dbentrance.com/blog/?p=112
[2] http://www.math.hawaii.edu/home/...
[3] http://www.rexswain.com/benford....

Comments