My original analysis was way off. However, in thinking about the problem over the weekend and doing some preliminary number-crunching in Excel, I came to the conclusion that the odds were not as remote as I had computed. I would have pursued the analysis further, but frankly, I was unsure how to proceed and put it aside. I got lazy and decided to google to see if the problem had already been solved, and sure enough, it was - by a fellow named Brian Hayes.
My revised assumptions and methods agreed with Brian, but I was unsure how to express the calculation elegantly as a formula. I even considered simulation; that is running a large number of trials - but never got to use that approach. Brian thought of that ALSO. At least
I was on the right track. I feel real good about THAT.
The key thing to keep in mind, DUers: Solving a complex problem involves TRIAL and ERROR. Insight after insight. Fine-tuning the analysis until one gets a sufficiently robust solution - which, after all, will always be an approximation-but a good one. What we are NOT looking for is proof; what we ARE looking for is a rational basis for rejecting coincidence.
Basically, here is how the problem was solved by Hayes:
Step 1: get the data for 30 Texas elections in Comal county.
Step 2: determine the range of Republican votes - 3100.
Step 3: Calculate the probability that at least THREE of the 30 county-wide elections will have the IDENTICAL NUMBER OF REPUBLICAN VOTES WITHIN THE 3100 RANGE, ASSUMING ALL POSSIBLE TRIPLICATES,ETC ARE EQUALLY LIKELY (A UNIFORM DISTRIBUTION). In this case the number was 18,181, but it could just as well be 18,182, etc. Of course, 18,181 reversed is still 18181 (a palindrome)- but that is NOT going to be an issue here.
The Bottom line result: The odds of 3 or more elections resulting in the SAME number of votes is just 2500 to 1 - a long shot, but not THAT long.
Thus, we cannot assume, based on these results alone, that the voting machines were rigged.
On the other hand, if the odds were 1 Million to 1, one would be stretching it to assume the occurrence a coincidence.
................................................................
A coincidence problem.
Brian Hayes
A friend who worries about voting fraud sent me a note about a recent election in Texas, where three winning Republicans all turned up with the same number of votes: 18181. "What's the probability of that happening by chance?" he asked.
Always a good question, of course. I asked him for his own estimate of the odds. He proposed that the probability is (1/18181)^3, which puts the event well beyond the one-in-a-trillion threshold. I disagree with this estimate, but before we can zero in on a better one, we need more facts. First of all, the election did happen. When I Googled for the pleasantly palindromic numeral "18181" the other day, the search engine reported 25,200 references on the Web. For example, there was a document titled "Projected Cash Flow for 2000 for Mrs. Nettie Worth, Rodeo Ranch, Wildparty, Kansas," which mentioned 18181 several times in connection with beef calves. (Nettie Worth? Now what's the probability of that?) Poking around a bit more, I learned that 18181 is the German postal code for the resort village of Graal Müritz in western Pomerania, and that the address of the Yorba Linda Library in California is 18181 Imperial Highway.
But among these distractions I also found numerous pointers to news stories about the November 5, 2002, election in Comal County, Texas, which is just northeast of San Antonio (county seat, New Braunfels). Eventually I came to the web site operated by the county itself. The results of the 2002 general election are posted at this URL: Here is a slightly condensed table of the vote totals. I have included only county-wide contests, and I've excluded a constitutional amendment where the votes for and against were not identified by party. The three "suspicious" 18181 totals are highlighted in the table below. The total number of ballots cast was 24362.
Race Republican Democrat Other Total
1. U.S .Senator 18156 5696 350 24202
2. U.S .Rep. District 21 19066 4627 371 24064
3. Governor of Texas 18558 5047 550 24155
4. Lieutenant Governor 16504 7186 477 24167
5. Attorney General 17935 5498 576 24009
6. Comptroller 19601 3962 534 24097
7. Commissioner of L and Office 17328 5129 1144 23601
8. Commissioner of Agriculture 18259 4635 925 23819
9. Railroad Commissioner 17166 5675 784 23625
10. Chief Justice Supreme Court 18051 5011 530 23592
11. Justice Supreme Court 1 17456 5387 653 23496
12. Justice Supreme Court 2 17860 5181 391 23432
13. Justice Supreme Court 3 17894 5392 * 23286
14. Justice Supreme Court 4 17175 6166 * 23341
15. Judge Criminal Appeals 1 17778 4762 821 23361
16. Judge Criminal Appeals 2 18045 5221 * 23266
17. Judge Criminal Appeals 3 18301 4604 416 23321
18. Board of EducationDist.5 17089 5683 513 23285
19. State Senator District 25 18181 4988 723 23892
20. State Rep.District 73 18181 5303 * 23484
21. Chief Justice 3rd District 19261 * * 19261
22. Judge 207th District 19342 * * 19342
23. Judge 274th District * * 19348 19348
24. Criminal District Attorney * * 19315 19315
25. County Judge 18181 5547 * 23728
26. Judge County Courtat Law 19345 * * 19345
27. District Clerk 19311 * * 19311
28. County Clerk 19554 * * 19554
29. County Treasurer 19306 * * 19306
30. County Surveyor 19229 * * 19229
Following my friend's line of reasoning, one might well argue that the odds against this particular outcome are even more extreme than he suggested. In principle, the Republican candidates in races 19, 20 and 25 could each have received any number of votes between 0 and 24362. Thus the relevant probability is not (1/18181)3 but (1/24363) 3, which works out to 6.9 x 10-14. This is the probability of seeing any specific triplet of vote totals in those three races, on the assumption that the totals are independent random variables distributed uniformly over the entire interval of possible outcomes. That last assumption is rather dubious, and I'll return to it below.
More important, however, the probability that three specific candidates receive a specific number of votes is not what we really want to calculate. Would it be any less remarkable if three different winners had all scored 18181? Or, if three candidates all received the same number of votes, but the number was something other than 18181? What we have here is a "birthday problem," analogous to the classic exercise of calculating the probability that some pair of people in a group share the same birthday. The textbook approach to the birthday problem is to work backwards: First compute the probability that all the birthdays are different, then subtract this result from 1 to get the probability of at least one match. This method is easy and lucid.
Unfortunately, it's not immediately clear how to extend it to the case of three birthdays in common. For the Comal County election coincidence, I reluctantly resorted to frontward reasoning. For the moment, let's go along with the fiction that each Republican candidate had an equal chance of receiving any number of votes between 0 and 24362. Then the number of possible election outcomes (considering the Republican votes only) is 24363^30. This is the denominator of the probability. For the numerator, we need to count how many of those cases include at least one trio of identical tallies. We've already seen one way this could happen, namely with the candidates in races 19, 20 and 25 having 18181 votes each. Holding these results fixed, there are no constraints at all on the other 27 races, and so there are 24363^27 ways of achieving this outcome. But in fact we don't insist that the vote in the three coincidence races be 18181; it could be any number in the allowed range, so that we need to multiply the numerator by another factor of 24363. Thus the probability that races 19, 20 and 25 will all have the same total is 24363^28 / 24363^30.
Finally, we note that there's nothing special about the specific races 19, 20 and 25; we want the probability that any three totals are equal. How many ways can we choose three races from among the 30? In 30-choose-3 ways, of course. This number is 4060, and so the probability that at least three Republicans in Comal County would have the same vote totals on election night is:
4060 x 2436328/24363^30 = 6.84 x 10-6
We are down below the one-in-a-million level. At this point the most doubtful part of the analysis is the assumption that the votes are uniformly distributed across the range from 0 to 24362. The true distribution is unknowable, but surely we can make a better estimate than that. All the actual votes for Republican candidates lie in the interval from 16504 to 19601, a range that encompasses 3098 possibilities. Suppose we round this up to 3100 and assume -- or pretend -- that each of the 3100 totals is equally likely. Then the revised probability estimate becomes:
4060 x 310028/310030 = 4.22 x 10-4
In other words, the odds against such a three-way coincidence are somewhere near 2500 to 1. Note that included within this estimate are cases with more than just a trio of identical votes, such as four totals that are all the same, or a "full house" result of three-of-a-kind plus a pair. But those further coincidences are unlikely enough that they don't make much difference. The probability of exactly one triplet (and all other vote totals distinct) is 3.67 x 10-4.
As for my friend who worries about election tampering -- did this line of argument put his mind at ease? He replied by asking if I had properly accounted for the ingenuity of those who fix elections. If they are able to determine the outcome, could they not also arrange to make it look statistically acceptable? The question deserves to be taken seriously. According to my analysis, the narrower the range of vote totals, the less suspicious is the appearance of an identical triplet. So should we rest easy about such coincidences if all the winning totals lie within a range of, say, 100 votes? Suppose you have been appointed Rigger of Elections in Comal County. Because of technical limitations, you cannot specify the exact number of votes that each candidate will receive, but you can set the mean and the variance of the normal distribution from which the vote totals will be selected at random. Your job is to ensure that all of your party's candidates win, without arousing the suspicions of the public. What is the optimal strategy?
I need to end this note with a confession. The analysis given above was not my first attempt to calculate the probability of a three-way coincidence. I had tried several other approaches, and each time got a different answer. So what makes me think the answer given here is the right one? Simple. I kept trying until I got a result in agreement with a computer simulation. I'm not proud of this method of doing mathematics. I would much prefer to be one of those people with an unerring Gaussian instinct for the right way to solve a problem. But it won't do to pretend. So what excuse can I make for myself? Do I have more faith in the computer and its pseudorandom number generator than I have in mathematics? I would prefer to put it this way: I have more faith in the laws of probability than in my own ability to reason accurately with them.
Editor's note: The astute reader will notice that Brian added the probabilities for each triplet to get the probability that at least one triple occurs. This is not quite correct since these events are not mutually exclusive. So what he is actually computing is the expected number of triples. In the classical birthday problem, if you ask for the number that will make the expected number of birthday-coincidences greater than 1/2 the answer is 20 which is less than the number 23 required to make the probability greater than 1/2 for 2 people to have the same birthday. See Chance News 6.13 for an occasion when this difference was important.When the number of birthdays is large, these two numbers become very close so we can expect Brian's calculation not to be affected very much by this.
The correct calculation for a match of three or more is not, in principle, difficult. As Brian said, the probability of at least one pair with the same birthday is computed by counting the number with no pair and subtracting this from 1. For a match of 3 we have to subtract also the probability that at least one pair have the same birthday but no three people do. This is not so easy and that is probably why Brian had a problem with what he calls the backward method. You can find a nice discussion of the result of this computation at the Matchcad library. If we use the formula given here to compute the probability of a match of 3 or more with 3100 possible birthdays, we get 6.83-6 as compared to Brian's 6.84*10 -6. So much our quibble!
--------------------------------------------------------------------------------
Bill Montante asked for comments on his definition of the word "chance." You can send them directly to him at
[email protected] but please also send them to us at
[email protected] since I guess we should try to decide what it means also.
Defining Chance
Bill Montante