Monday, June 12, 2006

Leading up to this past weekend, I had been planning to write something about how the men's French Open tennis final would be pitting two players against each other, who each had phenomenal streaks coming in. That indeed happened and I will still write about it, but something else happened over the weekend in college baseball, which I think tops the tennis match.

The University of South Carolina hit a mind-boggling five consecutive home runs against the University of Georgia, en route to a 15-6 win and 1-0 lead in the teams' two-out-of-three super-regional series (final qualifying round before the College World Series).

A simple way to estimate the probability of five homers in five at bats is to start with the Gamecocks' baseline probability of hitting a home run in any single at bat. This Southeastern Conference (SEC) baseball statistics page (updated through June 6, as I'm looking at it) tells us that, out of 2,215 at bats this season, South Carolina had hit 82 homers (.037).

Alternatively, we could increase the denominator by adding in plate appearances that are not counted as official at bats. The main source of such extra appearance are walks, however, and one could argue that many walks represent instances where the pitcher does not want to give the hitter the opportunity to swing the bat (explicitly, when there's an intentional walk, but also when a team "pitches around" a hitter). Also, by using only official at bats as the denominator (and thus keeping the home run ratio a little higher), that will make my upcoming calculation a little more conservative (i.e., helping to avoid overstating the rarity of the occurrence).

We then simply raise the Gamecocks' probability of a home run on a single at bat (.037) to the fifth power (representing the five homers), which yields .00000007 (7 X 10 to the minus eighth power, or 7 in 100 million). This type of calculation is analogous to determining that the probability of rolling double sixes on two dice is 1/36, by raising the probability of a six on a single die (1/6) to the second power.

In the dice example, it is assumed that the outcomes of the roll of two dice are independent (i.e., the number that comes up on one die does not affect the number that comes up on the other). One may question whether the independence assumption holds up in this home run-hitting scenario. Many of you are probably thinking that the same Georgia pitcher was throwing to these batters and just kept "grooving" the ball to the hitters, based on loss of speed and/or movement on the pitches. That may be true to some extent, but it must be noted that after the first three homers of the streak, Georgia changed pitchers and the new guy gave up two more homers!

Another consideration is that I was drawn to analyze the South Carolina streak by its spectacular nature. If we were to ask instead, in all the countless college baseball games played over a period of years, how likely is it that we would find such a streak at some point, the streak would not seem so unlikely.

Here is a passage from the textbook I use in teaching statistics (King & Minium, 2003, Statistical Reasoning in Psychology and Education, p. 205):

Let us consider again the case of Evelyn Adams... who won the New Jersey Lottery twice in a 4-month time span in 1986. The probability of Ms. Adams doing this was 1 in 17 trillion... If there were 4,123,000 lottery tickets sold for each lottery, and Ms. Adams had purchased 1 ticket for each, the probability of her winning both was (1 / 4,123,000) (1 / 4,123,000), the same as for any other specific person who purchased 1 ticket in each lottery.

But the probability of someone, somewhere winning two lotteries in 4 months is a different matter altogether. Professors Diaconis and Mosteller (1989) calculated the chance of this happening to be only 1 in 30.


The citation for the original Diaconis and Mosteller article is:

Diaconis, P., & Mosteller., F. (1989). Methods for studying coincidences. Journal of the American Statistical Association, 84, 853-861.

In fact, as the above-linked article about the South Carolina homer barrage notes, the five "dingers" merely tied the NCAA record (set in 1998), rather than breaking it.

What about the tennis match that I started this write-up with? I've gone on too long for a detailed statistical analysis, so I'll just note that Rafael Nadal came into the French Open final having won 59 straight matches on clay (the surface in the French), whereas his opponent Roger Federer had won 27 consecutive matches in major (Grand Slam) tournaments, capturing Wimbledon, the U.S. Open, and the Australian Open, before advancing to the finals in Paris (none of these three tournaments won by Federer are played on clay). Nadal beat Federer, and I'll leave you to read about it here.

No comments: