Monday, July 02, 2012

D-Backs' Hill Hits for the Cycle -- Twice Within 10 Games!

Three nights ago, Friday, June 29, Aaron Hill of the Arizona Diamondbacks hit for the cycle, the second time he had done so in 10 games. The previous occasion was on June 18 (click here for the D-Backs' game-by-game log).

According to a Wikipedia page on the subject, "Cycles are uncommon in Major League Baseball (MLB), and have occurred 294 times... The cycle is roughly as common as a no-hitter (272 occurrences in MLB history); it has been called 'one of the rarest' and 'most difficult feats' in baseball."

With the rarity of hitting for the cycle, one might think it extremely rare for any player to have done it twice (or more) in a career, never mind twice in one season or twice in the same month. Remarkably enough, multiple-cycle seasons for the same player have occurred three times previously, although you have to go back 81 years for the last instance before 2012 (Babe Herman in 1931). Even more remarkably, the first two players to hit for the cycle twice in the same season (Tip O'Neill in 1887 and John Reilly in 1883) apparently accomplished the feat in quicker succession than did Hill (O'Neill and Reilly are each listed as having hit for the cycle 7 days apart; unless either of them faced a heavy supply of doubleheaders, it is almost certain each player's two cycles occurred within fewer than 10 games).

So how likely (or should I say, unlikely) was Hill's recent pair of cycles within 10 games? An excellent starting point is Jeff Sackmann's 2010 article, "The Odds of a Cycle." The foundation of Sackmann's tutorial is the relatively simple "multiplication rule" for joint probability (e.g., obtaining the probability of double-sixes on dice by multiplying together the probability of a six on each die, 1/6 x 1/6 = 1/36).

Also helpful was a brief analysis specifically focused on Hill's recent pair of cycles, by Justin Hunter, which provided some of the necessary statistical inputs for Hill. I used Hill's empirical frequencies of recording different types of hits, provided by Hunter, rather than what Sackmann used (something known as CHONE projections). A key insight of Hunter's was to identity Hill's low rate of producing triples historically (which was actually around 0.4% instead of the 0.3% figure Hunter gave).

Sackmann starts out with the simple case of a "natural cycle," that is, hitting a single, double, triple, and home run in that specific order, in four total plate appearances (PA). Filling in Hill's probabilities of the different types of hit, shown in the following table, we then multiply the numbers in the first four columns by each other, yielding the answer in the fifth column.

p (single) p (double) p (triple) p (homer)
p (Nat Cycle | 4 PA)
.162 .053 .004 .027

As Sackmann further notes, having the benefit of 5 or 6 plate appearances in a game instead of 4 improves a player's odds of hitting for a natural (or any) cycle.

With 5 plate appearances, there are 5 ways a natural cycle can happen, as listed below (1, 2, 3, and H represent, respectively, a single, double, triple, and home run; x = non-fitting outcome, such as an out, whose probability we apparently don't care about):


For 5 PA games, we thus multiply our original probability (.000000927) by 5, yielding .00000464.

With 6 PA, there are 15 ways a natural cycle can happen  (those with semi-advanced mathematical training may be familiar with the n-choose-k principle, in this case, 6 choose 4). These are the 15 ways:


For 6 PA, we would multiply our original probability (.000000927) by 15, yielding .0000139.

To get one overall probability of a natural cycle for Hill, we next average the probabilities given 4, 5, and 6 plate appearances, weighted by the frequency of occurrence of the different numbers of PA. Sackmann provided relative frequencies of how often players (in general) get different numbers of PA in a game -- not figures specific to any particular player, but overall averages. Because it would be too time-consuming to look at box scores for Hill's nearly 1,000 games played, we'll use the overall relative frequencies, too.

PA Proportion of Time (for avg. player) p (Nat Cyc) with those # PA Column 2 x Column 3
3 .101 --- ---
4 .591 .000000927 .000000548
5 .274 .00000464 .00000127
6 .034 .0000139 .000000473

Summing the yellow numbers (and dividing by 1.00 for the aggregated PA proportions) yields .00000229, the weighted average for the probability of Aaron Hill attaining a natural cycle (based on his batting statistics, weighted by PA patterns for an average player).

In reality, however, we're interested not just in natural cycles, but all possible kinds of cycles. As Sackmann says, “There are 24 permutations of the sequence 'single, double, triple, home run.' Thus, a garden-variety cycle is 24 times more likely than a natural cycle.” For scenarios beginning with a single, there are 6 possible sequences (only one of which, shown in red, is a natural cycle):


There are also 6 sequences beginning with a double, 6 beginning with a triple, and 6 beginning with a homer, thus yielding the 24 permutations. Multiplying our previous probability by 24 yields the following probability of Hill hitting for the cycle in a game (regardless of how the single, double, triple, and homer are sequenced).

24 x .00000229 = .000055 (roughly once every 18,182 games)

Before we get to the final steps, let's reflect on what we've discovered thus far. We have a player, Aaron Hill, who would be expected to hit for the cycle once every 18,182 games. Yet, he has done it twice in only 10 games!

We now bring in an online binomial probability calculator, which can help us answer questions of the form: How likely is it that, with an underlying probability of .000055 of an event (the cycle) occurring, 2 or more occurrences will be observed in 10 trials. This probability is .000000136, or approximately once in every 7 million (7,352,941) 10-game sets.

Ten-game sets are pretty numerous, however. Each time a player appears in his first career game, he is establishing a 10-game set (i.e., Games 1-10), then the next day, he establishes another (Games 2-11), etc. Sackmann notes that, “When 30 teams play a 162-game season, that's 43,740 player-games …”

In earlier times, there were fewer teams than 30 and fewer games on a team's schedule than 162. For simplicity, however, let's use Sackmann's figure of 43,740 player-games per year, each of which inaugurates a new 10-game stretch, with rare exception (e.g., when a player has 9 or fewer games left in his career).

Dividing 7,352,941 (the expected number of 10-game stretches for Hill or a player of similar abilities to hit for the cycle 2 or more times) by 43,740 (a rough estimate of the number of player-specific 10-game stretches launched in a single season) yields 168.1. We would thus expect a double-cycle by the same player within 10 games about once every 168 years.

Major League Baseball (and its organizational forerunners) have been around for 143 years, so it is not all that surprising to see two cycles by the same player occur within 10 days, at least once. In fact, it apparently has happened 3 times, but only once after 1887.

If you've stuck with me this far, you're probably really interested in the subject matter! I would recommend the following article to gain additional perspective:

Stefanski, Leonard A. (2008). The North Carolina lottery coincidence. The American Statistician. 62, 130-134.

Stefanski reflects upon the different interests of statisticians and laypersons (including the media) in understanding rare occurrences involving lotteries, sports, and perhaps other phenomena. Whereas laypersons seem to be interested in what Stefanski calls a "narrow" perspective (e.g., Aaron Hill's achievement), statisticians embrace a "wide" perspective, seeking to contextualize an occurrence in the larger set of opportunities for an event to occur. Writes Stefanski:

Statisticians should point out when seemingly rare events are not really that rare. But in doing so we should not lose sight of the fact that for some human interest stories, a probability calculation from the “narrow perspective” is appropriate. My hunch is that we sometimes do lose sight of the human-interest angle because we are geared toward the “wide” perspective (p. 131).

If you notice any faulty assumptions or calculations, please let me know in the Comments section or by e-mailing me via my faculty homepage (see links section to the right).

No comments: