When is a hot (or cold) start actually safe to call a trend?

Chicago White Sox rookie Jose Abreu is dominating the league (or is he just having a hot month?)

Nam Y. Huh/AP

As we approach the end of April, we start to enter a nebulous period in baseball stats. Everyone knows that a baseball player can have a hot (or cold) week by random chance. Sometimes that happens at the beginning of April and it’s funny to go back and re-watch ourselves over-react to some early season mirage. Remember when Grady Sizemore was going to save the Red Sox?

But now at the end of the first month, Martin Perez has been a revelation to the Rangers pitching staff. The Angels’ Albert Pujols, after an awful 2013, has come back to life and looks a lot more like … well, Albert Pujols. Chicago White Sox rookie Jose Abreu is already in double digits in home runs, and Jason Vargas and his 1.54 ERA look like a very smart signing for the Royals. And that’s just the American League.

It’s become more common over the past decade to be more cautious about over-interpreting small sample sizes. In fact, the phrase “small sample size” that pops up a lot in the baseball media. It leaves many to wonder, just when does a sample size stop being small? We know that baseball players can and do change over time. For those who have shown big changes from their past patterns in the early going, when can we start believing it? Should I get that full back tattoo of Abreu’s jersey?

There’s no pinpoint moment when a sample changes from “unreliable” to “reliable.” Some fans might be tempted to point to articles such as this one or this one or this one that have been written on the subject, with authoritative-looking charts on when that exact magical point of reliability is. I’m familiar with those articles, mostly because I wrote them. That methodology is a nice place to start, because it gives us some way to begin to answer the question.

But before we begin…

AROUND THE HORN

Warning! Gory mathematical details ahead!

How can we tell how big a sample we need before we can trust a hitter’s performance (for the sake of convenience, we’ll stick to hitters) is not an illusion? A hundred plate appearances? Two hundred? The answer is that it depends on the stat. But for the moment, let’s take strikeout rate, which is simply the number of times a hitter has struck out divided by the number of plate appearances he has made.

Let’s ask a more basic question: What do we mean when we say that a sample is reliable? I’d argue for this definition. If we gave all batters some equal number of plate appearances, and then could somehow magically allow all of them to re-take those plate appearances in roughly the same circumstances (the same suite of opponents, the same weather, etc.) that their strikeout rates would all be roughly the same. Maybe not exactly, but close enough to where the numbers wouldn’t be all over the place.

We can’t magically turn back time, but we can do something similar. Suppose that I wanted to see how reliable strikeout rate was given a sample of 100 plate appearances. Here’s how I’d do it. I’d go back to 2013 (or further, if you care to) and take the first two hundred plate appearances for each hitter. But I wouldn’t look at them as “the first 100” and “the second 100.” Instead, I’d number them up and sort them out by odd-numbered and even-numbered plate appearances. That gives us two groups of 100 and those groups are going to have a lot in common. There will be plate appearances in each group from the same games, against the same opponent, and there’s a decent chance of some balance in some of the same pitchers being in there. It won’t be a perfect match, but in baseball it’s the closest that we’re going to get.

Now that we have our two groups for all players (at least the ones who had 200 plate appearances), we can calculate their strikeout rates in the odd-numbered group and the even-numbered group and put the two sets of results next to each other, comparing a player to himself. We can tell how well all of these pairs of numbers track each other by running a Pearson correlation. In statistics, this is known as split-half reliability. If you have a stats final coming up, feel free to steal that. If you’ve never taken stats, a Pearson correlation can either be positive (as one number goes up, the other goes up) or negative (as one goes up, the other goes down). They can range in strength from zero (knowing one number tells you absolutely nothing about the other number) to one (if you know one number, you can perfectly predict the other). We are hoping to see something closer to a strength of one, because it means that the two groups are close to each other, and that the stat is reliable.

HIGH FASHION WITH A PURPOSE

In general, more plate appearances make for a more reliable sample. We know this intuitively. We don’t believe that a hitter who has a 2-for-5 night is really ready to challenge Ted Williams. But, if he’s 100 for 250 so far, we might figure that he has a chance. If he’s 200 for 500 maybe that is the Splendid Splinter come back to life. Statistically, this shows up as well. As your sample size increases, the strength of that correlation goes up.

Deciding when it’s “reliable enough” is tricky. In reality, it’s just a matter of the sample getting slightly more reliable each time, and your decision of whether to call it a no-longer-small sample size or to get that Abreu tattoo is a matter how much certainty you like. But, because people like lines drawn, I’d recommend that the best place to draw it is when that correlation reaches a strength of 0.7. It’s roughly at that point where we cross the threshold of the similarities in the scores being half due to the player himself and half due to things beyond the player’s control.

If you do this math for strikeout rate, you find that about 60 plate appearances is enough to produce a reliable measure of a player’s true talent. For walk rate, it’s about 120 plate appearances. (Note: there’s a somewhat better, though more mathematically complex method for doing this – if you want a tutorial on the Kuder-Richardson method, Google it – that produces the same sort of results, one that I am using here). For something like on-base percentage, it actually takes about 460 plate appearances. For variables like how often a hitter swings or how often he makes contact, the number can be as low as 40 PA. (For a full list, I recommend this.)

So, for a regular starter, 120 plate appearances is roughly one month of work. Hey wait a minute … we’re coming up on a month into the season. Does that mean that we can now be fairly certain that guys who have been walking a lot more (or less) and striking out a lot less (or more) have turned the corner? Happy happy, joy joy! Not so fast, Stimpy.

When we talk about a month being enough for a reliable sample, we’re saying that a player’s strikeout and walk numbers are a reasonable estimate of his talent level during that month. The month that is now past. Remember how we wished that we could magically rewind time and see how things would unfold given a second go-around? Life doesn’t actually work like that. That same set of circumstances, whatever they were that led to the great (or awful or mediocre) performance isn’t going to repeat itself. How will our player respond to a new set of circumstances?

A cautionary tale to illustrate how silly it can be to assume that one month will accurately mirror a season: Below are performances from six different calendar months (all from 2013). All of them involved between 111 and 119 plate appearances. I’ll present you with the walk and strikeout rates that were achieved in those months and see if you can guess which player each of the months belongs to.

 

Strikeout Rate

Walk Rate

Player A

24%

10%

Player B

15%

3%

Player C

13%

13%

Player D

21%

15%

Player E

18%

11%

Player F

18%

7%

Player A either walks or strikes out in more than a third of his plate appearances. Player B puts the ball in play a lot more but doesn’t walk much. Player C has excellent plate discipline and manages to walk as often as he strikes out. Player D is another walk and strikeout machine, but is much more balanced than Player A. Players E and F look a bit alike.

OK, who are the mystery players? They are (in order) Anthony Rizzo of the Cubs in April 2013, Rizzo in May 2013, Rizzo in June 2013, Rizzo in … you get the idea. All of those samples are big enough to be considered “reliable.” Will the real Rizzo please stand up?

Rizzo’s an outlier (it’s why I picked him), but he’s an example of what can happen. It’s tempting to believe that a baseball player is the same in April that he is in August, because we think about baseball in terms of seasons. We don’t often recap players by what they did in June, but rather what they did in 2013, and that view from 30,000 feet has a way of making hills and valleys look the same. Maybe in May, Rizzo decided he wasn’t going to take as many pitches and was going to be more aggressive. Maybe by August he had fallen back into his old ways. It’s hard to know. Baseball players grow and change like everyone else, because they are human. It’s hard for humans to change a behavior and to sustain that change — how’s that New Year’s Resolution going? – so we shouldn’t be surprised that even if a player has a good month or even a good couple of months that it might not stay that way.

What we can know and what we can’t

So, should you believe that gaudy April stat line? If you want to buy into a player being the next big breakout star (or you want to say “it won’t last”) you have to understand why the line changed to begin with. Is our player swinging more? Is that a good thing for him? Sometimes that’s hard to do, even for people who do this for a living. If it’s something that the player is doing differently and it’s going well, what is the likelihood that he’ll be able to keep it up? Yes, small sample sizes can deceive, but even once we get past that stage, predicting what’s going to happen from May until September takes more than just looking at the rates in April and adding five extra months. It’s an assumption to believe that he’ll keep it up. Maybe it’s not a bad assumption, but it’s no guarantee. In a culture that likes guarantees, that’s a hard thing to say, but it’s the way that things really are.