Readers of this site are probably aware that you can’t read too much into statistics this early in the season because players have only given us a small sample of data to work with. The concept is that there isn’t enough data in a small sample to make an informed conclusion about the data that is available. This is technically true, but one of my biggest pet peeves in baseball analysis is that analysts seldom, if ever, define the population of data they are sampling and therefore do not actually prove that the sample is small. This drives me up the wall because frequently even good analysts suggest that a sample is small, when, in fact, it is not. Good baseball analysts also tend to dismiss trends drawn from small samples too quickly. The purpose of this post is to correct a few of these assumptions, and hopefully offer something helpful to the debate.
The number one mistake baseball analysts typically make when discussing a small sample of data is failing to understand what defines a small sample. Instead, the concept of the small sample is thrown around to describe any data that the analyst feels was collected over a period of time that does not accurately reflect the player or team’s true potential performance. You may, for example, see someone suggest that through early May it is unwise to read too much into a player’s performance because the amount of data provided constitutes a small sample. This seems true because we know, for example, that Mark Teixeira‘s numbers through May of last year did not reflect his season totals, but in Tex’s case early May actually was a large sample, by definition. Don’t take my word for it. According to Kaplan’s MBA Fundamentals Statistics, a small sample is “defined as a sample size of 30 or fewer items” (page 98). That means that by virtually any measure — games, at-bats, plate appearances — Tex has provided us a large sample of data to work with through the first five weeks of the season. That sample just didn’t reflect the performance we’d wanted to see.
The number 30 is not an arbitrary cutoff point for separating small and large samples. Its definition is rooted in an important distinction of statistical inference, the use of statistics to draw conclusions about populations of data (the entire collection of data possibly available) using only samples (subsets of those entire populations). At samples of 31 data points or more, the t-distribution, which is the distribution used in statistics to draw conclusions when only a small sample of data is available, approximates the normal distribution. At that size, the two distributions are approximately the same (Kaplan, page 104). Below 30 observations, the t-distribution is bell-shaped, similarly to a normal distribution, except it has fatter tails. This means that its mean is more likely to be influenced by outlier values than is the case in a normal distribution.
This latter point is why it is correct to caveat, but not entirely reject, conclusions drawn from small samples. It’s not that those conclusions are wrong–it’s just that extreme outlier values in the sample can incorrectly influence the mean. Robinson Cano provided us a perfect example on Saturday. Through Friday he was batting .276/.300/.483. He had a great game Saturday and entered Sunday’s game batting .324/.343/.618. Cano provided outlier performances before and during Saturday’s game. Those performances knocked his slash stats all over the place.
If by now you trust that samples of 31 or greater are actually large, it means that baseball players provide us with large samples quickly. When talking about batters a large sample can be said to be 31 plate appearances, which for most Yankee position players has already happened. For pitchers it can be said that a large sample of data occurs after 31 innings. This will take a little while longer, but CC Sabathia is almost there.
This, then, raises a different question. If large samples have more than 30 observations, and can be used to draw conclusions with few caveats, why then do the first 31 plate-appearances of a player’s season, or the first 31 games of a team’s season, often fail to predict the rest of the player’s or team’s season? There are several answers to this. The first has to do with standard deviation. As more observations are collected, the standard deviation of the sample goes down, which improves the accuracy of conclusions. Another perfectly valid explanation for this may also be that the true population of baseball data is not in fact normally distributed, which throws almost all our assumptions about mean tendencies out the window (and is a far more complex topic).
There is, however, a third explanation that is often overlooked. Just because two samples are large does not mean that they were drawn from the same population of data. They may, in fact, be samples of two different populations that actually need to be separated. Two examples can illustrate this point. First, imagine you want to estimate the height of people living in New York City. You take the average heights of individuals as they leave a bus. Over a sample of 40 individuals (a large sample) you get an average height of 4’7″ tall, with only three individuals coming in at over 5′ tall. This happened because I declined to mention that the bus was a school bus. The only adults on board were two teachers and the driver. The rest were school children. This was an example of omitted variable bias. We missed a critical fact about our results, one that would have changed our analysis. In this case, school children are not representative of New York’s adult population. They need to be separated from the adults.
Returning to baseball for the second example, we all know that in 2010 Tex hit like garbage in April, a bit better in May and June, was on fire in July and August, and then went ice cold in September. Tex’s performance was different in each of these periods, so much so that while they all encompass components of the same season they are probably samples of different populations of data that are independent of each other (April Tex, Tex swinging the bat well, injured Tex, etc.). Throughout the course of the season things happen to players that have the same impact as the school bus in the first example. As players get hurt or make adjustments that change their performance, one sample ends and another begins. As analysts it is our job to recognize when these changes have occurred, and separate the samples.
In conclusion, small samples get beaten up a lot in baseball analysis because they are misunderstood. While there are justifiable caveats regarding conclusions drawn from small samples, those caveats are not as damning as we often think. Furthermore, small samples are much smaller than we often realize. Once we are observing more than 30 points of data for whatever we are analyzing we are officially working with a statistically valid, large sample. Baseball performance varies wildly during the course of the season not because it takes time for a player or a team to submit a large sample, but actually because analysts often fail to separate independent samples that describe means that at first blush appear similar, but are in fact different.
LIKE TYA ON FACEBOOK
- TYA To Merge With It’s About The Money, Stupid
- What about Kevin Youkilis?
- Teix Now Front And Center On The “Needs To Produce” Radar
- Cashman: Heathcott A Dark Horse Candidate
- A Dog Chasing Cars
- Outfield Trade Targets
- The Problem With Brett Gardner
- A Look At Relief Prospect Branden Pinder
- The Yankees Should Be Realistic, Put Team on Short Leash in 2013
- Briefly discussing the internal options to replace Curtis Granderson
- the tao of badass pdf on What about Austin Romine?
- Joey Parkhill on Dante Bichette Jr’s Swing
- lululemon factory outlet on Contact Us
- Cary on Will R.A. Dickey’s Knuckleball Succeed In A Domed Stadium?
- Brenna on Links: Prospects, Support for A-Rod, Mariano is Love and Who’s in Center?
- Louis Vuitton Outlet Sale Singapore on The Monthly Prospector: April Edition
- Authentic Louis Vuitton Outlet Store on The Monthly Prospector: June Edition
- Louis Vuitton Outlet San Diego on Banuelos to Undergo Tommy John Surgery, Yankees Prospectors to Undergo Grief Counseling
- related web site on The Great Subway Race
- get your lover back on Contact Us
TagsA.J. Burnett Alex Rodriguez Andy Pettitte Austin Romine Baltimore Orioles Bartolo Colon Boston Red Sox Brett Gardner Brian Cashman Bullpen CC Sabathia Chien-Ming Wang Cliff Lee Curtis Granderson David Robertson Dellin Betances Derek Jeter Francisco Cervelli Freddy Garcia Game Recap Hiroki Kuroda Ivan Nova Javier Vazquez Jesus Montero Joba Chamberlain Joe Girardi Johnny Damon Jorge Posada Manny Banuelos Mariano Rivera Mark Teixeira Melky Cabrera Michael Pineda New York New York Yankees Nick Johnson Nick Swisher Phil Hughes Prospects Rafael Soriano Red Sox Robinson Cano Russell Martin Tampa Bay Rays Yankees