by Keith Law
I’ve never seen an attempt to quantify command that made any sense to me, although it could be possible as the quality of pitch-tracking data improves. You can’t just glean it from strikeout rate, walk rate, from some manipulation of those two, from hit rate or BABIP allowed, or any of those things. You just have to know it when you see it.
Grading out the individual pitches has always been highly subjective, of course, except for the radar gun, which is an imperfect proxy for the effectiveness of a fastball. In recent years, some teams have asked their scouts to start to record new information, like how many swings and misses an amateur pitcher (especially a high schooler) gets over the course of a game, perhaps broken down by pitch type. Scouts might also be asked to grade or otherwise evaluate how tight the rotation is on a curveball or slider, or how far a pitcher extends over his front side at release—that is, where his arm is relative to his landing leg when he lets go of a pitch.
MLB’s new data stream, called Statcast, takes some of the subjective elements of traditional scouting and puts objective numbers on them, which is changing the nature of scouting at all levels, from the amateur ranks to the majors. We no longer have to guess how hard a hitter makes contact, or whether a pitcher is truly skilled at limiting hard contact; how tight that curveball rotation is, or how much that pitcher’s four-seamer really moves. This doesn’t eliminate the role of the scout, but it changes it, removing some of the guesswork from the job and, if a front office really integrates the two sides of its operations, freeing up scouts to do the things that can’t be measured with radar or cameras, from evaluating his swing or pitching mechanics to getting to know the player’s character.
Some teams have indeed chosen to diminish the role of the scout within their departments, asking scouts who watch amateur players to gather as much video as possible of the player during and prior to games, getting biographical information, and administering tests like psychological and vision exams to players. Many teams have used statistical analysis to supply scouts with lists of players to see who might not otherwise have been on their radar—good performers at smaller colleges, or colleges outside of the NCAA’s Division 1 level—or to ask pro scouts to focus more on specific players while doing their general coverage of entire minor-league teams.
An opportunity still exists, however, for teams that can best integrate this new information with their scouting resources to answer questions that advanced data can’t answer in a vacuum. We know pitcher Charlie Brown has a high spin rate on his curveball, so why is he giving up so much hard contact? Does Brown throw the pitch too infrequently? Are hitters picking the curveball up early enough to lay off it and wait for a fastball? Does it finish out of the zone so often—as was true with Trevor Bauer and Carson Fulmer in college—that better hitters let it go by for a ball? Rather than asking scouts to provide subjective opinions on variables we can objectively measure, we can ask scouts to fill in the gaps between the skill measurements and the results on the field. Scouting is and has always been about projecting what a player will do when he reaches the majors or comes to the scout’s team. Simply acquiring every pitcher with a high-spin-rate curveball will bring you some successes, but at high cost. Understanding which high-spin-rate guys are likely to have success against big-league hitters—the ones best able to recognize or hit a curveball—is the way to narrow your target list in a meaningful way.
This also puts the onus on scouts to at least become conversant in the new language of baseball analysis. I’m not suggesting we need scouts to run SQL queries or learn R (although it would probably look good on the résumé), but scouts are going to be asked to identify more specific attributes in players, and they’re going to find themselves held more accountable going forward for their scouting reports. I don’t see teams asking scouts to apply the metrics themselves, but when they write up players’ abilities, teams will be better able to evaluate the evaluations using improved data.
The baseball industry had seen virtually no change in how players were evaluated since Branch Rickey’s days with the Cardinals, but since 2000 we’ve seen the introduction and proliferation of statistical analysis within the sport and among fans and media, which has in turn changed the way teams scout and acquire players, and all of which has led up to the second revolution, one that is currently in progress. This ongoing upheaval was triggered by the introduction of MLB’s Statcast product, one that is big enough to merit its own chapter.
17
The Next Big Thing Is Here, the Revolution’s Near:
MLB Statcast
We’re not telling people things they don’t know about baseball; we’re going where they are and giving them more information to help them understand it better.
—Cory Schwartz, Vice President of Stats, MLB Advanced Media
In 2015 MLB introduced into baseball a new source of data called Statcast, instantly changing the way that we think about stats and the game. The new data stream that teams receive from MLB Advanced Media (MLBAM) via Statcast is absolutely massive: 1.5 billion rows of data, each of which has about 70 fields, for all MLB games from Opening Day 2016 through mid-September, so something on the order of 100 billion items just for a single season, which means that for the first time teams have had to think about the mere challenge of storing a terabyte of data before they could even begin anything resembling analysis of it.
Statcast combines two separate systems for gathering physical information from the field of play, one radar-based and one optical-based, which MLBAM merges into a single stream of data that provides an unprecedented amount of detail on each pitch or play. That includes stuff you’d expect, like the velocity of the pitch or the result of the play, but also includes a lot of information that was previously either unrecorded or totally unavailable. Such information would include the speed of the ball as it left the bat, now known as “exit velocity,” or the spin rate on the pitch in revolutions per minute (RPM).
Information that was unavailable prior to Statcast is even more extensive and opens up new worlds for statistical analysis of player performance. The Statcast record for a single play includes the position of everyone and everything on the field: every fielder, every baserunner, every coach and umpire, and of course the position and trajectory of the ball from when it leaves the bat until when it’s picked up by a fielder—and again if it’s thrown from one fielder to another. This at least gives teams with the analytical capacity to model plays, create 3-D renderings, and, perhaps most important, get realistic measures of defensive performance, especially range, because for the first time we know where the fielders actually started on each play, rather than only knowing where they were when they reached the ball.
Although Statcast itself began in earnest on Opening Day of 2015, ushering in a new Year One of data for MLB teams and their analysts, the modern baseball data revolution really began in 2006 with pitch tracking and the tool known as Pitch f/x. MLB started offering fans its Gameday product online in 2001, but pitches were recorded manually and the data was not reliable enough for real analysis. (Prior to Pitch f/x, MLB stringers were instructed to mark any pitch called as a strike as being in the strike zone, even if it wasn’t even in the same ZIP code, and the same for pitches called balls. Thus, a pitcher could split the plate in two with a belt-high pitch, but if the home plate umpire, Mr. Magoo, called it a ball, the stringer charged with entering the data for MLB would have to enter it into the system as out of the strike zone.) MLBAM recognized that the consumer product would be vastly improved by accurate pitch data—velocity, movement, and pitch type, as well as a computerized strike zone that didn’t necessarily agree with the umpires’ ball/strike calls.
Although Pitch f/x started life as a consumer product, its introduction meant teams suddenly had real data to work with—and so did independent analysts all over the world, since MLB chose to make the Pitch f/x data stream public. Fans watching at home, following games on MLB.com, or using the league’s best-of-breed At Bat app were treated to pit
ch velocities, types, and locations, data for every pitch that also came through to clubs in “flat files” (meaning text files, unformatted, with fields separated by commas) those teams could use for a whole new field of analysis. Where previously teams had nothing to work with but event data—in this at bat, the batter saw this many pitches, this many strikes and balls, and the at bat ended when he hit the ball to this fielder—now they had detailed data on every pitch. Pitch f/x allowed analysts to start to tease out insights like who really had the most effective slider in baseball, or to what pitch a certain hitter was most vulnerable—and whether that changed by the pitcher’s handedness, or whether the hitter was behind in the count.
Where Pitch f/x was a few hundred thousand rows of data per season, however, Statcast is exponentially larger, which is the result of a huge technological investment by Major League Baseball in hardware and software built to capture this information. Statcast collects enough data for a 3-D model of every pitch, using a radar-based system from a Danish company called TrackMan and an optical system from the Long Island company ChyronHego. As soon as the pitch leaves the pitcher’s hand, the radar system is looking for the pitch, sending a notification to MLBAM’s scoring system and the stringer on-site with the pitch location and scoring information once the pitch is complete. The radar system then tracks the ball (and only the ball) from bat to fielder if it’s put into play, since the radar system offers higher fidelity than the optical system.
The optical system follows . . . well, everything else, especially the movements of all of the people on the field, from recording their starting positions to the routes and speeds of all of the people in motion, including fielders, baserunners, and the batter as he becomes a baserunner. The optical system also serves as a backup to the radar system should the latter lose track of the ball at some point, since the optical system is agnostic toward the objects it’s monitoring. The system thus tracks all kinds of observable, measurable player actions, such as running speeds, throwing velocity, and fielders’ routes to the ball, as well as the aforementioned aspects to the pitch’s path to the plate, including velocity, spin rate, and vertical break.
MLBAM rolled out its first Statcast data in 2015, but year one, according to everyone involved, included a lot of time and effort spent cleaning the data and learning how to make the data more reliable in something closer to real time. Because these systems are tracking players at 30 frames per second and are recording measurements that aren’t perfectly smooth—your running speed varies depending on whether your feet are in the air or one is on the ground—there can be measurement errors in such data, as well as clerical errors such as tagging a play with the wrong player ID codes.
Many of these issues were of a kind no one really expected, because we’d never encountered them before in the era where things like catcher throws to second base were timed by hand, by scouts using stopwatches. Such throws, which are called “pop times” and generally run between 1.85 seconds and 2.10 seconds for major-league catchers—you’ll get higher pop times the further from the majors you get—assume that the infielder receiving the throw is standing at second base. MLBAM found that they were getting absurdly low pop times from catchers not generally known for their throwing prowess because the system didn’t distinguish between normal throws and those cut off short of the bag, one example of how the operators had to train the system so it would capture the right data before team analysts could start using them to develop new metrics.
The sheer size of the Statcast data stream and the inevitability of errors within it have created entirely new jobs within the industry that were unthinkable in 2006, when I left the Blue Jays to join ESPN. While I worked for Toronto as the team’s lone statistical analyst, the bulk of my time spent on analytical work was spent gathering data: writing code to scrape college stats off Web pages or to import the flat files MLB would post every morning that included minor-league data, including split data and game logs for players, and then formatting the data so I could import them into a simple desktop tool like Access (for queries) or Excel (so I could sort, print, or share lists with colleagues). There just weren’t enough data to merit an investment in a real relational database management system until the Pitch f/x data started to arrive in 2006, after which many teams started to build systems around packages like SQL Server, mySQL, and Oracle.
I’ve joked with many people as I’ve written and researched this book that I am no longer qualified for the type of job I once held with the Blue Jays, and the inception of Statcast data has made that more true than ever. Teams that have built or are building architectures capable of handling this new torrent of information are hiring people with graduate degrees in computer science specialties such as machine learning or signal processing. The sheer quantity of data has created entirely new needs within baseball operations departments, from data cleaning to building systems capable of storing and querying medium to big databases that are several orders of magnitude larger than what previous systems were able to handle—quite literally going from a few gigabytes of data to a terabyte of data each year from Statcast, a quantity that threatens to continue to increase.
Sig Mejdal, the Houston Astros’ director of decision sciences, told me that “before, while you might have been happy with a person fascinated with baseball who has some good quantitative skills (can work Excel well) and seems socially mature, now you need an advanced degree or advanced college-level backgrounds. And as the size of analytical teams grows, you don’t want one more person just like you; you want a person with skills that none of you have, and that’s often master’s and Ph.D. level skills.”
Multiple other executives who oversee analytics departments said they’re looking for similar skills and backgrounds; it’s no longer sufficient to know a little code and love the sport; now teams expect candidates to have technical experience before they’re hired, including some specialized skills like machine learning, the more accurate term for “artificial intelligence,” which MLBAM uses to train its system to tag pitch types based on pitch velocity, break, and the known repertoire of the current pitcher.
MLBAM has also hired multiple experts to help root out systemic errors in the data and to help with more esoteric but critical topics like deciding which data to include when, for example, calculating an outfielder’s arm strength to show to fans in a tweet or during a game broadcast. If an outfielder makes a hundred throws back to the infield, some of those will be casual tosses because there is no threat of a runner advancing, while others will be throws made at full strength to try to either catch a runner or prevent one from moving up a base. How can we determine which throws to discard from the sample when calculating the average velocity of his throws so we can compare it to other outfielders’ throwing velocities? Throwing out data makes anyone who’s worked with statistics a little queasy, because you’re introducing a new arena of potential bias (“selection bias,” which means you’re skewing the data by what you choose to include or omit), but the hope of analysts working with Statcast data is that the sheer volume of samples will minimize any bias from data-cleaning efforts.
The Astros even took their search for people with technical skills one step further in the 2015–16 off-season, posting a job opening for a “development coach,” an actual coach who’d wear a uniform, but for whom SQL skills were a plus. SQL stands for Structured Query Language and is the most common programming language used to search for data within relational databases. If you have a database of player statistics and want to know how many players age twenty-nine or younger hit at least 20 home runs this season, you’d write a SELECT statement with a WHERE condition that contains the age and home runs criteria. It is as lightweight as programming languages go, but it is still programming, something most working adults in the United States—let alone baseball coaches—never learned at any point in school. Whether this catches on around the industry remains to be seen, but I’ll bet it becomes more common for teams to at least favor coaches who have a ba
sic level of understanding of how databases work, because if you can even craft a SELECT/WHERE statement, then you automatically know how to phrase the questions you want to ask your statistical analysts in a way that they can turn the questions into SQL queries and get you answers you can use.
With this new complexity, however, comes a world of new opportunities for coaches, executives, and even players to get answers to questions that previously were either unanswerable or couldn’t be answered with enough precision to help. These answers are already helping to change the way the game is played on the field, and going forward will change the way teams construct their rosters, utilize pitchers, and draft, sign, and develop players.
The lowest-hanging fruit available for people working with Statcast data, whether it’s the folks at MLBAM preparing information for social media or game broadcasts or team analysts working on player evaluations, is verifying scouting observations that had previously been left to imperfect human measurements. How fast is Billy Hamilton? (Answer: Faster than anyone else in the majors, but not as fast as Usain Bolt.) Whose curveball has the highest spin rate? (Answer: The Angels’ Garrett Richards.) What hitter has the highest exit velocity or the optimal launch angle for power? What is the optimal launch angle, or range of angles, for power anyway? (It turns out that hitting the ball with some loft is critical, but too much loft means more flyouts and fewer home runs.)