My Good Faith Critique Of DRA

Note: This isn’t a Tigers post. If you’re here for the Tigers, feel free to ignore. Also, I’m publishing this here rather than at FanGraphs because 1) I don’t want the general public to get the idea that FanGraphs as an institution is throwing shade at DRA and 2) I don’t want the perception that anything I’m saying is done in the service of driving traffic or subscriptions to or from either site. 

Evaluating pitchers is very hard, but that hasn’t stopped people from trying. Wins and losses. ERA. WHIP. FIP. These are all statistics that at one point or another had been at the forefront of the quest for The Best Single Metric. A wise person might suggest that searching for one metric to rule them all might be a silly quest, but even if we all decided to properly use every tool in the toolbox, there would still have to be a best metric among the useful ones.

Two years ago, the Baseball Prospectus stats team took a swing at building the next generation of pitching metrics, led by their top-line creation, Deserved Run Average (DRA). Many in our little corner of the world treated this as near second coming because it was first high-level attempt to get beyond the FIP-generation of metrics and some of the smartest people in the public analysis sphere had thrown their intellectual heft behind the effort.

DRA promised to incorporate a lot of information that hadn’t found its way into FIP while also taking a more complex approach to modeling the pitcher-value process. I agree that those are worthwhile goals.

I think FIP is a very useful metric, not just because it does a pretty good job of representing pitcher value but because it is extremely straightforward. I am not saying that simplicity makes FIP a good metric, but rather that its clarity does. FIP has flaws, but its flaws are in perfect view. I know exactly what FIP is doing and exactly what FIP is not doing. And this is precisely where DRA has so far failed to win me as a full convert so far.

I want to be clear that I am not saying DRA is less rigorous than FIP or that it has been designed poorly or in bad faith. My issue with DRA is not that I think there is something wrong with it, it’s that I don’t really know what to make of it. My argument is not that FIP is a better representation of pitcher value than DRA, it’s that I am less certain about the quality of DRA than I am the quality of FIP.

Imagine FIP and DRA are diamonds. I can hold FIP in my hand and examine it under a magnifying glass. DRA is on a table twenty feet away. I can see the exact quality of the FIP diamond, but I can only tell that DRA is a diamond. Smart people who cut the DRA diamond are telling me they think the DRA diamond is better, but I have not been able to see them side by side.

In my own analysis and in my own writing, I have utilized DRA but I still lean heavily on the FIP-family of metrics for this reason. If I’m writing about a player and want to communicate something, I prefer FIP to DRA because I can talk clearly about what FIP says. If I want to use DRA I can only say that based on the complex method it deploys, the pitcher is this good/bad/other.

Now many strong advocates of DRA will tell you that its complexity is good. Pitching, after all, is very complicated so it follows that any statistic that measures pitching holistically should also be complicated. That’s a very convincing point, but as I noted earlier my problem is not complexity, it’s clarity. I love complicated things. I’ve taken graduate-level courses in statistics and modeling. I am in no way turned off by DRA in concept. At no point in this piece am I saying DRA should be less complex.

However, there are two clear issues with DRA that prevent me from using it as my primary point of reference. The first is that the BP team has not outlined a justification for its modeling strategy. If you read through their explanations (see here, here, here, and here) what you find is a list of flaws that exist with other pitching metrics. “FIP doesn’t have X, X matters, so we put X in our model. We know pitchers control their BABIP to some degree, so we put that in the model.”

This creates a couple of issues. The first issue is that I can’t see what components are doing the lifting (for example, this page needs to be way more granular). Does a player have a good DRA because their opponents are very tough or because their defense is terrible? DRA jams a lot of information into a single output and that makes it quite difficult to use in any sort of interesting way. FIP only has five inputs (strikeouts, walks, hit batters, home runs, and balls-in-play) and even that can feel overly aggregated. DRA has even more inputs that have run through even more aggregation. That might provide DRA with a more accurate output but it blurs a lot of lines. This pitcher might be good, but I have no idea why he’s good.

More importantly, however, is the fact that the BP team has not thoroughly explained why their modeling choices (structure, not inputs) are the proper modeling choices. DRA is a complex model, and while complexity is good, complexity also means that there were hundreds of choices made along the way that could have made differently and produced differently outcomes. In other words, DRA is built on a lot of choices made by people about how to incorporate something and those choices have not been explained and defended. As I noted earlier, the choices may be correct, but I have no way of evaluating them if they do not explain how they came to them.

Here’s an excerpt from the Gory Math DRA post:

What is the best way to model this relationship? That required a lot of testing. A LOT of testing. We tried linear models. We tried local regression. We tried tree-based methods. We bagged the trees. We tried gradient boosting. We tried support vector machines. We even tried neural networks. None of them were providing better results than existing estimators. And then we tried the one method that turned out to be perfect: MARS.

MARS stands for Multivariate Adaptive Regression Splines, and was introduced by Dr. Jerome Friedman of Stanford in 1991. You don’t hear much about MARS anymore: it has been supplanted in the everyday modeling lexicon by trendier machine-learning methods, including many of those we mentioned above. But MARS, in addition to being useful for data dumpster-diving, also has another big advantage: interactions.

MARS uses what are known as regression splines to better fit data. Instead of drawing a straight line between two points, MARS creates hinges that allow the line to bend, resulting in “knots” that accommodate different trends. The power of these knots is enhanced when MARS looks at how variables interact with each other. These interactions are, in our opinion, one of the under-appreciated facts in baseball statistics.

As discussed above, pitchers who are pitching particularly well or poorly have a cascading effect on other aspects of the game, including base-stealing. Moreover, there is a survival bias in baseball, as with most sports: pitchers who pitch more innings tend to be more talented, which means they end up being starters instead of relievers or spot fill-ins. The power of MARS is it not only allows us to connect data with hinged lines rather than straight ones, but that it allows those hinges to be built around the most significant interactions between the variables being considered, and only at the levels those interactions have value. MARS also uses a stepwise variable selection process to choose only the number of terms sufficient to account for the most variance.

Most people won’t be able to make heads or tails of this section and it’s incumbent upon BP to make it more accessible, for one. But even granting a pardon for that, as someone knowledgeable in these issues, I don’t know if this strategy is a good one or a bad one. They made all sorts of choices based on various tests and I am simply asked to accept they chose the right one and that there isn’t a better option out there.

Now you might say that it isn’t their job to teach me how to literally write R code and test my own model so that I can probe the ether for things I think might be imperfect within DRA. Of course they shouldn’t be asked to test literally every possible model specification when building DRA, but you have to give me more information about why you chose to build it like this as opposed to some of the other approaches you tried or could have tried.

On the other hand, with something like FIP, all of the decisions are on display. You might think the decisions are wrong, but you can see the decisions and make that judgement. There are five inputs with a set of clear weights. That’s all FIP is, and while that limits FIP in terms of accuracy, FIP is extremely clear. I can’t make that judgement with DRA. A stronger and clearer defense of the specifications needs to be made.

And this leads me to my second key issue with DRA that prevents me from using it in a more serious way. DRA is two years old and has already had three major iterations that worked differently in meaningful ways. I have no problem with updating your metrics based on new data or new research, and I don’t think there is an inherent problem any of the specific changes they have announced. The problem is that DRA-2015, DRA-2016, and DRA-2017 have different views of the same seasons and I have a strong suspicion that DRA-2018 will lead to more of these cases.

The rapidity with which DRA has been revised indicates the BP team’s willingness to explore improvements (which is great!) but it also suggests to me that they haven’t figured out the right way to model the underlying data generating process.

When they announce a revision, they are stating that the previous version failed to capture something they found essential. It’s one thing if these changes were exclusively based on new data, but they are also based on changes to the modeling. And if the results are that sensitive to tweaks in method, I am suspect about the entire system. That doesn’t mean that FIP is necessarily better than any particular version of DRA, simply that I know that in a few months DRA is going to change and a pitcher I thought way decent might actually be kind of bad even though we didn’t learn anything new about the pitcher himself.

Put another way, are the things BP learned about DRA between 2015 and 2017 things they couldn’t have learned by exploring more specifications before the initial rollout? I am not saying they should hold the release back until it’s perfected because public input makes things better, but simply that the first few years are more akin to a beta test. I’m not ready to fully adopt the metric until it settles in a little more. That’s not me dismissing DRA or its potential value to the world of baseball analytics. I really like DRA from a conceptual perspective, but my perception is that the nuts and bolts are subject to change quite frequently, so I have yet to dive in without a life preserver.

I want to reiterate that none of this is a critique of any individual decision and it is decidedly not an argument that FIP is a better representation of pitching value than DRA. That is a separate argument that can be had on separate terms. But I do think that DRA is not as useful as FIP at this point in time. I am hesitant to use a metric whose workings I can’t see. I don’t know if the modeling strategy is correct and I am pretty sure that in a few months a chunk of pitchers will have totally different DRAs.

I also want to be clear that none of this is intended as shade or inter-nerd sniping. I have great respect for the BP stats team and have shared these critiques with them. This is not a take down, it’s a list of demands.

I think DRA is aiming in the right direction, I just haven’t been given enough information to figure out if it’s really an improvement over its predecessors. Building a metric like DRA makes all the sense in the world and some great people are in command, but it will remain a complementary metric for me until it is unpacked in a way that allows me to trace its design.

So here is what I would propose:

  1. Create an expanded version of the DRA run value page that includes every individual component so that people can see how the different factors are operating. It takes two seconds to figure out why FIP likes someone or doesn’t. Doing so with DRA is next to impossible.
  2. Go back to the drawing board on the public facing explanation and give clearer explanations of how DRA works and why it works that way. DRA is complex, but you can explain complex things in a clear manner if you break it down into less complex pieces and work with outsiders to ensure they follow the explanation at each step.
  3. If DRA is meant to be a living, breathing statistic that gets updated annually, then be willing to accept ongoing skepticism about the execution of the statistic. If you are rejiggering it frequently, then the audience is going to wonder if the current version is the right one. If you want to avoid that, you have to change the name each time you change the stat. I get that this is annoying, but it’s part of the job.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: