Econtalk.org

Susan Athey on Machine Learning, Big Data, and Causation

2016-09-12

Can machine learning improve the use of data and evidence for understanding economics and public policy? Susan Athey of Stanford University talks with EconTalk host Russ Roberts about how machine learning can be used in conjunction with traditional econometric techniques to measure the impact of say, the minimum wage or the effectiveness of a new drug. The last part of the conversation looks at the experimental techniques being used by firms like Google and Amazon.

Play

Time: 1:01:34

How do I listen to a podcast?

Download

Size:28.3 MB

Right-click or Option-click, and select "Save Link/Target As MP3.

Readings and Links related to this podcast episode

Related Readings

HIDE READINGS

This week's guest:

Susan Athey's Home page

Susan Athey on Twitter.

This week's focus:

"The State of Applied Econometrics: Causality and Policy Evaluation," by Susan Athey and Guido Imbens. Arxiv.org, July 2016. PDF file.

Additional ideas and people mentioned in this podcast episode:

Cross-state, time series, counterfactuals, Difference in Differences approaches:

James Heckman on Facts, Evidence, and the State of Econometrics. EconTalk. January 2016. Discussion of Card/Krueger study.

Noah Smith on Whether Economics is a Science. EconTalk. December 2015. Discussions of minimum wage evidence including the Card/Krueger study.

Difference in differences. Wikipedia.

Synthetic control method. Wikipedia. Refining Difference in Differences, Alberto Abadie.

High-dimensional statistics. Wikipedia.

Pedro Domingos on Machine Learning. EconTalk. May 2016. Discussion of A/B tests and more.

"Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California's Tobacco Control Program," by Alberto Abadie, Alexis Diamond, and Jens Hainmueller. Journal of the American Statistical Association, 2010. PDF file.

A few more readings and background resources:

Leamer, data mining critiques:

"Let's Take The Con Out of Econometrics," by Edward E. Leamer. UCLA Working Paper. 1983. PDF file.

Leamer on the State of Econometrics. EconTalk. May 2010.

Leamer on Macroeconomic Patterns and Stories. EconTalk. May 2009.

Forecasting and Econometric Models, by Saul H. Hymans. Concise Encyclopedia of Economics.

International Trade Agreements, by Douglas Irwin. Concise Encyclopedia of Economics.

Simon Kuznets. Biography. Concise Encyclopedia of Economics. Detailed data collection proponent.

A few more EconTalk podcast episodes:

Joshua Angrist on Econometrics and Causation. EconTalk. December 2014.

Nassim Nicholas Taleb on the Precautionary Principle and Genetically Modified Organisms. EconTalk. January 2015.

Highlights

Time

Podcast Episode Highlights

HIDE HIGHLIGHTS

0:33

Intro. [Recording date: June 18, 2016.]

Russ Roberts: So, we're going to start with a recent paper you wrote, co-authored, on the state of applied econometrics. It's an overview paper and it gets at a lot of issues that are current in the field. I want to start with an example you use in there, a pretty important policy issue, which is the minimum wage. It's in the news these days; people are talking about increasing the minimum wage; we've talked about it in passing on the program many times. What's the challenge of measuring the impact of the minimum wage on employment, say? Why is that hard? Can't we just go out and count how many jobs change when we change the minimum wage in the past?

Susan Athey: Sure. There are sort of two problems when we think about measuring the impact of a policy like the minimum wage. The first is just basic correlation versus causality. So, if you think about looking at a point in time, we could say, some states have a high minimum wage and other states have a lower minimum wage. And so, the naive thing to do would be to say, 'Hey, look, this state has a high minimum wage,' or 'This city has a high minimum wage, and employment is doing just fine. So, gosh, if we increase the minimum wage everywhere else, they would do just fine, too.' And the big fallacy there is that the places that have chosen to raise the minimum wage are not the same as the places that have not.

Russ Roberts: Necessarily. They might be, but probably are not.

Susan Athey: Highly unlikely. And you look at--recently certain cities have raised the minimum wage, these often tend to be cities that are booming, that have, say, a high influx of tech workers that are making very high wages; the rents have gone up in those cities and so it's made it very difficult for low-wage workers to even find anyplace to live in those cities. And that's often been the political motivation for the increase in the minimum wage. And so, a minimum wage that works in San Francisco or Seattle is not necessarily going to work in the middle of Iowa. And there's a number of underlying factors that come into play here. One is that a city that has lots of tech workers may have their--their customers at fast food restaurants may not be very price-sensitive. So it might be very easy for those restaurants to increase their prices and basically pass on the cost increase to their customers, and the customers may not really mind. So it may not really have a big impact on the bottom line of the establishment. Another factor is that if it's a place where workers are in scarce supply, maybe partly because it's hard to find a place to live, then the establishments may see a tradeoff when they raise wages: it they have higher wages it may help them keep a more stable work force, which helps kind of compensate for the cost of raising the minimum wage. So, all of these forces can lead to very different responses than in a place where those forces aren't operating. And so, if you just try to measure--relate the magnitude of the minimum wage: 'Hey, it's $15 here; it's $10 there; it's $8 there,' and here's the differences in employment and try to connect the dots between those points, you are going to get a very misleading answer.

Russ Roberts: Just to bring up an example, we've talked about many times on the program with a similar challenge, if you look at people's wages who have gone to college or done graduate work, it's a lot higher than people who haven't. And people say, 'Well, that's the return to education.' The question is, if you increase the amount of education in the United States, would you necessarily get the returns that the people who are already are educated are getting? And the answer is probably not. The people who go on to college might not be, and probably aren't, exactly the same as the people who are going now; and therefore the full effect would be more challenging to measure.

Susan Athey: Exactly. And there's also equilibrium effects, because if you send more people to college, you'll decrease the scarcity of college graduates, which can also lower the equilibrium wages. So, these things are very hard to measure. So, something that's typically a better starting point is looking over time and trying to measure the impact of changes in the minimum wage over time. But those also are confounded by various forces. So, it's not that San Francisco or Seattle have static labor markets. Those markets have been heating up and heating up and heating up over time. And so there are still going to be time trends in the markets. And so if you try to look, say, one year before and one year after, the counterfactual--what would have happened if you hadn't changed the minimum wage--is not static. There would have been a time trend in the absence of the change in policy. And so one of the techniques that economists try to use in dealing with that is something that is called 'difference in differences.' So, essentially you can think about trying to create a control group for the time trend of the cities or states that made a change. So, if you think about the problem you are trying to solve, suppose the city tries to implement a higher minimum wage. You want to know what the causal effect of that minimum wage increase was. And in order to do that, you need to know what would have happened if you hadn't changed the minimum wage. And we call that a counterfactual. Counterfactually, if the policy hadn't changed, what would have happened? And so you can think conceptually about what kind of data would have helped inform what would have happened in Seattle in the absence of a policy change. And so, one way that people traditionally looked at this is they might have said, well, Seattle looks a lot like San Francisco, and San Francisco didn't have a policy change at exactly the same time; and so we can try to use the time trend in San Francisco to try to predict what the time trend in Seattle would have been.

Russ Roberts: And they both start with the letter 'S', so therefore it could be reliable. Or it might not be. So that one of the challenges of course is that you are making a particular assumption there.

Susan Athey: Exactly. And so the famous Card/Krueger study of the minimum wage compared, you know, Pennsylvania to New Jersey. And that was kind of a cutting edge analysis at the time. Since then, we've tried to advance our understanding of how to make those types of analyses more robust. So, one thing you can do is to really develop systematic ways to evaluate the validity of your hypothesis. So, you, just as a simple thing, if you were doing Seattle and San Francisco, you might track them back over 5 or 6 years and look at their monthly employment patterns and see if they are moving up and moving down in the same patterns, so that you basically validate your assumption: There was no change, say, a year ago; and did it look like there was a change? So, for example, if Seattle was on a steeper slope than San Francisco, hypothetically, it might, if you did a fake analysis a year ago when there wasn't really a change, that fake analysis might also make it look like there was an impact unemployment. We call that a placebo test. It's a way to validate your analysis and pick up whether or not your assumptions underlying your analysis are valid. So, a more complex thing developed by Alberto Abedie is a method called Synthetic Control Groups. Instead of just relying on just one city, I might want to look at a whole bunch of cities. I might look at other tech cities like Austin. I might look at San Jose and San Francisco, and Cambridge; find other areas that might look similar in a number of dimensions to Seattle. And create what we call a synthetic control group. And that's going to be an average of different cities that together form a good control. And you actually look for weights for those cities, so that you match many different elements: you match the characteristics of those cities across a number of dimensions and you make sure that the time trend of those all together look similar. And that's going to give you a much more credible and believable result.

Russ Roberts: So, the standard technique in these kind of settings in economics is what we would call control variables: we are going to try to, in the data itself, for that particular city, say, I'm going to, say, control for educational level, family structure, overall economic growth in the city, etc.; and then try to tease out from that the independent effect of the change in the legislation as a discrete event--a 1-0 change--in policy. Say, the minimum wage. Of course, I also care about the magnitude of the increase, and so on.

9:51

Russ Roberts: So, that standard methodology in the past is multivariate regression, which has the general problem is you don't know whether you've controlled for the right factors. You don't know what you've left out. You have to hope that those things are not correlated with the thing you are trying to measure--for example, the political outlook of the city. There's all kinds of things that would be very challenging that--perhaps harder growth in a particular sector such as restaurants. You'd try to find as much data as you could. But in these synthetic, placebo cases, they, I assume have the same challenge, right? You've got an imperfect understanding of the full range of factors.

Susan Athey: That's right. So, if I think about trying to put together a synthetic control group that's sort of similar to the city that I'm trying to study, there are many, many factors upon, that could be predictive of the outcome that I'm interested in. And so, that brings us into something called high dimensional statistics, or colloquially, machine learning. And so these are techniques that are very effective when you might have more variables, more co-variates, more control variables than you have, even, observations. And so if you think about trying to control for trends in the political climate of the city, that's something that there's not even one number you could focus on. You could look at various types of political surveys. There might be 300 different, noisy measures of political activity. I mean, you could go to Twitter, and you could see, you know, in this city how many people are posting, you know, political statements of various types.

Russ Roberts: You could look at Google and see what people are searching on; you might get some--many [?] crude measures, like, did they change the mayor or city council, or some kind of change like that would be some indication, right?

Susan Athey: Exactly. So just trying to capture sort of political sentiments in a city would be highly multidimensional in terms of raw data. And so, I think the way that economists might have posed such a problem in the past is that we would try to come up with a theoretical construct and then come up with some sort of proxy for that. And maybe one, or two, or three variables; and sort of argue that those are effective. We might try to, you know, demonstrate, kind of cleverly, that they related to some other things of interest in the past. But it would be very much of like an art form, how you would do that. And very, very subjective. And so, machine learning has this brilliant kind of characteristic: that it's going to be data-driven model selection. So, you are basically going to be able to specify a very large list of possible covariates; and then use the data to determine which ones are important.

Russ Roberts: Covariates being things that move together.

Susan Athey: Things that you want to control for. And we've talked about a few different examples of ways to measure causal effects. There's, you know, difference in differences; there's things called regression continuity; there's things where you try to control for as much as you possibly can in a cross section, in a single point in time. But all of those methods kind of share a common feature: that it's important to control for a lot of stuff. And you want to do it very flexibly.

13:20

Russ Roberts: So, when I was younger and did actual empirical work--and I think--I know I'm older than you are. And in my youth we actually had these things called 'computer cards' that we would have to stack in a reader when we wanted to do statistical analysis. And so it was a very, very different game then. We invented, people invented these wonderful software packages that made things a lot easier like SAS (Statistical Analysis System) and SPSS (Statistical Package for the Social Sciences). And one of the things we used to sneer at, in graduate school, was the people who would just sort of--there was a setting or, I can't remember what you'd call it, where you'd just sort of go fishing. You could explore all the different specifications. So, you don't know whether this variable should be entered as a log [logarithm--Econlib Ed.] or a squared term. You don't know whether you want to have nonlinearities, for example. And so you just let the data tell you. And we used to look down on that, because we'd say, 'Well, that's just that set of data. It's a fishing expedition. And there's no theoretical reason for it.' Is machine learning just a different version of that?

Susan Athey: In some ways, yes. And in some ways, no. So, I think if you applied machine learning off the shelf without thinking about it from an economic perspective, you would run back into all the problems that--

Russ Roberts: It's dangerous--

Susan Athey: we used to worry about. So, let me just back up a little bit. I mean, so, the problem--let's first talk about the problem you were worried about. So, you can call it 'data mining in a bad way.' See, now data mining is a good thing.

Russ Roberts: It's cool.

Susan Athey: It's cool. But we used to use 'data mining' in a pejorative way. It was sort of a synonym for, as you say, fishing of the data. It was a-theoretical. But the real concern with it is was that it would give you invalid results.

Russ Roberts: Non-robust results, for example. And then you'd publish the one you'd, say, was confirming your hypothesis; and then you'd say that the best specification. But we don't know how many you searched for.

Susan Athey: Yeah. So, 'non-robust' is a very gentle description. So, I think 'wrong' is a more accurate one. So, part of what you are doing when you report the results of a statistical analysis is that you want to report the probability that that finding could have occurred by chance. So we use something called 'standard errors,' which are used to construct what's called the 'p value'. And that's basically telling you, if you resampled from a population, how often would you expect to get a result this strong. So, suppose I was trying to measure the effect of the minimum wage, the question is, if you drew a bunch of cities and you drew their outcomes, there's some noise there. There's--and so the question is: Would you sometimes get a positive effect? And so how large is your effect relative to what you would just get by chance, under the null that there is actually no effect of the minimum wage? And if you--the problem with, "data mining", or looking at lots of different models is that you might find, if you look hard enough you'll find the result you are looking for. Suppose you are looking for the result that the minimum wage decreased employment. If you find, if you look enough ways to control for things, just, you know, if you try a hundred ways, and there's a good chance one of them will come up with your negative result.

Russ Roberts: Even 5.

Susan Athey: And if you--exactly. And if you were to just report that one, it's misleading. And if you report it with a p-value that says, 'Oh, this is highly unlikely to have happened,' but you actually tried a hundred to get this, that's an incorrect p-value. And we call that p-value [?]. And it's actually led to sort of a crisis in science. And we found in some disciplines, particularly some branches of psychology, large fractions of their studies can't be replicated, because of this problem. It's a very, very serious problem.

Russ Roberts: In economics I associate it historically with Ed Leamer, although calling it historically--he's alive; he's been a guest on EconTalk a few times. And I think his paper, "How to Take the Con Out of Econometrics" was basically saying that if you are on a fishing expedition, and not as grand even as the one, absurd as the one we talked about where you tried a zillion things, if you are running lots of different variations, the classical measures of hypothesis testing and statistical significance literally don't hold. So there's a fundamental dishonesty there that's not necessarily fraud because it sort of became common practice in our field--that you could try different things and see what worked, and then eventually, though, you might come to convince yourself that the thing that worked was the one that was the weirdest or cleverest or confirmed your bias.

Susan Athey: Yeah, confirmed what you were expecting. Exactly. And so I think that this is a very big problem. I have a recent paper about using machine learning to try to develop measures of robustness. So, how do you think about this problem in a new world where you have, perhaps, more variables to control for than you have observations? So, let's take like a simpler example of measuring the effect of a drug. Suppose I had a thousand patients, but I had also a thousand characteristics of the patients.

Russ Roberts: That might interact with whether the drug works.

Susan Athey: Exactly. Suppose that 15 of these 1000 patients have really great outcomes. Now I know a thousand things about these people. There must be something that those 15 people have in common. Okay?

Russ Roberts: Perhaps.

Susan Athey: Well, if you have a thousand characteristics of them, it's almost certain that you can find some things that they share in common. So, maybe they are most men--not all men--mostly men. Maybe they are between 60 and 65. Maybe they mostly had a certain pre-existing condition. We have so many characteristics of them, if I just search I can find some characteristics they have in common. Then I can say: My new data-driven hypothesis is that men between 60 and 65 who had the pre-existing condition have a really good treatment effect of the drug. Now, there's going to be some other people like that, too; but because this group has such good outcomes, if you just average in a few more average people you are still going to find really good outcomes for this group of people. And so that is the kind of bad example of data mining. It's like, just curiously [spuriously?] these people just happen to have good outcomes, and I find something in the data that they have in common and then try to prove that the drug worked for these people.

Russ Roberts: An ex post story to tell that this makes sense because--and it turned out if it was a different set of characteristics I could tell a different ex post story.

Susan Athey: Right. Exactly. So, you might have liked to do that as a drug company or as a researcher if you had two characteristics of the people. But it just wouldn't have worked very well, because if it was random that these 15 people did well, it would be unlikely that they'd have these two characteristics in common. But if you have 1000 characteristics then you can almost certainly find something that they have in common, and so you can pretty much always find a positive effect for some subgroup. And so that type of problem is really a huge issue.

Russ Roberts: Because it doesn't tell you whether in a different population of 1000 people with 1000 characteristics that those were things you identify are going to work for them.

Susan Athey: Exactly.

Russ Roberts: You had no theory.

Susan Athey: Yeah. And so I started from the hypothesis [?]: of 1000 people, 15 people are going to have good outcomes. And so even if there was no effect of the drug, you would "prove" that this subgroup had great treatment effects.

21:30

Russ Roberts: And going the other way, if 15 people had tragic side effects, you would not necessarily conclude that the drug was dangerous. Or bad for the rest of the population.

Susan Athey: Exactly. And so the FDA (Food and Drug Administration) requires that you put in a pre-analysis plan when you do a drug trial to exactly avoid this type of problem. But the problem with that, the inefficiency there, is that most drugs do have heterogeneous effects. And they do work better for some people than others. And it's kind of crazy to have to specify in advance all the different ways that that could happen. And so you are actually throwing away good information by not letting the data tell you afterwards what the treatment effects were for different groups. So, machine learning can come in here to solve this problem in a positive way, if you are careful. And so, there's a few ways to be careful. The first point is just to distinguish between the causal variable that you are interested in and everything else that are just sort of characteristics of the individuals. So, in the example of the minimum wage you are interested in the minimum wage. That variable needs to be sort of carefully modeled, and you don't sort of maybe put in and maybe you don't put it in. You are studying the effect of the minimum wage: that's the causal treatment. So you treat that separately. But then all the characteristics of the city are sort of control variables.

Russ Roberts: And the population.

Susan Athey: And the population. And so, you want to give different treatment to those. And in classic econometrics you just kind of put everything in a regression, and the econometrics were the same--like, you'd put in a dummy variable for whether there was a higher minimum wage, and you would put in all these other covariates; but you'd use the same statistical analysis for both. And so I think the starting point for doing causal inference, which is mostly what we're interested in, in economics, is that you treat these things differently. So, I'm going to use machine learning methods, or sort of data mining techniques, to understand the effects of covariates, like, you know, all the characteristics of the patients; but I'm going to tell my model to do something different about the treatment effect. That's the variable I'm concerned about. That's the thing I'm trying to measure the effect of. And so, the first point is that, if you are using kind of machine learning to figure out which patient characteristics go into the model, or which characteristics of the city go into the model, you can't really give a causal interpretation to a particular political variable, or a particular characteristic of the individual, because there might have been lots of different characteristics that are all highly correlated with one another. And what the machine learning methods are going to do is they are going to try to find a very parsimonious way to capture all the information from this very, very large set of covariates. So, the machine learning methods do something called regularization, which is a fancy word for picking some variables and not others. Remember, you might have more variables than you have observations, so you have to boil it down somehow. So, they'll find--they might find one variable that's really a proxy for lots of other variables. So, if you're going to be doing data mining on these variables, you can't give them a causal interpretation. But the whole point is that you really only should be giving causal interpretations to the policy variables you were interested in to start with--like, the drug. Like the minimum wage. So, in some sense you shouldn't have been trying to do that anyways. So the first step is if you are going to do kind of data mining, you don't give causal interpretations to all the control variables you did data mining over. The second point, though, is that even if you are not giving them causal interpretations, you can still find spurious stuff. If you tell a computer to go out and search--

Russ Roberts: Look for patterns--

Susan Athey: Go out and search 10,000 variables--

Russ Roberts: Especially over time--

Susan Athey: just like the example that I gave of the patients, it will find stuff that isn't there. And the more variables you have, the more likely it is to find stuff that isn't there. So one of the ways that I've advocated modifying machine learning methods for economics is to sort of protect yourself against that kind of over-fitting, by a very simple method--it's so simple that it seems almost silly, but it's powerful--is to do sample splitting. So you use one data set to figure out what the right model is; and another data set to actually estimate your effects. And if you come from a small data world, like, throwing away half your data sounds--

Russ Roberts: Horrifying.

Susan Athey: Like a crime. Horrifying. Like, how am I going to get my statistical significance if I throw away half my data? But you have to realize the huge gain that you are getting, because you are going to get a much, much better functional form, a much, much better goodness of fit by going out and doing the data mining; but the price of that is that you will have overfit. And so you need to find a clean data set to actually get a correct standard error.

26:45

Russ Roberts: So, let me make sure I understand this. It strikes me as an improvement but not as exciting as it might appear. So, the way I understand what you are saying is, let's say you've got a population of, say, low education workers or low skill workers and you are trying to find out the impact of the minimum wage on them. So, what you do is only look at half of them. You build your model, and you do your data mining. And then you, then, try to see how the curve that you fit or relationship that you found in the data in the first half of the sample, if it also holds in the second. Is that the right way to say it?

Susan Athey: So, I would use the model--so, what I would be doing with the model is trying to control for time trends. And so I would have selected a model, say, and selected variables for regression, in a simple case. And then I would then use those selected variables and I would estimate my analysis in the second stage as if, just as if I were doing it in the old fashioned way. So, the old fashioned way is sort of the gods of economics told you that these three variables were important, and so you write your paper as if those are the only three you ever considered, and you knew from the start that these were the three variables. And then you report your statistical analysis as if you didn't search lots of different specifications. Because if you listen to the gods of economics, you didn't need to search lots of different specifications. And so--

Russ Roberts: That's in the kitchen. Nobody saw the stuff you threw out.

Susan Athey: Nobody saw you do it.

Russ Roberts: Nobody saw the dishes that failed.

Susan Athey: Exactly. And so you just report the output. Now, what I'm proposing is: Let's be more systematic. Instead of asking your research assistant to run 100 regressions, let the machine run 10,000. Then pick the best one. But you only use what you learned on a separate set of data.

Russ Roberts: So, again, it seems to me that you are trying to hold its feet to the fire. You went through these 10,000 different models, or 100,000. You found 3 that were, or 1, whatever it is, that seemed plausible or interesting. And now you are going test it on this other set of data?

Susan Athey: No, you're going to just use it, as if the gods of economics told you that was the right one to start with.

Russ Roberts: Why would that be an improvement?

Susan Athey: Because it's a better model. It would be a better fit to the data. And you don't, by the way, in the first part you don't actually look for what's plausible. You actually define an objective function for the machine to optimize, and the machine will optimize it for you.

Russ Roberts: And what would that be, for example? So, in the case of traditional statistical analysis, in economics, I'm trying to maximize the fit; I'm trying to minimize the distance between my model and the data set, the observations, right?

Susan Athey: Exactly. So, it would be some sort of goodness of fit. So, you would tell it to find the model that fit the data the best--with part of the sample.

Russ Roberts: And so then I've got--traditionally I'd have what we'd call marginal effects: the impact of this variable on the outcome holding everything else constant. Am I going to get a similar [?]

Susan Athey: Sure. Well, there's a number of different--

Russ Roberts: The equivalent of a coefficient, in this first stage?

Susan Athey: Sure. So, let's go back. In the simplest example, suppose you were trying to do a predictive model. So, all I wanted to do is predict employment.

Russ Roberts: Yup. How many jobs are going to be lost or gained.

Susan Athey: And so if you think about a difference in difference setting, the setting where somebody changed the minimum wage, what you are trying to do is predict what would have happened without the minimum wage change. So, you can think of that component as the predictive component. So, I'm going to take a bunch of data that--this would be data from, say, cities that didn't have a minimum wage change, or also data from the city that did but prior to the change. I would take sort of the untreated data and I would estimate a predictive model. And then the objective I give to my machine learning algorithm is just how well you predict.

Russ Roberts: Do the best job fitting the relationship between any of the variables we have to, say, employment in the city for workers 18-24 who didn't finish college.

Susan Athey: Exactly. But a key distinction is that that is a predictive model; and predictive models--the marginal effect of any component of that model should not be given a causal interpretation. So, I don't want to say that just because political sentiment was correlated with employment, it doesn't mean that political sentiment caused employment.

Russ Roberts: Agreed.

Susan Athey: If you want to draw a conclusion like that, you're going to need to design a whole model and justify the assumptions required to establish a causal effect. So, a key distinction is that these predictive models shouldn't be given causal interpretations. But you tell the machine: 'Find me the best predictive model.'

31:48

Russ Roberts: Here again, when you say 'predictions,' it's really just fitting. Because I'm never going to be able to evaluate whether the relationships between the variables that I've been using or that the machine told me to use, whether they accurately predict going forward.

Susan Athey: Right. It's only going to fit in the data set you have. And you are going to assume that all those relationships are stable. So what can go wrong is there could be some things that are true in your current data set, but spuriously true.

Russ Roberts: Right. They won't hold in the future.

Susan Athey: They won't hold in the future. Just from random variation in terms of how your sample ended up, it just happened that people who were high in some variables happened to be high in other variables: that those relationships were spurious and they are not going to hold up in the future. The machine learning tries to control for that, but it's limited by the data that it has. And so if there's a pattern that's true across many individuals in a particular sample, it's going to find that and use that. And so, the model [?] are sort of uninterpretable. They are prone to pick up things that may not make sense in terms of making predictions over a longer term, holding up when the environment changes, holding up when something could be a bad-weather winter or something and that affected employment. And so, the machine learning model would, might pick up things that are correlated with bad weather; and the next year the weather isn't bad.

Russ Roberts: And those other variables are still there.

Susan Athey: And those other variables are still there, but the relationship between them and employment no longer holds. So, there's [?] a big push right now in some subsets of the machine learning community to try to think about, like, what's the benefit of having an interpretable model? What's the benefit of having models that work when you change the environment? They have words like 'the main adaptation,' 'robust,' 'reliable,' 'interpretable.' So it's interesting is in econometrics we've always focused on interpretable, reliable models and we always have wanted models that are going to hold up in a lot of different settings. And so, that push in econometrics has caused us to sacrifice predictive power. And we never really fully articulated the sacrifice we were making, but it was sort of implicit that we were trying to build models of science, trying to build models of the way the world actually works--not trying to just fit the data as well as possible in the sample. As a result, we're not that good at fitting data. The machine learning techniques are much better at fitting any particular data set. Where the econometric approaches can improve is where you want your model to hold up in a variety of different circumstances and where you are very worried about picking up spurious relationships or being confounded by underlying factors that might change. And so I think where I'd like to go with my research and where I think the community collectively wants to go, both in econometrics and machine learning, is to use the best of machine learning--to sort of automate the parts that we're not good at hand-selecting or interpreting--allowing us to use [?] to set wider data sets, lots of different kind of variables--but figuring out how to constrain the machine to give us more reliable answers. And finally, to distinguish very clearly the parts of the model that could possibly have a causal interpretation from parts that cannot. And so, there, all the old tools and insights that we have from econometrics about distinguishing between correlation and causality--none of that goes away. Machine learning doesn't solve any of those problems. Because most of our intuitions and theorems about correlation versus causation already kind of assumed that you have an infinite amount of data; that you could use the data to its fullest. Even though in practice we weren't doing that, when we would say something like, in effect 'it's not identified,--we'd say 'you cannot figure out the effect of the minimum wage without some further assumptions' and without some assumption like that these cities have a similar time trend, the cities that didn't get the change have a similar time trend than the ones that did--those are kind of conceptual, fundamental ideas. And using a different estimation technique doesn't change the fundamentals of identifying causal effects.

36:17

Russ Roberts: So, let's step back and let me ask a more general question about the state of applied econometrics. The world looks to us, to economists, for answers all the time. I may have told this story on the program before--I apologize to my listeners, but Susan hasn't asked it--but, a reporter once asked me: How many jobs did NAFTA (North American Free Trade Agreement) create? Or how many jobs did NAFTA destroy? And I said, 'I have no idea.' And he said, 'So what do you mean, you have no idea?' And I said, 'I have a theoretical framework for thinking about this, that trade is going to destroy certain kinds of jobs: A factory that moves to Mexico, you can count those jobs; but what we don't see are the jobs that are created because people ideally have lower prices for many of the goods they face, and therefore they have some money left over and might expand employment elsewhere. And I don't see that so I can't count those jobs. So I don't really have a good measure of that.' And he said, 'But you are a professional economist.' And I said, 'I know--I guess. Whatever that means.' And he said, 'Well, you're ducking my question.' I said, 'I'm not ducking your question: I'm answering it. You don't like the answer, which is: I'd love to know but I don't have a way to answer that with any reliability.' And so, similarly, when we look at the minimum wage, when I was younger, everyone knew the minimum wage destroyed lots of jobs; it was a bad idea. In the 1990s with Card and Krueger's work, and others, people started to think, 'Well, maybe it's not so bad,' or 'Maybe it has no effect whatsoever.' And there's been a little bit of a pushback against that. But now it's a much more open question. Recently, I'd say the most important question in the last, in a while, has been: Did the Stimulus create jobs and how many? People on both sides of the ideological fence argue both positions: that it created a ton, and others say it didn't; some even would say it cost us jobs. So, how good do you think we are at the public policy questions that the public and politicians want us to answer? And, have we gotten better? And do you think it's going to improve, if the answer is Yes? Or do you think these are fundamentally artistic[?] questions, as you suggested they might be earlier in our conversation?

Susan Athey: I think we're certainly getting a lot better. And having more micro data is especially helpful. So, all of these questions are really counterfactual questions. We know what actually happened in the economy: that's pretty easy to measure. What we don't know is what would have happened if we hadn't passed NAFTA. Now, something like NAFTA is a particularly hard one, because, as you said, the effects can be quite indirect.

Russ Roberts: It's a small part of our economy, contrary to the political noise that was made around it.

Susan Athey: But also the benefits are maybe quite diffuse.

Russ Roberts: Correct.

Susan Athey: And so, one side can be easy to measure, like a factory shutting down. But the benefits from, say, lots of consumers having cheaper goods and thus having more disposable income to spend on other things. The benefits to all the poor people who are not having to pay high prices to products, those benefits are much harder to measure because they are more diffuse. So that's a particularly hard one. For something like the minimum wage, you know, having more micro level data is super helpful. So, in the past you only had state-level aggregates. Now you have more data sources that might have, you know, city-level aggregates. Plus, more, different data sources to really understand at the very micro level what's happening inside the city to different industries, to different neighborhoods. And so on. And so, better measurement, I think can really, really help you. If you think about macro predictions, you might ask a question: 'Gee, well, if we had a major recession, what's going to happen in a particular industry?' Well, in the past, you might have just had to look back and say, 'Gee, what happened when the GDP (Gross Domestic Product) of the United States went down, to that industry?' Now, if you have more micro data you could look back at the county level, or even the city level, and see the specific economic performance of that specific city, and see how that related to purchases in that particular industry. And then you have much more variation to work with. You don't just have, like, 2 recessions in the past 10 years--like, 2 data points. Because each locality would have experienced their own mini-recession. And they might have experienced more recessions, in fact. And so you could really have much more variation in the data: many more mini-experiments. Which allows you to build a better model and make much more accurate predictions. So, broadly, modeling things at the micro level rather than at the macro level can be incredibly powerful. But, that's only going to work well if the phenomenon are micro-level phenomena. So, if you think about the economics, say, the average income in a county is a good predictor of restaurant consumption in that county. And you can pretty much kind of look at that locally. But NAFTA, or something like that, that might affect, you know, people's movements between cities. It could have other types of macroeconomic impact. So, where each city wouldn't be an independent observation. And so for those kind of macro level questions, yeah, more micro data can help, but it's not giving you, you know, 3000 experiments where you used to have 1.

41:49

Russ Roberts: So, it strikes me that in economics we are always trying to get to the gold standard of a controlled experiment. And since we often don't have controlled experiments, we try to use statistical techniques that we've been talking about to try to parse out differences across the reality that we don't have a controlled experiment to control for factors that actually change. And we're inevitably going to be imperfect because of the challenges of, the fact that we don't observe every variable; and inevitably there are things that are correlated that we don't understand. Yet, I feel like, in today's world, at the super-micro level, and in particular on the Internet with certain companies' access to data about us, they can do something much closer to a controlled experiment--sort of, what's called an A/B test--than we can do as policy makers, as economists. Do you think that's true? Are we learning? Are companies who are trying different treatment effects, essentially, of their customers, are they getting more reliable results from those effectively than they would, say, through hiring an economist in the past used to just do analysis on their customers? With, say, an advertising campaign, which was always very difficult.

Susan Athey: So, you brought up this idea of the A/B test, which is just really incredibly powerful. And I think when people think about Silicon Valley they imagine Steve Jobs in a garage, or the invention of the iPhone or the iPad, and they think that's what Silicon Valley is--

Russ Roberts: It's wild--

Susan Athey: all about.

Russ Roberts: Some leap of genius.

Susan Athey: Yeah. But most of the innovation that happens in the big tech companies is incremental innovation. And the A/B test is probably the most impactful business process innovation that has occurred in a very long time. [More to come, 43:47]