As part of our AI For Growth executive education series, we interview top executives at leading global companies who have successfully applied AI to grow their enterprises. Today, we sit down with Max Sklar, Head of Machine Learning Attribution at Foursquare.
User attribution, especially between offline and online worlds, is a persistent challenge for marketers. Using novel machine learning techniques on top of Foursquare’s incredible data trove of consumers’ physical behavior, Max enables enterprise customers to identify when online campaigns have driven offline revenue and convert real-world foot traffic into long-term digital customers and social media fans.
In this interview, you’ll learn:
- Current challenges and machine learning solutions to attribution in digital marketing
- Why becoming more comfortable with probabilistic thinking is a good quality for business and technology leaders to have
- Why logistic regression is still a great idea
Today, I have the pleasure of being joined by Max Sklar from Foursquare. Max, tell our audience about yourself and how you first got into AI.
Max Sklar: Hi Mariya, thanks for having me on the program. To start with my story, my journey to Foursquare, to AI machine learning was really the intersection of two interests that I had. I worked as a software engineer in New York for a few years, and then I went to grad school and discovered machine learning, natural language processing when I was there.
I was really fascinated on a theoretical level by the idea of these really interesting problems, because we’re not solving something where we know how to program the computer directly, we’re [programming] the machine [that] is trying to figure the model on its own, and that was really fascinating to me. I wanted to discover ways in which these techniques can be used in real world.
At the same time, I had this interest in local search maps going way back to when I was an undergrad. I had a website called Sticky Map back in 2005-2006, and it was just when the Google Maps API started coming out. I built a map Wikipedia type situation, where people would post little markers on the map and add little messages at different locations.
I thought it was really cool that everyone was using that, so when I discovered Foursquare I saw it all come together and I was able to work on a product like this while also applying machine learning [and] NLP to various parts of the business and of the product.
I started out you know working on Foursquare’s local recommendation engines to figure out where are the best places in the city, and I worked on venue ratings, trying to figure out which places are good, which places are not so good. Then I worked on a product called Marsbot, which is Foursquare’s chatbot. Now I’m applying machine learning to one of our b2b business application—attribution, which we’ll talk about in a bit—to try to build that out as smart as possible.
In the last few months, I started a podcast called the Local Maximum, and I talk to people in AI and machine learning every week. Sometimes we talk about more general topics in tech and entrepreneurship It’s been a lot of fun. I’ve been able to talk to a lot of interesting people and network with a lot of interesting people.
MY: I love the title of that particular podcast, and we’ve even talked about my Applied AI book on there, so we’ll be including a link in the article that accompanies this video series.
MS: I know for a fact at least one person bought your book based on the podcast interview.
MY: That’s wonderful, I appreciate that! Let’s go back to this problem that you were talking about, which is B2B attribution.
Attribution is the bane of marketers. You spoke earlier about how machine learning can be used to solve problems that we don’t know how to explicitly code ourselves, so I know you’ve been thinking about this problem for a really long time.
Can you share with our audience what are you defining as attribution, and what does it actually mean for Foursquare and your customers?
MS: Attribution is a causal model. We’re trying to figure out whether being exposed to an ad actually causes someone to change their behavior. The behavior that Foursquare knows about, that we specialize in, is whether you actually visited a store.
Foursquare doesn’t look at whether you clicked on stuff, it doesn’t look at what websites you visited, or even what purchases you made— although we have some of that data in there—our specialty is “can I get people to go into my stores?” We’re trying to figure out whether an ad causes someone to visit stores vs. is just correlated to going to stores. It’s the whole causation versus correlation thing, which is very difficult to tease out.
MY: Why is attribution so difficult? You mentioned correlation versus causality. Causal models are quite difficult to build, period, not just in machine learning models.
What are some other reasons that attribution is so difficult from a technical standpoint?
MS: There are a lot of factors that go into human behavior, and it’s really complicated. Any model that we build is a huge oversimplification of reality.
I’ll use a chain [store], Starbucks, as an example. Starbucks isn’t a client of ours, but it’s one that everybody understands. I’m targeting people who are around the same age range that go to Starbucks. I’m targeting people who have a little extra income and go to Starbucks in the morning, so they’re already going to a large extent. In what way did my ad caused them to go, and in what way was I just targeting people who are already going to go to Starbucks?
There’s always another step you can take, so maybe I’ll correct by what city they’re in, or maybe I’ll try to correct by age and gender, or maybe I’ll correct by whether they go to coffee shops in general, and this is just an infinite process. You can go on.
Also, [you could correct by] how those features interact with each other, and they could interact with each other in an almost infinite number of ways. It’s a constant you know trial and error in figuring out what features matter, what attributes matter, and what we want to correct for the most.
Another problem that’s related is that often times if you’re not careful, you’re just measuring and optimizing for targeting. I want to target people who are going to Starbucks anyway. That’s obviously not good from a business perspective.
The way I think about it is someone’s in the store and they’re about to swipe the credit card, then somebody runs over to them, whispers in their ear, “Buy a cup of coffee,” and then they swipe. Then that person cheers, “Hey, I get credit for that! Pay me!”
MY: That’s what Google and Facebook claim, right? “Oh yeah, we drove that guy to click on your ad, so pay both of us.” And you’re like, “No.”
MS: The third problem is that the data, in general, is very noisy. Everywhere in the chain of data, there’s noise that shuffles things around a little bit. When you have a lot of noise in the data, it’s always very difficult to separate out the signal from the noise.
For example, when we get the information [on] who saw the ads, there are errors in that. There’s even fraud in that, like maybe some groups say so-and-so saw the ad when they didn’t. You try to figure out which groups of users are actually the same person or the same family. Lots of errors in that.
On every level of the technology, every level in the chain, it introduces errors and it’s a lot of work to take all of them into account, and you’re always finding more sources of error. People have said that it’s also an adversarial problem: if somebody gets really good at attribution, then there’s somebody else waiting in the wings to take advantage of some feature you didn’t correct for, to try to optimize for that when they’re not helping the businesses in the end.
MY: Attribution has always been a problem. What have been historic ways that people have attempted to either correct these errors at all these different levels of the technology, or attempt to mitigate them in some way, or even just use some other more novel technique? What historically has been done to solve this problem?
MS: One of the best ways to do it is to do a controlled experiment, where you show a certain group of people the ad, and then you have another group of people where you don’t show them an alternative ad but instead you just let the bidding go to someone else.
What happens there is you can actually compare the two groups. It’s pretty fair. The sources of error will be the same on both sides so long as it’s random, and you can measure most effectively whether your ad is actually working or not.
But we don’t do that, and the problem with that is, first of all, you have to be on a particular platform that allows that, and often times we’re measuring attribution after the fact, so we can’t even do that.
Also, it’s very costly for these companies. They want their ad to go out to as many people as possible. When they say, “Okay, but you’re gonna hold back a certain group and they’re not going to see it,” then it’s like, “we can’t afford to do that, we have to you know maximize our ad budget.” There’s always that tension.
I would say if you can do a controlled experiment, go for it, but if they can’t, then there are these other techniques that we’re working on.
The technique that we had most recently replaced was a very good technique—I’ll talk about why we replaced it in a second—and that’s called one-to-one matching. We don’t have a controlled experiment, but we do know who saw the ad, so we construct a control group that is demographically identical to the exposed group. We started with five attributes: age, gender, your recency statistics (you’re put into buckets as to how long ago you visited in the past), your DMA (essentially, your city), and the language that you speak.
All of those signals have some error. We don’t know everyone’s gender and age exactly.
MY: Is the error mostly just missing values, or you also sometimes get incorrect values?
MS: Both. The missing values for age and gender, we try to impute. We have a whole other ML algorithm for that. The way I think about it is “everybody who see my ad” gets a buddy, and the buddy is demographically identical to them.
MY: Like a digital twin!
MS: Yeah. How much would you have visited had you not see my ad? We look at how much your buddy visited.
It’s pretty good. It corrects for those five things, but we noticed several limitations, which is why we replaced it.
One limitation is that we could have missing data. We have imputed data, so for a lot of people, we don’t know their gender. We just know there’s a 70% chance this person’s a female, 30% chance they’re male. Can we use that better than just randomly choosing 70/30 or trying to pick the largest one?
A second problem is that we can’t add more features to it, so not everyone’s gonna have a match. You might not have a buddy that’s exactly the same as you. [If] I want to also match on whether you have an iPhone or an Android, I want to match an Android user to an Android user and iPhone users to an iPhone user, but then that cuts your potential number of matches in half. The more you do that, you get a version of the curse of dimensionality, and the number of matches goes down [until] you can’t do it anymore.
The third problem is that we have this huge panel of location data that we use to do attribution, I think it’s 13 million users in the US, and we’re going to be throwing most of it away for this one-to-one model where we have a small number of people who saw the ad and then a small control group of buddies, and the rest of it is discarded.
Then, what we have is an actual visit rate for the exposed group, an expected visit rate for the control group, and then you have a fraction of actual visits over expected visits, and that fraction here represents the lift.
The issue is each side of the fraction is a count, and we imagine [count data] is generated by an actual rate of visitations—that’s like a Poisson distribution—and that generates some uncertainty. The numerator is an uncertain value, the denominator is an uncertain value, and then when you divide them, you get an answer with an even larger error bound. If we can have a machine learning control group and reduce the error bounds on the denominator, we can reduce the error bounds on the whole thing.
Essentially, what the denominator (the control group) is doing, that’s just your expected visits given that you didn’t see the ad, and we realized this is just a machine learning problem. We can insert our data into a machine learning model, and for a given person, for all the features about them, what is their propensity on any given day to visit the chain?
The good thing about that is, first of all, it reduces the error bounds in the denominator like I said, but it could [also] take into account any feature that we’d like, so now we have 500 features that we throw in there, and we’re adding more.
The third benefit is that it could take into account the uncertainty of the features we have. If we don’t know exactly what gender you are, but we have some probability distribution coming out of another model, we could take that probability distribution as an input and you could actually use that data rather than just trying to impute it.
I saw those three benefits, and we saw that this is a technique that’s being used in a few other places, [so] we decided to go for it. We just launched couple months ago.
MY: Since your launch and your addition of these 500 new features, have you observed that certain features that you weren’t tracking in your original 5 for your digital twin matching system are actually very impactful or have a lot of predictive value?
MS: Yes. One that has a lot of predictive value that wasn’t our original system was recency, whether you’ve been there in the past. The one-to-one matching only had three buckets: have you been there in the last 30 days, have you been there in the last 60 days, or have you not been that been there in the last 60 days? Everyone was put in those three buckets, and your control group buddy got matched in one of those three buckets.
Now we can have way more buckets and we could also look at frequency rather than recency, and it turns out those features are very predictive, which makes sense. How often have you gone in the past or not is a big one, and we’re still trying to look through the rest of the data to determine which features are important.
One of the interesting things is we have a hundred of these running a day. Every chain or every group of venues has a slightly different set of features. I’ve never seen this before, where we can’t find hyperparameters exactly, because my hyperparameters on one model might not be the same for the other 99. Trying to analyze all 100 at once is actually a very interesting problem which we haven’t dug into yet.
Another interesting one is what apps people have on their phone. That can be predictive on where you go.
MY: I think you can only get that on Android, or can you get that information on iPhone as well?
MS: For us, we have a pilgrim SDK. It is embedded in some other apps in the whole app ecosystem. It’s only in apps where it really helps the consumer experience for that app. For example, it’s in Snipsnap, which is a coupon company. It’s in TouchTunes, etc. We have a bunch of apps feeding information into our panel, and it’s those apps. I don’t know exactly how many there are, maybe ten that we can track.
MY: What are you learning about what other apps people have on their phones?
MS: It’s not just “what is the source of the data”? If it’s someone who is using a couponing app, then they might behave one way. If it’s someone who is using a music app [like] a jukebox app, they might behave another way. Or if it’s someone who uses a travel app, they might behave a third. It makes sense that those people would tend to go to slightly different places.
MY: Makes sense! You mentioned you have to build a slightly different model for every single restaurant and venue because their features are going to differ, their user distribution is going to differ…
Have you noticed some interesting patterns in terms of what features matter more for certain types of venues?
MS: Oh, that’s interesting. That’s one thing that we have been taking on a case-by-case basis. Right now that’s a further area of study for us, but one thing that has been interesting to me is that age and gender is such a big focus in the industry, like everyone wants to correct for age and gender, but there are a lot of cases where it doesn’t seem to matter all that much unless that chain is particularly targeted at a gender or a age.
Many of them are just more general. We don’t have Starbucks, but if we did, I would imagine that age and gender wouldn’t be that much of a factor. You could think maybe certain ages go to Starbucks more than others, but the model would probably find other factors that outweigh that.
MY: Indeed, I think marketing has been trying to find more behavioral data. Rather than looking at the customer demographics, looking at their psychographics and there are jobs to be done.
You already mentioned one of these, which is the apps that you use is a type of behavioral data. Are there are other behavioral features, other than the apps, that you’ve notice have been useful?
MS: Another one would be what categories of places you’ve been to before. Do you tend to go to gyms? Have you been to a hardware store? Have you been to coffee shop in the last 30 days?
That’s been really interesting, and there are a few interesting things that we’re gonna want to add in the near term, like what other chains have you been to before and do you visit expensive restaurants or not? I think we could throw in and it would provide really interesting insights.
MY: I would love to talk about your future development plans later, but for now, I wanted to get your advice.
Obviously this is a problem that almost every marketer has. For marketers that do have this offline component, they need to track whether their ads are really driving foot traffic, what is your advice for them to get set up with a better machine learning for attribution model?
MS: Before we talk about machine learning, I would try to determine whether you can run these real A/B test experiments or not, [determine] whether that’s in your budget. Most people can’t do it all the time, but if you can, it’s a good thing to just get a sense of what’s really working and what’s not.
Whatever service that you use, try to learn a little bit about the methodology that they’re using and ask some good questions [like] whether they’re using one-to-one [matching] or whether they’re using a machine learning model.
MY: What kind of machine learning model? For example, if you’re trying to suss out how sophisticated are these people, what are the machine learning models or the machine learning algorithms that you think are actually best suited for this specific type of attribution?
MS: Often what we use is a logistic regression on just all our features. That’s a very common one for marketing data, because a lot of the times, you have all of these the features that I mentioned. They’re floating out there [and] not really connected with each other. It’s not image recognition, it’s just a lot of different features. Some of them can cause a little bump in visits, some of them caused a little decline in visits, and you want to add them all up and maybe have some cross terms to determine whether features are interacting in strange ways.
We have a logistic regression where we’re constantly adding more and more features and more cut points to it, but some other groups might have some different approaches, so I would try to look at what the approaches are and try to understand why they chose that approach.
We experimented with other models, we experimented with KNN, we experimented with random forests or boosted decision trees, and we found that they were all about equally accurate. So it was like, well, we can get more data, we could do logistic regression more efficiently, and we can get a lot of insights from it that we can’t get from the other one, so let’s just start with that.
Another thing to look at that marketers don’t ask about often enough is the idea of uncertainty and confidence bounds.
We don’t actually output a lift on the back end, what we output is our uncertainty, [or] our probability distribution over lift. It’s a graph of a probability distribution function.
I’ve never been asked by anyone to see the probability distribution function. I’d love to be asked!
MY: I was going to ask! Most marketers don’t even know what to do with the confidence interval. They don’t know how to interpret that.
MS: No, they just want a p-value for statistical significance, which I have a lot of criticism of, and they say, “It’s significant? Good, we’ll go with it.”
That’s not really a good way to go. A lot of these p-values can be hacked. In other words, just run simulations again and again until you get the answer that you want, which is not good.
People are afraid of uncertainty bounds and PDFs, probability distribution function, because they’re like, “Well, it’s uncertain, what am I going to do with that?”
But getting the uncertainty gives you a good idea of what this information is telling you. It gives you a good idea of “what can I expect if I run this experiment again around where did we end up?”
I think people should think more about [the possibility that] if I’m not given an answer but more of an uncertainty but less uncertain than when I started, how can I use that? I think if people think about that, then they can get more use out of it.
MY: That does require a mentality shift for sure. Thinking probabilistically is definitely a skill set that business and technology leaders should have, but it’s often de-emphasized.
I wanted to go back to something that you had mentioned earlier about machine learning models. Something that surprises people is that using the latest, fanciest deep neural network is not actually always going to produce the most accurate answers, and frankly for scenarios where the data is very noisy, techniques like logistic regression are not only comparably accurate but also so much simpler to build and a lot easier and more computationally efficient
MS: Right. In the case of the noisy data, what it allows us to do is—and I’m open to expanding it into something a little more complicated in the future…maybe in the future we can run it through a few different ones and see what happens. We have noisy data when we have irregularities in the data, which we do a lot because problems come up when there are features that we haven’t corrected for yet.
Like I said, even though we’ve added a lot more features, sometimes we’re still not picking up the actual lift, sometimes we’re picking up a targeting or something else. Being able to see the weights of those feature allows us to dive in and start to get an idea of what’s going on. There are a lot of optimizations that we can make as well, which have helped us out a lot.
We use a numerical library in Scala called breeze to train these models relatively quickly. We can do them pretty fast.
MY: Absolutely. Iteration speed is key for production ML, so I’m really glad that you brought that up.
What are some things that you’re still not quite able to do yet with the machine learning attribution models you’ve built, and what are your plans for the next six to twelve months in terms of new areas of innovation and research?
MS: First of all, we want to add more features, because the more features that we have…what we found is there aren’t any silver bullet features that solve all our problems. The more features we add, the less we can say, “Okay, maybe it’s not correcting for x, maybe it’s not correcting for y because we have them in the model.
An interesting one that I want to put in relatively soon is, for these people we have an idea of where their stops are clustered, which gives us an idea of where their home location is and where their work location is? Where’s the nearest instance of the chain from their home and work? Is there a Starbucks within a hundred feet of where you work? That could be a really good feature.
MY: We are creatures of habit. We go for things that are convenient and proximate.
MS: Yeah. A lot of information, we can glean from their past behavior. Like I said, what types of places have they visited in the past.
Another one is trying to estimate someone’s income. I would love to know whether someone is a parent or not, because…
MY: Kids are expensive! It’s very simple.
MS: Yeah. All those things affect user behavior. In addition to adding features, there’s also making better use of the features we have.
I mentioned cross terms, and this is something that logistic regression isn’t good at, but something that involves decision trees or is more nonlinear is very good at, which is trying to figure out “I’m in New York and it’s also a Sunday”: is the answer to that just the New York increase in visits versus the Sunday increase in visits, or do those two features have some interaction with each other?
A good example would be in New York and it’s a particular date, because then you can get the weather on that date, [and] weather on its own might be a good feature as well.
Another area that we’re working on is trying to get more insights from what we have as well. We try to say okay, among different groups of users, what was the campaign effect on the different groups of users? What were the campaign effect people from 18 to 24, for example?
There is a communication issue with that, because all the thing about uncertainty and all that still applies. Some of the issues that we’re seeing, like we might see some city out there where that city had 200% lift. I don’t believe that ad could be that effective, so then we start thinking [about] what’s going on here in these specific pockets.
Negative lift happens sometimes. That occurs when there’s some targeting bias. It could be that the people who weren’t targeted by this particular ad were targeted by other better ads. It could be that they’re tending to target people who already visit a whole lot, and for some reason, we’re overestimating that. Or it could be that it was a really bad ad and it pushed people away.
MY: That’s true. Not all PR is good PR.
MS: Yeah. We really want to dive in the insights, and we have gotten some insights where we’ve been able to tell people to target this group, and it looks like that group converted really well on your ad. But you’ve also found some irregularities that we know are probably not true, [so] we have to do a lot more research to try to suss out what the problem is. I think it’s just an ongoing battle.
MY: Machine learning never ends! If you think you have your model deployed, no you get to keep working.
MS: I think there’s difference: if I were to build a classifier on what’s a picture of a cat and what’s not, there’s a better stopping point where it’s like, “This is pretty good.” I can get a cat pretty well.
With marketing, and it’s the same with predicting stock prices or things like that, there’s always more areas…
MY: It’s a zero-sum game, there’s a finite amount of attention every marketer’s trying to get more sophisticated. The people competing with you are out of luck because they’re not going to be able to compete with your machine learning capabilities.
MS: Finally, I want to say there’s some statistical techniques that are on our roadmap as well. We’re looking into propensity weighting, that’s an interesting idea where [we look at] what’s the lift for a panel versus the lift in general? Can we kind of compare that to the population as a whole?
Then another one is using Shapley values on several different campaigns. If you’ve been hit by two ads or more, who gets credit for that?
Those are two other things on our roadmap for the next six to twelve months. I don’t know if we’ll get to everything I mentioned in the next six to twelve months, but I guarantee that we’ll get to some of them.
MY: It’s good to be aspirational. It’s good to have a lot of plans and try to hit as many goals as possible.
Thank you so much, Max. This was such a wonderful interview, full of details and practical advice. I’m sure that our listeners will have derived a lot of value from it. Thank you so much for coming on the AI for Growth series.
MS: Thanks a lot for having me, and good luck everyone!