Discover more from The Analytics Engineering Roundup
Ep 39: 3rd party data and you (w/ Auren Hoffman)
Should you be using it in your work? Probably not! But we have big incentives as a society to make more datasets open to research.
Auren Hoffman currently serves as the CEO and Chief Historian at SafeGraph, a data-as-a-service company he founded, which provides primarily location data.
In this conversation with Tristan and Julia, Auren shares how truly few companies are making use of 3rd-party datasets today, how opening up more datasets to public research could help us solve big problems, and a fun fact about Abraham Lincoln's (!) work in the industry.
For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com.
The Analytics Engineering Podcast is sponsored by dbt Labs.
Listen & subscribe from:
Key points from Auren in this episode:
Who are some of the biggest buyers of third-party data?
And by the way, that's the right thing. So there's a data maturity curve. And you shouldn't be going out and buying external data until you could work with your own data.
So if you're a retailer, you have a lot of great data that's coming in. You have data about your employees and what's happening and who's worked out. You have data about your ingredients. You have data about your pricing. You have data about when customers are coming. If all those other amazing data to mine, you should get fairly along on that curve of mining that first before you bring in external data. Once you've gotten pretty far along then it makes sense to start bringing in external data.
Most companies are not that far on the curve yet, so they're getting there. You could see a path where companies are moving further and further along, but very few companies. So if you think of almost any industry, it's still a very small percentage of organizations in that industry that can buy external data. Even if you think about hedge funds, there are about a hundred hedge funds that buy external data in any meaningful way today, and there are 11,000 total hedge funds. So we're talking about under 1% of hedge funds really can buy that today. Now there are probably another 500 that will be buying it in the future, but they're not there yet.
And hedge funds don't even have any internal data at all, so they're they don't even have a good excuse. When you think about a retailer, you're really talking about 20 or 30 major retailers that can buy external data today. Most of the other ones aren't that far on. If you think about real estate, it's basically zero that can really use that.
So most industries, it's still only a small percentage can buy data today.
Are we supposed to be buying datasets right now and we just don't even know it?
Potentially, a lot of people buy a little bit of data here and there as Julia mentioned; you buy data about like leads or something like that, right? Yeah. You might buy it.
Yeah, exactly. And so it really depends. Some people will buy data to augment their sales or marketing efforts and that's a pretty common practice, but most companies' budget for that is very small.
And probably should be relatively small as well. Sometimes it's baked into what they're already doing. Obviously, if you're doing advertising on Facebook, the data is baked into it, you don't really think of it as buying data that's out there. But for most companies, it really depends on what you're building, and then if you're building a solution, some external data could really inform that.
So obviously if you're in the real estate world, if you had data if you think so let's take a step back. There are four nouns of data: you have data about people, you have data about places, so that's what SafeGraph does, you have data about companies like you mentioned firmographic data or stock ticker data or something like that, and then you have data about products. 99% of data is in one of those four nouns, and then you can cross those with each other. You can also cross it with time and you can cross it with price. So if you think of a stock ticker, the noun would be company, and then it would be cross with price and with time.
And you can go back a hundred years and you could back-test your AT&T ticker. Now, maybe the tick a hundred years ago is every 24 hours. And the tick today is every 24 milliseconds or something. But the data is very similar and that data is fairly accurate data to use.
Who's protecting our data and how is that evolving over the years?
I think in your question you're really thinking about the first noun of data, which is data about people, which is a piece of the data business, but it's only a piece and generally, most of that is for marketing purposes where people will try to sell that data for marketing purposes.
So they might say, okay, we have some data about Julia. Julia is a woman. She's in tech and therefore, if you're marketing to Julia, you might want to market her differently than you market to my 80-year-old grandfather or something like that. And so they might use that for different types of things.
The data about people is regulated. And then you can move into areas where medical data or something like that where they're very clear rules about what you can do and how you can do it and how you can move it and what you can do with that type of data. But even in marketing, which is much less regulated, you still have quite a few regulations about what you can do, and how you can do it.
And there are a lot of rules about it, both in the US and overseas, and even in California, they have CCPA, there are all these other types of things. Some of the more interesting data is not data about people, but some of these other types of things. And even there the data might be proprietary.
A lot of times your company might be using a product and that product is collecting data about your company, and then you still want to have some sort of assurity about how they use your data. So you can imagine if you have all your financial data in QuickBooks, you probably want to know that they're not going to just sell off your revenues to some investor or something like that.
But you might be okay if they just aggregate it and they say startups with this number of people have these types of expenses; that you might be okay with. So there's always these things that you have to understand and all these data rights are often changing, and so understanding that is important.
Most companies don't sell data, they're usually in the sales process, and are happy to not negotiate for rights to the data.
You started SafeGraph explicitly with the intent to curate a dataset and monetize it. How does one go about doing that?
There are lots of ways to get data.
What is Associated Press? It's just a data company, right? It's spitting out facts about things that happened. And so you could have people go gather it, you can crawl the Internet to get data. You can ask people to contribute it in some sort of data co-op. So there are lots of different strategies to go get data. You can even create synthetic data on top of real data, which could be really interesting. And that might be a privacy-safe way, if you had access to medical data, but you still wanted researchers to be able to take a look at it, you can create synthetic data on top of it.
So there are lots of different ways of making data products. In our case, we get most of our data from crawling. So we do a lot of Internet crawling to try to find a lot of interesting data, whether it's understanding the local McDonald's, understanding the city of San Francisco, or the city of Berlin, or getting data from job boards.
There are lots of different places where they'll have interesting data about physical places. And we want to make sure that we get that data and then aggregate it. The most important thing about a data company is that the data is true.
And people always forget that sounds obvious, but it's literally the number one thing. It's like if you're selling facts, you want your facts to be true. Now, if you're selling billions and billions of facts as we are in SafeGraph, you'll never be a hundred percent true. You'll never be close.
But that should be your objective. Your objective should be when you're wrong, to fix it quickly, to have a really good QA system, etc because people want to rely on that data. They want to build models on that data. And if you're building models and you start times 0.9 against each other, a bunch of times gets a small number really fast. So you want to have as high accuracy as possible from the get-go.
As most transactions are still happening directly between a buyer and a seller, who's verifying the data quality?
It is exactly a big problem right now. It's very hard to do these data valuations and data valuations significantly slow down the sales process.
Usually, in a sales process, there's some sort of data valuation that's happening where you're sending them some sort of data and they're looking at it. That can take a long time because how do you do that? The tools for that are hard. And that could be the biggest chunk of the sales process that's out there.
And then the burden is on this company who's buying data and may not really know how to evaluate it in the right way if they go do that. So brands are important. But those brands take a long time to build, especially for smaller companies, which is why a lot of people still buy from these legacy old-school data companies like Dun & Bradstreet. Dun & Bradstreet has been around for so long: Abraham Lincoln worked for Dun & Bradstreet, he actually worked there in the 1850s.
You have these companies that have just been around for just a really long time and part of the reason is that the brand is important. And if you think of these other companies that have been around for a while, Experian, et cetera, they have a brand that you have some sort of level of trust in.
Because if you're a startup, if you're a newer company, like SafeGraph, you have to build that over time. And of course, the best way to build it is to have an excellent product. But that makes it building it much easier. But even if you have an excellent product, it just takes long periods of time to go do that.
And if you think of G2 for software, which is a great tool if you want to evaluate software and you want to get in touch with people and see some reviews on things and et cetera. G2 doesn't have a data piece of it yet. Hopefully, they will. Or, their company might do that in the future. But, even if they do, I still think it's much harder to review data than it is software. Because software, it's if I'm selling a CRM to a dentist, there's like a key buyer, which is like the office manager of the dentist or something, and you understand that use case. And usually, the software is some sort of vertical, it's like I'm selling to marketers or whatever. Whereas data is often can be a lot more horizontal and so it may have lots of different use cases for the data. So it might work excellently in one use case but not as well in another use case.
Tell us more about data co-ops
data co-ops are super powerful and you get to win or take most really fast because it's a marketplace. They can kind of hop in top-down or bottom-up.
So you can cut the top down. You can go to the top companies in the space and you can say, hey, why don't you all share data about fraud or something? And if we all share data, we all put it into this one bucket, then we can reduce fraud as an industry quite a bit. And so there are lots of those types of data co-ops.
Maybe one of the more famous ones is Verisk. And they have all this insurance data that gets put into a co-op and then they can help understand transactions that look similar or fraudulent. And that's something where you do need an external company because you're not allowed to share data between competitors.
It could be anti-competitive pretty quickly. So you do need some sort of external company. In Verisk's case, it was actually owned by the insurance companies for many years, so it was almost like a nonprofit that was owned by them. So you see those all the time. You see them a lot in the financial services industry, and you see them in many other industries that are out there.
And then you have the bottoms-up data co-ops you mentioned, like these payroll type of things where, okay, I have a senior engineer, what should I pay the senior engineer? Of course, like in these bottoms-up things, one of the hard things is just normalizing all that data. Okay, I have a senior engineer, and you have a senior engineer. Are they the same type of person or a very different type of person?
So these things are really hard to start but there might be some sort of ways to do that.
So here's an example of a business idea for your listeners. You can get all these small businesses to often their QuickBooks, and then every time they are thinking about having a customer, you could tell them what are the average days outstanding that these customers pay? And let's say you're a small design agency and you know you can't take on too many customers and you've got Ford and GM and you're going to charge them both a hundred thousand dollars for a product for your services.
And Ford pays within 15 days and GM pays within 90 days on average. Okay? Well, you're going to go with Ford. And so that could be a really nice little thing where if everybody aggregates that information, it can work really well. And there are lots of little data co-ops, even in other products.
So if you think of like a spam filter like Gmail is a data co-op. Every time you click on the spam, it informs everybody. It makes everyone better. So there are tons of data co-ops embedded into products already to make all of our lives better.