Unlocking the Power of Synthetic Data: Samuel Cohen on Transforming Market Research

by Karen Lynch

Head of Content

Discover how FairGen’s Samuel Cohen uses synthetic data to enhance market research. Learn about the challenges, opportunities, and future of synthetic sampling.

Listen to the episode

In this episode of the Greenbook Podcast, Lenny Murphy sits down with Samuel Cohen, CEO and co-founder of FairGen, to explore the transformative potential of synthetic data in market research. Samuel shares his journey from pursuing a PhD in AI and synthetic data at UCL to founding FairGen and revolutionizing data quality for hard-to-reach audiences.

The conversation delves into the critical distinction between augmentation and replacement in synthetic data, how FairGen’s approach addresses data scarcity and enhances insights, and the opportunities and risks of this emerging technology. Whether you're curious about the future of synthetic sampling or the implications for consumer insights, this episode is packed with forward-thinking ideas.

Key Discussion Points:

Samuel’s Background: From a PhD at UCL to leading FairGen and addressing data quality challenges in market research.
FairGen’s Mission: Improving insights by using synthetic data to augment sample size for hard-to-reach populations and niches.
Synthetic Data Explained: Understanding the two families of synthetic data approaches: LLM-based and high-quality survey-trained models.
Real-World Applications: How synthetic sampling boosts trackers for CPG brands and its potential use in public policy and political polling.
Navigating Risks: The importance of guidelines, transparency, and collaboration to ensure proper use of synthetic data technologies.
Future of Research: Why synthetic sampling is here to stay and how it’s shaping the future of consumer insights and decision-making.

Resources/Links:

FairGen Website: fairgen.ai
GRIT Report: Explore Key Insights

You can reach out to Samuel on LinkedIn.

Many thanks to Samuel Cohen for joining the show. Thanks also to our production team and our editor at Big Bad Audio.

Transcript

Lenny: Hello everybody. It’s Lenny Murphy with another edition of the Greenbook Podcast. Thank you for taking time out of your day to spend it with me and my guest. And today, this is one of those exciting times where I’ve never spoken to the guest before, so those are super fun for me. So, Samuel Cohen, CEO and co-founder of Fairgen. Sam, welcome.

Samuel: Thank you very much. Hi, Lenny.

Lenny: It’s great to meet you. I have heard a lot about Fairgen, and I know our topic is going to be pretty interesting for everybody, as we talk about the wonderful world of synthetic sample. But we’ll get to that in a minute. First, why don’t you tell us a little bit about Fairgen, and then your origin story, and then we’ll go from there.

Samuel: Great. So, about Fairgen. So, it all started three years ago. I realized something was going to go on in market research where data quality is dropping, and someone had to think about how to resolve that. And on my personal history, I started by going through mathematics very deep, and jumping into AI about eight years ago, now, and seven years ago, into synthetic data, which at the time was something super niche. Like, no one thought this would ever scale. Like, I was trying to generate some images of, you know, simple things. It was just really hard, took a lot of time, was really hard to scale. You know, I decided I wanted to try to scale these things up. I did a PhD on that topic at UCL in London, but spent most of my time at Meta AI Research Labs during my PhD. And before that, I had actually done a master at Oxford, also on that topic, mostly, and I did a PhD. I realized I wanted to start a venture, and that this data quality problem was going to get big, and that was what I wanted to solve. So, we co-founded Fairgen. We raised a first round very quickly. That was in June 22. 2.5 million rounds, we started growing the team, hiring researchers, AI researchers, but also people that know about market research and data in that space, and we started cracking the problem. What we have now is a solution for quantitative research that allows to boost hard-to-reach niches, and hence improve the insights on, you know, small sample groups to make sure that, you know, when you get these insights out, they really show what’s happening in the real world without, you know, suffering from the small sample sizes, but also from the quality of data being very low, and hence the insights being even worse for small niches. So, that sort of Fairgen. And just, like, you know, high level, right now, we have a platform, a SaaS platform. It’s being used by some of the best companies in the world for market research, and we’re also integrating for 25 into most of the big groups you know of. So, it’s really taking off, and I’m really excited to talk about it now.

Lenny: Very cool. Now, our audience can’t see you, but being an old fogey myself, I mean, you’re young, [laugh] from my perspective, and I know that in your bio that you had gotten your PhD at 25, correct, so you—

Samuel: Correct.

Lenny: All right. So, you’re a prodigy, as well as obviously a visionary in thinking about where the future was going. So, I just think that’s cool. And where’s the accent from?

Samuel: French.

Lenny: French.

Samuel: It’s a French accent. So, I grew up in Paris. And actually I have moved a lot. So, I was in Paris until 14 years old, and moved to Singapore, did all of my high school there, and then went to study for uni at the Imperial in London, then Oxford, and UCL. And we’re now based in Tel Aviv, Israel, which is pretty good tech hub. So, for everything startup, like, it allows you to scale and grow faster than in some other places.

Lenny: Okay. All right, very cool. I’ve connected the dots, so let’s dive in. So, I’m going to parrot back to you what I heard, is that the Fairgen is primarily focused on what I would think of as augmentation versus replacement. Although there’s certainly been within the industry—and there’s a lot of confusion about this. So, let me tell you my take, and since you are a PhD on this topic, you can correct me. So, when I think of synthetic data, one, I think we’ve been doing this for a long time, right? There’s lots—to an extent—there’s always been access to information to fill in gaps, although not nearly as sophisticated as AI allows us to have it today. And then we hear the term ‘synthetic sample.’ And primarily when I think about synthetic sample, I think there’s two classes. One is trained off of real data—and I think it’s probably closer to what you’re doing—so let’s say off of panel profile information, or, you know, a large repository of real consumer information that the AI, the models can duplicate, it’s kind of an agent type of approach of what that real person would say, or that population would say. It’s one group. The other—and I think this is what’s confused the issue—is just kind of general AI, the LLMs that are just sucking in all types of stuff, and experiments with how that more generalized approach can duplicate some level of human responses. And the evidence that I have seen seems to indicate that, sure for early-stage exploration, the general LLM AI stuff, eh, it can be directional, but it’s definitely not something you would make a decision off of. For our purposes in the insights space, the approach that is based off training of real data, whether it is augmenting and/or replacing potentially, that actually has real meat to it and it’s pretty darn accurate. So, how did I do? From a layman standpoint, please fill in the gaps, correct me, give our audience a little more context.

Samuel: You did great. That’s basically how I also think about it, in terms of these two families. And I’ve been, like, pushing, you know, very hard on thinking about how to define it, right, and how also, to get people to think about these different definitions and not to just say, “I’m doing synthetic data.” It’s too restrictive, right? So, I want to run you through the way I think about it, and you’ll see that you’re going to recover your definitions, right? It’s going to be a bit technical, but like, let’s go through it together. So, think about the world, right? There’s a planet, and there is billions of people you can ask questions to, right? So, I have a survey of 20 questions, 30 questions, and I’m going to specify a population, like, Americans—but could be more niche; could be doctors, for example—and I’m going to send this survey. And I’m going to get back some answers. And then I have a table with, you know, rows being the number of, basically, the people that are answered to these questions, and columns being the questions themselves, right? That’s the real world, like, sampling, you know, that’s we’ve been doing since I got up a hundred-plus years ago. Now, let’s go into the AI world. Think about synthetic sampling as a way to simulate the world, right, based on some parameters. So, it’s like if we’re creating a replica of the world, right? And so, this model is a replica, and then you can ask this simulator, okay, give me more people, or give me more people from that group, right? So, a synthetic generation model is simply a simulator. Now, you can recover these two definitions that you gave—or these two families—from that perspective. The second family, where we’re talking about LLMs trying to basically simulate humans based on having learned on a lot of data, what you’re simply doing is you create your simulator, right, by training a model on billions of data points, right, massive amount of past surveys. But also, like web data, just the whole internet. And it’s trying to learn to mimic the real world and how people think, and how people have emotions, how people respond to questions, and so on. And it’s basically trying to simulate all that. And then you can ask this model, “Okay, you are an 18 to 24-year-old male living in region X, okay? Answers these 30 questions.” And that’s it. And then you get answers. And this simulator replicates the process of, if you were to go back to the real world and to ask people—more people—to answer questions, you will get some data right. However, this simulator is not very accurate because it’s pre-trained on a lot of data. Some data may be outdated, right? And that’s why we have trackers in this industry because we need to keep up to trends, right? And also it has a lot of crappy data from, you know, untrusted sources. And so basically, this simulator is very far—sometimes—from the ground truth, and there is no way to control the distance between this simulator that we created and what the real people from the world would have said, right? I hope that makes sense. Then we have the second family that you mentioned, which is basically ours, right, where you’re trying to craft a high quality simulator of the world, this time based only on relevant data, and relevant data being the survey that would have had to be ran in the real world. So, we say, okay, we ran a survey. We gathered 1000 respondents, right, that answered this survey. I have my data table with these rows and columns, right, with all the answers from these guys, and I train my simulator only on this data, right? And the fact that we learn only on this data means that it cannot go too far from where it should go, and that it also can focus its power onto learning some statistical patterns that will allow it to extrapolate. What do I mean by extrapolate? I mean, create new rows of responses from new individuals that live in the same sort of world with the same parameter as the real people, but they’re new, and they respond coherent responses that bring new value. New value being, reducing the margin of error, right? So, that’s basically the approach that we’re taking. It gives us control, a lot of control, because we can make this simulator accurate, grounded into the survey’s world, right? And we can also have statistical guarantees because it’s just relies on statistics that we have control over compared to LLMs where we don’t have any control. So, that’s basically how you recover your two families. So, you were really close to my beliefs, and I hope that gives a little, like, sort of different perspective on things.

Lenny: Oh, it does. That’s really helpful. And again, for our audience, we just heard from a PhD in synthetic data, so this is the final word, right? Or at least the [laugh] trusted source on how we think about this. Now, let’s get a little more into the weeds from a business standpoint because it’s—you know, this is the topic, it been the topic for the last few years. So, there’s not a day goes by that I don’t have some level of conversation with somebody about this. So again, helped me keep my thinking straight. From a business standpoint within the world of research, what I hear is kind of what we’ve been hearing for a long time that—so back in the era of big data, I would always say, look, you know, there will come a point where we will have all of the data to the who, what, when, where, and how. That will exist somewhere. The mechanism is how to synthesize that to be able to predict what somebody would do without ever asking a question. And you know, it didn’t know the AI was the unlock to do that. But here we are. We have that. Now, what’s missing is the why. So, I think about that from framing up of look, if we’re tracking, testing those type of things, then why the hell wouldn’t we utilize some variation of either the augmented approach like you’re using or even substituting with purely synthetic sample? But if we need to find out something new, you know, if we’re trying to test a new, novel breakthrough product, or determine what’s really driving a change, a fundamental change, in behavior. So, let’s say during Covid would have been a good example, right? Things changed, and we pretty quickly realized that all of our norms and benchmarks were not relevant anymore. We could not rely upon those as really defining how people were making decisions. There were new factors, so therefore new research was required to do that. Is that how you see, in a very broad way, the applications of the technology. Look, things that are pretty predictable and pretty static, we don’t see much change, absolutely your approach and the technology can fill in the gaps, or even, you know, potentially even take away the gaps entirely, but it’s not going to be the right approach when you’re trying to discover something new one way—

Samuel: Yeah.

Lenny: Or the other. Is that—

Samuel: Yeah. So, I think it’s a pretty interesting sort of way of thinking about this. First thing first, we don’t try to replace real-world sampling because you need to start from something in order to know where you are. So, what you’re saying about Covid, right? It changed the entire world, people’s perceptions on things, usages, and attitudes, right, and habits. So, you cannot basically rely on a model that hasn’t seen this new wave of data about what’s happening currently in the world, right? So, for example, LLM-based approaches between massive amounts of data and then use, like, sort of zero shots in a new period of time would completely fail here, right, for sure. Now, the approach I’m voting for is human plus AI, right? That’s why we create augmentation. Like you have a field, you augment it. So, think about trackers. I want to frame what you’re saying from the perspective of trackers. You have a tracker, a very general tracker, on for example, multiple brands and engagements, right? Something that a large CPG company would run, right? And you get to a wave, which is just post-Covid, post start of Covid, where everything changed. You do quarterly trackers, and now you’re in the quarter where Covid started, right? What I’m saying with our technology augmentation is you need a lot less data from this current wave to be able to re-represent what people are thinking than you would without this technology, right? So, if you do 2000 people per wave, right, you would probably need a lot less to understand granularly what are people thinking. So, this is exactly what we’re doing. We have delivered insights on niches of trackers, brands, and product trackers to massive CPG companies, and they boost at the level of brands, or even sometimes product which is super-low penetration, like, one 1% of the population. So, if you do a 2000 people wave, and you have 20 users for that specific product, you can’t say anything there because you need enough data. And things change, too. The delta between these 20 people of this wave and the 20 people of the previous wave is massive, and it’s really hard to say something. However, if you train your model on all the waves, including the current one, right, you’re going to be able to give a lot more accurate insights on these niches because the model has learned both from the current wave, from the past waves, and so you can go a lot deeper. So really, I don’t believe in replacement. I only believe in augmentation. And one last thought that I have, I don’t say that LLM-based models are not useful, right—and you mentioned this—for concept ideation, right, or early stages of testing. It’s amazing. It allows you to, you know, get new ideas and understand people, and craft things for people, like, at early stages of design, you know, cheaply and fast. So, on this kind of use cases, it’s magical. It’s incredible. However, you know, you can’t, like, deliver high accuracy and sensitive insights based on these things. Some CMOs—actually, most CMOs in large CPG brands—are compensated based on the numbers that us market researchers deliver on [unintelligible 00:16:48] branding engagements. So, this is sensitive. You need high accuracy and you need reliability. And you know, LLMs don’t do well for that. They do well for other things.

Lenny: I can’t tell you how gratified I feel to have you saying, “Yeah, you’re not an idiot, Lenny.”

Samuel: [laugh]. You’re not an idiot.

Lenny: So thank you. But we’re still new in this, though, although the rate of adoption is faster than anything that I have ever experienced on any technology in my life. And yeah, I just think in terms of systems and processes within our industry, and obviously there’s massive implications for how things change here. So, my guess would be, using your tracking example, that brands are going, “Well, yeah. Why am I paying for 2000 completes? I don’t need that. I need 200, and we’ll fill in the gaps with Fairgen to ensure that we have a good—you know, that’s accurate overall.” But I would then think, but we’re not going to catch outliers in this.

Samuel: Correct.

Lenny: And occasionally there are outliers that would occur within your tracker. So, now I’m going to reallocate my budget to a new approach that allows me to, you know, look for those outliers in other ways. So, it changes the structure and flow of research fundamentally from that standpoint, right? I’m shifting—it’s kind of like when we moved into automation and agile in the industry. It had an impact to just how people were prioritizing budgets and, you know, how they were thinking about the process overall. Is that accurate? Is that what you’re seeing or hearing from your clients that they’re saying, “Oh, this is great. I’m no longer going to spend a million bucks a year on this tracker. Instead, it’s going to be”—you know, whatever, I assume significantly less—“But we’re going to now shift that budget into cooler, more strategic stuff that may generate aha moments, versus okay, we know we’re on the right track,” which is what trackers do.

Samuel: So, you’re touching on probably one of the most interesting and important points that we’re addressing right now from the go-to-market perspective, and I’ll tell you the way I think about it. You cannot use this to cut budgets on important projects where you need accuracy because you should see this as a way to go deeper and be more granular. But you don’t use this to degrade the accuracy of some things, and to improve some other things. So, I’m going to make it clear. So, when you have 1000 people per wave, or 2000 people per wave, right, let’s say now we say, okay, we’re going to do now 500 people per wave, right, and we’re going to boost it with Fairgen, right? So, Fairgen is going to be good at improving the accuracy at the level of niches because that’s what we know how to do. What we don’t do is improving the accuracy of the results on the global level, on the global main field, at the gen pop level, for example. Because you cannot turn 500 people into 1000 people. What you can do is turn 30 people that are part of 500 people as a 90 people group. That you can do. You can boost niches. You can’t boot global results because it’s statistically impossible, right? So, if you were to cut budget in that way, right, you would degrade the accuracy at the global level while improving the accuracy at the granular level, which is a problem, right? And agencies recommend against that. So, what we’re proposing, you should see this as a new opportunity to go deeper and generate more value to brands through agencies, right? And that’s also how agencies and actually brands are thinking about this. When [we haven’t had 00:20:38] a single, large CPG or else brands asked to reduce budgets for the sake of this solution. Rather, they’re using crazy amount of boost per studies to be able to go really deep and understand things that they want to understand but that they can’t with their current methodology. One last thing I want to point out, I spend a lot of time with execs of the market research industry, like, top firms, and thinking about where the market research industry is going. And obviously Greenbook is also thinking about this and as much as is [Omar 00:21:12] and other groups. So, right now, agencies need to create more values and more value to customers and brands in order for the volume of things not to shrink, right, because of the competitive sort of spectrum of the market research industry. So, no one wants to degrade the quality of studies because it’s competitive, and if the quality degrades, you’re going to risk losing a bit. So, no one wants to degrade. No one wants to put more budget for something that they can get cheaper with another competitor. Right now, brands are looking for more values, more value to be able to go deeper for sort of their marketing for their product design, all of these things where it’s sensitive to have granular insights, but right now they just can’t because it would be too expensive with the methodology that they have in their hands. So, this is a new door opener. It’s easy and low barrier of entry, it’s very easy to use, and it allows agencies to provide more value to their customers. So, that’s really the way I think about it. It’s the way agencies that work with us think about it, and lucky, it’s also the way brands working with them think about it because they’re using this.

Lenny: Very cool. Thank you. I had not thought about that particular angle, obviously, and as soon as you say it makes perfect sense. So, I was going to ask whether you had done any experiments in public policy or polling, which I’m guessing that you have, but probably were not particularly effective at a macro level. So, let’s say, you know, we’re recording this post US, you know, election, my guess is that you probably couldn’t have predicted the overall outcome, although you would have been pretty good at predicting how specific populations—

Samuel: Correct. You’re very right. So, we’ve done some things. Obviously, you know, political opinion is never the priority of companies starting in our industry because it’s a smaller market, but it’s really good PR and marketing, which is why we decided to go with it. So, we’ve worked with one of our partners, Ifop, our original design partner. Actually, it was the Gallup of Europe. They started, like, I think, 80 years ago. It’s the oldest market research company in Europe, and they believed in this way before everyone else. And during the latest European elections that were very busy, as you’ve probably seen, we’ve explored predicting what niches are going to vote for. So, niches may be teachers, doctors, young people, religious groups. So, we’ve boosted these things, and we are the first group that ever was able to publish results on elections based on synthetic data. So, publishing is not just putting it on your website, right? In pretty much every country, you need the polling commission, which is typically a government run sort of institution to look at the methodology, look at the results, validate qualitatively that the results are good, and all of these things. So, we went through this scrutiny, and we got the validation to be able to publish these results, which was very cool for us as another way of more of a quantitative validation that all of this makes sense and works. And we also had the quantitative validation in the sense that we—another large market research group in France then boosted in the real world, some of these groups were predicting things for, and our results were really close, which means that, you know, we’ve done something that made sense and refined the results in a pretty accurate way. I can envision that when we get to the midterms and to the next US elections, this will be massively used to predict [unintelligible 00:25:08] at a state level, where you have less data, obviously. And because US elections work according to states, right, and [unintelligible 00:25:17] in that case, you’ll be able probably to do better than just using the real data because if you’re good at every single state, obviously you just put the results together with the [unintelligible 00:25:28], and then you get probably a more accurate picture. So, I’m super excited to get to these midterms and to see what we can do there. I can tell you not yet the name, but we’re piloting with the largest [unintelligible 00:25:42] opinion pollster in North America, and I hope that we’ll be able to provide some results there for the midterms.

Lenny: Okay, that’s very cool. Now, let’s continue on the trend for a minute because it brings up an issue that I’ve been trying to bring attention to for a little while. You know, we live in a very fragmented media ecosystem. My take was, that the results of the election here in the US—and also in other places—and what we think it was a miss, were simply reflection of we weren’t accessing specific populations very well as an industry. The obvious one would be young men here in the US. It was a big, big component of that they’re not engaged in our sample ecosystem, so it was difficult to get any data on them, therefore, that was the enough margin of error to why the election outcomes were different than what many people expected. So, in your approach around augmentation, how do you deal with the potential scarcity of data on a population? Let’s use that example here in the US: young men not answering surveys. There’s not significant body of research for us to build the models off of, so what do we do about that when you know there’s just not a good foundation?

Samuel: Let’s start from the use case, right? Let’s think together about, first, why we didn’t do well in the last elections to predict. Like, you know, polling has been really inaccurate in these elections, right? We’ve seen that. You know, the gap was like that, as we’ve seen, between Trump and Kamala, was way bigger than what, you know, we pollsters had predicted. So, there is two reasons why this gap was big. One is bias, the second is variance, or scarcity. Let’s start from bias because it’s more intricate, right? So, bias is that there is a tendency to say and report less that you’re voting for, you know, the right, or sometimes extreme right, than the left or extreme left. It’s just a phenomena that we all know about. In that case, we can call it—here, some people call it the ‘Shy Trump Effect,’ for example, right, ‘Shy Trump Voter Effect.’ So, this is very hard to correct for, and that’s basically why we’re still not doing great there. It’s very hard because you don’t know that someone is such a Shy Trump Voter that when he respond to a survey.

Lenny: Sure.

Samuel: It’s the same—

Lenny: ...do disconnect effectively, right? Yes.

Samuel: Exactly. And so, we need to think of, what are the tools that we have to better estimate this, right, in order to do better, right? That’s, like, a direction that the whole field is going towards. That’s bias. That’s not what we’re dealing with for now. Let’s talk about variance, which has a massive impact because when you do political polling, you try to correct this result. What I just said about biases, there is some methods that try to correct for things by, for example, using national estimates of various types of things, like, did you vote for the last election, and so on, right? But you when you do this weighting, this method of weighting, but there is a lot of scarcity for some groups, and this weighting can bring results into the wrong direction, right? Because it’s not a magical solution, right? When you say this gro—when you 3x a group by saying, “Okay, we have 30 Hispanics instead of 90. I’m going to give us 3x weight to these guys,” you don’t have 90 people. So, if these 30 people, if what they say is far from the truth because there’s too much variance, right, you’re propagating a lot of error at the global level.

Lenny: Right. Garbage in, garbage out, right? I mean—

Samuel: Garbage in, garbage out. So, how can we help? Well, in this specific use case, if we augment this group from 30 to 90, we reduce the margin of error of this group from 30 to 90 people, which means that the propagated error at the global level will be smaller, like, for all of these states, right? And so, that’s a way to make sure that for these few niches where things are very inaccurate, we can basically help correct things at that level. And so, we can help with the whole variant slash scarcity part of the issue. For the bias side of things, it’s more about methodologies that political pollsters are following, and I’m sure people will do better in a few years. But we’re focused on the scarcity and variance problem, where it’s a statistical problem, it’s a grounded problem, and it’s a problem where the better your model is, the better you’re going to do with prediction?

Lenny: Okay. All right, that’s, uh—we could take that—keep talking about that particular thing. I won’t bore [crosstalk 00:30:24]—

Samuel: [laugh].

Lenny: It’s just, it’s a fascinating problem, and it has been. But it’s also indicative—and this may be—I’d love to get your take on this as well, and you mentioned in the beginning—you know, sample quality, so garbage in, garbage out. So, I expect that you have a pretty good view quality sample sources, or the quality levels of various sample sources. So, let’s leave it at that [laugh].

Samuel: Yeah.

Lenny: But yet, systemically, across the industry, you know, there is still this—it’s a massive struggle, and we keep trying to find other ways to deal with it. So, I’m going to do a double-barreled question here, or at least tee up the next thing. What is your take on how are you dealing with the potential contamination factor of training your models off of flawed data? And then, second, are you looking at other data sources, let’s say behavioral data, to help mitigate against that say/do disconnect, to increase the accuracy of the models?

Samuel: So, you know, we could spend hours talking about this, but the first thing is to better understand this problem, we actually built a tool called Fair Check that basically implements a lot of the checks that, you know, industry people do sometimes, most of the time, manually to find weird patterns, like for example, people that just say things either randomly or, like, straight liners, that just say the same thing many times, or duplicated people, right, that have, like, a different IP, but that say the same thing. Like we have detectors for this thing who put it into a tool. A lot of our customers are also using this for data quality, but it was more about understanding what’s currently happening with data quality, and what are the different things that people are doing. I actually started in cyber in my career. I was doing AI in cyber, trying to detect malwares in Android, and what people told me there—I was working at Check Point, one of the biggest cyber companies in the world—and what my boss told me at the time is, there’s always going to be someone better on the other side, like, trying—someone will always be able to hack you, and so you just need to try to get your defenses as high as possible, right? But someone will always be able to pass through your defense. So, that’s the problem with data quality right now in surveys, and it’s very hard for us to defend against everything. So, data is often not great. So, a few things. First, when you train a model on data, if there’s a few outliers which are sometimes malicious, right, the model sees that this guy is, you know, very weird. And it’s not necessarily going to learn to replicate that person because this model smooth out things in a way. So, it’s going to be less as a—the result it will generate will be less affected by this outlier than if you were to just look at the data itself. That’s the first thing. The second thing is, we have this tool that I mentioned that can allow to remove some of these worse rows that you see. The third thing is, in terms of our guarantees, when I say with 3x your data at the niche level, I’m saying with 3x your effective sample size. What I mean by that is, if you have 50 Gen Zs, but actually there is 15 malicious random people that just, like, straight liners, we don’t go from 50 to 250. We go from 35 to 100 and something, right? That’s how it works, right? We boost the effective sample size. Like, what your data is actually worth, we’ll 3x it. We can’t do magic. If you gave me 50 Gen Zs, and actually, the 50 Gen Zs are just, like, straight liners, I’m not going to be able to generate 150 Gen Zs, right? So, we’re very clear to our customers about that, is we boost the effective sample size, we don’t boost the sample size itself. So, if the data is crap, we will give you back crap data. That’s basically—

Lenny: Yeah, yep.

Samuel: —the way we think of it.

Lenny: Okay. Now, what about the—so that all makes perfect sense, and—

Samuel: And the second question. Yeah, yeah, yeah, the second question. So, we don’t use any external data, any behavioral data or anything. The model is really, like, a very modern, highly scalable, and industrializable version of imputation that people have been using in the industry, right, took to a way other level. And we don’t use exter—we cannot and do not use external behavioral data because we are very worried that it will lead to some biases that are undesirable because it’s very hard to gauge and balance between, like, the real data we care about, like the survey data, and, like, things that could come from outside, like from data that you would have trained [unintelligible 00:35:05] themselves saying some stuff. So, that’s basically the state of the art right now. I’m not saying that in a year, or two years, three years, we won’t see a mix of things, but for now, this is where we are, and we’re pretty satisfied with the results we’re allowed to provide, so we’re not focused on this for now.

Lenny: Okay. Yeah, totally get that. I will say, I expect to see that synthesis. You know, there’s other companies that I know that, you know, leverage, you know, behavioral data via app usage, you know, those type of things, the digital exhaust of—and does show highly predictive capability, especially to use that example of the, you know, the [Seju 00:35:49] disconnect. So, I’m sure that you guys will keep experimenting with it. I think we’ll get to a world where quality first, period, and then what real first-party data sources exist that basically allow us to predict, you know, human behavior? Maybe we all just have digital avatars, and that’s what we’re synthesizing at some point. But anyway, interesting times. I want to be conscious of your time as well as our listeners’ because you and I, I think could go on for a long time. This is a fascinating conversation, and it’s only going to continue to grow in importance. So, for our audience, you know, we know the data shows—well, here’s the innocent data point. Let’s see if this jives with what you’re seeing. You know, there’s the GRIT Report, and then Qualtrics just released their annual market research report. And in the GRIT report, when we look at adoption, it’s still relatively low, right? It kind of sub 30%, and in some cases, from an application standpoint, even lower, right? We’re getting a lot more on the analysis, we’re getting some on sample, et cetera, et cetera. The Qualtrics report asked users how much they expected to be utilizing something to the effect of synthetic data over the course of the next three years, and 71% of respondents said yes. So, we have still a long way to go from current usage to this projected usage, but we’re going there, right? Is that your take? Do you think, “Yeah, this train has left the station. It’s, you know, now everybody’s going to have to adapt.”

Samuel: Yell, here is the way I think about is, like, these new tools allow you to do your job faster, better, and cheaper, right, so why would we not use it? You’ve seen the effect ChatGPT had on all of us, right? So, these kinds of technologies do three things I just said, and from the synthetic sampling perspective, which is our focus, it allows you to get higher quality insights a lot faster because these niches are very long to get to, a lot cheaper because they’re very expensive, and also, like, you know, it allows you to do things that you would never have even considered before because everything that, you know, a brand goes to an agency, they tell the agency, “I want to understand better this group.” Agency will say, “Today, with what we have, it’s impossible. There’s no way.” It’s not even a budget question. Like you—[unintelligible 00:38:20] you send a survey to your customer base, and you have a 4% response rate, it’sj very hard for you to get more people to respond here, and the incentives are really hard to get right, so you’re stuck. You can’t go deep, right? So, this new set of tool gives you a way to go deeper, again, faster and cheaper. So, this is here to stay. We’re all improving these technologies really fast. Every month it gets better, and I’m really excited to see where we are, indeed, in three years, and I have a tendency to think that will be higher than 71%, but I’m biased [laugh].

Lenny: [laugh]. Right. Spoken like a CEO.

Samuel: [laugh]. Yes.

Lenny: Are investors listening?

Samuel: [laugh].

Lenny: So Sam, what did I not ask that you wanted to talk about, that you wanted to make sure, I want to get this out there?

Samuel: I think one thing is the risks. Like, we didn’t talk too much about the risk here. Well, I think we’re both very optimistic people, so I guess that’s probably why we haven’t talked about the risk, but I always like to talk about it because actually talking about the risks reassures customers, from our perspective. They want to know what the risks are, and if you’re trying not to speak about risks, they’re like, “Okay, there is something fishy here,” right? So, the thing is, this new set of tools and methodologies, right, need to have strong set of guidelines that scope out what you can do and what you can’t do, right? It needs to be clear to users of these technologies, you know, what’s the red line that you can’t cross in terms of these technologies, right? So, the risk, basically, is misuse of these technologies if you don’t know these sort of guidelines, and the dos and don’ts. So, I just want to give you a very simple example. If I was to give in the hands of my customer my platform, I would tell them, “Okay, you can boost niches of any size, a 50% niche.” And they boost it, and then the result is not a 3x boost as they expect, in terms of accuracy and margin of error. Then either they come back screaming at me, and then I’ll explain that they didn’t know about that, or they’ll use this results as is, and get some insight that aren’t accurate because it’s not a proper 3x boost, right? So, this is a risk. This is the biggest risk: misuse. So, we spend a lot of time building documentation, spending time with our users, our customers, the technical people in the companies we work with, to make sure that they’re aligned with what the technology we provide can actually do. So, that’s really the thing I would be most careful about. And I really think that all actors like Fairgen, developing synthetic data, needs to spend a lot of time evangelizing on one hand, but also talking about the risks and best practices so we don’t make mistakes using the technologies, and so we only take the best out of it.

Lenny: Great point. It reminds me of when social media first emerged and folks are saying, “Oh, it’s going to replace research, qualitative research.” Well no, it’s not.

Samuel: It’s not.

Lenny: Or it better not because there are these risks and those conversations needed to happen, so I really appreciate you bringing this up. This is cool. I know that we will talk again because you’re helping to drive this entire new shift forward. So, thank you for that. Where can people find you?

Samuel: On our website, fairgen.ai. So, there are all our contact details to me. On LinkedIn, you can also find me. I’m very responsive there. And, yeah, these are the best ways to catch me.

Lenny: Okay. Well, Sam, this has been a great conversation. Really appreciate it. Any final thoughts that you want to get out to the audience?

Samuel: I’m very optimistic about the future. And there’s just one last thing I want to say, these technologies are easy to try, so I just recommend trying. So, before making an opinion on these technologies, try, try, try. We’d be happy to guide you through everything and, yeah, happy to meet some of you in the future. Thank you very much for that. Was great.

Lenny: No, thank you for being here. And thank you to our audience. Again, Sam and I may not have found an opportunity to talk, so in a purely selfish level, I love doing the podcast, love to have an audience because you give me a reason to talk to cool people like you, Sam, so thank you. I want to give a big shout out to our producer, Brigette. She keeps the wheels turning all the time. So, thank you, Brigette. Our editor, Big Bad Audio, and our sponsor, which I should say, Sam, I think Fairgen is coming in as a sponsor of the podcast, so thank you for doing that as well. We appreciate it. That helps keep the keeps the lights on. And that’s it for this edition of the Greenbook Podcast. Everybody take care, and we’ll talk again real soon. Bye-bye.

Listen and watch on your favorite platform