Partner Content

Data Science

February 2, 2026

Testing Synthetic Data Against Academic Benchmarks: A Replication Study

Qualtrics examines how synthetic data performs against academic benchmarks, addressing trust and validation gaps in AI-driven research.

Testing Synthetic Data Against Academic Benchmarks: A Replication Study

The synthetic data conversation in market research has reached an interesting turning point. Vendors are making claims about accuracy. Researchers are asking harder questions about validation. And somewhere in the middle sits a gap: most AI models powering synthetic research haven't been tested against established academic benchmarks in ways the industry can examine.

Last October, we decided to test that gap directly. We reproduced the Paxton and Yang (2024)1 study that revealed quality issues with ChatGPT and Gemini on basic survey research tasks, the same study that showed general-use LLMs (large language model) struggling with fundamental Likert scale questions. We ran our fine-tuned model through the exact same protocol and committed to sharing what we learned.

This is what the data showed.

The Study That Raised Questions

When Joe Paxton and Yongwei Yang presented their findings at Quant UX Con 20241, they documented something specific: general-use LLMs weren't matching human response patterns on attitudinal questions. The gaps weren't subtle, high correlations that didn't exist in human data, limited variability that made synthetic responses look uniform.

For anyone working with synthetic data, this study offered a useful benchmark. The methodology was straightforward: 11 questions about Google Search usage, demographics, and attitudes. Basic survey research that happens thousands of times daily in market research departments. A good test case for any synthetic model claiming to handle attitudinal measurement.

We saw an opportunity to contribute data to the broader conversation about where fine-tuned models fit relative to general-use LLMs.

Four Models, One Protocol

We replicated Paxton and Yang's survey with the same procedures and US general population targeting. Then we collected data from four different data sources:

  1. Qualtrics Edge Audiences synthetic model, our fine-tuned LLM trained specifically for survey research. This model is built on millions of aggregated and anonymized human responses to hundreds of thousands of questions, with training runs that can take upwards of two weeks.
  2. ChatGPT-5 (OpenAI), accessed via API with the same prompting structure Paxton and Yang used.
  3. Gemini (Google), also via API with matched prompts.
  4. A human panel targeting the same demographic parameters. Real people answering the same questions, serving as our benchmark.

We requested 600 synthetic responses and received 584 usable completions. The other 16 didn't complete properly, consisting of incomplete surveys, responses that didn't follow logic, data that contradicted itself. We scrubbed them before analysis. That's a 2.7% failure rate, and it's worth noting because understanding data quality requires seeing what gets excluded.

The human sample required similar quality control. We collected 620 respondents and removed 102 who didn't meet standard project management quality criteria. Screen-out rate: 16.5%. Quality control applies to human data as well.

Then came the critical step: k-nearest neighbor matching2 on age, gender, and Google Search use frequency. This created true apples-to-apples comparisons across all four sources with 518 perfectly matched respondents per data source. Without this matching, demographic differences could skew comparisons in ways that have nothing to do with model quality.

What the Numbers Showed

The results revealed a clear pattern. Our synthetic model deviated from human responses by an average of 0.07 standard deviations (Cohen's D). That translates to 0.07 scale points difference on average across the eight attitudinal questions.

GPT deviated by 0.87 standard deviations. Gemini hit 0.88. Both general-use LLMs showed consistent patterns, over-estimation on top-two-box percentages, inflated correlations between variables, and response distributions that looked cleaner than human data typically does.

The 12x accuracy metric comes from comparing Cohen's D values: 0.07 versus 0.87-0.88. That represents an order of magnitude difference in how closely synthetic responses matched human patterns.

But the more interesting finding isn't just about averages. When we plotted the top-two box selection rates, our synthetic data's margin of error bars matched the human data. The spread looked right. The messiness looked right. Real humans don't produce perfectly consistent response patterns, they have variance, outliers, and patterns that don't always make logical sense. Synthetic data claiming to represent human responses should reflect that reality.

Proportion of Responses

The correlation patterns reinforced this observation. In human data, Google Search usage frequency correlated with six of eight attitudinal variables at modest levels. Our synthetic preserved those relationships. GPT and Gemini showed inflated correlations, the kind of too-perfect patterns that look convincing in a chart, but don't match how humans actually think about products.

Understanding the Boundaries

Now for the practical part: where these findings matter and where they don't.

Qualtrics synthetic model performs well on the types of questions in this study: attitudinal Likert scales, opinion statements, psychographic measures. We trained it on millions of responses to survey questions, so pattern recognition in conversational survey flow is what it learned to do.

Ask it about past behavior requiring personal memory like, "How was your experience at that restaurant last night?" and you won't get nearly as useful of data. The model simply can't know about your individual experiences, and if you were to ask different sets of people, you would get similarly varied answers. Questions that require personal memory or situational detail fall outside what synthetic data can deliver. That said, our model still significantly outperforms general-use LLMs on these types of questions because it's been trained to understand survey response patterns.

Emerging topics outside our training corpus and extreme niche populations both highlight areas for continued development. When historical survey data on a topic or population is sparse, the model has fewer patterns to learn from. Addressing these limitations is part of our ongoing validation work, testing performance on new topic areas, expanding to additional populations, and incorporating different research methodologies into future training.

This particular study played to our model's strengths. We didn't test behavioral recall or highly specialized topics. One replication validates one set of use cases, not all of them. The broader conversation about synthetic data quality needs multiple validation studies across different question types, from multiple vendors, tested against various benchmarks.

When to Use What

Based on what we learned, here's the practical breakdown.

Synthetic makes sense for concept exploration and hypothesis generation before committing resources. It works for attitudinal and opinion measurement using established scales. It's valuable for rapid testing before human field investment. And it helps identify interesting segments that warrant deeper investigation.

Human data remains essential for past behavior requiring memory, sensitive topics requiring genuine empathy, legal or compliance research where authenticity isn't optional, and final validation before major business decisions.

The hybrid approach often delivers the best results. When a business question arrives, you can test viability immediately with synthetic responses rather than waiting for budget, sample procurement, and field timelines. Run a micro-study this week, refine concepts based on results, test variations next week. These weekly iterations with hundreds or thousands of synthetic responses build a progressively stronger business case. The final confirmation uses human respondents (n=300-500) to validate what emerged from iterative synthetic exploration, often alongside a matched synthetic sample to ensure strategic alignment across target demographics. You move from a few isolated studies per year to continuous learning cycles.

If you're evaluating any synthetic data provider, demand specifics: Public validation studies against established benchmarks. Disclosed failure and hallucination rates. Clear training data sources (panels versus scraped web content versus proprietary). Known limitations and poor-fit use cases. Regular model updates and refinement schedules.

And insist on seeing distribution data, not just means. A synthetic model that produces the right average but wrong spread can lead to incorrect conclusions about human variability.

What Happens Next

This replication represents one data point in our ongoing validation work. We're continuing to test across different question types, use cases, and methodologies. The goal is building a clearer picture of where fine-tuned synthetic models add value and where they don't.

But the broader need extends beyond any single vendor's testing. The industry would benefit from standardized validation approaches that allow meaningful comparisons across different synthetic methodologies. Academic researchers studying synthetic data quality systematically, not just as vendor case studies. Professional organizations developing frameworks similar to what exists for other research methodologies.

The synthetic data conversation is moving from "does it work?" to "when does it work, for what, and how do we know?" That's progress. Answering those questions requires data from multiple sources, tested multiple ways, with results that researchers can examine directly.

The future of synthetic data depends on what the research community collectively understands about its capabilities and limitations. Right now, that understanding benefits from more public testing, more shared data, and more willingness to document what doesn't work alongside what does.

References

1 Paxton, J. and Y. Yang (2024) “Do LLMs simulate human attitudes about technology products?” Paper presented at Quant UX Con 2024, June.

https://drive.google.com/file/d/16F_JZv4eHNiDMJT6BT7F6m97C2rBX8-7/view

2 Ho, D., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt: Nonparametric Preprocessing for Parametric Causal Inference. Journal of Statistical Software, 42(8), 1–28. https://doi.org/10.18637/jss.v042.i08

https://cran.r-project.org/web/packages/MatchIt/MatchIt.pdf

Large Language Models (LLMs)chatgptartificial intelligencesynthetic data

Comments

Comments are moderated to ensure respect towards the author and to prevent spam or self-promotion. Your comment may be edited, rejected, or approved based on these criteria. By commenting, you accept these terms and take responsibility for your contributions.

Derrick McLean, PhD

Derrick McLean, PhD

Product Scientist, Edge COE at Qualtrics

1 article

author bio

Disclaimer

The views, opinions, data, and methodologies expressed above are those of the contributor(s) and do not necessarily reflect or represent the official policies, positions, or beliefs of Greenbook.

About partner

The synthetic data conversation in market research has reached an interesting turning point. Most AI models powering synthetic research haven't been tested against established academic benchmarks in ways the industry can examine. New research from Qualtrics explores this gap, testing data from four different data sources against academic benchmarks.

Sign Up for
Updates

Get content that matters, written by top insights industry experts, delivered right to your inbox.

67k+ subscribers