Synthetic Data – Introduction, Benchmarking Synthetic Data Quality: Metrics and Model Performance

Synthetic data is transforming market research with scalable, private, and cost-effective solutions, but fidelity and utility must be rigorously benchmarked.

by Abhishek Bhatia

Associate Director, Primary Research at Acuity Knowledge Partners

Introduction

Synthetic data in market research refers to artificially generated data that mimics real-world data. This data is created using algorithms and statistical models to simulate the characteristics and patterns found in actual data. The primary purpose of synthetic data is to provide a viable alternative to real data, especially when real data is scarce, sensitive or difficult to obtain. The following are some key points about synthetic data in market research:

1. Privacy and security: Synthetic data can help protect privacy and comply with data-protection regulations, as it does not contain any real personal information. This is particularly important when dealing with sensitive customer data.

2. Cost-effective: Generating synthetic data can be more cost-effective than collecting real-world data, especially in large quantities. It can also save time and resources.

3. Testing and validation: Synthetic data can be used to test and validate market research models and tools before deploying them in real-world scenarios. This helps in refining methodologies and ensuring accuracy.

4. Scalability: Researchers can generate large volumes of synthetic data to simulate various market conditions and scenarios, which might be impractical with real data due to limitations in availability and cost.

5. Bias reduction: Synthetic data can be designed to reduce biases that might be present in real-world data, providing a more balanced and comprehensive view for analysis.

6. Innovation: It helps researchers experiment with new ideas and approaches without the constraints of real-world data limitations, fostering innovation in market research techniques. Overall, synthetic data is a powerful tool in market research, enabling researchers to overcome challenges related to data availability, privacy and cost while enhancing the robustness and reliability of their analyses.

The growth of synthetic data has been significant in recent years, driven by a number of factors and advancements in technology, some of which are listed below:

1. Advancements in AI and machine learning

Generative models: Techniques such as generative adversarial networks (GANs), variational autoencoders (VAEs) and other generative models have improved the ability to create realistic synthetic data.
Deep learning: Deep-learning algorithms have enhanced the quality and complexity of synthetic data, making it more useful for a number of applications.

2. Privacy concerns and regulations

Data protection laws: Regulations such as GDPR, CCPA and HIPAA have increased the need for privacy-preserving data solutions. Synthetic data helps organizations comply with these regulations by reducing the risk of exposing sensitive information.
Privacy-preserving techniques: Synthetic data-generation methods often incorporate privacy-preserving techniques, making them attractive for sectors handling sensitive data.

3. Demand for large datasets

Big data: The rise of big data analytics requires large volumes of data for training and testing models. Synthetic data provides a scalable solution to meet these demands.
Data augmentation: In machine learning, synthetic data is used for data augmentation to improve model performance and robustness.

4. Cost and accessibility

Cost-effective: Generating synthetic data can be more cost-effective than collecting real-world data, especially in large quantities.
Accessibility: Synthetic data can be generated on-demand, making it accessible for various research and development purposes.

5. Innovation in data-generation techniques

Algorithmic improvements: Continuous improvements in algorithms and techniques for generating synthetic data have enhanced its quality and applicability.
Tool development: The development of specialized tools and platforms for synthetic data generation has made it easier for organizations to adopt and implement synthetic data solutions.

6. Applications across sectors

Healthcare: Synthetic data is used for medical research, training AI models and simulating clinical trials without compromising patient privacy.
Finance: Financial institutions use synthetic data for risk modeling, fraud detection and compliance testing.
Retail and marketing: Synthetic data helps in customer behavior analysis, market segmentation and personalized marketing strategies.
Autonomous systems: Synthetic data is crucial for training autonomous vehicles and robotics, where real-world data collection can be challenging and expensive.

7. Research and development

Academic research: Increased interest in synthetic data within academic research has led to more studies, publications and advancements in the field.
Collaborations: Collaborations between academia, industry and government agencies have fostered innovation and growth in synthetic data technologies.

8. Ethical considerations

Bias mitigation: Synthetic data can be designed to reduce biases present in real-world data, promoting fairness and inclusivity in AI models.
Ethical AI: The use of synthetic data aligns with ethical AI principles by ensuring data privacy and reducing the risk of harm. Overall, the growth of synthetic data is driven by technological advancements, regulatory pressures, cost considerations and increasing demand for large, high-quality datasets across sectors. As these factors continue to evolve, the adoption and development of synthetic data are expected to expand further.

Synthetic data has become a critical asset in artificial intelligence and machine learning applications, providing alternatives to real-world datasets that are privacy-preserving, scalable and legally compliant. As access to the real-world datasets is often constrained by privacy and regulatory laws, high data-collection costs and biases, synthetic data presents a viable alternative, generated through probabilistic models, deep generative networks or rule-based simulations.

The growth of synthetic data is reflected in significant investments and developments, such as Accenture’s investment in Aaru (a startup leveraging synthetic data to create AI-driven predictive model), NVIDIA’s omniverse using synthetic data to train autonomous vehicle models with virtual road scenarios, Meta using synthetic data for improving computer vision models, AWS and Google Cloud’s synthetic data initiatives and the European Commission’s AI Act that promotes privacy-preserved AI, encouraging the adoption of synthetic data in regulated sectors such as healthcare and finance.

However, ensuring the quality of synthetic data is crucial for maintaining model performance and compliance with regulatory requirements. To ensure quality, a number of benchmarking methodologies have emerged, from classical statistical tests to adversarial attack resilience assessments.

Synthetic Data in Market Research and Intelligence

Synthetic data is being used increasingly in market research and intelligence due to its ability to address a number of challenges associated with real-world data. “By 2027, synthetic data will replace at least 20% of real consumer data used for predictive analysis in market research industry” , according to the Forrester Research Report 2024. A transformation is on the horizon for the market research sector, driven by synthetic data, as traditional data-collection methods such as surveys, observational studies and interviews often suffer from response biases, high costs and time constraints.

One of the most profound transformations is in turnaround time in delivering insights, as research tasks such as survey designing, data cleaning and fieldwork, which used to take weeks to complete, can now be done within hours using AI-driven synthetic modeling. Generative AI (GenAI) is accelerating insight deliverability by curating survey-ready responses that depicts human behavior, helping companies test products concepts, communication and ad creative in real time, according to a study by Harvard Business Review.

Synthetic data is also being used increasingly for mimicking hard-to-reach audiences such as niche populations, sensitive demographics and low-incidence segments. GenAI tools now enable researchers to simulate large-scale response of niche groups, expanding the scope of insight without requiring significant financial outlay.

Synthetic data is also transforming the way tracker studies and longitudinal research are conducted by simulating responses across time intervals and reducing reliance on repeated fieldwork while maintaining trend continuity. This could prove beneficial, especially to sectors that requires responsive and robust tracking such as FMCG.

One of the most important usages of synthetic data in the market research sector is scenario testing; rather than simply reporting on what has happened, researchers can now model what might happen. This predictive capability enables brands to frame and test marketing or product strategies, pricing models or user journeys before committing resources, providing stress-free experimentation.

However, amid all the excitement, industry leaders are emphasizing the need for rigorous oversight and data governance. Synthetic data must be audited for realism, representativeness and potential biases; thus, it is important to benchmark their quality and performance.

Let’s now understand the technical foundation of synthetic data and how quality benchmarking could be done efficiently.

Technical Foundations for Synthetic Data Generation

1. Generative models for market research

a. Generative adversarial networks (GANs): GANs are widely used to generate synthetic consumer data, such as demographic profiles, purchase demographic profiles and online behaviour patterns, when trained on real consumer data.

b. Variational autoencoders (VAEs): These are effective for generating synthetic data with complex distribution, such as multi-modal consumer preferences, and are generally used for simulating diverse market segments.

c. Agent-based modeling (ABM): ABM simulates individual agents, such as consumers, and their interactions with a market environment; thus, it is valuable for modelling dynamic market behaviours, such as the impact of pricing changes.

2. Privacy-preserving techniques

a. Differential privacy: This ensures that individual data points cannot be re-identified by adding controlled noise to synthetic data. It is beneficial for financial institutions, healthcare and market researchers, where datasets often contain sensitive consumer information.

b. K-anonymity and L-diversity: These technologies group similar records and ensure that synthetic data cannot be traced back to specific individuals, maintaining diversity within each group.

c. Federated learning: This technique enables multiple parties to train models on synthetic data simultaneously without sharing raw data, maintaining data utility while enhancing its privacy.

Benchmarking Synthetic Data Quality in market research

Benchmarking synthetic data quality in market research is essential to ensure that the generated data is reliable, accurate and useful for analysis. The following are some key aspects and methods to consider when benchmarking synthetic data quality:

1. Fidelity metrics: These measure how well synthetic data replicates real-world data in terms of statistical properties. Key metrics include the following:

Statistical distance measures: Metrics such as the Kolmogorov-Smirnov (KS) test, Wasserstein distance and Jensen-Shannon divergence are used to quantify the similarities between synthetic and real data distributions.
Correlation preservation: To ensure accurate market segmentation and targeting, synthetic data must preserve correlations between the variables.
Marginal and conditional distributions: Synthetic data should accurately replicate marginal distribution and conditional distributions.

2. Utility metrics: These evaluate the effectiveness of synthetic data in training models for market research applications. Key metrics include the following:

Model performance: With metrics such as accuracy, recall, precision and F1- scores, it assesses the performance of models trained on synthetic data versus real data.
Generalization: This assesses how models trained on synthetic data perform on real-world test data.
Feature importance: This ensures that synthetic data preserves the importance of key features in predictive models.

3. Privacy metrics: These assess the robustness of synthetic data against re-identification attacks. Key metrics include the following:

Re-identification risk: This measures how probable it is to identify individuals in synthetic datasets with the help of techniques such as linkage attack.
Membership inference attacks (MIAs): These assess whether an attacker can determine if a specific data point was used to train a machine-learning model and generate synthetic data, potentially revealing sensitive information about the training data.
Differential privacy guarantees: These offer a mathematically rigorous framework to quantify and guarantee the privacy of personal data, and a privacy budget to ensure compliance with regulatory requirements.

Market research – Industry Applications

Synthetic data is being adopted increasingly across sectors for market research due to its ability to provide high-quality, privacy-preserving and scalable data solutions. The following are some key industry applications of synthetic data in market research:

Price and demand forecasting: Synthetic data is being used increasingly in simulating economic fluctuations and consumer reactions to pricing strategies. Agent-based modeling is particularly effective for simulating dynamic market behavior, with utility validated through demand forecasting accuracy.
- Predictive analytics: Healthcare providers use synthetic data to train predictive models for disease progression, patient readmission and treatment efficacy.
- Risk modeling: Financial institutions use synthetic data to model various risk scenarios, including credit risk, market risk and operational risk.
- Customer behavior analysis: The market research sector is increasingly using synthetic data to simulate consumer behavior. Synthetic data can simulate customer interactions, purchase patterns and preferences, helping retailers understand and predict consumer behavior.
- Market segmentation: Retailers use synthetic data to segment their customer base, enabling targeted marketing and personalized recommendations.
- Customer experience: Synthetic data helps analyze customer service interactions and improve customer experience through better service delivery.
- Market analysis: Automotive companies, energy providers, real estate companies and retail and e-commerce giants use synthetic data to analyze market trends, customer preferences and competitive dynamics.

Trends in Adopting Synthetic Data

The future of synthetic data in market research is promising and transformative. As technology advances and demand for high-quality, privacy-compliant data grows, synthetic data is set to play a crucial role in shaping the landscape of market research. However we need to be cautious about the following:

Need for standardized benchmarks: The market research sector needs standardized benchmarks to evaluate synthetic data quality. SDMetrics and SynthCity are some of the initiatives in this direction, but there is still a long way to go.
Real-time data generation: With demand for real-time data increasing, developing adaptive synthetic data-generation models that respond dynamically to new market trends will be crucial for researchers.
Techniques for bias mitigation: Even when designed to reduce biases, synthetic data, if designed poorly, can introduce new biases. Thus, new techniques such as fairness-aware generative models need to be developed to address this challenge.
Combining with federated learning: Integrating synthetic data with federated learning presents a powerful approach to enhancing privacy, security and efficiency in market research and other data-driven fields. As technology continues to evolve, the integration of synthetic data with federated learning will become a standard practice, offering a powerful solution for addressing the challenges of data privacy, security and scalability in the digital age.

Conclusion

Synthetic data is revolutionizing the market research sector by providing scalable, privacy- preserving and cost-effective alternatives to traditional data-collection methods. However, its adoption depends on rigorous benchmarking to ensure fidelity, utility and privacy. As the sector continues to evolve, advancements in generative models, privacy-preserving techniques and evaluation frameworks will be critical to unlocking the full potential of synthetic data in market research.

By addressing technical challenges and leveraging expert insights, market researchers can harness synthetic data to generate actionable insights, optimise strategies and stay ahead in a rapidly changing market landscape.

data quality consumer behavior machine learning

Comments

Comments are moderated to ensure respect towards the author and to prevent spam or self-promotion. Your comment may be edited, rejected, or approved based on these criteria. By commenting, you accept these terms and take responsibility for your contributions.

Abhishek Bhatia

Associate Director, Primary Research at Acuity Knowledge Partners

2 articles

Disclaimer

The views, opinions, data, and methodologies expressed above are those of the contributor(s) and do not necessarily reflect or represent the official policies, positions, or beliefs of Greenbook.

From Noise to Insight: A Comprehensive Guide to Social Media Listening

Discover the power of social media listening to gain insights into customer needs, trends, purchase intentions, and challenges for your business.

March 15, 2024

Read article

See all articles

Get the latest updates from top market research, insights, and analytics experts delivered weekly to your inbox