Categories
Explore the use of synthetic data in addressing data scarcity while ensuring validation to prevent biases and inaccuracies. Balance innovation with ethics.
The insights profession has long been anchored in three foundational principles: rigour, objectivity, and transparency. These key principles have guided our industry for decades, ensuring that market research is trustworthy, actionable, and most importantly, ethically gathered. However, as new technologies emerge, there is a tendency to adopt them enthusiastically without realizing potential long-term impacts.
I recently debated this topic with Simon Chadwick (Managing Partner at Cambiar Consulting), Mike Stevens (Founder & Leading Consultant at Insights Platform), and Finn Raben (Founder at Amplifi Consulting), and we examined the latest of these trends, which is the dependence and growth of Synthetic Data, a tool with immense potential but also significant risk if not handled wisely. This is a crucial development for our profession and in this article, I further reflect on this important topic.
Synthetic data is not a new concept as it has been around in various forms through the years. Essentially, synthetic data is artificially generated data to mimic the statistical properties of real-world data that has been previously gathered.
The use of synthetic data to mimic the characteristics and patterns of real-world data without containing any personally identifiable information (PII) or sensitive information as created through mathematical models and algorithms to stimulate the statical properties and relationships found in actual datasets.
Synthetic data has enabled researchers to overcome data scarcity issues, maintain privacy compliance, and explore diverse scenarios without relying solely on limited or sensitive real-world data sources. This has allowed for more robust analysis, experimentation, and model training while safeguarding privacy and confidentiality (thank you to ChatGPT for that definition of synthetic data!). However, despite all its advantages, synthetic data still carries inherent risks, particularly when it’s used as a direct substitute for real-world data.
One historical example to consider about innovation adoption is the introduction of polyester, a synthetic fiber that revolutionized clothing in the mid-20th century. Initially hailed as a miracle material that did not require ironing, polyester's unintended environmental consequences with its proliferation of microplastics only became apparent decades later. Similarly, synthetic data could have unforeseen impacts if not properly scrutinized. For instance, if synthetic data is not carefully validated, it could introduce biases or inaccuracies into research findings, leading to misguided decisions down the road.
The allure of synthetic data lies in its ability to address long-standing challenges in market research, such as reaching hard-to-interview populations or filling data gaps. However, the industry must resist the temptation to embrace this technology without the necessary checks and balances. Just as the environmental impact of synthetic fibers has prompted companies like Patagonia and Samsung Electronics to seek solutions, the market research industry must proactively establish safeguards for synthetic data.
To ensure the responsible use of synthetic data, it is essential to apply the same rigor, triangulation, and validation that we expect from traditional research methods. This involves continuously updating synthetic models with real-world data to maintain accuracy and relevance, as well as clearly communicating the use of synthetic data to stakeholders to avoid ethical and transparency concerns.
Synthetic data can be invaluable in specific, controlled environments. For example, in medical research, synthetic personas can be created based on existing data to study patient behavior without compromising privacy such as studies conducted with cancer patients.
However, broader applications, such as using synthetic data to fill simple survey quotas, can be problematic. In these cases, the synthetic responses may not align with real-world data, leading to inaccurate insights.
One study, presented by Annelies Verhaeghe at ESOMAR’s AI conference in 2023, highlighted the discrepancies between synthetic data and real respondent data in a study on Land Rover perceptions. The study found that the synthetic model required constant updates with real data to remain accurate. This case study underscores the importance of rigorously validating synthetic data against real-world benchmarks to ensure its reliability, accuracy, and
The use of synthetic data raises several key concerns:
Bias and Lack of Representativeness: Synthetic data is generated based on models that may inadvertently reflect biases present in the training data. This can result in skewed insights that do not accurately represent the target population, particularly if the algorithms are not sophisticated enough to simulate a wide range of demographics and behaviors.
Quality and Reliability Issues: The quality of synthetic data depends heavily on the algorithms used to generate it. Poorly designed models can produce low-quality data, leading to unreliable insights and misguided decisions.
Ethical and Transparency Concerns: The use of synthetic data must be transparently communicated to clients and stakeholders. Failing to do so can lead to trust issues and ethical dilemmas, particularly if the data is used to manipulate outcomes or if its limitations are not disclosed.
Regulatory and Compliance Risks: The creation and use of synthetic data must comply with data protection laws and ethical standards. This is especially critical in regulated industries, where non-compliance can lead to legal challenges and penalties.
Impact on Decision-Making: Decisions based on synthetic data that does not accurately reflect real consumer behavior can lead to poor business outcomes. Over-reliance on synthetic data can also diminish the importance of human judgment and qualitative insights, which are crucial for understanding complex human behaviors.
As synthetic data becomes more prevalent, there is a risk that it could contaminate data lakes—large repositories where data from various sources is stored and analyzed. If synthetic data is not properly identified and managed, it could lead to a situation where future analyses are based entirely on artificial data, with no real-world inputs. This could have serious implications for the validity of research findings and the decisions based on them.
To prevent this, the industry must implement mechanisms to identify and tag synthetic data within datasets, ensuring that it can be tracked and managed appropriately. This will help avoid a scenario similar to the Great Pacific Garbage Patch, where microplastics have accumulated in the ocean, largely invisible but with significant environmental consequences.
The potential of synthetic data to revolutionize market research is undeniable. However, as with any powerful tool, it must be used with care. The industry must strike a balance between embracing innovation and upholding the principles of rigor, objectivity, and transparency. By doing so, we can harness the benefits of synthetic data while mitigating its risks, ensuring that our insights remain accurate, reliable, and ethically sound.
As we navigate this new frontier, it is crucial to remain vigilant, continually assessing the impact of synthetic data on our research practices and outcomes. Only by doing so can we avoid repeating the mistakes of the past and ensure that synthetic data serves as a valuable asset rather than a liability in the evolving landscape of market research.
Comments
Comments are moderated to ensure respect towards the author and to prevent spam or self-promotion. Your comment may be edited, rejected, or approved based on these criteria. By commenting, you accept these terms and take responsibility for your contributions.
Disclaimer
The views, opinions, data, and methodologies expressed above are those of the contributor(s) and do not necessarily reflect or represent the official policies, positions, or beliefs of Greenbook.
More from Crispin Beale
Discover the latest innovations in market research. Uncover the power of behavioral science in gaining deeper insights with new technologies and metho...
Discover the intersection of AI and behavioral science and learn effective strategies for navigating this evolving field.
In today’s digital age, eCommerce has become a vital component of the global economy. With the advent of the internet, the eCommerce industry has trul...
Sign Up for
Updates
Get content that matters, written by top insights industry experts, delivered right to your inbox.