About us
Learn how GA4GH helps expand responsible genomic data use to benefit human health.
Learn how GA4GH helps expand responsible genomic data use to benefit human health.
Our Strategic Road Map defines strategies, standards, and policy frameworks to support responsible global use of genomic and related health data.
Discover how a meeting of 50 leaders in genomics and medicine led to an alliance uniting more than 5,000 individuals and organisations to benefit human health.
GA4GH Inc. is a not-for-profit organisation that supports the global GA4GH community.
The GA4GH Council, consisting of the Executive Committee, Strategic Leadership Committee, and Product Steering Committee, guides our collaborative, globe-spanning alliance.
The Funders Forum brings together organisations that offer both financial support and strategic guidance.
The EDI Advisory Group responds to issues raised in the GA4GH community, finding equitable, inclusive ways to build products that benefit diverse groups.
Distributed across a number of Host Institutions, our staff team supports the mission and operations of GA4GH.
Curious who we are? Meet the people and organisations across six continents who make up GA4GH.
More than 500 organisations connected to genomics — in healthcare, research, patient advocacy, industry, and beyond — have signed onto the mission and vision of GA4GH as Organisational Members.
These core Organisational Members are genomic data initiatives that have committed resources to guide GA4GH work and pilot our products.
This subset of Organisational Members whose networks or infrastructure align with GA4GH priorities has made a long-term commitment to engaging with our community.
Local and national organisations assign experts to spend at least 30% of their time building GA4GH products.
Anyone working in genomics and related fields is invited to participate in our inclusive community by creating and using new products.
Wondering what GA4GH does? Learn how we find and overcome challenges to expanding responsible genomic data use for the benefit of human health.
Study Groups define needs. Participants survey the landscape of the genomics and health community and determine whether GA4GH can help.
Work Streams create products. Community members join together to develop technical standards, policy frameworks, and policy tools that overcome hurdles to international genomic data use.
GIF solves problems. Organisations in the forum pilot GA4GH products in real-world situations. Along the way, they troubleshoot products, suggest updates, and flag additional needs.
NIF finds challenges and opportunities in genomics at a global scale. National programmes meet to share best practices, avoid incompatabilities, and help translate genomics into benefits for human health.
Communities of Interest find challenges and opportunities in areas such as rare disease, cancer, and infectious disease. Participants pinpoint real-world problems that would benefit from broad data use.
The Technical Alignment Subcommittee (TASC) supports harmonisation, interoperability, and technical alignment across GA4GH products.
Find out what’s happening with up to the minute meeting schedules for the GA4GH community.
See all our products — always free and open-source. Do you work on cloud genomics, data discovery, user access, data security or regulatory policy and ethics? Need to represent genomic, phenotypic, or clinical data? We’ve got a solution for you.
All GA4GH standards, frameworks, and tools follow the Product Development and Approval Process before being officially adopted.
Learn how other organisations have implemented GA4GH products to solve real-world problems.
Help us transform the future of genomic data use! See how GA4GH can benefit you — whether you’re using our products, writing our standards, subscribing to a newsletter, or more.
Help create new global standards and frameworks for responsible genomic data use.
Align your organisation with the GA4GH mission and vision.
Want to advance both your career and responsible genomic data sharing at the same time? See our open leadership opportunities.
Join our international team and help us advance genomic data use for the benefit of human health.
Share your thoughts on all GA4GH products currently open for public comment.
Solve real problems by aligning your organisation with the world’s genomics standards. We offer software dvelopers both customisable and out-of-the-box solutions to help you get started.
Learn more about upcoming GA4GH events. See reports and recordings from our past events.
Speak directly to the global genomics and health community while supporting GA4GH strategy.
Be the first to hear about the latest GA4GH products, upcoming meetings, new initiatives, and more.
Questions? We would love to hear from you.
Read news, stories, and insights from the forefront of genomic and clinical data use.
Attend an upcoming GA4GH event, or view meeting reports from past events.
See new projects, updates, and calls for support from the Work Streams.
Read academic papers coauthored by GA4GH contributors.
Listen to our podcast OmicsXchange, featuring discussions from leaders in the world of genomics, health, and data sharing.
Check out our videos, then subscribe to our YouTube channel for more content.
View the latest GA4GH updates, Genomics and Health News, Implementation Notes, GDPR Briefs, and more.
Discover all things GA4GH: explore our news, events, videos, podcasts, announcements, publications, and newsletters.
11 Apr 2024
There is growing enthusiasm for synthetic data as a means of enabling meaningful data analysis while preserving privacy. However, the technical literature identifies threats to privacy for some forms of synthetic data when generated using personal data. In this brief, we discuss the key question of whether/when synthetic data may constitute personal data, set out the approach that data controllers should adopt in answering such questions and explain why general legal uncertainty is likely to remain for some time.
Synthetic data can be thought of as ‘artificial data that closely mimic the properties and relationships of real data’. While there is considerable interest in synthetic data as a form of privacy enhancing technology (PET), it has other important potential uses such as to fill gaps in real-world data, correct bias in datasets, and enable cohort planning, code development or other technical applications. For example, synthetic genomic data has been generated by the Common Infrastructure for National Cohorts in Europe, Canada, and Africa Project (CINECA) to showcase and demonstrate federated research and clinical applications.
Synthetic data is also an umbrella term. It can be generated in multiple ways and for varying purposes and consequently, will result in different forms with different privacy implications (see Table below).
Generation Method | Synthetic data generation methods can be grouped into three main categories: statistical, noise-based and machine learning methods. Some require manual manipulation of the data, others add noise to disguise personal data, and others learn the complex relationships in the real data to map those correlations onto entirely ‘made up’ but biologically realistic data values. The output synthetic data can therefore vary in whether it is fully or only partially synthetic. All approaches may require manual review and oversight to assess accidental and coincidental matches. |
Privacy Preservation Techniques | As with other data, privacy preservation techniques may be applied to synthetic data. The characteristics (i.e., distribution, size etc.,) and intended use of the data will determine the most appropriate technique. For example, differential privacy (which adds noise to distort the true data values) is better suited to large datasets but this technique reduces accuracy (i.e., ‘fidelity’) and therefore its utility in some contexts. |
Context and Use Limitations | Technical and organisational restrictions on access to synthetic data and the tools available when processing them may be applied to further limit privacy risks. As with other forms of data, care should be taken when repurposing or changing the context of synthetic data to ensure data remain fit for purpose and sufficiently safeguarded in context. |
The key question is whether synthetic data fall within the material scope of data protection law as ‘personal data’ (Article 4(1) GDPR). Due to the wide range of synthetic data generation methods, outputs and uses there can be no one-size-fits-all response but there are some indications of the approach and relevant factors that courts and regulatory bodies will adopt in determining an answer.
In our recent review of emerging statements and guidance from authorities in Europe we found that data protection authorities are currently approaching synthetic data in a similar, “orthodox” way. This views synthetic data as a novel privacy enhancing technology and begins with the presumption that if the source data are ‘personal data’, then the output data (synthetic data) remain personal data, unless it can be demonstrated with a high degree of confidence that threats of re-identification are minimal and well safeguarded.
In the health and genomic context this means that the use of personal data of patients to generate synthetic data should trigger an assessment by the data controller of whether the output synthetic data could (in combination with any other available sources) be used to identify a living individual. At this stage, it is not clear whether coincidental matching (i.e., where a living individual’s profile was not part of the training data but the generated synthetic profile accidentally matches them or a future existing real person) should also be considered. Such challenges are currently underexplored.
As regular readers of these briefs will be aware, determining whether data are identifying or anonymous requires a multifaceted contextual risk assessment, placing risk on a spectrum of identifiability that asks data controllers to assess ‘the means reasonably likely to be used to identify an individual.’ (Recital 26, GDPR) Further detail on such assessments has been provided in a previous brief.
In addition to understanding how the data were generated, the privacy preservation techniques applied, the current and (possible) future processing environment, and the purposes for processing; data controllers may also need to consider the technical and organisational safeguards in place to determine if effective anonymisation has been reached (see table above). A reassessment will likely be triggered whenever any of these factors change throughout the dataset’s lifecycle.
However, while this regulatory approach ensures that data protection and privacy are upheld as far as possible, it is important to note that it is not without cost or challenge:
Unfortunately, the potential for confusion about the status of synthetic data in data protection law has been exacerbated by the drafting of the EU Artificial Intelligence Act (AIA), which seemingly groups synthetic data with anonymous or ‘other non-personal data’ (Article 59(1)(b)), despite data protection authorities around Europe adopting a more nuanced, case-by-case approach. However, the EU AI Act applies without prejudice to existing EU law, including the GDPR (recital 9), which means that the regulatory status and definition of synthetic data within data protection law remains unresolved. Moreover, questions have been raised on how well innovations, such as synthetic data and its generation methods and outputs, fit the existing data protection framework. For example, the GDPR and data protection guidance do not currently address the issue of coincidental matching with existing or future real persons (which could arise with synthetic data). Determining whether this risk is relevant to data protection will involve consideration of the appropriate role of data protection law (as opposed to other parts of the legal framework), and what forms of ‘harm’ it seeks to guard against. To apply data protection law to all forms of synthetic data without such reflection threatens to overstretch the concept of ‘personal data’ and warp the function of data protection law.
However, until regulators, courts and policymakers address these issues head on, synthetic data developers and users should continue to follow data protection impact assessment and anonymisation best practices when assessing its identifiability and other data protection risks.
—
Elizabeth Redrup Hill is a Senior Policy Analyst (Law and Regulation) in the Humanities Team at the PHG Foundation, a think tank with a special focus on genomics and personalised medicine that is a part of the University of Cambridge.
Colin Mitchell is Head of Humanities at the PHG Foundation.
See all previous briefs.
Please note that GDPR Briefs neither constitute nor should be relied upon as legal advice. Briefs represent a consensus position among Forum Members regarding the current understanding of the GDPR and its implications for genomic and health-related research. As such, they are no substitute for legal advice from a licensed practitioner in your jurisdiction.