About us
Learn how GA4GH helps expand responsible genomic data use to benefit human health.
Learn how GA4GH helps expand responsible genomic data use to benefit human health.
Our Strategic Road Map defines strategies, standards, and policy frameworks to support responsible global use of genomic and related health data.
Discover how a meeting of 50 leaders in genomics and medicine led to an alliance uniting more than 5,000 individuals and organisations to benefit human health.
GA4GH Inc. is a not-for-profit organisation that supports the global GA4GH community.
The GA4GH Council, consisting of the Executive Committee, Strategic Leadership Committee, and Product Steering Committee, guides our collaborative, globe-spanning alliance.
The Funders Forum brings together organisations that offer both financial support and strategic guidance.
The EDI Advisory Group responds to issues raised in the GA4GH community, finding equitable, inclusive ways to build products that benefit diverse groups.
Distributed across a number of Host Institutions, our staff team supports the mission and operations of GA4GH.
Curious who we are? Meet the people and organisations across six continents who make up GA4GH.
More than 500 organisations connected to genomics — in healthcare, research, patient advocacy, industry, and beyond — have signed onto the mission and vision of GA4GH as Organisational Members.
These core Organisational Members are genomic data initiatives that have committed resources to guide GA4GH work and pilot our products.
This subset of Organisational Members whose networks or infrastructure align with GA4GH priorities has made a long-term commitment to engaging with our community.
Local and national organisations assign experts to spend at least 30% of their time building GA4GH products.
Anyone working in genomics and related fields is invited to participate in our inclusive community by creating and using new products.
Wondering what GA4GH does? Learn how we find and overcome challenges to expanding responsible genomic data use for the benefit of human health.
Study Groups define needs. Participants survey the landscape of the genomics and health community and determine whether GA4GH can help.
Work Streams create products. Community members join together to develop technical standards, policy frameworks, and policy tools that overcome hurdles to international genomic data use.
GIF solves problems. Organisations in the forum pilot GA4GH products in real-world situations. Along the way, they troubleshoot products, suggest updates, and flag additional needs.
GIF Projects are community-led initiatives that put GA4GH products into practice in real-world scenarios.
The GIF AMA programme produces events and resources to address implementation questions and challenges.
NIF finds challenges and opportunities in genomics at a global scale. National programmes meet to share best practices, avoid incompatabilities, and help translate genomics into benefits for human health.
Communities of Interest find challenges and opportunities in areas such as rare disease, cancer, and infectious disease. Participants pinpoint real-world problems that would benefit from broad data use.
The Technical Alignment Subcommittee (TASC) supports harmonisation, interoperability, and technical alignment across GA4GH products.
Find out what’s happening with up to the minute meeting schedules for the GA4GH community.
See all our products — always free and open-source. Do you work on cloud genomics, data discovery, user access, data security or regulatory policy and ethics? Need to represent genomic, phenotypic, or clinical data? We’ve got a solution for you.
All GA4GH standards, frameworks, and tools follow the Product Development and Approval Process before being officially adopted.
Learn how other organisations have implemented GA4GH products to solve real-world problems.
Help us transform the future of genomic data use! See how GA4GH can benefit you — whether you’re using our products, writing our standards, subscribing to a newsletter, or more.
Join our community! Explore opportunities to participate in or lead GA4GH activities.
Help create new global standards and frameworks for responsible genomic data use.
Align your organisation with the GA4GH mission and vision.
Want to advance both your career and responsible genomic data sharing at the same time? See our open leadership opportunities.
Join our international team and help us advance genomic data use for the benefit of human health.
Discover current opportunities to engage with GA4GH. Share feedback on our products, apply for volunteer leadership roles, and contribute your expertise to shape the future of genomic data sharing.
Solve real problems by aligning your organisation with the world’s genomics standards. We offer software dvelopers both customisable and out-of-the-box solutions to help you get started.
Learn more about upcoming GA4GH events. See reports and recordings from our past events.
Speak directly to the global genomics and health community while supporting GA4GH strategy.
Be the first to hear about the latest GA4GH products, upcoming meetings, new initiatives, and more.
Questions? We would love to hear from you.
Read news, stories, and insights from the forefront of genomic and clinical data use.
Attend an upcoming GA4GH event, or view meeting reports from past events.
See new projects, updates, and calls for support from the Work Streams.
Read academic papers coauthored by GA4GH contributors.
Listen to our podcast OmicsXchange, featuring discussions from leaders in the world of genomics, health, and data sharing.
Check out our videos, then subscribe to our YouTube channel for more content.
View the latest GA4GH updates, Genomics and Health News, Implementation Notes, GDPR Briefs, and more.
Discover all things GA4GH: explore our news, events, videos, podcasts, announcements, publications, and newsletters.
27 Mar 2025
Refget Sequence Collections, a newly approved GA4GH product, creates standardised identifiers for collections of reference sequences to facilitate data validation and reproducible research.
By Jaclyn Estrin, GA4GH Science Writer
The Global Alliance for Genomics and Health (GA4GH) is pleased to announce the product approval of refget Sequence Collections.
Approved by the Product Steering Committee, refget Sequence Collections joins the ranks of nearly forty other products developed by members of GA4GH Work Streams to enable broad, responsible sharing of genomic and related health data. GA4GH products are designed to harness the power of global genomic data to foster improvements in human health outcomes.
When researchers embark on a new genomic analysis, they often seek to compare their sequence data to a reference sequence. Reference sequences are commonly used to interpret biological data such as an individual’s genetic makeup.
Comparison against a reference genome is essential to help researchers determine where genomic variations exist to better understand how these differences contribute to genetic diseases.
However, since the beginning of genomic research, institutions have used different naming conventions for individual reference sequences, including a single chromosome or sequences of DNA, mRNA, and proteins. Therefore, the challenge of identifying and locating appropriate reference sequences results in a time consuming, manual data integration and comparison process.
GA4GH’s Large-Scale Genomics (LSG) Work Stream developed the refget Sequences API in 2018 to respond to this challenge. Refget Sequences uses an algorithm to assign a unique identifier — like a fingerprint — to a single sequence. An accompanying API allows researchers to do the reverse discovery as well, finding the original sequence from the identifier.
The newest approved GA4GH product, refget Sequence Collections, is an expansion of the refget Sequences API. The product development was led by Nathan Sheffield (University of Virginia), Timothee Cezard (EMBL’s European Bioinformatics Institute), Andy Yates (EMBL’s European Bioinformatics Institute), Sveinung Gundersen (ELIXIR Norway; University of Oslo), Shakuntala Baichoo (Peter Munk Cardiac Centre – Artificial Intelligence), and Rob Davies (Wellcome Sanger Institute), with support from LSG Work Stream Manager Reggan Thomas (EMBL’s European Bioinformatics Institute) and Work Stream Co-Leads Oliver Hofmann (University of Melbourne) and Geraldine Van der Auwera (Seqera).
While the refget Sequences API provides a single name for a single reference sequence (for instance, a single chromosome), refget Sequence Collections assigns a name for a collection of reference sequences (for instance, a set of chromosomes, which makes up a genome or assembly).
“Instead of relying on an ambiguous identifier, like hg38 or GRCh38, which could be used to refer to genomes with subtle but important differences, the sequence collections standard provides a deterministic, content-derived identifier that would reflect those differences,” explained Sheffield. “Furthermore, these identifiers can be determined by anyone, so the naming does not rely on a centralised authority.” This allows the standard to be used for a broader array of use cases than current approaches.
Refget Sequence Collections uses a standard algorithm to assign a unique identifier for a collection of reference sequences. This can be used to identify anything represented as a collection of sequences, which can be retrieved and compared, including genomes, transcriptomes, or proteomes.
“The process of mapping data to coordinate systems derived from genome sequences is central to many workflows in fields such as epigenomics, transcriptomics, and biodiversity genomics,” said Gundersen. “Refget Sequence Collections also aims to address long-standing issues related to assessing the compatibility of data files with genome browsers and other coordinate-based analysis tools.”
The product enables an automated genomic analysis, rather than the cumbersome process of manual detection and validation of compatible reference sequences.
“The refget Sequence Collections standard helps researchers easily confirm the exact assembly they are working with and helps to pinpoint and resolve discrepancies that could impact not only research but also clinical decisions down the line,” said Yates.
Think of a sequence as a building block with a specific height, length, width, and colour. An architect (or a researcher) chooses a collection of these building blocks to form a structure (or researchers, to construct their genomic analysis). Another architect chooses another set of building blocks to form their own structure. Each architect might build a different structure (or genomic analysis), or a similar structure but with blocks placed in a different order. What refget Sequence Collections does is help researchers identify if they have been using the same set of building blocks, or collection of sequences, within their genomic analysis.
Yates continued, “Ensuring absolute clarity about which reference sequences are being used can lead to more consistent, reproducible research and in a healthcare setting this translates into more accurate diagnostics and better patient outcomes.”
The next phase for the refget team is to incorporate refget Sequence Collections into file format specifications, including CRAM, BAM, SAM, and VCF. The team will also look to advance the standard further to develop algorithms that assign identifiers to human reference pangenomes — a collection of all the DNA sequences, or all of the genomes, in a group of individuals.
The product team is also aiming to align some of their concepts with the GA4GH Variation Representation Specification (VRS) to define variation with respect to a reference sequence. Through refget Sequence Collections, researchers can create assets, including genes or variations, that explicitly link to the reference genome with which it is associated. Cezard said, “Eventually all of these standards become part of the infrastructure that will be leveraged to make better clinical calls.”