refget Sequence Collections is an approved GA4GH product

27 Mar 2025

Refget Sequence Collections, a newly approved GA4GH product, creates standardised identifiers for collections of reference sequences to facilitate data validation and reproducible research.

Colorful toolbox surrounded by gear icons against a binary code background

By Jaclyn Estrin, GA4GH Science Writer

The Global Alliance for Genomics and Health (GA4GH) is pleased to announce the product approval of refget Sequence Collections. 

Approved by the Product Steering Committee, refget Sequence Collections joins the ranks of nearly forty other products developed by members of GA4GH Work Streams to enable broad, responsible sharing of genomic and related health data. GA4GH products are designed to harness the power of global genomic data to foster improvements in human health outcomes.

When researchers embark on a new genomic analysis, they often seek to compare their sequence data to a reference sequence. Reference sequences are commonly used to interpret biological data such as an individual’s genetic makeup.

Comparison against a reference genome is essential to help researchers determine where genomic variations exist to better understand how these differences contribute to genetic diseases. 

However, since the beginning of genomic research, institutions have used different naming conventions for individual reference sequences, including a single chromosome or sequences of DNA, mRNA, and proteins. Therefore, the challenge of identifying and locating appropriate reference sequences results in a time consuming, manual data integration and comparison process.

GA4GH’s Large-Scale Genomics (LSG) Work Stream developed the refget Sequences API in 2018 to respond to this challenge. Refget Sequences uses an algorithm to assign a unique identifier — like a fingerprint — to a single sequence. An accompanying API allows researchers to do the reverse discovery as well, finding the original sequence from the identifier.

The newest approved GA4GH product, refget Sequence Collections, is an expansion of the refget Sequences API. The product development was led by Nathan Sheffield (University of Virginia), Timothee Cezard (EMBL’s European Bioinformatics Institute), Andy Yates (EMBL’s European Bioinformatics Institute), Sveinung Gundersen (ELIXIR Norway; University of Oslo), Shakuntala Baichoo (Peter Munk Cardiac Centre – Artificial Intelligence), and Rob Davies (Wellcome Sanger Institute), with support from LSG Work Stream Manager Reggan Thomas (EMBL’s European Bioinformatics Institute) and Work Stream Co-Leads Oliver Hofmann (University of Melbourne) and Geraldine Van der Auwera (Seqera).

While the refget Sequences API provides a single name for a single reference sequence (for instance, a single chromosome), refget Sequence Collections assigns a name for a collection of reference sequences (for instance, a set of chromosomes, which makes up a genome or assembly). 

“Instead of relying on an ambiguous identifier, like hg38 or GRCh38, which could be used to refer to genomes with subtle but important differences, the sequence collections standard provides a deterministic, content-derived identifier that would reflect those differences,” explained Sheffield. “Furthermore, these identifiers can be determined by anyone, so the naming does not rely on a centralised authority.” This allows the standard to be used for a broader array of use cases than current approaches.

Refget Sequence Collections uses a standard algorithm to assign a unique identifier for a collection of reference sequences. This can be used to identify anything represented as a collection of sequences, which can be retrieved and compared, including genomes, transcriptomes, or proteomes. 

“The process of mapping data to coordinate systems derived from genome sequences is central to many workflows in fields such as epigenomics, transcriptomics, and biodiversity genomics,” said Gundersen. “Refget Sequence Collections also aims to address long-standing issues related to assessing the compatibility of data files with genome browsers and other coordinate-based analysis tools.”

The product enables an automated genomic analysis, rather than the cumbersome process of manual detection and validation of compatible reference sequences.

“The refget Sequence Collections standard helps researchers easily confirm the exact assembly they are working with and helps to pinpoint and resolve discrepancies that could impact not only research but also clinical decisions down the line,” said Yates. 

Think of a sequence as a building block with a specific height, length, width, and colour. An architect (or a researcher) chooses a collection of these building blocks to form a structure (or researchers, to construct their genomic analysis). Another architect chooses another set of building blocks to form their own structure. Each architect might build a different structure (or genomic analysis), or a similar structure but with blocks placed in a different order. What refget Sequence Collections does is help researchers identify if they have been using the same set of building blocks, or collection of sequences, within their genomic analysis. 

Yates continued, “Ensuring absolute clarity about which reference sequences are being used can lead to more consistent, reproducible research and in a healthcare setting this translates into more accurate diagnostics and better patient outcomes.”

The next phase for the refget team is to incorporate refget Sequence Collections into file format specifications, including CRAM, BAM, SAM, and VCF. The team will also look to advance the standard further to develop algorithms that assign identifiers to human reference pangenomes — a collection of all the DNA sequences, or all of the genomes, in a group of individuals.

The product team is also aiming to align some of their concepts with the GA4GH Variation Representation Specification (VRS) to define variation with respect to a reference sequence. Through refget Sequence Collections, researchers can create assets, including genes or variations, that explicitly link to the reference genome with which it is associated. Cezard said, “Eventually all of these standards become part of the infrastructure that will be leveraged to make better clinical calls.”

Latest News

Colorful toolbox surrounded by gear icons against a binary code background
27 Mar 2025
refget Sequence Collections is an approved GA4GH product
See more
Colorful lego blocks set against a binary code background
27 Mar 2025
Variation Representation Specification (VRS) v2.0 is an approved GA4GH product
See more
GA4GH welcomes new Chief Product Officer Sasha Siegel
6 Mar 2025
Sasha Siegel joins GA4GH as Chief Product Officer
See more