refget Sequence Collections

Solves naming chaos for genomes by generating unique identifiers for collections of sequences

When working with large quantities of DNA and protein sequences, genomic researchers need to identify and organise them into groups that typically contain all the chromosomes of an individual. They name a group of sequences so that they can keep track of which sequences they have used in their research. All the individual sequences in the groups have unique features, such as a sequence name, sequence length, and the sequence itself. Different researchers might use different groups of sequences for their research, but then name the groups the same way, making it hard to keep track of groups of sequences built by different researchers. Refget Sequence Collections creates a unique identifier, like a fingerprint, for the entire group of sequences, establishing an unambiguous way to identify the collection. This makes it easier to share, compare, and understand biological data, ensuring that everyone is talking about the same group of sequences.

A companion tool to refget Sequence Collections, refget Sequences provides a single name for a single reference sequence.

Jump to...

Benefits

  • Removes ambiguity in identifying and referencing a reference genome
  • Enables researchers to compare, share, and analyse collections of sequence data

Target users

Data generators, data custodians, and developers

A diagram depicting how refget Sequence Collections works
Image summary: A diagram depicting how refget Sequence Collections works
THEME
CATEGORY
TYPE
STATUS
Work Stream
LATEST VERSION
Product Leads
  • Nathan Sheffield
  • Timothe Cezard
  • Andy Yates
  • Sveinung Gundersen
Staff Contact
Tools & Platforms

Community resources

Dive deeper into this product! Refget Sequence Collections (SeqCol) aims to standardise identifiers for collections of sequences. A common example and primary use case of SeqCol is for a reference genome, but SeqCol can be used to identify genomes, transcriptomes, or proteomes — anything that can be represented as a collection of sequences. We provide an algorithm for encoding sequence collection identifiers, an API describing lookup and comparison of sequence collections, and recommended ancillary attributes to decorate sequence collections. Seqcol extends the same methods found in refget Sequences to collections of sequences, meaning identifiers are generated by their attributes such as sequence names and content. Like refget Sequences, SeqCol digests are defined by a hash algorithm, rather than an accession authority.


News, events, and more

Catch up with all news and articles associated with refget Sequence Collections.

Colorful toolbox surrounded by gear icons against a binary code background
27 Mar 2025
refget Sequence Collections is an approved GA4GH product
See more
Colorful lego blocks set against a binary code background
27 Mar 2025
Variation Representation Specification (VRS) v2.0 is an approved GA4GH product
See more