Common reasons for using the Nucleotide database include:
Nucleotide collects and organizes multiple types of nucleotide sequences from sources within NCBI and beyond and provides an interface for searching, visualization, and analysis.
Reference sequences (RefSeqs) are a substantial and important subset of sequences in the Nucleotide database. Their importance lies in the fact that they are authoritative representations derived from researcher-submitted sequences.
As a whole, the pool of researcher-submitted sequences represented by INSDC repositories like GenBank are redundant, with multiple sequences covering the same genomic region or transcript. In addition, individual researcher-submitted sequences may be incomplete or have sequencing errors. In RefSeqs, redundancies are removed, incompleteness is resolved, and errors are corrected.
The representativeness and quality of RefSeqs make them stable, trusted reference points for research.
RefSeqs that appear in Nucleotide or in the literature have distinctive accession numbers. Understanding how these numbers are structured can help you quickly identify both the molecule type being described and some information about how the RefSeq was derived.
These numbers consist of a two letter prefix followed by an underscore, a set of six or nine numbers, and a version number. In the case of the accession number NM_183124.4, "NM" indicates the molecule type (i.e., protein-coding transcript, or mRNA) and staff-curated processing; "183124" is a six number identifier; and the last "4" is the version number.
The following table summarizes RefSeq accession numbers.
Accession Prefix | Molecule Type | Description |
---|---|---|
AC_ | DNA | Complete genomic molecule, usually alternate assembly |
NC_ | DNA | Complete genomic molecule, usually reference assembly |
NG_ | DNA | Incomplete genomic region |
NT_ | DNA | Contig or scaffold, clone-based or whole genome shotgun sequence data |
NW_ | DNA | Contig or scaffold, primarily whole genome shotgun sequence data |
NZ_ | DNA | Complete genomes and unfinished whole genome shotgun sequence data. Used predominantly for prokaryotic genomes |
NM_ | mRNA | Protein-coding transcripts (usually curated) |
NR_ | RNA | Non-protein-coding transcripts, including lncRNAs, structural RNAs, transcribed pseudogenes, and transcripts with unlikely protein-coding potential from protein-coding genes |
XM_ | mRNA | Computationally predicted model protein-coding transcripts |
XR_ | RNA | Computationally predicted model non-protein-coding transcripts |
Note: The text above is adapted from two sources: (1) Pruitt K, Brown G, Tatusova T, et al. The Reference Sequence (RefSeq) Database. 2002 Oct 9 [Updated 2012 Apr 6]. In: McEntyre J, Ostell J, editors. The NCBI Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2002-. Chapter 18. Available from: https://www.ncbi.nlm.nih.gov/books/NBK21091/ and (2) O'Leary NA, Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-D745. doi:10.1093/nar/gkv1189
One way to easily find RefSeqs in Nucleotide is to use the "Source databases" limit that appear on the left of the page after launching a search. By clicking on the "RefSeq" link, only sequences from the RefSeq database will be shown in results. In the case below, the 203 results of the original search will be reduced to 95 by using the RefSeq limit.
For another way to find RefSeqs in Nucleotide, see the "A Search Example in Five Steps" section of this guide for a description of how to add a RefSeq filter to your Nucleotide search.