Welch Medical Library Guides: Finding Sequence Using NCBI Nucleotide: Overview of the Nucleotide Database

Reasons for Using Nucleotide

Common reasons for using the Nucleotide database include:

Finding nucleotide sequences individually or in batch
Highlighting sequence features (exons, coding sequences, mRNA, etc.)
Finding sub-sequences or patterns in sequences
Using BLAST to find similarities across sequences
Designing PCR primers
Downloading sequences for further analysis

Nucleotide's Scope

Nucleotide collects and organizes multiple types of nucleotide sequences from sources within NCBI and beyond and provides an interface for searching, visualization, and analysis.

Types of Sequences in Nucleotide

Genomic DNA and RNA
mRNA
cRNA
ncRNA
rRNA
tRNA
Transcribed RNA

Sources of Nucleotide Content

Primary Sequence

International Nucleotide Sequence Database Collaboration (INSDC)
A cooperative initiative for maintaining archives of researcher-submitted nucleotide sequence. Its collaboration partner databases include the DNA Data Bank of Japan (DDBJ), the European Nucleotide Archive (ENA), and NCBI's GenBank. GenBank records are the primary sequence content of NCBI's Nucleotide database.
Protein Data Bank (PDB)
An archive of researcher-submitted 3D structures of proteins, nucleic acids, and complex assemblies. Nucleic acid sequences associated with PDB records are searchable in NCBI's Nucleotide database.

Annotated Sequence

Reference Sequence Database (RefSeq)
Contains curated, non-redundant sequences for genomic DNA, transcripts (RNA), and proteins, providing stable reference points for further study. RefSeq records are derived from sequences submitted to the International Nucleotide Sequence Database Collaboration (INSDC). RefSeq sequences are discoverable by searching either the stand-alone RefSeq database or the Nucleotide database.
Third Party Annotation (TPA) Database
Contains sequences built from the existing primary sequence data in GenBank (part of the International Nucleotide Sequence Database Collaboration). The sequences and corresponding annotations are experimentally supported and have been published in a peer-reviewed scientific journal. TPA records are retrieved through the Nucleotide Database.

Reference Sequences

The Value of Reference Sequences

Reference sequences (RefSeqs) are a substantial and important subset of sequences in the Nucleotide database. Their importance lies in the fact that they are authoritative representations derived from researcher-submitted sequences.

As a whole, the pool of researcher-submitted sequences represented by INSDC repositories like GenBank are redundant, with multiple sequences covering the same genomic region or transcript. In addition, individual researcher-submitted sequences may be incomplete or have sequencing errors. In RefSeqs, redundancies are removed, incompleteness is resolved, and errors are corrected.

The representativeness and quality of RefSeqs make them stable, trusted reference points for research.

RefSeq Accession Numbers

RefSeqs that appear in Nucleotide or in the literature have distinctive accession numbers. Understanding how these numbers are structured can help you quickly identify both the molecule type being described and some information about how the RefSeq was derived.

These numbers consist of a two letter prefix followed by an underscore, a set of six or nine numbers, and a version number. In the case of the accession number NM_183124.4, "NM" indicates the molecule type (i.e., protein-coding transcript, or mRNA) and staff-curated processing; "183124" is a six number identifier; and the last "4" is the version number.

The following table summarizes RefSeq accession numbers.

Accession Prefix	Molecule Type	Description
AC_	DNA	Complete genomic molecule, usually alternate assembly
NC_	DNA	Complete genomic molecule, usually reference assembly
NG_	DNA	Incomplete genomic region
NT_	DNA	Contig or scaffold, clone-based or whole genome shotgun sequence data
NW_	DNA	Contig or scaffold, primarily whole genome shotgun sequence data
NZ_	DNA	Complete genomes and unfinished whole genome shotgun sequence data. Used predominantly for prokaryotic genomes
NM_	mRNA	Protein-coding transcripts (usually curated)
NR_	RNA	Non-protein-coding transcripts, including lncRNAs, structural RNAs, transcribed pseudogenes, and transcripts with unlikely protein-coding potential from protein-coding genes
XM_	mRNA	Computationally predicted model protein-coding transcripts
XR_	RNA	Computationally predicted model non-protein-coding transcripts

Note: The text above is adapted from two sources: (1) Pruitt K, Brown G, Tatusova T, et al. The Reference Sequence (RefSeq) Database. 2002 Oct 9 [Updated 2012 Apr 6]. In: McEntyre J, Ostell J, editors. The NCBI Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2002-. Chapter 18. Available from: https://www.ncbi.nlm.nih.gov/books/NBK21091/ and (2) O'Leary NA, Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-D745. doi:10.1093/nar/gkv1189

Searching for Reference Sequences in Nucleotide

One way to easily find RefSeqs in Nucleotide is to use the "Source databases" limit that appear on the left of the page after launching a search. By clicking on the "RefSeq" link, only sequences from the RefSeq database will be shown in results. In the case below, the 203 results of the original search will be reduced to 95 by using the RefSeq limit.

For another way to find RefSeqs in Nucleotide, see the "A Search Example in Five Steps" section of this guide for a description of how to add a RefSeq filter to your Nucleotide search.