Skip to Main Content

Finding Sequence Using NCBI Nucleotide

This guide introduces the NCBI Nucleotide database and provides a workflow for using it to find sequences.

Reasons for Using Nucleotide

Common reasons for using the Nucleotide database include:

  • Finding nucleotide sequences individually or in batch
  • Highlighting sequence features (exons, coding sequences, mRNA, etc.)
  • Finding sub-sequences or patterns in sequences
  • Using BLAST to find similarities across sequences
  • Designing PCR primers
  • Downloading sequences for further analysis

Nucleotide's Scope

Nucleotide collects and organizes multiple types of nucleotide sequences from sources within NCBI and beyond and provides an interface for searching, visualization, and analysis. 

Types of Sequences in Nucleotide
  • Genomic DNA and RNA
  • mRNA
  • cRNA
  • ncRNA
  • rRNA
  • tRNA
  • Transcribed RNA
Sources of Nucleotide Content
Primary Sequence
Annotated Sequence

Reference Sequences

The Value of Reference Sequences

Reference sequences (RefSeqs) are a substantial and important subset of sequences in the Nucleotide database. Their importance lies in the fact that they are authoritative representations derived from researcher-submitted sequences.

As a whole, the pool of researcher-submitted sequences represented by INSDC repositories like GenBank are redundant, with multiple sequences covering the same genomic region or transcript. In addition, individual researcher-submitted sequences may be incomplete or have sequencing errors. In RefSeqs, redundancies are removed, incompleteness is resolved, and errors are corrected.

The representativeness and quality of RefSeqs make them stable, trusted reference points for research.

RefSeq Accession Numbers

RefSeqs that appear in Nucleotide or in the literature have distinctive accession numbers. Understanding how these numbers are structured can help you quickly identify both the molecule type being described and some information about how the RefSeq was derived.

These numbers consist of a two letter prefix followed by an underscore, a set of six or nine numbers, and a version number.  In the case of the accession number NM_183124.4, "NM" indicates the molecule type (i.e., protein-coding transcript, or mRNA) and staff-curated processing; "183124" is a six number identifier; and the last "4" is the version number.

The following table summarizes RefSeq accession numbers.

Accession Prefix Molecule Type Description
AC_ DNA Complete genomic molecule, usually alternate assembly
NC_ DNA Complete genomic molecule, usually reference assembly
NG_ DNA Incomplete genomic region
NT_ DNA Contig or scaffold, clone-based or whole genome shotgun sequence data
NW_ DNA Contig or scaffold, primarily whole genome shotgun sequence data
NZ_ DNA Complete genomes and unfinished whole genome shotgun sequence data. Used predominantly for prokaryotic genomes
NM_ mRNA Protein-coding transcripts (usually curated)
NR_ RNA Non-protein-coding transcripts, including lncRNAs, structural RNAs, transcribed pseudogenes, and transcripts with unlikely protein-coding potential from protein-coding genes 
XM_ mRNA Computationally predicted model protein-coding transcripts
XR_ RNA Computationally predicted model non-protein-coding transcripts

Note: The text above is adapted from two sources: (1) Pruitt K, Brown G, Tatusova T, et al. The Reference Sequence (RefSeq) Database. 2002 Oct 9 [Updated 2012 Apr 6]. In: McEntyre J, Ostell J, editors. The NCBI Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2002-. Chapter 18. Available from: https://www.ncbi.nlm.nih.gov/books/NBK21091/ and (2) O'Leary NA, Wright MW, Brister JR, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733-D745. doi:10.1093/nar/gkv1189

Searching for Reference Sequences in Nucleotide

One way to easily find RefSeqs in Nucleotide is to use the "Source databases" limit that appear on the left of the page after launching a search. By clicking on the "RefSeq" link, only sequences from the RefSeq database will be shown in results. In the case below, the 203 results of the original search will be reduced to 95 by using the RefSeq limit.

For another way to find RefSeqs in Nucleotide, see the "A Search Example in Five Steps" section of this guide for a description of how to add a RefSeq filter to your Nucleotide search.


RefSeq limit on the Nucleotide results page