Skip to Main Content

Finding Sequence Using NCBI Nucleotide

This guide introduces the NCBI Nucleotide database and provides a workflow for using it to find sequences.

Step 1: Identify Core Concepts

Before running a search in Nucleotide, you want to think carefully about your information need, since this can impact the effectiveness of your search. You may even want to write your information need down, both to keep it front of mind and to use it for building your search strategy.

For example, by writing down 

I need to retrieve the mRNA RefSeq sequences for the defensin gene cluster in the mouse,"

you can clearly see a number of main ideas or concepts (indicated in bold) that you should include in your strategy for searching Nucleotide. Again, it may be helpful to write these concepts down, both to help you optimize your current search and to record what you've done for future reference.

The concepts for this information need can be written as

  • Concept 1: mRNA
  • Concept 2: RefSeq
  • Concept 3: Defensin genes
  • Concept 4: Mouse

The next step in the searching process is to generate search terms corresponding to these concepts. 

Step 2: Generate Search Terms

mRNA and RefSeq

The concepts for mRNA and RefSeq can be thought of as known items. mRNA is a known molecule type with specific kinds of characteristics (exons, coding sequence, etc.). RefSeq is both a known source database and a known type of sequence in Nucleotide with certain features (non-redundant, curated, etc.).

These known items are commonly used in searching, and are therefore represented in Nucleotide as easily-selectable filters. These filters can be selected from the search results page or they can be applied to terms using a field tag (see "Step 3: Use Field Tags" below).

Defensin Genes

When generating search terms for gene names in Nucleotide, a good practice is to use approved symbols. Approved symbols represent the most up-to-date ways of expressing gene names, so they help ensure that your search will be both comprehensive and efficient. Approved symbols are also shorter and less complex (e.g., Defa4) than corresponding approved gene names (e.g., defensin, alpha, 4).

Approved gene symbols can be found by searching internationally-recognized resources for organism-focused genetic, genomic, phenotypic, and other biological data. The following table includes a sample of these resources.

Organism Approved Gene Symbol Resource
Drosophila FlyBase
Human HUGO Gene Nomenclature Committee (HGNC)
Mouse Mouse Genome Informatics (MGI)
Rat Rat Genome Database (RGD)
Yeast Saccharomyces Genome Database (SGD)
Zebrafish Zebrafish Information Network (ZFIN)

Since the organism of interest in this search is the mouse, Mouse Genome Informatics (MGI) is an excellent source for approved gene symbols. A search of MGI using the word "defensin" yields over 100 results. These results take two forms starting either with "Defa" (Defa1, Defa2, Defa3, etc.) or "Defb" (Defb1, Defb2, Defb3, etc.). To easily capture all of the multiple variants from of these two roots, truncation can be used to form the terms Defa* and Defb*.

Note: For more about truncation in this guide, see the "Tool 3: Search Term Formatting" section of the "Six Core Tools for Searching" box.

Mouse

Like mRNA and RefSeq, mouse sequences can be retrieved using a filter on the search results page or they can be retrieved by applying the "Organism" field tag (see "Step 3: Use Field Tags" below).

Step 3: Use Field Tags

Field tags can now be applied to the search terms generated for each concept as a way to build an efficient and focused search. See the table below for the search terms, their corresponding field tags, and the corresponding search string for each concept.

Concept Search Terms Field Tags Concept Search String
mRNA mRNA [Filter] mRNA[Filter]
RefSeq RefSeq [Filter] RefSeq[Filter]
Defensin Genes Defa* OR Defb* [Gene Name] Defa*[Gene Name] OR Defb*[Gene Name]
Mouse "Mus musculus" [Organism] "Mus musculus"[Organism]

Note: The organism field tag works with both common and scientific names. In this case, we use the scientific name to focus on a particular species. We could have also used Mouse[Organism], but this corresponds to the broader Mus genus taxonomy found in NCBI's Taxonomy database.

Note: For more about field tags in this guide, see the "Tool 2: Field Tags" section of the "Six Core Tools for Searching" box.

Step 4: Use the Advanced Search Builder

Now that you have the proper search strings for each of your concepts, you can enter them in Nucleotide's Advanced Search Builder.

A good practice is to run the searches for each of your concepts one-at-a-time by using the "Add to history" feature. As illustrated in the screenshot below, this enables you to check for errors. Low numbers of results may indicate a misspelling or other error. As expected, we retrieve large numbers of results for our mRNA, RefSeq, and mouse concepts (searches #1, #2, and #4) and many fewer for our defensin genes concept (search #3).

By using the "Add" feature, we can combine the individual search strings (searches #1-4) to determine where they intersect using the AND Boolean operator. This is our final result set (search #5). Clicking on the "Items found" number (95) will show us the details of these results and allow us to work with them.  


Nucleotide's Advanced Search Builder Example Search


Note: When pasting your search strings into the search builder, you can leave the default on "All Fields," since you are using field tags on your search terms. Also note that when using the "Add" feature, the Boolean AND is the default for combining search concepts.

Note: For more about the Advanced Search Builder in this guide, see the "Tool 5: Nucleotide's Advanced Search Builder" section of the "Six Core Tools for Searching" box.

Step 5: Work with Search Results

After generating a set of results in Nucleotide, you can work with individual records or with the result set as a whole.

Working with Individual Records

Opening an individual record from your set of search results allows you to analyze the particular sequence associated with that record. Options for sequence analysis include running BLAST to find similar sequences, picking primers for sequence amplification, and focusing on specific sequence features, such as exons. Sequence analysis options are highlighted in the following screenshot.


Individual record in Nucleotide with sequence analysis options highlighted


Working with Result Sets as a Whole

In addition to working with individual search records, you can work with all of your search results at once by exporting them as a single file. In the screenshot shown below, the sequences of all 95 search results are being downloaded as a file in FASTA format by using Nucleotide's "Send to" feature.


Nucleotide results page showing "Send to" feature for file download 

Welch Medical Library Support

If you have questions about searching Nucleotide, other NCBI databases, EMBL-EBI databases, or web-based bioinformatics databases from other sources, please contact:

Rob Wright, Basic Science Informationist, rwrigh32@jhmi.edu.