The searching example outlined in multiple steps on the companion section of this guide (see "A Search Example in Five Steps") demonstrates an approach for building targeted, multi-component searches in Nucleotide.
This approach involves first representing your research question as a series of major concepts. These concepts are then translated into formatted search terms and then incorporated into a search query using Nucleotide's Advanced Search Builder.
Elements of the Advanced Search Builder include (1) a search fields list, (2) the "show index list" feature, and (3) built-in Boolean operators. These elements can improve your search efficiency in Nucleotide by helping you incorporate multiple focusing concepts at the same time.
Conceiving of your research question as a set of related main ideas or concepts is a helpful way to begin the searching process. By expressing the essence of your question in your search strategy, you insure that your search results will all have these minimum essential components.
For example, consider the following research question:
What are the known sequences of the VP2 gene from Chinese isolates of canine parvovirus type 2?"
That what are desired are sequences is clear. However, this fact need not be reflected in the search strategy. This desired attribute of our search results is addressed by our choice of database. By searching Nucleotide, we are assured that our results will be sequences. Regarding the other desired attributes of our results, the following concepts in our search strategy will ensure that they are represented.
Note: This research question is taken from Hao X, He Y, Wang C, et al. The increasing prevalence of CPV-2c in domestic dogs in China. PeerJ. 2020;8:e9869. Published 2020 Sep 29. doi:10.7717/peerj.9869
The generation of search concepts may to some degree be informed by a knowledge of the structure of Nucleotide records and the searchable elements of that structure. Nucleotide records have over 30 types of elements or fields, most of which are directly searchable. Fields in Nucleotide records include those for accession numbers, sequence features, sequence source, and associated journal literature.
Field tags, represented as field names in brackets, can be used in a Nucleotide query to search these fields. Two commonly used fields tags, [Organism] and [Gene Name], directly relate to the example search concepts above. Search terms related to search concepts 1 and 3 that are formatted with field tags would be written as "canine parvovirus type 2"[Organism] and "VP2"[Gene Name] respectively.
Field tag searching helps to avoid retrieving off-target results. For example, the use of the [Organism] field tag for the search term "mouse" avoids retrieving human sequences whose records mention they are similar to mouse sequences or bacterial sequences for bacteria derived from the mouse gut metagenome.
The NCBI Help Manual provides a list of Nucleotide field tags and their definitions.
Note: GenBank, one of the sources of Nucleotide data, provides a sample GenBank record with annotations and searching tips.
Applying field tags is one way of formatting search terms. Such formatting provides Nucleotide with specific instructions on how it should handle the search terms you are using. In the case of field tags, the instructions are to search for terms only in the fields specified (e.g. "VP2"[Gene Name] only searches for VP2 in the gene name part of the record).
Additional search term formatting can broaden your search by searching for term variants. Using an asterisk (*) at the end of a term will find all variants of that term. This process is called truncation, because it's from the shortened root of the term that the variant generation occurs. This can both save time and incorporate relevant terms you might otherwise not have thought of.
For example, if you are interested in finding the sequences of all of the members of the human golgin gene family whose gene names start with GOLGA and GOLGB, you could use the formatted search term GOLG*[Gene Name]. Otherwise, you would need to include each individual gene name in a long search string as follows: "GOLGA2"[Gene Name] OR "GOLGA4"[Gene Name] OR "GOLGB1"[Gene Name] OR "GOLGA3"[Gene Name] OR "GOLGA5"[Gene Name]...
Note: You should use truncation with caution. Doing a quick check of search results will let you know whether the truncation you've performed is inadvertently adding off-target terms to your search.
It's generally a good idea when using databases from NCBI to use quotation marks around search terms to ensure that they are searched as written. This is particular true for phrases.
Note: You can't use quotation marks with truncation in Nucleotide.
The OR Boolean operator is typically used to connect the synonyms in a search concept. Sticking with the golgin example, the following is a search string for the human golgin gene family concept: GOLG*[Gene Name] OR "GORAB"[Gene Name] OR "LOC102724117"[Gene Name].
The AND operator is typically used to connect two or more concepts to find where they intersect. Suppose that the aim of the search is to find mRNA and ncRNA golgin sequences. In this case, a second concept related to sequence type could be combined with the concept related to the human golgin gene family using AND as follows: (GOLG*[Gene Name] OR "GORAB"[Gene Name] OR "LOC102724117"[Gene Name]) AND ("mRNA"[Filter] OR "ncRNA"[Filter]).
Nucleotide's Advanced Search Builder makes it easy to build and document complex queries through the following features:
When unsure of the spelling of a search term, you can partially type it into an advanced search box, select the appropriate field tag, and then click the "Show index list" link for an alphabetical list of options. For example, by typing "mus mus," choosing "Organism" from the search field drop down, and clicking on "Show index list," you can select the correct spelling of the mouse species "mus musculus" (see below).
You can also use the index list feature to check for available entry types for a search field. For example, the index list for the "Filter" field shows many options for sequence type (e.g. mRNA, ncRNA, snoRNA, snRNA, etc.) and data source (e.g. DDBJ, PDB, RefSeq, etc.).