Skip to Main Content

Content Permissions for Text & Data Mining (TDM)

This guide provides guidance for navigating the content permissions for text & data mining at JHU.

Text & Data Mining (TDM) Data Sources

There are many freely available materials for TDM projects, but most library database licenses provided by the library do not allow for text and/or data mining. There are some text and data mining resources available which vary by publisher, and which may have restrictions imposed by the publisher. If you do not see the resource you are looking for on this TDM LibGuide, please contact the Scholarly Communications Librarian, Nancy Shin.  

Freely Available Content vs. Library Licensed Content

Here is a list of free available content for your text and data mining projects. In addition to the specific resources listed below, see this list of Open Access disciplinary repositories if you are looking for general scholarly literature.  As a rule of thumb, one should double check the terms of access/use even for freely available resources as these terms can change over time. 

Data Source: MIT TDM LibGuide and Washington University TDM LibGuide 

 Publisher 

Available Content 

Online Access Info 

 arXiv 

An open access archive/repository of non-peer reviewed scholarly literature in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics 

arXiv API 

 CaseLaw 
 
Access
 Project 

All U.S. federal and state case law 

CaseLaw access policy 

 Congress.gov 

Structured info on legislators, bills, bill summaries, amendments, committee reports, appointee nominations, international treaties, and more.  

Congress.gov API documentation 

 CrossRef 

Metadata records with CrossRef DOIs 

CrossRef API documentation 

 HathiTrust 
 Digital
 Library 

Metadata, page images, and OCR for 17+ million digitized items in HathiTrust Digital Library 

HathiTrust data availability and APIs 

 Internet
 Archive 

Wayback Machine, Open Library (books), and Internet Archive metadata 

Internet Archive developer’s portal  

 Library of
 Congress 

Covers America’s historical newspapers 

Chronicling America API 

 Library of
 Congress 

“LC for Robots” provides machine-readable access to the Library of Congress' digital collections, including images, laws and regulations, and bibliographic information 

LC for Robots Info 

 NCBI
 Developer
 Portal 

The NCBI developer portal contains a variety of resources such as API, software libraries, and datasets that can be downloaded and accessed for computational use.   

See their data use policy and usage guidelines 

NCBI Developer Portal 

 National
 Library of
 Medicine 

Several text mining tools for accessing various NLM databases and biomedical literature. 

NLM APIs 

 ORCID API 

Queries and searches the ORCID researcher identifier system and obtain researcher profile data 

Receive API credentials by becoming an ORCID member vs. using the Public API 

 PLoS (Public
 Library of
 Science) 

Access to article corpus and article metadata 

PLoS text and data mining info 

PLoS API 

 Project
 Gutenberg 

Over 60,000 books, usually out of copyright. No API available. Scraping available from mirror sites only 

Project Gutenberg website terms of use 

 PubMed
 Central 

Open access full-text scholarly articles that have been published in biomedical and life sciences journals 

Though, even when an article is in PubMed Central, whether it can be used for text mining depends on its copyright and/or Creative Commons License status.   

  

Here is a link to a page that explains what is in PMC and provides access to datasets of open access material in PMC broken down by type of license: 

  

https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/ 

PubMed Central text mining tools 

PubMed Central developer portal 

 World Digital
 Library 

Primary source materials from many cultures and countries, representing over 100 different languages 

Multiple WDL access options 

 At this time, most of the resources available through the library do not allow for text and/or data mining. However, there are some resources where the publisher permits text and data mining and may require a license or acceptance of terms and conditions separate from the library’s license. Individual publishers typically have restrictions on use so please read carefully any terms and conditions before proceeding.

Currently, a few providers allow limited use for non-commercial, research purposes only and limit use to only certain kinds of content. Please contact Scholarly Communications Librarian, Nancy Shin, before beginning a TDM project so we can advise you of your possible options, and clarify which content is covered for TDM activity under the existing licenses.

 Publisher 

 JHU License Conditions through December 2024

 Publisher Text Mining Policies 

 AMA 

Researchers must submit a text mining request per AMA’s policy. AMA permits text mining of its licensed journal content by authorized users of an institutional license. 

https://jamanetwork.com/pages/about-tdm 

 Elsevier 

Researchers must submit a text mining request per Elsevier’s policy, and must contact the library to understand the institutional license terms that apply to the project if the project is approved by Elsevier. 

https://www.elsevier.com/about/policies-and-standards/text-and-data-mining 

 

 Wiley 

Researchers must submit a text mining request per Wiley’s policy. 

https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining 

 

If you do not see the resource you are looking for in this LibGuide, TDM may not be permissible.