There are many freely available materials for TDM projects, but most library database licenses provided by the library do not allow for text and/or data mining. There are some text and data mining resources available which vary by publisher, and which may have restrictions imposed by the publisher. If you do not see the resource you are looking for on this TDM LibGuide, please contact the Scholarly Communications Librarian, Nancy Shin.
Here is a list of free available content for your text and data mining projects. In addition to the specific resources listed below, see this list of Open Access disciplinary repositories if you are looking for general scholarly literature. As a rule of thumb, one should double check the terms of access/use even for freely available resources as these terms can change over time.
Data Source: MIT TDM LibGuide and Washington University TDM LibGuide
Publisher |
Available Content |
Online Access Info |
arXiv |
An open access archive/repository of non-peer reviewed scholarly literature in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics |
|
CaseLaw |
All U.S. federal and state case law |
|
Congress.gov |
Structured info on legislators, bills, bill summaries, amendments, committee reports, appointee nominations, international treaties, and more. |
|
CrossRef |
Metadata records with CrossRef DOIs |
|
HathiTrust |
Metadata, page images, and OCR for 17+ million digitized items in HathiTrust Digital Library |
|
Internet |
Wayback Machine, Open Library (books), and Internet Archive metadata |
|
Library of |
Covers America’s historical newspapers |
|
Library of |
“LC for Robots” provides machine-readable access to the Library of Congress' digital collections, including images, laws and regulations, and bibliographic information |
|
NCBI |
The NCBI developer portal contains a variety of resources such as API, software libraries, and datasets that can be downloaded and accessed for computational use.
See their data use policy and usage guidelines |
|
National |
Several text mining tools for accessing various NLM databases and biomedical literature. |
|
ORCID API |
Queries and searches the ORCID researcher identifier system and obtain researcher profile data |
Receive API credentials by becoming an ORCID member vs. using the Public API |
PLoS (Public |
Access to article corpus and article metadata |
PLoS text and data mining info
|
Project |
Over 60,000 books, usually out of copyright. No API available. Scraping available from mirror sites only |
|
PubMed |
Open access full-text scholarly articles that have been published in biomedical and life sciences journals
Though, even when an article is in PubMed Central, whether it can be used for text mining depends on its copyright and/or Creative Commons License status.
Here is a link to a page that explains what is in PMC and provides access to datasets of open access material in PMC broken down by type of license:
https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/
|
PubMed Central text mining tools
PubMed Central developer portal
|
World Digital |
Primary source materials from many cultures and countries, representing over 100 different languages |
At this time, most of the resources available through the library do not allow for text and/or data mining. However, there are some resources where the publisher permits text and data mining and may require a license or acceptance of terms and conditions separate from the library’s license. Individual publishers typically have restrictions on use so please read carefully any terms and conditions before proceeding.
Currently, a few providers allow limited use for non-commercial, research purposes only and limit use to only certain kinds of content. Please contact Scholarly Communications Librarian, Nancy Shin, before beginning a TDM project so we can advise you of your possible options, and clarify which content is covered for TDM activity under the existing licenses.
Publisher |
JHU License Conditions through December 2024 |
Publisher Text Mining Policies |
AMA |
Researchers must submit a text mining request per AMA’s policy. AMA permits text mining of its licensed journal content by authorized users of an institutional license. |
https://jamanetwork.com/pages/about-tdm
|
Elsevier |
Researchers must submit a text mining request per Elsevier’s policy, and must contact the library to understand the institutional license terms that apply to the project if the project is approved by Elsevier.
|
https://www.elsevier.com/about/policies-and-standards/text-and-data-mining
|
Wiley |
Researchers must submit a text mining request per Wiley’s policy. |
https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining
|
If you do not see the resource you are looking for in this LibGuide, TDM may not be permissible.