Text data mining (TDM) is the process of extracting meaning from unstructured text data. Examples of this type of data are documents, websites, and social media, as well as semi-structured text formats like JSON, XML, and HTML. Natural Language Processing (NLP) techniques and Machine Learning (ML) techniques can be employed to explore text and better understand hard to see relationships in the data.
When conducting TDM it’s essential to understand the difference between compiling and working with a corpus to do research which is considered fair use VERSUS distributing and publishing the copyright-protected content which might challenge and/or exceed fair use. For example, in the latter, a researcher may wish to publish or distribute the copyright-protected content in order to test applied algorithms. This exceeds fair use. Courts review whether portions used are large enough to potentially hurt the present or future market capability of a copyright owner.
If the materials to which you want to apply TDM are in the public domain (copyright does not apply), or are factual in nature (e.g., statistics, citations etc.) versus creative content protected by copyright, fair use applies and the content can be used without requesting permission.
When undertaking TDM, there’s more to think about beyond copyright. For example, if you’re doing TDM on a corpus of materials from a library-licensed database or an online resource, there may be terms of use in the Johns Hopkins license agreement that may impact a perceived right to fair use for TDM.
Additionally, tools used to “scrape” content search results could breach Johns Hopkins' license agreement where TDM is not permitted, and this could result in resource access being shut down for the institution.
Publisher TDM policies and permissions may require use of publisher approved APIs, stipulate secure storage and access to content during the research, and/or ensure a retention and disposition of content plan is followed.
It’s important when doing TDM on library-licensed databases, that any license agreements or website terms of use are carefully considered before you decide about how to move forward. In library-licensed databases the institutional license is more important than the generic terms of use you find for an online database when deciding on how to proceed with a TDM project.