Each data source has their own procedures to extract data. Some use third-party programs, some have a convenient data package you can download in one click, and some require going through an Application Programming Interface (API). Each source has more detailed steps listed below. Please contact AskDSP@udel.libanswers.com for assistance.
Data is delivered in a variety of formats.
The Library subscribes to certain collections from Adam Matthew, a vendor specializing in primary source materials from a variety of regions and time periods. For text and data mining purposes, the data must be requested from Adam Matthew. Contact a librarian at AskDSP@udel.libanswers.com to get started.
HathiTrust is a partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the world. Many materials in HathiTrust are only accessible to member institutions (University of Delaware is a member). See available datasets, or use the HathiTrust Research Center's (HTRC) Analytics platform to build a custom dataset.
JSTOR is a digital library of more than 2,000 journals and more than 25,000 books in the humanities, social sciences, and sciences. JSTOR supports text data access through platforms like Constellate. Users must create a free JSTOR account to download datasets.
Project Gutenberg is a volunteer-driven, free digital library that offers over 56,000 free eBooks for public use. They offer works in many languages, but most books are in English. Please note that some works are still copyrighted material. See this page on automated access to the collection for directions on retrieving data.
DPLA provides digital access to many collections of “America’s libraries, archives, museums, and other cultural heritage institutions.” Materials include books, photos, audio and video recording, and other media. See the Developers page of DPLA for information on bulk data downloads and API access.
The World Digital Library, sponsored in part by the Library of Congress, archives digitized images of historical materials, both texts and images, from across the globe. Metadata is available as a bulk download; full text will require permission from the Library of Congress. Data delivered in CSV, JSON, or XML format.
The Biodiversity Heritage Library is an online collection of scientific texts focused on natural history, biology, botany, and other natural sciences. It contains both scholarly journal articles and books. Access the BHL API to retrieve data. Data delivered in JSON or XML format.
Women Writers Online is the digital library of the Women Writers Project out of Northeastern University. The library contains text of early women's writing in English, from 1526 to 1850. Review the information on their text database, and email the team at firstname.lastname@example.org, with a brief description of your research plans.
Chronicling America is the website portal of the National Digital Newspaper Project, and contains digitized American newspapers from 1789 to 1963. The API and bulk data download page has information on retrieving metadata and full text.
Europeana is a digital library focused on European materials, including an extensive digitized newspaper collection. Europeana offers multiple APIs depending on the researcher's needs.
The New York Times keeps archives of the newspaper’s past issues dating back to 1851. Recent articles require a NYT subscription to access. Members of the UD community have access to a subscription through the library. See the Newspapers guide for more information. Access one of NYT's APIs to gather text data.
As of 2023, changes to the Twitter and Reddit API Terms of Service have made research using data from these platforms incredibly difficult. Twitter / X now requires users to pay a fee to access enough data to make text mining most useful, and changes to the Reddit API have both reduced traffic to the site and limited the ability to use the API to download data.
The platforms below are worth exploring, but their functionality may be limited due to these recent changes.
Social Media Archive at ICPSR
Datasets from social media platforms like Facebook, Reddit, Twitter / X, and more. These datasets may be more complete than those that require retrieval from a social media platform.
Social Media Macroscope
An "all-in-one" analytics environment for social media data retrieval, pre-processing, and analysis.
If you are interested in conducting text mining research using social media data, please reach out to the Digital Scholarship team at the Morris Library (AskDSP@udel.libanswers.com) for help. You can also search for the name of your desired platform and "developer" or "API" to find information on that specific platform.
Case.law is a project aiming to make case law more publicly accessible. Over six million court documents have been digitized from the Harvard Law Library's collections, covering cases from 1658 to 2018.
Genius, formerly Rap Genius, is a reliable web source of song lyrics from all genres. They also publish news, interviews with artists, and other content related to popular music. They maintain an API to access site data.