Skip to Main Content

English and American Literature

How do I get data from these sources?

Each data source has their own procedures to extract data. Some use third-party programs, some have a convenient data package you can download in one click, and some require going through an Application Programming Interface (API). Each source has more detailed steps listed below. Please contact AskDSP@udel.libanswers.com for assistance. 

Data is delivered in a variety of formats.

Books and Primary Sources

Adam Matthew Collections

This spreadsheet lists primary source collections that the Library has access to. For text and data mining purposes, the data must be requested from Adam Matthew. Contact a librarian at AskDSP@udel.libanswers.com to get started. 

HathiTrust Digital Library

  • HathiTrust (pronounced hah-tee) is a partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the world. Many materials in HathiTrust are only accessible to member institutions (University of Delaware is a member). The Data API lets researchers access content in the digital library.  

JSTOR

  • JSTOR is a digital library of more than 2,000 journals and more than 25,000 books in the humanities, social sciences, and sciences. JSTOR supports text data access through platforms like Constellate. Users must create a free JSTOR account to download datasets. 

Project Gutenberg

  • Project Gutenberg is a volunteer-driven, free digital library that offers over 56,000 free eBooks for public use. They offer works in many languages, but most books are in English. Please note that some works are still copyrighted material. See this page on automated access to the collection for directions on retrieving data. 

Digital Public Library of America

  • DPLA provides digital access to many collections of “America’s libraries, archives, museums, and other cultural heritage institutions.” Materials include books, photos, audio and video recording, and other media. Request an API key to gain access to the DPLA API. Data delivered in JSON-LD format.

World Digital Library

  • The World Digital Library, sponsored in part by the Library of Congress, archives digitized images of historical materials, both texts and images, from across the globe. Metadata is available as a bulk download; full text will require permission from the Library of Congress. Data delivered in CSV, JSON, or XML format.

Biodiversity Heritage Library

  • The Biodiversity Heritage Library is an online collection of scientific texts focused on natural history, biology, botany, and other natural sciences. It contains both scholarly journal articles and books. Access the BHL API to retrieve data. Data delivered in JSON or XML format.

Women Writers Online

  • Women Writers Online is the digital library of the Women Writers Project out of Northeastern University. The library contains text of early women's writing in English, from 1526 to 1850. Review the information on their text database, and email the team at wwp@neu.edu, with a brief description of your research plans.

Newspapers

Chronicling America

  • Chronicling America is the website portal of the National Digital Newspaper Project, and contains digitized American newspapers from 1789 to 1963. The API and bulk data download page has information on retrieving metadata and full text.

Europeana 

  • Europeana is a digital library focused on European materials, including an extensive digitized newspaper collection. Europeana offers multiple APIs depending on the researcher's needs.

New York Times Archive

  • The New York Times keeps archives of the newspaper’s past issues dating back to 1851. Recent articles require a NYT subscription to access. Members of the UD community have access to a subscription through the library. See the Newspapers guide for more information. Access one of NYT's APIs to gather text data. 

Social Media and More

Documenting the Now 

  • Documenting the Now collects tweet data (tweet IDs) and publishes them as an Open Access data sets. They also maintain a tool called Hydrator that turns the tweet IDs into full tweets.

TAGS (Twitter Archiving Google Sheet)

  • TAGS is a complex Google Sheets template to retrieve Twitter data. This platform supports basic network analysis visualization.

Case.law

  • Case.law is a project aiming to make case law more publicly accessible. Over six million court documents have been digitized from the Harvard Law Library's collections, covering cases from 1658 to 2018.

Genius

  • Genius, formerly Rap Genius, is a reliable web source of song lyrics from all genres. They also publish news, interviews with artists, and other content related to popular music. They maintain an API to access site data.