Skip to Main Content

Text Mining Methods and Tools

Learn more about text mining (aka computational text analysis) and how you can apply it to your research.

What is text mining?

Text mining, also known as computational text analysis, is a method where a researcher uses computational tools to analyze a large set of texts (a text corpus). Text mining can be used to discover patterns or deviations in a set of texts, examine relationships between documents or ideas, analyze sentiment, or track changes in texts over time.

To see text mining in action, check out America's Public Bible or Mining the Dispatch.

Process of a Text Mining Project

basics steps of a text mining project: get data, prepare and process the data, present your results

Completing a text mining project can be broken down into three overarching steps. This is just an overview; the steps themselves are broken down further on the Processing Text page. 

  1. Get data (text and metadata)
    • Use existing text corpora from sources like Project Gutenberg, the Internet Archive, etc.
    • Use APIs (application programming interfaces) to retrieve data from websites, like Twitter or forums.
    • Use own sources; physical books can be scanned and run through OCR (Optical Character Recognition) software to create machine-readable text.
  2. Process the data (prepare and analyze)
    • Preparation
      • Includes data cleaning and fixing errors in the data. Be careful with this step! You must know your data well to clean it properly.
      • Includes creating derived data - information derived from the raw data that was analyzed. This might be a word frequency list or text with parts of speech tagged.
    • Analysis
      • Includes running text mining programs (or algorithms) on a text corpus. 
        • Stop words are terms that are omitted from analysis. Many tools include their own stop word list, but it may need to be modified or added to depending on the corpus being examined. For example, many pronouns (I, we, they, she, etc.) are automatically omitted. If a URL or a word not useful for analysis frequently occurs, adding it to the stop word list will clarify the results. 
  3. Present results, data, and/or code
    • Present results through traditional modes like articles or books, or use a digital medium like a website.
    • Many areas of digital scholarship prioritize sharing data and code as well, to allow for replication and to share processes.