The most important first step when embarking on a text mining project is getting the text into a machine-readable format, meaning that it can be read (but not necessarily understood) by a computer. An important thing to keep in mind is that computers rely on rules and input to complete processes. They can't understand meaning or nuance in text, which is why many text mining projects use a combination of the "distant" perspective text mining provides and the "close" reading more common in traditional textual studies.
Creating machine-readable text will vary depending on what text you're studying. Optical Character Recognition (OCR) is a technology that can be used to create machine-readable text from analog materials (like print books and physical newspapers). The quality of OCR outputs can vary widely, and depends on a variety of factors, including the contrast quality and clarity of the letters on the page. Newspapers, for example, are notorious for creating subpar OCR because they are printed on cheap, thin paper and degrade over time. OCR quality can significantly influence the viability of your text mining project -- errors will become apparent as you run various text mining algorithms.
Poor quality OCR can be remedied in some cases by cleaning up the text. Often, this will mean manually correcting errors or editing the scanned page in a photo editing software to help improve the OCR output. For instance, low contrast between the background and the text can potentially be fixed by lightening the scanned image in Photoshop.
Other pre-processing steps might include part-of-speech tagging, creating a stop words list (a list of words that are omitted from analysis), or tokenization (a process which creates machine-readable data from free text).
Text mining methods can generally be split into two categories: supervised and unsupervised machine learning. Supervised means that there is direct interaction from the researcher, which might look tagging specific words as a specific part of speech. On the other hand, unsupervised machine learning lets the computer attempt to make connections and infer meaning on its own. Topic modeling, a version of a larger method called clustering, is unsupervised (discussed more below).
Your chosen method will depend on your dataset and your research question. Below are brief explanations of a few common text mining methods. Not every possible method is discussed, and some more complex methods build off the foundations of these.
We often see word frequency analysis in our daily lives! If you've ever seen a word cloud, showing words that appear more often in a larger font, that was created with word frequency analysis. Below is a word cloud generated from Voyant using a dataset of newspapers about the passage of the 19th amendment, which gave women the right to vote. Predictably, we can see that the words "women" and "amendment" appear most often. Based on this visualization, we might want to do research into "colby" or "tennessee" and see what the significance of those two words are in relation to the 19th amendment.
Collocation shows the research words that are within a certain distance of each other. Using this method, you could conduct basic sentiment analysis (a method which reports whether certain words / phrases are positive, negative, or neutral), see patterns in how meaning of words has changed over time or across different bodies of work, and more.
The example below uses the TermsBerry tool from Voyant to visualize collocated words in the works of Jane Austen. Here we can see that the prefix "mrs" is often accompanied by "mr" (shown by the darker colored circle on the word "mr"), which tracks with our understanding of how people were addressed during the Regency era. This is a simple example, but you may find that collocated words in your dataset surprise you and change your understanding of the text.
As mentioned above, a common type of clustering is a method called topic modeling. In topic modeling, the computer assumes that certain words are related because they appear together frequently, and arranges those words into topics. The computer doesn't actually know what the topics are -- the researcher must create meaning by seeing which words are clustered together.
Clustering can be done at the document or word level -- for example, each document might be assigned one topic during the text mining process rather than including multiple topics. Document-level clustering enables the researcher to see which one topic each document seems to embody.
The example below is word-level topic modeling; words have been arranged into potential topics, based on what appears frequently together. These topics were generated from a dataset of public domain fictional novels using the ConText tool.
Topic1 (remember that the computer can't name the categories because that necessitates assigning meaning) includes the words "man," "time," "young," and "father", among others. From this example, we might interpret Topic1 as referring to familial relationships between men.
Something interesting to note is that Topic4 seems almost entirely comprised of names from Victor Hugo's novel "Les Miserables," which we know is very long. Keep in mind how various factors of your dataset, including size and quality of machine-readable text, can affect your outputs.