This is a guest post by Mae Capozzi (@Mcapoz), an English major and Italian minor at Skidmore College.
At the Re:Humanities conference at Haverford College on 3 and 4April 2014, I presented my research entitled “A Postcolonial ‘Distant Reading’ of Eighteenth and Nineteenth Century Anglophone Literature.” This particular project involved the use of topic modeling in MALLET to analyze a corpus of Victorian novels. This allowed me to draw conclusions about the permeability of the boundaries between Britain and India at this time. I hypothesized that as Britain became more involved in India, I would see a clear spike in conversation about India in the Anglophone literature of the period. While this is a fairly obvious and reductive thesis––it is not a new idea to suggest that as civilizations come in contact cultural diffusion occurs—my goal was to prove this concept using the digital humanities.
MALLET is a java-based program created by Andrew McCallum at UMASS. I, and others, argue that MALLET can be a highly useful tool for humanists. The program is run from the command line of the computer resulting in a steep learning curve for those totally unfamiliar with coding. Even so, this type of research produces fruitful results once understood. I learned to use this program in a fairly short amount of time with the aid of a wonderful website called “The Programming Historian” created by Shawn Graham, Scott Weingart, and Ian Milligan. The site outlines how to use MALLET and serves as a tutorial to create your first topic model. After completing the tutorial a few times, I began to understand the way the program works.
What is Topic Modeling?
Topic Modeling works by taking words from a corpus and placing them in different bins (topics). MALLET can read a corpus and create a number of different topics that can then be analyzed and graphed over time. The humanist can use these texts, as I have, to draw conclusions about the culture from which texts are taken.
Why is Topic Modeling Useful?
Topic Modeling is particularly useful when completing a distant reading of a large body of texts. For example, I hope to do a topic model of 10,000 texts this summer using a dataset from HathiTrust. MALLET can read more texts at once than one person could read in five years, meaning it can be a highly utilizable tool for humanists doing large-scale work. Franco Moretti, in his essay “Conjectures on World Literature” coins the term “distant reading:”
Distant reading: where distance, let me repeat it, is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes—or genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, Less is more.
Moretti places “distant reading” in direct opposition with the more traditional “close reading” and suggests that both types of analysis are important. He notes that distant reading presents the opportunity to draw conclusions on “units” both “smaller or much larger than the text.” The conclusions drawn from a “distant reading” of a culture can then be applied to a “close reading” of a text. He asserts that it “distant reading” is a valid type of analysis even though the “text itself [may disappear]” because what is gained is just as necessary for a comprehensive understanding of a period or a culture.
This type of analysis also seems to draw from New Historicism and Cultural Criticism as outlined by Stephen J. Greenblatt. In “Resonance and Wonder,” Greenblatt explains his concept of “new historicism:” “I used the term ‘new historicism’ to describe an interest in the kinds of issues I have been raising—in the embeddedness of cultural objects in the contingenicies of history….”The concept of distant reading, in my opinion, is the perfect way to put Greenblatt’s New Historicism into practice—it allows for a fuller comprehension of cultural trope and ideologies that can then be applied to individual texts.
An Application of Topic Modeling
Goals of this Project
My goal in this project was to examine the permeability of the boundaries between Britain and India. Traditionally, it has been understood that Britain impacted India––this is most clearly evidenced by the fact that the two primary languages in India are Hindi and English. More recently, scholars have begun to examine India’s impact on Britain, and this is what I have sought to do with my project. I also wanted to examine the way the digital humanities could be used to study texts from a postcolonial standpoint. Because postcolonialism is a non-traditional perspective, I wanted to push myself a step further and study from a postcolonial lens from using technology rather than doing a traditional close reading.
To understand my particular project, one must have at least a basic grasp of the history of British interference in India. (See slide 4).
To complete this project, I selected 136 texts by hand from Project Gutenberg and put them in order by date. This corpus is admittedly small and I would like to eventually expand it to 10,000 texts. I then created a file MALLET could read, and did a topic model of the file: (See slide 7). The most exciting topic in this model was topic 16 because it directly referenced India.
I did a few other topic models of this same corpus but with a different number of topics each time. (See slide 10).
What each of these topics has in common is a reference to India, which suggests that India was a major part of the discourse during the Victorian period. Model 7 turned out particularly well, even including the word “India.” I did note that certain topics came out better than others—for example, Model 7 is much more obviously in reference to India than Model 6.
The next step was to try and graph these topics over time. In slide 11, I graphed Model 4. There is a clear spike around 1880, which makes a lot of sense seeing as the start of the Raj was in 1857. I hypothesize that India was a fairly taboo topic for a few years, until 1880, the height of British colonialism. Around this time, India became one of Britain’s most important colonial possessions and was a necessary part of the British economy. I am also unsurprised by the spike around 1900 due to the importance of India to Britain around this time. Another reason why I am seeing these spikes in the latter part of the 19th century is the inclusion of Rudyard Kipling into the corpus. This may have skewed the data somewhat.
I also decided to include a graph of Model 6 because it was a bit unclear. (See slide 12). There is a slight spike around 1840, and another around 1860. The spike around 1860 makes sense in the context of the Indian Army Mutiny and the start of the Raj in 1857. Then again, I see a spike around 1880 and 1900, as well as right before World War I, which makes sense because of the political unrest growing in India at this time, as well as India’s important but often overlooked role in the war.
Ultimately, I found this project entirely useful as an exploratory endeavor. I would like to expand this study to eventually include 10,000 texts because I feel as though I can do a much more effective distant reading of Victorian Literature if I feed MALLET as much information as I possibly can. A word of caution: while distant reading can be a useful tool for the humanist, it cannot replace a close reading of the texts. In my particular project, it makes sense to make a topic model because of the size of the corpus—it would be impossible for anyone to read 10,000 texts as an undergraduate, obviously! Even so, I would like to apply this research to a close reading of Victorian literature in order to complete what I call an “advised close reading” of a text. My use of the digital humanities has enriched my understanding of the permeability of the boundaries between Britain and India as well as the discourses circulating during the Victorian period. This type of scholarship will result in a much more well-rounded and extensive study of these periods, as well as hopefully eliminate some of the bias inherent in a close reading of a text. While I could choose to examine only the topics that referred to India, I also have information on many other genres and topics circulating in Victorian England at this time and could examine any of this data as comfortably as I have examined the topics concerning India.
About the Author: Mae Capozzi (@Mcapoz) is an English major and Italian minor at Skidmore College. She is interested in postcolonial studies and, more specifically, the colonial relationship between Britain and India. She finds the digital humanities captivating and would like to continue exploring this “frontier.” She would like to continue her studies in graduate school after Skidmore. Her other interests include speaking Italian, playing jazz piano, and listening to music. Mae would like to thank Professor Scott Enderle for research help and to Professor Catherine Golden for editing her post.
Graham, S., S. Weingart, and I. Milligan. “Getting Started with Topic Modeling and MALLET.” The Programming Historian 2 (2012). http://programminghistorian.org/lessons/topic-modeling-and-mallet. Web.
Greenblatt, Stephen. “Resonance and wonder.” Exhibiting cultures: The poetics and politics of museum display (1991): 42-56.
Moretti, Franco. “Conjectures on world literature.” New left review (2000): 54-68.