For readers of this blog who are not my DH201 classmates, please don’t read this post. It won’t make sense to you, and you’ll be annoyed for having wasted your time.
For this assignment of mapping digital humanities concepts using topic modeling, we were assigned volume one through six of the Digital Humanities Quarterly, a “digital journal covering all aspects of digital media in the humanities“. Given that this description covers quite a broad range of possible topics, all the more so since the journal itself “defines both “the humanities” and “the digital” quite broadly,” it should come as no surprise that the themes of the journal are not obvious and require quite a bit of trial and error and subjective judgement to decide upon.
The topics and parameters I eventually settled on are displayed in Table 1.
Table 1: Digital Humanities Quarterly Topics
|Number of topics||5|
|Number of iterations||400|
|Number of topic words printed||10|
|Topic proportion threshold||0.05|
|1||digital technology reading based narrative space works form http knowledge||Digital reading|
|2||text work information data literature history literary scholarly web visual||Literature on the web|
|3||research rend time humanities university project www sense system writing||Humanities research|
|4||org target textual lt science italic analysis design scholarship ref||Textual scholarship|
|5||game humanities title texts media world computer http player games||Game|
Before arriving at these topics, I tried many different combinations of parameters, including varying the number of topics from 5 topics to 50, changing the number of topic words per topic in increments of 5, changing the number of iterations from the default of 200 and the topic proportion thresholds from 0.05. I also toggled between feeding the topic modeling tool the six volumes of the journal as six separate files and as a single file (no hard returns, no duplicate spaces). The results of all the different runs I made on this corpus of texts are recorded in an Excel spreadsheet, which can be accessed here.
The reason that I eventually settled on the parameters — number of topics and organization — that I did (displayed in Table 1) is because these parameters made the most intuitive sense to me ex ante looking at the results. I chose to feed the entire six volumes as a single document because I believe we are interested in finding the topics that the Quarterly in its entirety covers, not topics that each volume covers. In other words, I have no reason to think that each of the six volumes was dedicated to a different theme and so should be treated as an isolated, complete text. I restricted the number of potential topics to quite a small number — 5 to be exact — because of cognitive limitations. While all complex texts cover a great many themes, I think the human mind is such that we each can think about no more than three to five things at a time. For example, if somebody were to tell me that the Digital Humanities Quarterly cover topics 1, 2, 3 . . . N, by the time N exceeds 5 items or so, I will have forgotten what the first few items were. Machines have great capacity for taking in and churning out data. I, unfortunately, do not.
By using the topic modeling tool, I got some sense of scholars in the digital humanities consider research worth doing and publishing. In order to glean patterns, i.e. changes over time, in the articles that the journal publishes, I tried feeding the topic modeling tool one volume at a time to see what were the topics covered by vol. 1, vol. 2 . . . etc. I then compared these six sets to results to see if any patterns emerges. Again, the results are recorded in the Excel spreadsheet. No obvious pattern jumps out at me, but perhaps more imaginative souls would find something meaningful beyond the differences that naturally emerge from different texts, e.g. something actually tied to time aspect or the maturing of a publication.
For me, this process of mapping digital humanities concepts using the topic modeling tool evinced feelings both of excitement and apprehension. I was excited because this is my first introduction to this tool, and this type of digital humanities scholarship more broadly. While I had some idea on the ability to do text mining previously — for example using the Voyant Tools to get a frequency of word counts in a document — topic modeling, both the idea and the implementation of the idea via Mallet and the topic modeling tool, was completely new to me. The tool is also incredibly easy to use; the barrier to learning to just technically operate the tool is close to zero (download the software, get some texts, click, click, and one has results). This ease contributes to my excitement but also my feelings of apprehension since the “results” came much faster than my understanding of what these results mean. I was never able to shake this sense of uncertainty, of feeling like I am groping around in the dark, having no clear guidance about how to adjust the parameters I was feeding the tool, and taking (educate?) stabs at guessing topics.
I am not sure that this process yielded useful insights beyond what I would have gotten from reading the texts. Given that there were only six volumes in this particular corpus, reading each article is not too time consuming as to be infeasible. For instance, if I had done even less than reading the articles word for word but instead had simply read their titles and abstracts, I think I would have as firm (or as fuzzy) of an idea of what this journal is about as I did by running the texts through the topic modeling tool and then guessing what each line in the results file that the tool produces signifies as a topic.
Because of the above and because my grasp of what the topical modeling tool is actually doing is still quite poor, I am not inclined to repeat this topic mapping process . However, if I were to study this methodology further and use the tool again, I would use it on a text where I already have some understanding of the text covers. For instance, I would like to see what the topic modeling process reveals to be the topics of the United States Code Title 17, what is more commonly known as the federal copyright law statute.