Ted Underwood’s Topic Modeling Made Just Simple Enough had this to say about the algorithm behind topic modeling, one of the most frequently used techniques in the digital humanities, “humanists often have to take topic modeling on faith”. Despite the fact that there exist “several good posts out there that introduce the principle of the thing . . . it’s a long step up from those posts to [understanding the algorithm] mathematically”.
What does taking a tool that one uses on faith mean? I can’t answer for other digital humanists, including those who are experienced practitioners rather than brand new students to the field. I say that having spent one class period listening to three very experienced digital humanities scholars explaining the thing and being just a few days shy of having to use the tool for an assignment, I can’t answer the most basic question about Mallet, a tool that implements the topic modeling algorithm: What does it do?
I don’t know. Mallet a black box that spits out words and lines them up in lines, where somehow words in one line are supposed to signify some common theme and differentiate themselves from words on another line. How does it do that?
I don’t know. The answer I got in class for this question was, “do you know math?”. Collectively, we do not. Thus followed spontaneous verbal explanations, unaided by the use of the black boards or dummy examples, that left me no more enlightened than before. Now, after having done some reading on my own, I still cannot claim to have reached any enlightenment. Why is probability involved? What are the priors that the algorithm is operating on? Whatever counts as a co-occurrence? What are the metrics to judge how well the parameters that one has fed into black box worked? How does one know what parameters to choose? I don’t know. I don’t know. I don’t know. I don’t know.
But here I am, jumping in head-first to using a tool that I haven’t the roughest of ideas how it works. To put this in relative terms, I don’t claim to understand the Google algorithm (big trade secret & all). If called upon to explain it, I can only do so in the most general of terms, conceptually highlighting how Google does more than just count the occurrences of the search terms on a web page and getting very many details very, very wrong. Yet I feel that I can explain the Google algorithm (Google!) better than I can Mallet.
Scary isn’t it?