Nowadays, most of us have access to large volumes of information, particularly in the form of text. This wealth of written data is useful when looking for specific information. However, when the goal is to summarise what unlabelled text is actually about, the size of a dataset necessitates the use of automated solutions for categorisation.
This task may be approached using the Natural Language Processing (NLP) field of computer science that is concerned with computational methods of processing and interpreting large collections of natural language text. Provided one has access to a sufficient amount of labelled data, the task of text classification becomes a relatively easy problem to solve. However, when no labelled data is available, the unsupervised method for discovering semantic structures in text is the only option available. This blogpost gives a taste of topic modelling and uses a simple example to illustrate how to interpret the results obtained with one of the popular topic modelling methods.
One of the interpretations of topic in linguistics defines it as the focus of a discussion. For example, when you hear a conversation about someone celebrating a victory with their shoe on a podium, you might be able to piece together that the discussion is about motorsport. For humans this is a relatively simple task, given some level of general knowledge and experience. For a computer, however, this task is less simple, given the absence of the ontological knowledge required to infer the relationship between the meaning of words and the underlying concepts.
NLP methods identify topics as semantic structures in a text, discovered primarily using statistical information gathered from a given dataset. One of the commonly used topic modelling approaches is Latent Dirichlet Allocation (LDA) , on which this blogpost focuses. In LDA, a topic is defined as a computationally derived distribution of relatively similar words. These are sourced from input documents and aggregated using statistical and probabilistic methods. For example, the words “card”, “credit”, “pay”, “check”, “month”, “update”, “paid” and “confirm” may be identified to constitute a topic by a topic model and can be grouped under a label such as payments/credit card details. Unfortunately, if no labelled data is available for label assignment, the results of topic modelling require further interpretation by humans.
Getting a solid grasp on the theoretical foundations of the LDA method requires careful reading and reflection on the original paper. However, a working understanding of the concept can be gained with the aid of illustrative examples. This introduction provides a good starting point. This blog post provides an alternative example for understanding the main idea of topic modelling with LDA.
Before going any further, let’s recap that a document is represented as a blend of topics and a topic is a blend of relevant words. Imagine there’s a hungry (both literally and for knowledge) individual called Bob, who is seeking to crack the secret of restaurants from the Michelin Red Guide, specifically trying to understand what types of cuisine make them successful. It is cost prohibitive to dine in every restaurant, so Bob resorts to other means. He knows that the LDA model is not applicable only to text and has found applications in bioinformatics  and image classification , so he thinks it might also prove useful in his endeavour. Bob recognises the analogy – a restaurant’s menu (document) is a blend of dishes types (topics), and a dish is a blend of ingredients (words), and hence concludes that the LDA model seems applicable.
To understand what the cuisine profiling is, Bob devises a plan. To avoid dealing with infinite possibilities, he picks a finite number of cuisines to learn about and assumes that an average dish typical for a given cuisine contains a particular number of ingredients, in this case, he assumes five. The very first step could be to randomly guess which ingredients are ‘typical’ for a given cuisine and which cuisines are typical for given restaurants. However, to make things a bit easier, Bob decides to gather some clues. He sneaks into back alleys and peeks at ingredients being delivered to the restaurants and actions being performed in their kitchens. For example, for one restaurant, Bob notices that the majority of produce is delivered from local fishmongers and includes oysters, lobsters and caviar. In addition, fresh produce from local organic farmers is also supplied. This suggests that the restaurant’s menu consists of at least some proportion of seafood dishes and possibly salads.
After performing similar observations for the remaining restaurants, Bob identifies ingredients’ delivery patterns. Having some background in cooking, Bob has an idea of which ingredients are typical for certain dishes and cuisines. At the same time, for some ingredients, it turns out to be less clear for which dishes they are sourced. For example, Bob notices that many restaurants order freshly laid eggs to use in a variety of recipes – for sabayon, pasta, creams and more.
To recreate a restaurant’s menu, Bob needs to first determine the proportion of types of dishes offered by the restaurant, say 50% meat and poultry recipes, 35% salads, and 15% desserts. Then, Bob settles on a finite number of ingredients that he believes constitute each recipe. At this point, Bob already has an approximate understanding of cuisine distribution and the distribution of ingredients per cuisine.
To further refine these distributions, Bob focuses on each restaurant one by one and calculates two values:
Having these values, Bob iteratively generates more accurate representation of cuisine per restaurant and ingredients per cuisine distributions, using the product of the two values as a prior. For example, Bob notices that some restaurants get big orders of eggs delivered, as well as flour and butter. And in many cases the common type of dishes these ingredients are used for are baked goods and only in some instances for something else like pasta and sauces. The illustrative example described below shows what the output of LDA topic model applied to this case is and how to interpret it.
To start topic modelling with LDA we need a dataset. To mimic the dataset of haute cuisine enthusiast Bob, we would use a subset of about 650 cooking recipe instructions randomly sourced online. In our case, the dataset consists of blocks of text per recipe, each describing a sequence of actions to perform using recipe ingredients. In our case we would actually use the LDA topic model applied to text.
To start off, a set of typical preprocessing and cleaning steps is performed on the dataset, such as removing punctuation, uninformative words (stopwords), etc. Once the dataset is prepared, the LDA model is created using the gensim library. In essence, a model is represented by two probability distributions: a document-topic distribution (for example, document A consists of 60% of topic T1, 20% of topic T2, and 10% of T3) and a topic-word distribution (i.e., topic T1 is defined by the words W1, W2, W3, and so on with corresponding probabilities).
There are several inconveniences associated with unsupervised topic modelling, and the LDA model specifically. Among the main ones are the need to estimate the number of topics beforehand, and the complexities of interpreting the results (that is, assigning labels to each topic).
For our sample dataset of cooking instructions, the choice of five topics seems reasonable. We took the initiative and assigned labels to each topic based on the words they are defined by. The model representation is included in the interactive visualisation below (created with pyLDAvis library). The circles on the left represent the topics. Once a circle is selected, the bar chart on the right highlights the words that constitute the topic. The reader is encouraged to look at the words to see if they find the topic labels reasonable. Two topics (Topic 1: wings, oil, eggs, … and Topic 5: oil, shrimp, simmer, …) were somewhat incoherent and therefore labelled as misc1 and misc2 respectively.
Once the topic model is created, one can use it to infer the distribution of topics in a new document. To see how this works with our pretrained model, let’s use an interesting recipe of a pie first published in 1829  as an unseen document, whose topic profile we wish to obtain. The recipe is unusual because it is a pie, which implies baking, yet it also includes fruits and meat, which is an unusual combination (that would’ve certainly confused Bob) by modern cooking standards.
We use the text of the pie recipe, preprocessed in a similar way as our sample dataset, and our pretrained LDA model to estimate the topic distribution for this dish. The text of the recipe is included below. The highlighted words are colour-coded to indicate their association with most probable topic. Our pretrained model identifies this recipe as being 51% about baking and 30% about meat, which seems reasonable. It is worth noticing that despite some words being highlighted with orange and grey and therefore associated with pasta and misc1, the recipe overall is not strongly associated with these topics, because the probability is marginally small.
This example inference illustrates several important points. First, the reader might be wondering, why fruit-related topics didn’t come up, since we clearly have words such as “orange” or “citron” in the text. The reason for this is behind the dataset used for model creation – there were not enough fruit-related recipes in the training dataset for the model to be able to identify this as a probable topic (so related words likely landed in the categories misc1 and misc2). Second, it might seem strange to have some neutral words to be associated with a particular topic.
One thing is to have words such as “meat” and “pound” associated with the topic meat, but another is to have words like “fine” to be associated with baking or “large” with pasta. There is, however, a possible explanation – perhaps the word “fine” appears often in the context of baking; for example, “fine sugar” or “fine flour”. The word “large” is likely to be related to pasta because it is common to suggest using a large pot of water to give pasta enough room to boil. Another surprising example is to see “beef” not highlighted as meat. This is a good illustration of the fact that algorithmic approximation of topics lacks a human-like ability to draw conclusions based on common knowledge.
This blogpost briefly describes the main idea behind the LDA topic model and showed the results obtained for a sample dataset. Given the sample dataset, the model and its application are illustrative. However, the application of the LDA technique to real-world datasets can be somewhat less intuitive, since identified topics might be less coherent and far from a human’s understanding of a topic.
To improve the interpretability of the results, one would need to estimate the number of topics and carefully preprocess the dataset by removing not only basic stopwords, but also omitting some domain-specific words. Overall, LDA topic modeling is a useful tool to provide insight about the topics that may characterise an unlabelled dataset, which is particularly helpful when no labelled data is available.
 David M. Blei, Andrew Y. Ng, Michael I. Jordan. “Latent Dirichlet Allocation.“, In: Journal of Machine Learning Research, 3, p. 993, 2003.
 Lin Liu, Lin Tang, Wen Dong, Shaowen Yao, and Wei Zhou. “An Overview of Topic Modeling and its Current Applications in Bioinformatics.“, In: SpringerPlus, 5, p. 1608, 2016.
 Nikhil Rasiwasia and Nuno Vasconcelos. “Latent Dirichlet Allocation Models for Image Classification.“, In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, p. 2665, 2013.
 William A. Henderson. “Modern Domestic Cookery, and Useful Receipt Book: Adapted for Families.“, enl. and improved by M. D. Hughson, New York: T. Kinnersley, 1829.
Header image courtesy of Brooke Lark.
Thanks to Matt Hannah, Alex Cummaudo, Andrew Vouliotis, Tanya Frank, and Shannon Pace for reviewing this post and providing suggestions.