Latent Dirichlet Allocation using Scikit-learn

Avinash Navlani
4 min readDec 12, 2022

In this tutorial, we will focus on Latent Dirichlet Allocation (LDA) and perform topic modeling using Scikit-learn. LDA is an unsupervised learning algorithm that discovers a blend of different themes or topics in a set of documents.

What is Latent Dirichlet Allocation?

Latent Dirichlet Allocation is the most popular technique for performing topic modeling. LDA is a probabilistic matrix factorization approach. LDA decomposes a large dimensional Document-Term Matrix(DTM) into two lower
dimensional matrices: M1 and M2.

In vector space, we can represent any text document as a document-term matrix. Here, m*n matrix has m documents D1, D2, D3 … Dm and vocabulary size of n words W1, W2, W3 .. .Wn. Each cell value is the frequency count of the word Wj in Document Di.

How do LDA works?

LDA iterates for each word and tries to assign it to the best topic. The main idea behind LDA is that a document is a combination of topics and each topic is a combination of words.

LDA uses two probabilities: First, probability of words in document d that currently assigned to topic t. Second, probability of assignment of topic t to over all documents.

--

--

Avinash Navlani

Sr Data Scientist| Analytics Consulting | Data Science Communicator | Helping Clients to Improve Products & Services with Data