Guide to Topic Modeling

6 min readJul 13, 2021

Introduction

Topic Modeling is a part of the machine learning technique. It aids in searching and sorting documents. Sometimes finding the Topic or arranging the documents from an extensive collection of documents is pretty hard. Topic modeling helps in discovering the abstract “topics” that occur in a pool of documents.

Problem With Conventional Models

1. They work on bag of words (BOW) representation. So, they can not be used for documents with new words.
2. For them to work on a multilingual model they require multiple sets of vocabularies that is computationally expensive.

Discussed Topics

Top2Vec
Contextualized Topic Modeling
Naive Bayes Classifier
LDA

Top2Vec

Top2Vec is a state-of-the-art model and it is easy to implement. First, you need to create jointly embedded word vectors from the document using any one of Doc2Vec, Universal Sentence Encoder, or BERT Sentence Transformer. Next, the embedding dimension needs to be lowered by using UMAP. Then find the densest area in the embedding space. In each dense area, the closest words are the respective topics vectors. It also works well on short texts.

Installation of Top2Vec

pip install top2vec# pre-trained universal sentence encoder
pip install top2vec[sentence_encoders]# pre-trained BERT sentence transformer
pip install top2vec[sentence_transformers]pip install top2vec[indexing]

Training of Data Set

from top2vec import Top2Vec
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)
model.get_num_topics()

Use per-train Model

universal-sentence-encoder
universal-sentence-encoder-multilingual
distiluse-base-multilingual-cased

For large data sets and data sets with very unique vocabulary, doc2vec can produce better results.
The universal sentence encoder may perform in all options and faster too.
The “distiluse-base-multilingual-cased” encoder is suggested for multilingual documents and documents in languages that are not covered in the “universal-sentence-encoder-multilingual” encoder. Transformers are slower than the others.

model = Top2Vec(documents, embedding_model='universal-sentence-encoder
')

Contextualized Topic Modeling

This model is based on embedding techniques like BERT and deep learning models. Embedding space helps in representing the documents. It even works on unseen words. This model is a modification of ProdLDA. The documents representation is used as an input and to reconstruct the original BOW of the documents. After that, trained ZeroShotTM can generate the representations of the test documents.

Installation of Contextualized

pip install contextualized-topic-models                       
pip install nltk

Train the Model

After creating the array from the documents, WhiteSpacePreprocessing helps in removing the stop words from the documents. The pre-processed document contains only 2 thousand most frequent words.

# Import Libraries
from contextualized_topic_models.utils.preprocessing import  WhiteSpacePreprocessing
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation                    
from contextualized_topic_models.models.ctm import ZeroShotTM                       
import nltk   
  # stopwrods to be removed                                          nltk.download('stopwords')# the file that contains our docuemtns                                             text_file = "dbpedia_sample_abstract_20k_unprep.txt"
   
# we load the documents                                           documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()]# removes stopwords and punctuation                       
sp = WhiteSpacePreprocessing(documents,stopwords_language='english')# returns the pre and the unpre processed documents 
# vocab with the most frequent 2K tokens   
# these tokens are represent the topics                       preprocessed_documents,
unpreprocessed_documents, vocab = sp.preprocess()

TopicModelDataPreparation helps us to create the training datasets. With ZeroshotTM, We can train our model. This model works well even with documents in a different language.

# contextualized model
tp = TopicModelDataPreparation("distiluse-base-multilingual-cased")                                               # create the training set                 
training_dataset =tp.create_training_set(
                       text_for_contextual=unpreprocessed_documents
                       ,text_for_bow=preprocessed_documents) # train the model                       
# input_size is the size of our vocabulary                       
# bert_input_size is the size of the embedding of the BERT model                       
# n_components is the number of topic we are going to get                      
# n_epochs is the number of epochs we are going to train our model                    
ctm = ZeroShotTM(input_size=len(tp.vocab), bert_input_size=512
                                 ,n_components=50, num_epochs=100) # run the model
ctm.fit(training_dataset)# Show topics
ctm.get_topics()

Naive Bayes Classifier

It is an old technique. Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

Document Classification

Document Classification is a form of NLP. We classify the document into different pre-assign categories. This makes the document easier to sort and manage. A classifier is trained on a dataset so that it can predict the document's categories.

Preprocessing

This step is very crucial for creating the document vector. A common methodology used to do this is TF-IDF. The TF-IDF weighting for a word increases with the number of times the word appears in the document but decreases based on how frequently the word appears in the entire document set. After that, We can implement various classification algorithms.

# Pre-processingimport pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer# tf-idf
vectorizer = TfidfVectorizer(sublinear_tf=True,stop_words='english')
X = vectorizer.fit_transform(document)# Features and wieghts
feature_names = vectorizer.get_feature_names()
dense = X.todense()
df = pd.DataFrame()# top 100 features of documents
df['feature'] = feature_names
df['value'] = denselist[0]
df = df.sort_values(by=['value'], ascending=False)
df = df.head(100)

Naive Bayes

Naive Bayes is very useful for small datasets. Convert the top features into the embedding(universal-sentence-encoder) and multiply the weights to the embedding. The resultant word (= embed(word)*weight) is used for training. Use Gaussian Naive Bayes for training. It performs well with continuous data.

import tensorflow_hub as hub
import tensorflow as tf
import numpy as np
import scipy
from sklearn.naive_bayes import GaussianNB# universal-sentence-encoder
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")# Prepare the data for training
# word = embed(word)*weightlabels = []
words = []
for i, column in enumerate(df_input.index):
    for val in df_input['words']:
        val_em = embed([val])
        words.append(val_em[0])    words = [j for sub in words for j in sub]
    labels.append(df_input['label'])# Split the data-set
x_train, x_test, y_train, y_test = train_test_split(words, labels, test_size=0.8,random_state=109)#Create a Gaussian Classifier
gnb = GaussianNB()#Train the model using the training sets
gnb.fit(x_train, y_train)#Predict the response for test dataset
y_pred = gnb.predict(x_test)from sklearn import metrics# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

LDA

Latent Dirichlet allocation is a topic model that generates the topics based on topic frequency from a set of documents. It is an unsupervised learning model. LDA is particularly useful for finding a reasonably accurate mixture of topics within a given document.

Pre-processing

Before implementing the LDA model, documents need to be processed and the corpus is created. Steps for pre-processing are mentioned below.

Tokenize
Lemmatize the tokens
Compute bigrams
Create corpus

from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import Phrases
from gensim.corpora import Dictionary# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]# Lemmatize the documents.
lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]# Compute bigrams.
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
# Remove rare and common tokens.
# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)
# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

Training

# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)