Entry
A very effective methodmachine learningInatural language processingIsobject modeling. The content of the text is an example of a collection of documents. This technique consists in searching for abstract motifs appearing there. This method emphasizes the basic structure of the text, bringing to light themes and patterns that may not be immediately apparent.

To analyze the content of large collections of documents, such as thousands of tweets, topic modeling algorithms rely on statistical techniques to find patterns in the text. These algorithms classify articles on a variety of selected topics after looking at the frequency and co-occurrence of words in documents. As a result, content appears more structured and understandable, making it easier to identify underlying themes and patterns in your data.
Latent Dirichlet distribution (LDA), latent semantic analysis, and non-negative matrix factorization are some conventional topic modeling techniques. However, this blog article uses BERT to model topics.
Learn more:Thematic modeling using latent Dirichlet distribution (LDA)
Purpose of science
Here is the learning objective for the Thematic Modeling with BERT workshop, given as an enumeration:
- Learn the basics of topic modeling and how it is used in NLP.
- Learn the basics of BERT and how to create embedded documents.
- To prepare text data for the BERT model, it must be preprocessed.
- Use the [CLS] token to extract document imports from BERT output.
- Use grouping methods (such as K-means) to group related materials and find hidden themes.
- Use appropriate metrics to assess the quality of the topics you create.
Through this learning objective, participants will gain hands-on experience using BERT to model topics. Using this knowledge, they will be equipped to analyze and extract hidden threads from large text datasets.
This article was published as part of the Data Science Blogathon.
Contents
- Entry
- Loading data
- Modeling issues
- Visualization of topics
- 1. Material conditions
- 2. Interdisciplinary distance map
- 3. Visualize the hierarchy of topics
- topic search
- application
- Frequently asked questions and answers
Loading data
This is Australian broadcaster content that has been available on Kaggle for eight years. It contains two important columns: publish_date: date of publication of the article in the format yyyyyyyyyyyyyth. The English translation of the header text is header_text. This is the knowledge that the thematic model will use.
import pande jako pd # Odczyt danych data = pd.read_csv('../input/abc-news-sample/abcnews_sample.csv')data.head()

# Create a new column containing the text length of each headline data["headline_text_len"] = data["headline_text"].apply(lambda x : len(x.split()))print("The longest headline has: {} words " .format (data.headline_text_len.max()))

# Visualize length distribution import as snsimport matplotlib.pyplot as pltsns.displot(data.headline_text_len, kde=False)

for idx in data.sample(3).index: headline = data.iloc[idx] print("Header #{}:".format(idx)) print("Publish date: {}".format(headline.publish_date )) print("Κείμενο: {}\n".format(headline.headline_text))

Modeling issues
In this example, we will look at the basics of Topic BERT and the processes required to build a robust topic model.
Learn more:A beginner's guide to modeling topics in Python
Exercises
The beginning of the BERT thread is first. Our documents are in English, so we have English. Use language="multilingual" instead of "language" if you want to use a model that supports multiple languages.
The probabilities of the subject will also be calculated. However, this can drastically slow down the BERT thread for large datasets (>100,000 documents). You can speed up the model as well as turn it off.
import warnings.filterwarnings("ignore")
!pip install bertopic

%%timefrom bertopic εισαγωγή BERTopicmodel = BERTopic(verbose=True,embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 7)headline_topics, _ = model.fit_transform(data.headline_text)

To prevent messages from being displayed during model initialization, set Verbose to True. The proposed transformer model with the best ratio of speed to efficiency is paraphrase-MiniLM-L3-v2. We set Min_topic_size to 7, although the default number is 10. The number of clusters or topics decreases as the value increases.
Extraction and representation of the theme
freq = model.get_topic_info()print("Broj temat: {}".format( len(freq)))freq.head()

In the three main columns of the table above, all 54 people are listed in descending order of size/number.
- Outliers are denoted as -1 and identified by the entity number, which serves as an identifier. Since they offer no value, these problems should be avoided.
- Stan "to countrefers to the number of words in the subject.
- The topic is known asTo do.
We can get the main terms from each subject and the accompanying c-TF-IDF scores. The more important the result, the more appropriate the time for the topic.
a_topic = freq.iloc[1]["Topic"] # Odaberite 1. temu
model.get_topic(a_topic) # Display words and their c-TF-IDF results

From this topic, we can see that all the expressions make sense in relation to the basic theme, which seems to be firefighters.
Visualization of topics
You can learn more about each topic by using the topic visualization. We will focus on the visualization options mentioned in BERTopic, which include, but are not limited to, term visualization, cross-modal distance map, and topic hierarchy grouping.
1. Material conditions
You can use the c-TF-IDF output to create a bar chart of key terms for each topic, which provides an attractive way to graphically compare issues. Here is a related illustration to topic six.
model.visualize_barchart(top_n_topics=6)

Issue 1 is related to crime as the most common phrases are "man accused of murder ended up in prison". Any of the following topics can easily lead to the same analysis. The horizontal line is the more appropriate to the subject, the longer it is.
2. Interdisciplinary distance map
For those familiar with the LDAvis library for Dirichlet's hidden distribution. The user of this library has access to an interactive dashboard that displays words and results related to each topic. With the visualize_topics() method, BERTopic achieves the same thing and goes even further by providing a distance between cases (the shorter the length, the more relevant the topics).
model.visualize_topics()

3. Visualize the hierarchy of topics
Some topics are close, as shown in the cross-distance topics dashboard. One thing that might come to your mind is how can I reduce the number of problems. The good news is that you can organize these issues hierarchically, which allows you to choose the exact version number. The taste of the visualization makes it easier to understand their relationship.
model.visualize_hierarchy(top_n_topics=30)

We can see that similarly colored subjects are grouped by looking at the first level of the dendrogram (level 0). For illustration
- Motifs 7 (health, hospital, psyche) and 14 (deceased, died of collapse) combine for closeness.
- Operators 6 (farmers, farmers and farmers) and 16 (cattle, sheep and meat) should be classified in a similar way.
- These details can help the user understand why topics are compared to each other.
topic search
Once the topic model is trained, we can use the find_topics method to search for topics that are semantically related to a specific word or phrase of the input query. For example, we can search for the first 3 items related to the word "politics".
# Select up to 3 similar topics, similar topics, similarity = model.find_topics("politics", top_n = 3)
- The index of topics in similar_topics is ranked from most similar to least similar.
- Similarity scores are presented in descending order within similarities.
similar_topics

most_similar = related_topics[0]print("Information about most similar topics: \n{}".format(model.get_topic(most_similar)))print("Simulation score: {}.format(similarity[0]))

Clearly, phrases like "elections," "Trump," and "Obama," which are unmistakably political, represent the most comparable topics.
Model serialization and loading
When you are satisfied with the result of your model, you can save it for further analysis by following the instructions below:
%%bashmkdir './model_dir

# Save the model on the previously created map with the name 'my_best_model'model.save("./models_directory/my_best_model")# Load serial modelsmy_best_model = BERTopic.load("./models_directory/my_best_model")my_best_model

application
Finally, topic modeling with BERT provides an effective method for detecting hidden topics in textual data. While BERT was originally developed for various natural language processing applications, it can be used to model topics using document embedding and grouping methods. Here are the highlights from this article:
- The importance of topic modeling is that it enables companies to gain insights and make informed decisions by helping them understand underlying themes and patterns in vast amounts of unstructured textual data.
- While BERT is not a conventional topic modeling approach, it can still provide the insightful document embedding that is essential for identifying hidden topics.
- BERT creates document embedding by extracting semantic data from output [CLS] tokens. These embeddings give documents a dense representation in vector space.
- Thematic modeling with BERT is an ever-evolving field as new research and technological advances improve its efficiency.
Overall, advancing knowledge of topic modeling with BERT enables scientists, researchers, and data analysts to extract and analyze underlying themes in large text corpora, draw insightful conclusions, and make informed decisions.
Frequently asked questions and answers
Q1: What is BERT and how does it relate to thematic modelling?
A1: Google developed BERT (Bidirectional Encoder Representations from Transformers), a pre-trained language model. It is used in natural language processing tasks such as text classification and answering questions. In thematic modeling, researchers use BERT to create document embeddings that represent the semantic meaning of documents. They then combine these integrations to discover hidden motives in the body.
Q2: How is BERT different from traditional thematic modeling algorithms such as LDA or NMF?
A2: BERT differs from traditional algorithms such as LDA (Latent Dirichlet Allocation) or NMF (Non-Negative Matrix Factorization) as it is not specifically designed for thematic modeling. LDA and NMF explicitly model the topic-based document generation process. At the same time, BERT learns word representations in a context-rich way through unsupervised training on massive amounts of textual data.
Q3: Is BERT the best model for thematic modeling?
A3: Depending on your use case, BERT can be used for topic modeling, but it's not always the best choice. The choice of the best model depends on factors such as the size of the dataset, computational resources, and the specific goals of the analysis.
Q4: What are document embeddings and why are they important in modeling topics with BERT?
A4: Embedded documents are dense vector representations that capture the semantic meaning of the document. In topic modeling with BERT, document embedding is generated by extracting a vector representation of an output token [CLS] that encodes the entire meaning of the entire document. These integrations are crucial for grouping similar documents together to reveal hidden motives.
The media displayed in this article is not owned by Analytics Vidhya and is used at its sole discretion.