Artificial Vision and Language Processing for Robotics
上QQ阅读APP看书,第一时间看更新

Topic Modeling

Within NLU, which is a part of NLP, one of the many tasks that can be performed is extracting the meaning of a sentence, a paragraph, or a whole document. One approach to understanding a document is through its topics. For example, if a set of documents is from a newspaper, the topics might be politics or sports. With topic modeling techniques, we can obtain a bunch of words representing various topics. Depending on your set of documents, you will then have different topics represented by different words. The goal of these techniques is to know the different types of documents in your corpus.

Term Frequency – Inverse Document Frequency (TF-IDF)

TF-IDF is a commonly used NLP model for extracting the most important words from a document. To perform this classification, the algorithm will assign a weight to each word. The idea of this method is to ignore words without relevance to the meaning of a global concept, (which means the overall topic of a text), so those terms will be down-weighted (which means that they will be ignored). Down-weighing them will allow us to find the keywords of that document (the words with the greatest weights).

Mathematically, the algorithm to find the weight of a term in a document is as follows:

Figure 3.19: TF-IDF formula

  • Wi,j: Weight of the term, i, in the document, j
  • tf,j: Number of occurrences of i in j
  • df,j: Number of documents containing i
  • N: Total number of documents

The result is the number of times a term appears in that document, multiplied by the log of the total number of documents, divided by the number of documents that contain the term.

Latent Semantic Analysis (LSA)

LSA is one of the foundational techniques of topic modeling. It analyzes the relationship between a set of documents and their terms, and produces a set of concepts related to them.

LSA is a step ahead when compared to TF-IDF. In a large set of documents, the TF-IDF matrix has very noisy information and many redundant dimensions, so the LSA algorithm performs dimensionality reduction.

This reduction is performed with Singular Value Decomposition (SVD). SVD factorizes a matrix, M, into the product of three separate matrices:

Figure 3.20: Singular Value Decomposition

  • A: This is the input data matrix.
  • m: This is the number of documents.
  • n: This is the number of terms.
  • U: Left singular vectors. Our document-topic matrix.
  • S: Singular values. Represents the strength of each concept. This is a diagonal matrix.
  • V: Right singular vectors. Represents terms' vectors in terms of topics.

    Note

    This method is more efficient on a large set of documents, but there are better algorithms to perform this task such as LDA or PLSA.

Exercise 12: Topic Modeling in Python

In this exercise, TF-IDF and LSA will be coded in Python using a specific library. By the end of this exercise, you will be able to perform these techniques to extract the weights of a term in a document:

  1. Open up your Google Colab interface.
  2. Create a folder for the book.
  3. To generate the TF-IDF matrix, we could code the formula in Figure 3.19, but we are going to use one of the most famous libraries for machine learning algorithms in Python, scikit-learn:

    from sklearn.feature_extraction.text import TfidfVectorizer

    from sklearn.decomposition import TruncatedSVD

  4. The corpus we are going to use for this exercise will be simple, with just four sentences:

    corpus = [

         'My cat is white',

         'I am the major of this city',

         'I love eating toasted cheese',

         'The lazy cat is sleeping',

    ]

  5. With the TfidfVectorizer method, we can convert the collection of documents in our corpus to a matrix of TF-IDF features:

    vectorizer = TfidfVectorizer()

    X = vectorizer.fit_transform(corpus)

  6. The get_feature_names() method shows the extracted features.

    Note

    To understand the TfidfVectorizer function better, visit the Scikit Learn documentation - https://bit.ly/2S6lwWP

    vectorizer.get_feature_names()

    The output is as follows:

    Figure 3.21: Feature names of the corpus

  7. X is a sparse matrix. To see its content, we can use the todense() function:

    X.todense()

    The output is as follows:

    Figure 3.22: TF-IDF matrix of the corpus

  8. Now let's perform dimensionality reduction with LSA. The TruncatedSVD method uses SVD to transform the input matrix. In this exercise, we'll use n_components=10. From now on, you have to use n_components=100 (it has better results in larger corpuses):

    lsa = TruncatedSVD(n_components=10,algorithm='randomized',n_iter=10,random_state=0)

    lsa.fit_transform(X)

    The output is as follows:

    Figure 23: Dimensionality reduction with LSA

  9. attribute .components_ shows the weight of each vectorizer.get_feature_names(). Notice that the LSA matrix has a range of 4x16, we have 4 documents in our corpus (concepts), and the vectorizer has 16 features (terms):

    lsa.components_

    The output is as follows:

Figure 3.24: The desired TF-IDF matrix output

The exercise has ended successfully! This was a preparatory exercise for Activity 3, Process a Corpus. Do check the seventh step of the exercise – it will give you the key to complete the activity ahead. I encourage you to read the scikit-learn documentation and learn how to see the potential of these two methods. Now you know how to create the TF-IDF matrix. This matrix could be huge, so to manage the data better, the LSA algorithm performs dimensionality reduction on the weight of each term in the document.

Activity 3: Process a Corpus

In this activity, we will process a really small corpus to clean the data and extract the keywords and concepts using LSA.

Imagine this scenario: the newspaper vendor in your town has published a competition. It consists of predicting the category of an article. This newspaper does not have a structural database, which means it has only raw data. They provide a small set of documents, and they need to know whether the article is political, scientific, or sports-related:

Note

You can choose between spaCy and the NLTK library to do the activity. Both solutions will be valid if the keywords are related at the end of the LSA algorithm.

  1. Load the corpus documents and store them in a list.

    Note

    The corpus documents can be found on GitHub, https://github.com/PacktPublishing/Artificial-Vision-and-Language-Processing-for-Robotics/tree/master/Lesson03/Activity03/dataset

  2. Pre-process the text with spaCy or NLTK.
  3. Apply the LSA algorithm.
  4. Show the first five keywords related to each concept:

    Keywords: moon, apollo, earth, space, nasa

    Keywords: yard, touchdown, cowboys, prescott, left

    Keywords: facebook, privacy, tech, consumer, data

    Note

    The output keywords probably will not be the same as yours. If your keywords are not related then check the solution.

    The output is as follows:

Figure 3.25: Output example of the most relevant words in a concept (f1)

Note

The solution for this activity is available on page 306.