Feature Extraction from Texts

As we already know, machine learning algorithms do not understand textual data directly. We need to represent the text data in numerical form or vectors. To convert each textual sentence into a vector, we need to represent it as a set of features. This set of features should uniquely represent the text, though, inpidually, some of the features may be common across many textual sentences. Features can be classified into two different categories:

  • General features: These features are statistical calculations and do not depend on the content of the text. Some examples of general features could be the number of tokens in the text, the number of characters in the text, and so on.
  • Specific features: These features are dependent on the inherent meaning of the text and represent the semantics of the text. For example, the frequency of unique words in the text is a specific feature.

Let's explore these in detail.

Extracting General Features from Raw Text

As we've already learned, general features refer to those that are not directly dependent on the inpidual tokens constituting a text corpus. Let's consider these two sentences: "The sky is blue" and "The pillar is yellow". Here, the sentences have the same number of words (a general feature)—that is, four. But the inpidual constituent tokens are different. Let's complete an exercise to understand this better.

Exercise 2.11: Extracting General Features from Raw Text

In this exercise, we will extract general features from input text. These general features include detecting the number of words, the presence of "wh" words (words beginning with "wh", such as "what" and "why") and the language in which the text is written. Follow these steps to implement this exercise:

  1. Open a Jupyter Notebook.
  2. Import the pandas library and create a DataFrame with four sentences. Add the following code to implement this:

    import pandas as pd

    from textblob import TextBlob

    df = pd.DataFrame([['The interim budget for 2019 will '\

                        'be announced on 1st February.'], \

                       ['Do you know how much expectation '\

                        'the middle-class working population '\

                        'is having from this budget?'], \

                       ['February is the shortest month '\

                        'in a year.'], \

                       ['This financial year will end on '\

                        '31st March.']])

    df.columns = ['text']

    df.head()

    The preceding code generates the following output:

    Figure 2.15: DataFrame consisting of four sentences

  3. Use the apply() function to iterate through each row of the column text, convert them into TextBlob objects, and extract words from them. Add the following code to implement this:

    def add_num_words(df):

        df['number_of_words'] = df['text'].apply(lambda x : \

                                len(TextBlob(str(x)).words))

        return df

    add_num_words(df)['number_of_words']

    The preceding code generates the following output:

    0     11

    1     15

    2      8

    3      8

    Name: number_of_words, dtype: int64

    The preceding code line will print the number_of_words column of the DataFrame to represent the number of words in each row.

  4. Use the apply() function to iterate through each row of the column text, convert the text into TextBlob objects, and extract the words from them to check whether any of them belong to the list of "wh" words that has been declared. Add the following code to do so:

    def is_present(wh_words, df):

        """

        The below line of code will find the intersection

        between set of tokens of every sentence and the

        wh_words and will return true if the length of

        intersection set is non-zero.

        """

        df['is_wh_words_present'] = df['text'].apply(lambda x : \

                                    True if \

                                    len(set(TextBlob(str(x)).\

                                    words).intersection(wh_words))\

                                    >0 else False)

        return df

    wh_words = set(['why', 'who', 'which', 'what', \

                    'where', 'when', 'how'])

    is_present(wh_words, df)['is_wh_words_present']

    The preceding code generates the following output:

    0     False

    1     True

    2     False

    3     False

    Name:  is_wh_words_present, dtype: bool

    The preceding code line will print the is_wh_words_present column that was added by the is_present method to df, which means for every row, we will see whether wh_word is present.

  5. Use the apply() function to iterate through each row of the column text, convert them into TextBlob objects, and detect their languages:

    def get_language(df):

        df['language'] = df['text'].apply(lambda x : \

                         TextBlob(str(x)).detect_language())

        return df

    get_language(df)['language']

    The preceding code generates the following output:

    0     en

    1     en

    2     en

    3     en

    Name:  language, dtype: object

With that, we have learned how to extract general features from text data.

Note

To access the source code for this specific section, please refer to https://packt.live/2X9jLcS.

You can also run this example online at https://packt.live/3fgrYSK.

Let's perform another exercise to get a better understanding of this.

Exercise 2.12: Extracting General Features from Text

In this exercise, we will extract various general features from documents. The dataset that we will be using here consists of random statements. Our objective is to find the frequency of various general features such as punctuation, uppercase and lowercase words, letters, digits, words, and whitespaces.

Note

The dataset that is being used in this exercise can be found at this link: https://packt.live/3k0qCPR.

  1. Open a Jupyter Notebook.
  2. Insert a new cell and add the following code to import the necessary libraries:

    import pandas as pd

    from string import punctuation

    import nltk

    nltk.download('tagsets')

    from nltk.data import load

    nltk.download('averaged_perceptron_tagger')

    from nltk import pos_tag

    from nltk import word_tokenize

    from collections import Counter

  3. To see what different kinds of parts of speech nltk provides, add the following code:

    def get_tagsets():

        tagdict = load('help/tagsets/upenn_tagset.pickle')

        return list(tagdict.keys())

    tag_list = get_tagsets()

    print(tag_list)

    The preceding code generates the following output:

    Figure 2.16: List of PoS

  4. Calculate the number of occurrences of each PoS by iterating through each document and annotating each word with the corresponding pos tag. Add the following code to implement this:

    """

    This method will count the occurrence of pos

    tags in each sentence.

    """

    def get_pos_occurrence_freq(data, tag_list):

        # Get list of sentences in text_list

        text_list = data.text

        

        # create empty dataframe

        feature_df = pd.DataFrame(columns=tag_list)

        for text_line in text_list:

            

            # get pos tags of each word.

            pos_tags = [j for i, j in \

                        pos_tag(word_tokenize(text_line))]

            

            """

            create a dict of pos tags and their frequency

            in given sentence.

            """

            row = dict(Counter(pos_tags))

            feature_df = feature_df.append(row, ignore_index=True)

        feature_df.fillna(0, inplace=True)

        return feature_df

    tag_list = get_tagsets()

    data = pd.read_csv('../data/data.csv', header=0)

    feature_df = get_pos_occurrence_freq(data, tag_list)

    feature_df.head()

    The preceding code generates the following output:

    Figure 2.17: Number of occurrences of each PoS in the sentence

  5. To calculate the number of punctuation marks, add the following code:

    def add_punctuation_count(feature_df, data):

        

        feature_df['num_of_unique_punctuations'] = data['text'].\

            apply(lambda x: len(set(x).intersection\

            (set(punctuation))))

        return feature_df

    feature_df = add_punctuation_count(feature_df, data)

    feature_df['num_of_unique_punctuations'].head()

    The add_punctuation_count() method will find the intersection of the set of punctuation marks in the text and punctuation sets that were imported from the string module. Then, it will find the length of the intersection set in each row and add it to the num_of_unique_punctuations column of the DataFrame. The preceding code generates the following output:

    0 0

    1 0

    2 1

    3 1

    4 0

    Name: num_of_unique_punctuations, dtype: int64

  6. To calculate the number of capitalized words, add the following code:

    def get_capitalized_word_count(feature_df, data):

        """

        The below code line will tokenize text in every row and

        create a set of only capital words, ten find the length of

        this set and add it to the column 'number_of_capital_words'

        of dataframe.

        """

        feature_df['number_of_capital_words'] = data['text'].\

            apply(lambda x: len([word for word in \

            word_tokenize(str(x)) if word[0].isupper()]))

        return feature_df

    feature_df = get_capitalized_word_count(feature_df, data)

    feature_df['number_of_capital_words'].head()

    The preceding code will tokenize the text in every row and create a set of words consisting of only capital words. It will then find the length of this set and add it to the number_of_capital_words column of the DataFrame. The preceding code generates the following output:

    0 1

    1 1

    2 1

    3 1

    4 1

    Name: number_of_capital_words, dtype: int64

    The last line of the preceding code will print the number_of_capital_words column, which represents the count of the number of capital letter words in each row.

  7. To calculate the number of lowercase words, add the following code:

    def get_small_word_count(feature_df, data):

        """

        The below code line will tokenize text in every row and

        create a set of only small words, then find the length of

        this set and add it to the column 'number_of_small_words'

        of dataframe.

        """

        feature_df['number_of_small_words'] = data['text'].\

            apply(lambda x: len([word for word in \

            word_tokenize(str(x)) if word[0].islower()]))

        return feature_df

    feature_df = get_small_word_count(feature_df, data)

    feature_df['number_of_small_words'].head()

    The preceding code will tokenize the text in every row and create a set of only small words, then find the length of this set and add it to the number_of_small_words column of the DataFrame. The preceding code generates the following output:

    0 4

    1 3

    2 7

    3 3

    4 2

    Name: number_of_small_words, dtype: int64

    The last line of the preceding code will print the number_of_small_words column, which represents the number of small letter words in each row.

  8. To calculate the number of letters in the DataFrame, use the following code:

    def get_number_of_alphabets(feature_df, data):

        feature_df['number_of_alphabets'] = data['text']. \

            apply(lambda x: len([ch for ch in str(x) \

            if ch.isalpha()]))

        return feature_df

    feature_df = get_number_of_alphabets(feature_df, data)

    feature_df['number_of_alphabets'].head()

    The preceding code will break the text line into a list of characters in each row and add the count of that list to the number_of_alphabets columns. This will produce the following output:

    0     19

    1     18

    2     28

    3     14

    4     13

    Name: number_of_alphabets, dtype: int64

    The last line of the preceding code will print the number_of_columns column, which represents the count of the number of alphabets in each row.

  9. To calculate the number of digits in the DataFrame, add the following code:

    def get_number_of_digit_count(feature_df, data):

        """

        The below code line will break the text line in a list of

        digits in each row and add the count of that list into

        the columns 'number_of_digits'

        """

        feature_df['number_of_digits'] = data['text']. \

            apply(lambda x: len([ch for ch in str(x) \

            if ch.isdigit()]))

        return feature_df

    feature_df = get_number_of_digit_count(feature_df, data)

    feature_df['number_of_digits'].head()

    The preceding code will get the digit count from each row and add the count of that list to the number_of_digits columns. The preceding code generates the following output:

    0 0

    1 0

    2 0

    3 0

    4 0

    Name: number_of_digits, dtype: int64

  10. To calculate the number of words in the DataFrame, add the following code:

    def get_number_of_words(feature_df, data):

        """

        The below code line will break the text line in a list of

        words in each row and add the count of that list into

        the columns 'number_of_digits'

        """

        feature_df['number_of_words'] = data['text'].\

            apply(lambda x : len(word_tokenize(str(x))))

        return feature_df

    feature_df = get_number_of_words(feature_df, data)

    feature_df['number_of_words'].head()

    The preceding code will split the text line into a list of words in each row and add the count of that list to the number_of_digits columns. We will get the following output:

    0 5

    1 4

    2 9

    3 5

    4 3

    Name: number_of_words, dtype: int64

  11. To calculate the number of whitespaces in the DataFrame, add the following code:

    def get_number_of_whitespaces(feature_df, data):

        """

        The below code line will generate list of white spaces

        in each row and add the length of that list into

        the columns 'number_of_white_spaces

        """

        feature_df['number_of_white_spaces'] = data['text']. \

            apply(lambda x: len([ch for ch in str(x) \

            if ch.isspace()]))

        return feature_df

    feature_df = get_number_of_whitespaces(feature_df, data)

    feature_df['number_of_white_spaces'].head()

    The preceding code will generate a list of whitespaces in each row and add the length of that list to the number_of_white_spaces columns. The preceding code generates the following output:

    0 4

    1 3

    2 7

    3 3

    4 2

    Name: number_of_white_spaces, dtype: int64

  12. To view the full feature set we have just created, add the following code:

    feature_df.head()

    We will be printing the head of the final DataFrame, which means we will print five rows of all the columns. We will get the following output:

Figure 2.18: DataFrame consisting of the features we have created

With that, we have learned how to extract general features from the given text.

Note

To access the source code for this specific section, please refer to https://packt.live/3jSsLNh.

You can also run this example online at https://packt.live/3hPFmPA.

Now, let's explore how we can extract unique features.

Bag of Words (BoW)

The Bag of Words (BoW) model is one of the most popular methods for extracting features from raw texts.

In this technique, we convert each sentence into a vector. The length of this vector is equal to the number of unique words in all the documents. This is done in two steps:

  1. The vocabulary or dictionary of all the words is generated.
  2. The document is represented in terms of the presence or absence of all words.

A vocabulary or dictionary is created from all the unique possible words available in the corpus (all documents) and every single word is assigned a unique index number. In the second step, every document is represented by a list whose length is equal to the number of words in the vocabulary. The following exercise illustrates how BoW can be implemented using Python.

Exercise 2.13: Creating a Bag of Words

In this exercise, we will create a BoW representation for all the terms in a document and ascertain the 10 most frequent terms. In this exercise, we will use the CountVectorizer module from sklearn, which performs the following tasks:

  • Tokenizes the collection of documents, also called a corpus
  • Builds the vocabulary of unique words
  • Converts a document into vectors using the previously built vocabulary

Follow these steps to implement this exercise:

  1. Open a Jupyter Notebook.
  2. Import the necessary libraries and declare a list corpus. Add the following code to implement this:

    import pandas as pd

    from sklearn.feature_extraction.text import CountVectorizer

  3. Use the CountVectorizer function to create the BoW model. Add the following code to do this:

    def vectorize_text(corpus):

        """

        Will return a dataframe in which every row will ,be

        vector representation of a document in corpus

        :param corpus: input text corpus

        :return: dataframe of vectors

        """

        bag_of_words_model = CountVectorizer()

        """

        performs the above described three tasks on

        the given data corpus.

        """

        dense_vec_matrix = bag_of_words_model.\

                           fit_transform(corpus).todense()

        bag_of_word_df = pd.DataFrame(dense_vec_matrix)

        bag_of_word_df.columns = sorted(bag_of_words_model.\

                                        vocabulary_)

        return bag_of_word_df

    corpus = ['Data Science is an overlap between Arts and Science',\

              'Generally, Arts graduates are right-brained and '\

              'Science graduates are left-brained',\

              'Excelling in both Arts and Science at a time '\

              'becomes difficult',\

              'Natural Language Processing is a part of Data Science']

    df = vectorize_text(corpus)

    df.head()

    The vectorize_text method will take a document corpus as an argument and return a DataFrame in which every row will be a vector representation of a document in the corpus.

    The preceding code generates the following output:

    Figure 2.19: DataFrame of the output of the BoW model

  4. Create a BoW model for the 10 most frequent terms. Add the following code to implement this:

    def bow_top_n(corpus, n):

        """

        Will return a dataframe in which every row

        will be represented by presence or absence of top 10 most

        frequently occurring words in data corpus

        :param corpus: input text corpus

        :return: dataframe of vectors

        """

        bag_of_words_model_small = CountVectorizer(max_features=n)

        bag_of_word_df_small = pd.DataFrame\

        (bag_of_words_model_small.fit_transform\

        (corpus).todense())

        bag_of_word_df_small.columns = \

        sorted(bag_of_words_model_small.vocabulary_)

        return bag_of_word_df_small

    df_2 = bow_top_n(corpus, 10)

    df_2.head()

    In the preceding code, we are checking the occurrence of the top 10 most frequent words in each sentence and creating a DataFrame out of it.

    The preceding code generates the following output:

Figure 2.20: DataFrame of the output of the BoW model for the 10 most frequent terms

Note

To access the source code for this specific section, please refer to https://packt.live/3gdhViJ.

You can also run this example online at https://packt.live/3hPUTi8.

In this section, we learned what BoW is and how to can use it to convert a sentence or document into a vector. BoW is the easiest way to convert text into a vector; however, it has a severe disadvantage. This method only considers the presence and absence of words in a sentence or document—not the frequency of the words/tokens in a document. If we are going to use the semantics of any sentence, the frequency of the words plays an important role. To overcome this issue, there is another feature extraction model called TFIDF, which we will discuss later in this chapter.

Zipf's Law

According to Zipf's law, the number of times a word occurs in a corpus is inversely proportional to its rank in the frequency table. In simple terms, if the words in a corpus are arranged in descending order of their frequency of occurrence, then the frequency of the word at the ith rank will be proportional to 1/i:

Figure 2.21: Zipf's law

This also means that the frequency of the most frequent word will be twice the frequency of the second most frequent word. For example, if we look at the Brown University Standard Corpus of Present-Day American English, the word "the" is the most frequent word (its frequency is 69,971), while the word "of" is the second most frequent (with a frequency of 36,411). As we can see, its frequency is almost half of the most frequently occurring word. To get a better understanding of this, let's perform a simple exercise.

Exercise 2.14: Zipf's Law

In this exercise, we will plot both the expected and actual ranks and frequencies of tokens with the help of Zipf's law. We will be using the 20newsgroups dataset provided by the sklearn library, which is a collection of newsgroup documents. Follow these steps to implement this exercise:

  1. Open a Jupyter Notebook.
  2. Import the necessary libraries:

    from pylab import *

    import nltk

    nltk.download('stopwords')

    from sklearn.datasets import fetch_20newsgroups

    from nltk import word_tokenize

    from nltk.corpus import stopwords

    import matplotlib.pyplot as plt

    import re

    import string

    from collections import Counter

    Add two methods for loading stop words and the data from the newsgroups_data_sample variable:

    def get_stop_words():

        stop_words = stopwords.words('english')

        stop_words = stop_words + list(string.printable)

        return stop_words

    def get_and_prepare_data(stop_words):

        """

        This method will load 20newsgroups data and

        and remove stop words from it using given stop word list.

        :param stop_words:

        :return:

        """

        newsgroups_data_sample = \

        fetch_20newsgroups(subset='train')

        tokenized_corpus = [word.lower() for sentence in \

                            newsgroups_data_sample['data'] \

                            for word in word_tokenize\

                            (re.sub(r'([^\s\w]|_)+', ' ', sentence)) \

                            if word.lower() not in stop_words]

        return tokenized_corpus

    In the preceding code, there are two methods; get_stop_words() will load stop word list from nltk data, while get_and_prepare_data() will load the 20newsgroups data and remove stop words from it using the given stop word list.

  3. Add the following method to calculate the frequency of each token:

    def get_frequency(corpus, n):

        token_count_di = Counter(corpus)

        return token_count_di.most_common(n)

    The preceding method uses the Counter class to count the frequency of tokens in the corpus and then return the most common n tokens.

  4. Now, call all the preceding methods to calculate the frequency of the top 50 most frequent tokens:

    stop_word_list = get_stop_words()

    corpus = get_and_prepare_data(stop_word_list)

    get_frequency(corpus, 50)

    The preceding code generates the following output:

    Figure 2.22: The 50 most frequent words of the corpus

  5. Plot the actual ranks of words that we got from frequency dictionary and the ranks expected as per Zipf's law. Calculate the frequencies of the top 10,000 words using the preceding get_frequency() method and the expected frequencies of the same list using Zipf's law. For this, create two lists—an actual_frequencies and an expected_frequencies list. Use the log of actual frequencies to downscale the numbers. After getting the actual and expected frequencies, plot them using matplotlib:

    def get_actual_and_expected_frequencies(corpus):

        freq_dict = get_frequency(corpus, 1000)

        actual_frequencies = []

        expected_frequencies = []

        for rank, tup in enumerate(freq_dict):

            actual_frequencies.append(log(tup[1]))

            rank = 1 if rank == 0 else rank

            # expected frequency 1/rank as per zipf's law

            expected_frequencies.append(1 / rank)

        return actual_frequencies, expected_frequencies

    def plot(actual_frequencies, expected_frequencies):

        plt.plot(actual_frequencies, 'g*', \

                 expected_frequencies, 'ro')

        plt.show()

    # We will plot the actual and expected frequencies

    actual_frequencies, expected_frequencies = \

    get_actual_and_expected_frequencies(corpus)

    plot(actual_frequencies, expected_frequencies)

    The preceding code generates the following output:

Figure 2.23: Illustration of Zipf's law

So, as we can see from the preceding output, both lines have almost the same slope. In other words, we can say that the lines (or graphs) depict the proportionality of two lists.

Note

To access the source code for this specific section, please refer to https://packt.live/30ZnKtD.

You can also run this example online at https://packt.live/3f9ZFoT.

Term Frequency–Inverse Document Frequency (TFIDF)

Term Frequency-Inverse Document Frequency (TFIDF) is another method of representing text data in a vector format. Here, once again, we'll represent each document as a list whose length is equal to the number of unique words/tokens in all documents (corpus), but the vector here not only represents the presence and absence of a word, but also the frequency of the word—both in the current document and the whole corpus.

This technique is based on the idea that the rarely occurring words are better representatives of the document than frequently occurring words. Hence, this representation gives more weightage to the rarer or less frequent words than frequently occurring words. It does so with the following formula:

Figure 2.24: TFIDF formula

Here, term frequency is the frequency of a word in the given document. Inverse document frequency can be defined as log(D/df), where df is document frequency and D is the total number of documents in the background corpus.

Now, let's complete an exercise and learn how TFIDF can be implemented in Python.

Exercise 2.15: TFIDF Representation

In this exercise, we will represent the input texts with their TFIDF vectors. We will use a sklearn module named TfidfVectorizer, which converts text into TFIDF vectors. Follow these steps to implement this exercise:

  1. Open a Jupyter Notebook.
  2. Import all the necessary libraries and create a method to calculate the TFIDF of the corpus. Add the following code to implement this:

    from sklearn.feature_extraction.text import TfidfVectorizer

    def get_tf_idf_vectors(corpus):

        tfidf_model = TfidfVectorizer()

        vector_list = tfidf_model.fit_transform(corpus).todense()

        return vector_list

  3. To create a TFIDF model, write the following code:

    corpus = ['Data Science is an overlap between Arts and Science',\

              'Generally, Arts graduates are right-brained and '\

              'Science graduates are left-brained',\

              'Excelling in both Arts and Science at a '\

              'time becomes difficult',\

              'Natural Language Processing is a part of Data Science']

    vector_list = get_tf_idf_vectors(corpus)

    print(vector_list)

    In the preceding code, the get_tf_idf_vectors() method will generate TFIDF vectors from the corpus. You will then call this method on a given corpus. The preceding code generates the following output:

Figure 2.25: TFIDF representation of the 10 most frequent terms

The preceding output represents the TFIDF vectors for each row. As you can see from the results, each document is represented by a list whose length is equal to the unique words in the corpus and in each list (vector). The vector contains the TFIDF values of the words at their corresponding index.

Note

To access the source code for this specific section, please refer to https://packt.live/3gdzsHA.

You can also run this example online at https://packt.live/3fdP5gS.

In the next section, we will solve an activity to extract specific features from texts.