Text Summarization Through Natural Language Processing

Text summarization is a technique used to reduce the size of any given text while still preserving its original meaning. This is an important skill to have in natural language processing (NLP) since it can give us great insight into the original text. In this article, we’ll discuss how to use NLTK to create a text summarizer algorithm. We’ll cover the four main steps of text summarization: removing stop words, creating a frequency table, assigning scores to sentences based on the frequency table, and building a summary by adding sentences with a specific score threshold.

What are Stop Words and Why Do We Remove Them?

Stop words are words that do not add any value to the overall meaning of the sentence. We generally remove stop words from text because they don’t provide us with any insight into the original body of text. To remove stop words, we must first create two arrays – one for each word in the text and another for stop words.

Using NLTK for Text Summarization
We will use two NLTK libraries to build our text summarizer:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

Creating the Frequency Table
The next step is to create a dictionary for our word frequency table. For this step, we are going to use only those words that are not part of our stop words array.

freqTable = dict()
for word in words:
word = word.lower()
if word in stopWords:
continue
if word in freqTable:
freqTable[word] += 1
else:
freqTable[word] = 1

Assigning Scores to Sentences

Now that we have our sentences tokenizer, we must run the sent_tokenize argument to create an array of our sentences. We need a dictionary to assign and keep the score of every sentence, so that we can create the summary.

sentences = sent_tokenize(text)
sentenceValue = dict()

We must go through every sentence and assign it a score depending on the words it contains. There are many algorithms available for assigning scores, but we are going to use a basic algorithm which adds the frequency of each non-stop-word in the sentence.

for sentence in sentences:
for wordValue in freqTable:
if wordValue[0] in sentence.lower():
if sentence[:12] in sentenceValue:
sentenceValue[sentence[:12]] += wordValue[1]
else:
sentenceValue[sentence[:12]] = wordValue[1]

It is understood that an index zero of word value always returns the word itself while the index one always represents the number of total instances present. One issue with score algorithms is that long sentences always have an advantage over shorter sentences. To solve this problem, we must divide each sentence score by the total number of words present in the sentence.

Building the Summary

Now we need to find a threshold to compare our scores. The simplest approach to this problem is to find the average score of every sentence. Once we get it, we can easily find our threshold.

sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]

average = int(sumValues / len(sentenceValue))

We can go with a shorter summary and get a threshold from there. We must apply the threshold and store our sentences to get our summary.

summary = ‘ ‘
for sentence in sentences:
if sentence[:12] in sentenceValue and sentenceValue[sentence[:12]] > (1.5 * average):
summary += ” ” + sentence

Once completed, we can print the summary to check its quality.

Frequency Table Enhancement
We can make smarter frequency tables using a Stemmer algorithm which brings words back to their root words. This is useful when we want to give importance to words with similar meanings. To implement Stemmer, we can use the stemmers provided in the NLTK library.

from nltk.stem import PorterStemmer
ps = PorterStemmer()

Once done with importing stemmer, we can pass each word by the stemmer before we add it to our frequency table. It is a good practice to stem each word before we add scores of the words when we go through every sentence.

Conclusion
Text summarization is an important skill to have in natural language processing. In this article, we discussed how to use NLTK to create a text summarizer algorithm. We went over the four main steps of text summarization: removing stop words, creating a frequency table, assigning scores to sentences based on the frequency table, and building a summary by adding sentences with a specific score threshold. We also discussed how to make smarter frequency tables using a stemmer algorithm.

What are Stop Words and Why Do We Remove Them?

Assigning Scores to Sentences

Building the Summary

Leave a Reply Cancel reply