Stemming and Lemmatization: What Are They and How to Use Them in Data Analysis

In data analysis, it’s often useful to group similar words together. Different word inflections can have the same meaning, so it can be beneficial to treat them as the same word. For example, instead of handling the words “swimmer”, “swimming”, and “swim” individually, we can treat them as the same word, which is “swim”. To do this, we can use two approaches, stemming and lemmatization.

What Is Stemming?

Stemming is a crude attempt at generating the root word form. It commonly returns a word that is simply the first several characters of any word form. There are several different versions of stemming algorithms available, but the most commonly used ones are the Porter stemmer and the Snowball stemmer.

To perform stemming, you must import the PorterStemmer from the NLTK library. Here’s an example of how to do it:

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
test_word = "swimming"
word_stem = stemmer.stem(test_word)
print(word_stem)

The output of this code is “swim”.

What Is Lemmatization?

Lemmatization takes those inflected word forms and returns them to their base form, root or lemma. To achieve this, you need some context of the word to use, such as whether it is an adjective or a noun.

To perform lemmatization, you must import the WordNetLemmatizer from the NLTK library. Here’s an example of how to do it:

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
test_word = "swimming"
word_lemmatize = lemmatizer.lemmatize(test_word)
word_lemmatize_verb = lemmatizer.lemmatize(test_word, pos="v")
word_lemmatize_adj = lemmatizer.lemmatize(test_word, pos="a")
print(word_lemmatize, word_lemmatize_verb, word_lemmatize_adj)

The output of this code is “swimming swim swim”.

Differences Between Stemming and Lemmatization

There are several differences between lemmatization and stemming. For instance, stemming can work on words without knowing their context, so it has lower accuracy, but it is faster than lemmatization. In addition, word lemmatizing returns you a real word even in those cases when it is not the same word. It may be a synonym, but you get a real word. Therefore, if you care about the speed and not about accuracy, use stemming, but, if accuracy is more important to you, use lemmatization.

Stemming Non-English Words

You can also stem non-English words using the NLTK stemmer. It can stem thirteen languages besides English, such as French, Spanish, and Portuguese.

To use the NLTK stem function to classify some non-English words, you can use the following code:

from nltk.stem import SnowballStemmer
french_stemmer = SnowballStemmer(‘french’)
print(french_stemmer.stem(“French word”))

Getting Synonyms and Antonyms from WordNet

As you can perform lemmatization using WordNet, you can also use it for getting synonyms and antonyms. WordNet is a large database built for natural language processing, including brief definitions and collections of synonyms.

To get WordNet definitions, you can use the following code:

from nltk.corpus import wordnet
syn = wordnet.synsets(“pain”)
print(syn[0].definition())
print(syn[0].examples())

The output of this code is “a symptom of some physical disorder or hurt” and “the patient developed severe pain and distension”.

You can also use WordNet to obtain synonymous words by running the code as follows:

from nltk.corpus import wordnet
synonyms = []
for syn in wordnet.synsets(‘computer’):
for lemma in syn.lemmas():
synonyms.append(lemma.name())
print(synonyms)

The output of this code is “computer, computing machine, computing device, data processor, electronic computer, information processing system, calculator, reckoner, figurer, estimator, computer”.

In a similar manner, you can get some antonyms from WordNet by running the code as follows. However, before you add your words to the array, make sure you check their lemmas.

from nltk.corpus import wordnet
antonyms = []
for syn in wordnet.synsets(“big”):
for l in syn.lemmas():
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(antonyms)

The output of this code is “small, little, tiny”.

Conclusion

Stemming and lemmatization are two approaches to treating similar words as the same word. Stemming is a less thorough approach that is faster but has lower accuracy. Lemmatization is a more thorough approach that is slower but has higher accuracy. You can also use the NLTK stemmer and WordNet to stem non-English words and to find synonyms and antonyms.