Using NLTK for Natural Language Processing
Natural language processing (NLP) is a branch of artificial intelligence that deals with analyzing, understanding, and generating natural language. It is used in various applications such as sentiment analysis, text classification, machine translation, question answering, and more. For all of these tasks, one of the most important steps is to correctly identify the parts of speech in a given text. This is where the NLTK library comes in.
NLTK, or the Natural Language Toolkit, is a popular open-source library for NLP. It provides tools and resources for working with written and spoken language data. In this article, we’ll look at how to use NLTK for part-of-speech tagging, which is the process of assigning each word in a sentence its corresponding part of speech (e.g. noun, verb, adjective, etc.).
What is Part-of-Speech Tagging?
Part-of-speech tagging, or POS tagging, is the process of assigning each word in a sentence to its corresponding part of speech. This is a crucial step in natural language processing tasks such as text classification, sentiment analysis, and question answering.
For example, in the sentence “I love dogs”, the words “I” and “love” would be tagged as pronouns and verbs, respectively. Knowing the parts of speech of each word helps us understand the meaning of the sentence and how the words relate to each other.
Downloading NLTK’s Tagger and Data
To use the NLTK library for part-of-speech tagging, we first need to download the NLTK Twitter corpus and the NLTK POS averaged perceptron tagger.
The NLTK Twitter corpus contains a sample of 20,000 tweets that were retrieved from the popular Twitter Streaming API. The POS averaged perceptron tagger uses the perceptron algorithm to predict which part of speech tag is most likely to occur.
To download the NLTK Twitter corpus, we can run the following command on the command line:
python – m nltk . downloader twitter _ samples
To download the NLTK POS averaged perceptron tagger, we can run the following command on the command line:
python – m nltk . downloader averaged _ perceptron _ tagger
Once we have both datasets downloaded, we can double-check if we downloaded them correctly by opening our Python interactive environment and importing the Twitter samples:
from nltk . corpus import twitter _ samples
To get the file IDs of the JSON files in the Twitter corpus, we can run the following command:
twitter _ samples . fileids ( )
Using the file IDs, we can then return the tweet strings with the following command:
twitter _ samples . strings ( ‘ tweets . 20150430 – 223406 . json ‘ )
Using NLTK for Part-of-Speech Tagging
Once we have the NLTK Twitter corpus and tagger downloaded, we can begin writing a code to process the imported tweets. The main goal of our script will be to find and count how many nouns and adjectives appear in the positive subset of the Twitter corpus. Counting nouns helps us determine how many topics are being discussed, while counting adjectives helps us determine what kind of language is being used.
First, we need to import the necessary libraries and modules:
import nltk
from nltk . tokenize import word_tokenize
from nltk . tag import pos_tag
Next, we need to create our tokenizer and tagging functions:
def tokenize ( text ) :
tokens = word_tokenize ( text )
return tokens
def tag ( text ) :
tokens = tokenize ( text )
tags = pos_tag ( tokens )
return tags
Now, we can create a function to count the number of nouns and adjectives in a given sentence:
def count_pos ( text ) :
tags = tag ( text )
count_nouns = 0
count_adjectives = 0
for ( word , tag ) in tags :
if tag == “NN” or tag == “NNS” :
count_nouns += 1
elif tag == “JJ” or tag == “JJR” or tag == “JJS” :
count_adjectives += 1
return count_nouns , count_adjectives
Finally, we can use the count_pos() function to count the number of nouns and adjectives in the positive subset of our Twitter corpus:
Get the positive tweets from the corpus
positive_tweets = twitter_samples.strings(‘positive_tweets.json’)
Count the number of nouns and adjectives
total_nouns = 0
total_adjectives = 0
for tweet in positive_tweets :
nouns , adjectives = count_pos ( tweet )
total_nouns += nouns
total_adjectives += adjectives
print ( “Total nouns: ” , total_nouns )
print ( “Total adjectives: ” , total_adjectives )
Conclusion
In this article, we looked at how to use NLTK for part-of-speech tagging. We downloaded the NLTK Twitter corpus and the NLTK POS averaged perceptron tagger and used them to count the number of nouns and adjectives in the positive subset of the Twitter corpus.
Part-of-speech tagging is an important step in natural language processing and can be used for tasks such as sentiment analysis, text classification, and question answering. With the NLTK library, it is easy to get started with part-of-speech tagging and use it to gain insights into a given text.