Link Search Menu Expand Document (external link)

Text Classification

Classification is the task of choosing the ‘correct’ label for an input. Supervised classification refers to a labeling task where the labels are defined in advance (this is in contrast to unsupervised classification, e.g. topic modeling, where the labels are not predefined). Some examples of supervised classification include:

  • Categorizing an email as spam or not spam
  • Categorizing the topic of a news article is from a fixed list of topics
  • Categorizing the sentiment of a document as positive, negative, or neutral

Supervised classification tasks uses data that has already been classified in order to train machine learning algorithms to assign labels to unclassified data. The central idea here is that we are training our computer to look for certain word features of text to develop a model of language labels. This model can then be used to classify new bodies of text.

Below, we will explore different supervised classification methods.

Gender classification

Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let’s build a classifier to model these differences more precisely.

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we’ll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name.

def gender_features(word):
    return {'last_letter': word[-1]}

gender_features('Louis')
{'last_letter': 's'}

The output of our gender_features function is a dictionary of feature sets, which maps feature names (last_letter) to their values (word[-1]). Feature names typically provide a human-readable description of the feature, as in the example ‘last_letter’. Feature values are typically simple values, such as booleans, numbers, or strings. In this case, it is a simple string.

Now that we’ve defined a gender feature extractor, we need to prepare a list of examples and corresponding class labels.

import nltk
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

import random
random.shuffle(labeled_names)

labeled_names[:10] #list first 10 names
[('Weider', 'male'),
 ('Dalia', 'female'),
 ('Shaun', 'male'),
 ('Page', 'female'),
 ('Casey', 'male'),
 ('Mara', 'female'),
 ('Maryangelyn', 'female'),
 ('Cindy', 'female'),
 ('Jeniece', 'female'),
 ('Annelise', 'female')]

We have just created a list of word features with gender labels (stored in the object labeled_names).

What does our features set look like?

featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
featuresets[:10] 
[({'last_letter': 'r'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'y'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'n'}, 'female'),
 ({'last_letter': 'y'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'e'}, 'female')]

Let’s now divide this features list into two sets: a training set and a test set. The training set will be used to train a computer algorithm classifier. The test set will be used to evaluate how well our classifier performs.

train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

Let’s test how our classifier works on names that didn’t appear in the training or test set

classifier.classify(gender_features('Gandalf')), classifier.classify(gender_features('Bilbo'))
('male', 'male')

These character names from The Hobbit are correctly classified. But our classifier isn’t perfect.

classifier.classify(gender_features('Katniss')), classifier.classify(gender_features('Dumbledore')) 
('male', 'female')

A classifier will never be 100% accurate. But we can systematically evaluate how well a classifier performs by looking at how well it classifies data that has already been labeled.

Below, let’s look at how our classifier assigns gender labels to data in our test set. Discrepancies between the gender label assigned by our classifier and the gender labels that were already included in the test set provides a measure of classifier accuracy.

print(nltk.classify.accuracy(classifier, test_set))
0.712

Our gender classifier is approximately 75% accurate, which is pretty good.

We can further examine the classifier to determine which features it found most effective for distinguishing gender.

classifier.show_most_informative_features(5)
Most Informative Features
             last_letter = 'a'            female : male   =     39.6 : 1.0
             last_letter = 'k'              male : female =     32.4 : 1.0
             last_letter = 'f'              male : female =     15.3 : 1.0
             last_letter = 'p'              male : female =     12.6 : 1.0
             last_letter = 'm'              male : female =     11.9 : 1.0

This list shows the likelihood ratios between different word features and their labeled categories. For example, names in the training set that end in “a” are about 36 times more likely to be female than male, but names that end in “k” are 32 times more likely to be male than female.

Choosing The Right Features

Selecting relevant features and deciding how to encode them for a classifier can have an enormous impact on the classifier’s ability to extract a good model. Much of the interesting work in building a classifier is deciding what features might be relevant and how best to represent them. Although it’s often possible to get decent performance by using a fairly simple and obvious set of features, there are usually significant gains to be had by using carefully constructed features based on an understanding of the task at hand.

Typically, feature extractors are built through a process of trial-and-error. It’s common to start with a “kitchen sink” approach and then checking to see which features actually are helpful.

def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

gender_features2('Tommy') 
{'first_letter': 't',
 'last_letter': 'y',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 0,
 'has(h)': False,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 0,
 'has(j)': False,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 2,
 'has(m)': True,
 'count(n)': 0,
 'has(n)': False,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 1,
 'has(t)': True,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 1,
 'has(y)': True,
 'count(z)': 0,
 'has(z)': False}

However, there are limits to the number of features that you should use. Too many features can make the algorithm rely on idiosyncrasies in your training data that don’t generalize well to new examples. This problem is known as overfitting, and can be especially problematic when working with small training sets.

How does gender_features2 compare against our original gender_features classifier?

featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
0.762

Error analysis

Once an initial set of features has been chosen, we can refine the feature set using error analysis.

First, we select a development set, containing the corpus data for creating the model. This development set is then subdivided into the training set and the dev-test set.

train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

The training set is used to train the model and the dev-test set is used to perform error analysis. Note: it is important that we employ a separate dev-test set for error analysis rather than just using the test set. The division of the corpus data into different subsets is shown below.

corpusworkflow

We train a model using the training set [1], and then run it on the dev-test set [2]:

train_set = [(gender_features2(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features2(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features2(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set) #[1]
print(nltk.classify.accuracy(classifier, devtest_set)) #[2]
0.739

Using the dev-test set, we can generate a list of errors that the classifier makes when predicting name genders:

errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features2(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

We can then examine individual error cases where the model predicted the wrong label and try to determine what additional pieces of information would allow the classifier to make the right decision (or which existing pieces of information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly.

len(errors) # number of mislabeled names
261
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))
correct=female   guess=male     name=Agnes                         
correct=female   guess=male     name=Ardis                         
correct=female   guess=male     name=Ardys                         
correct=female   guess=male     name=Audy                          
correct=female   guess=male     name=Berte                         
correct=female   guess=male     name=Beth                          
correct=female   guess=male     name=Bo                            
correct=female   guess=male     name=Bobbette                      
correct=female   guess=male     name=Bonny                         
correct=female   guess=male     name=Brear                         
correct=female   guess=male     name=Brit                          
correct=female   guess=male     name=Buffy                         
correct=female   guess=male     name=Cam                           
correct=female   guess=male     name=Cameo                         
correct=female   guess=male     name=Carroll                       
correct=female   guess=male     name=Cherish                       
correct=female   guess=male     name=Chloris                       
correct=female   guess=male     name=Coriss                        
correct=female   guess=male     name=Correy                        
correct=female   guess=male     name=Cris                          
correct=female   guess=male     name=Cristin                       
correct=female   guess=male     name=Dagmar                        
correct=female   guess=male     name=Deborah                       
correct=female   guess=male     name=Doreen                        
correct=female   guess=male     name=Dorthy                        
correct=female   guess=male     name=Drew                          
correct=female   guess=male     name=Dulce                         
correct=female   guess=male     name=Easter                        
correct=female   guess=male     name=Elsbeth                       
correct=female   guess=male     name=Em                            
correct=female   guess=male     name=Evy                           
correct=female   guess=male     name=Fallon                        
correct=female   guess=male     name=Faythe                        
correct=female   guess=male     name=Florence                      
correct=female   guess=male     name=Francesmary                   
correct=female   guess=male     name=Gladys                        
correct=female   guess=male     name=Goldy                         
correct=female   guess=male     name=Gussie                        
correct=female   guess=male     name=Gwenore                       
correct=female   guess=male     name=Gwyneth                       
correct=female   guess=male     name=Harriott                      
correct=female   guess=male     name=Honor                         
correct=female   guess=male     name=Inez                          
correct=female   guess=male     name=Ingeborg                      
correct=female   guess=male     name=Ivett                         
correct=female   guess=male     name=Jerry                         
correct=female   guess=male     name=Jo                            
correct=female   guess=male     name=Joby                          
correct=female   guess=male     name=Joell                         
correct=female   guess=male     name=Joey                          
correct=female   guess=male     name=Josee                         
correct=female   guess=male     name=Joselyn                       
correct=female   guess=male     name=Jourdan                       
correct=female   guess=male     name=Karon                         
correct=female   guess=male     name=Kit                           
correct=female   guess=male     name=Korry                         
correct=female   guess=male     name=Kristin                       
correct=female   guess=male     name=Lucky                         
correct=female   guess=male     name=Margaret                      
correct=female   guess=male     name=Marget                        
correct=female   guess=male     name=Margo                         
correct=female   guess=male     name=Marjory                       
correct=female   guess=male     name=Marylou                       
correct=female   guess=male     name=Meridith                      
correct=female   guess=male     name=Merry                         
correct=female   guess=male     name=Mignon                        
correct=female   guess=male     name=Moll                          
correct=female   guess=male     name=Mommy                         
correct=female   guess=male     name=Myriam                        
correct=female   guess=male     name=Olympe                        
correct=female   guess=male     name=Patty                         
correct=female   guess=male     name=Perl                          
correct=female   guess=male     name=Perry                         
correct=female   guess=male     name=Phoebe                        
correct=female   guess=male     name=Phylis                        
correct=female   guess=male     name=Phyllis                       
correct=female   guess=male     name=Pru                           
correct=female   guess=male     name=Prudence                      
correct=female   guess=male     name=Rebekah                       
correct=female   guess=male     name=Rey                           
correct=female   guess=male     name=Robyn                         
correct=female   guess=male     name=Rosaleen                      
correct=female   guess=male     name=Rosalyn                       
correct=female   guess=male     name=Rosamond                      
correct=female   guess=male     name=Roselyn                       
correct=female   guess=male     name=Roxane                        
correct=female   guess=male     name=Roxie                         
correct=female   guess=male     name=Roxine                        
correct=female   guess=male     name=Rubi                          
correct=female   guess=male     name=Scarlet                       
correct=female   guess=male     name=Shanon                        
correct=female   guess=male     name=Sharyl                        
correct=female   guess=male     name=Sherry                        
correct=female   guess=male     name=Sheryl                        
correct=female   guess=male     name=Shir                          
correct=female   guess=male     name=Sigrid                        
correct=female   guess=male     name=Siobhan                       
correct=female   guess=male     name=Sonny                         
correct=female   guess=male     name=Starr                         
correct=female   guess=male     name=Stormi                        
correct=female   guess=male     name=Sue                           
correct=female   guess=male     name=Terese                        
correct=female   guess=male     name=Terry                         
correct=female   guess=male     name=Theo                          
correct=female   guess=male     name=Thomasine                     
correct=female   guess=male     name=Tiffy                         
correct=female   guess=male     name=Timmy                         
correct=female   guess=male     name=Tish                          
correct=female   guess=male     name=Torie                         
correct=female   guess=male     name=Torrie                        
correct=female   guess=male     name=Trix                          
correct=female   guess=male     name=Trixi                         
correct=female   guess=male     name=Trudey                        
correct=female   guess=male     name=Tuesday                       
correct=female   guess=male     name=Ulrike                        
correct=female   guess=male     name=Venus                         
correct=female   guess=male     name=Wandis                        
correct=female   guess=male     name=Wendy                         
correct=female   guess=male     name=Willi                         
correct=female   guess=male     name=Willy                         
correct=female   guess=male     name=Wrennie                       
correct=female   guess=male     name=Yoko                          
correct=female   guess=male     name=Zorah                         
correct=male     guess=female   name=Abby                          
correct=male     guess=female   name=Aditya                        
correct=male     guess=female   name=Agamemnon                     
correct=male     guess=female   name=Alain                         
correct=male     guess=female   name=Aleks                         
correct=male     guess=female   name=Allan                         
correct=male     guess=female   name=Allie                         
correct=male     guess=female   name=Allin                         
correct=male     guess=female   name=Andrea                        
correct=male     guess=female   name=Andri                         
correct=male     guess=female   name=Anthony                       
correct=male     guess=female   name=Antone                        
correct=male     guess=female   name=Antonin                       
correct=male     guess=female   name=Antony                        
correct=male     guess=female   name=Archie                        
correct=male     guess=female   name=Arie                          
correct=male     guess=female   name=Aristotle                     
correct=male     guess=female   name=Arne                          
correct=male     guess=female   name=Arvie                         
correct=male     guess=female   name=Arvy                          
correct=male     guess=female   name=Barnaby                       
correct=male     guess=female   name=Barny                         
correct=male     guess=female   name=Benjamin                      
correct=male     guess=female   name=Benn                          
correct=male     guess=female   name=Benny                         
correct=male     guess=female   name=Binky                         
correct=male     guess=female   name=Blair                         
correct=male     guess=female   name=Bradly                        
correct=male     guess=female   name=Brinkley                      
correct=male     guess=female   name=Cal                           
correct=male     guess=female   name=Chaddie                       
correct=male     guess=female   name=Chancey                       
correct=male     guess=female   name=Charlie                       
correct=male     guess=female   name=Chaunce                       
correct=male     guess=female   name=Chrissy                       
correct=male     guess=female   name=Clair                         
correct=male     guess=female   name=Clancy                        
correct=male     guess=female   name=Clare                         
correct=male     guess=female   name=Clemente                      
correct=male     guess=female   name=Clinton                       
correct=male     guess=female   name=Constantin                    
correct=male     guess=female   name=Cyril                         
correct=male     guess=female   name=Dani                          
correct=male     guess=female   name=Dannie                        
correct=male     guess=female   name=Darby                         
correct=male     guess=female   name=Darian                        
correct=male     guess=female   name=Darrin                        
correct=male     guess=female   name=Daryl                         
correct=male     guess=female   name=Dean                          
correct=male     guess=female   name=Deryl                         
correct=male     guess=female   name=Dickie                        
correct=male     guess=female   name=Emil                          
correct=male     guess=female   name=Evelyn                        
correct=male     guess=female   name=Ezechiel                      
correct=male     guess=female   name=Flynn                         
correct=male     guess=female   name=Gail                          
correct=male     guess=female   name=Galen                         
correct=male     guess=female   name=Garey                         
correct=male     guess=female   name=Gene                          
correct=male     guess=female   name=Gil                           
correct=male     guess=female   name=Glynn                         
correct=male     guess=female   name=Granville                     
correct=male     guess=female   name=Hadleigh                      
correct=male     guess=female   name=Hamel                         
correct=male     guess=female   name=Hamlen                        
correct=male     guess=female   name=Hanan                         
correct=male     guess=female   name=Hassan                        
correct=male     guess=female   name=Hillel                        
correct=male     guess=female   name=Hymie                         
correct=male     guess=female   name=Ike                           
correct=male     guess=female   name=Izzy                          
correct=male     guess=female   name=Jeremie                       
correct=male     guess=female   name=Jimmie                        
correct=male     guess=female   name=Jimmy                         
correct=male     guess=female   name=Joaquin                       
correct=male     guess=female   name=Jonathan                      
correct=male     guess=female   name=Jule                          
correct=male     guess=female   name=Kalvin                        
correct=male     guess=female   name=Keene                         
correct=male     guess=female   name=Kendall                       
correct=male     guess=female   name=Kennedy                       
correct=male     guess=female   name=Kin                           
correct=male     guess=female   name=Lane                          
correct=male     guess=female   name=Laurance                      
correct=male     guess=female   name=Lay                           
correct=male     guess=female   name=Lennie                        
correct=male     guess=female   name=Lenny                         
correct=male     guess=female   name=Leslie                        
correct=male     guess=female   name=Lindy                         
correct=male     guess=female   name=Linoel                        
correct=male     guess=female   name=Lonnie                        
correct=male     guess=female   name=Lovell                        
correct=male     guess=female   name=Luke                          
correct=male     guess=female   name=Lyn                           
correct=male     guess=female   name=Manuel                        
correct=male     guess=female   name=Marion                        
correct=male     guess=female   name=Marlin                        
correct=male     guess=female   name=Marlon                        
correct=male     guess=female   name=Martainn                      
correct=male     guess=female   name=Marten                        
correct=male     guess=female   name=Meryl                         
correct=male     guess=female   name=Mika                          
correct=male     guess=female   name=Millicent                     
correct=male     guess=female   name=Milt                          
correct=male     guess=female   name=Neal                          
correct=male     guess=female   name=Nealon                        
correct=male     guess=female   name=Nealy                         
correct=male     guess=female   name=Neddie                        
correct=male     guess=female   name=Nevil                         
correct=male     guess=female   name=Neville                       
correct=male     guess=female   name=Nilson                        
correct=male     guess=female   name=Noble                         
correct=male     guess=female   name=Obadiah                       
correct=male     guess=female   name=Odie                          
correct=male     guess=female   name=Pascale                       
correct=male     guess=female   name=Pierce                        
correct=male     guess=female   name=Rafael                        
correct=male     guess=female   name=Raleigh                       
correct=male     guess=female   name=Randall                       
correct=male     guess=female   name=Randell                       
correct=male     guess=female   name=Ray                           
correct=male     guess=female   name=Reggie                        
correct=male     guess=female   name=Rickey                        
correct=male     guess=female   name=Salman                        
correct=male     guess=female   name=Sansone                       
correct=male     guess=female   name=Simone                        
correct=male     guess=female   name=Slade                         
correct=male     guess=female   name=Sonnie                        
correct=male     guess=female   name=Stillman                      
correct=male     guess=female   name=Sullivan                      
correct=male     guess=female   name=Tait                          
correct=male     guess=female   name=Timmie                        
correct=male     guess=female   name=Virge                         
correct=male     guess=female   name=Yale                          
correct=male     guess=female   name=Yardley                       
correct=male     guess=female   name=Zachariah                     
correct=male     guess=female   name=Zacharie                      
correct=male     guess=female   name=Zechariah                     

Looking through this list of errors, it appears that certain 2-character suffixes are indicative of gender. For example, names ending in ‘yn’ appear to be predominantly female, despite the fact that names ending in ‘n’ tend to be male; and names ending in ‘ch’ are usually male, even though names that end in ‘h’ tend to be female.

Let’s adjust our feature extractor to include features for two-letter suffixes:

def gender_features(word):
    return {'suffix1': word[-1:], 'suffix2': word[-2:]}

Let’s rebuild our classifier with the new feature extractor and see whether its accuracy improves.

train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))
0.776

This error analysis procedure can be repeated after checking for patterns in errors made by the improved classifier. Each time the error analysis procedure is repeated, we should select a different dev-test/training split, to ensure that the classifier does not start to reflect idiosyncrasies in the dev-test set.

However, once we’ve used the dev-test set to develop the model, we can no longer trust that it will give us an accurate idea of how well the model would perform on new data. It is therefore important to keep the test set separate, and unused, until our model development is complete. At that point, we can use the test set to evaluate how well our model will perform on new input values.

Exercise

In addition to the suffix features, add another word feature to your gender classifier and evaluate the accuracy of this new model.

Document Classification

Sometimes we want to classify an entire document rather than a particular word. Document classification is especially useful for sentiment analysis because the emotions or opinions contained in a text are not typically represented by a single word.

Sentiment analysis is a category of supervised classification algorithms where the labels consist of different emotions, often negative, positive, and neutral. The procedure is the same as before: using pre-labeled corpora, we can build classifiers that automatically tag new documents with appropriate category labels.

As an illustrative example, let’s work with the Movie Reviews Corpus that categorizes each review as positive or negative.

from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

Now let’s define a feature extractor for our documents such that the classifier will know which aspects of the data it should pay attention to.

For document topic identification, we can define a feature for each word that indicates whether the document contains that word. To limit the number of features that the classifier needs to process, we begin by constructing a list of the 2000 most frequent words in the overall corpus. We can then define a feature extractor that simply checks whether each of these words is present in a given document.

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

Now that we’ve defined our feature extractor we can use it to train a classifier to label new movie reviews. To check how reliable the resulting classifier is, we compute its accuracy on the test set. And once again, we can use show_most_informative_features() to find out which features the classifier found to be most informative.

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
0.82
classifier.show_most_informative_features(10)
Most Informative Features
 contains(unimaginative) = True              neg : pos    =      8.2 : 1.0
    contains(schumacher) = True              neg : pos    =      6.9 : 1.0
          contains(mena) = True              neg : pos    =      6.9 : 1.0
        contains(suvari) = True              neg : pos    =      6.9 : 1.0
     contains(atrocious) = True              neg : pos    =      6.5 : 1.0
        contains(shoddy) = True              neg : pos    =      6.3 : 1.0
        contains(turkey) = True              neg : pos    =      5.8 : 1.0
        contains(justin) = True              neg : pos    =      5.7 : 1.0
           contains(ugh) = True              neg : pos    =      5.7 : 1.0
       contains(unravel) = True              pos : neg    =      5.7 : 1.0

Example using Naive Bayes Classification

Let’s run some classification tasks using the Supreme Court confirmation hearings transcripts

This data contains the text of every Supreme Court confirmation hearing for which Senate Judiciary Committee transcripts are available (beginning in 1971 with hearings for Lewis Powell and William Rehnquist and concluding with Neil Gorsuch’s 2017 hearing).

import pandas as pd
import nltk

data = pd.read_csv('../input/Oct-7-2019-Supreme-Court-Confirmation-Hearing-Transcript.csv', encoding='latin1')

data = data.reset_index()

data.head()
index Total Order Order Year Hearing Title Speaker (Party)(or nominated by) Speaker and title Statements sentiment
0 0 1 1 2017 Neil M. Gorsuch Chairman R Senator Chuck Grassley (IA) Chairman Grassley. Good morning, everybody. I ... positive
1 1 2 2 2017 Neil M. Gorsuch Nominee R Neil M. Gorsuch Judge Gorsuch. Pleasure to be here. Thank you. positive
2 2 3 3 2017 Neil M. Gorsuch Chairman R Senator Chuck Grassley (IA) Chairman Grassley. This is a big day for you a... positive
3 3 4 4 2017 Neil M. Gorsuch Nominee R Neil M. Gorsuch Judge Gorsuch. Just a little, Senator. neutral
4 4 5 5 2017 Neil M. Gorsuch Chairman R Senator Chuck Grassley (IA) Chairman Grassley. Yes. Before we begin, I wou... neutral

Classifying partisanship

Do Republicans and Democrats speak differently at judicial confirmation hearings? That is, can we infer party label based on what a speaker says?

The dataset already includes the party label of each speaker. We can use this information to create a partisanship classifier.

The first thing we need to do to create our classifier is to create a set of word features associated with a given party label. There are a few pre-processing steps we will need to do in order to extract the labeled text features from our pandas dataframe object so it can be added to our classifier.

Pre-processing steps

The first thing we want to do is create a list of all words across all documents. Recall that this information is used to calculate the posterior probabilities in naive bayes classification.

Let’s begin by tokenizing all of the documents.

words = data['Statements'].apply(nltk.word_tokenize)
words # the tokenized words are structured as a dataframe.
0        [Chairman, Grassley, ., Good, morning, ,, ever...
1        [Judge, Gorsuch, ., Pleasure, to, be, here, .,...
2        [Chairman, Grassley, ., This, is, a, big, day,...
3        [Judge, Gorsuch, ., Just, a, little, ,, Senato...
4        [Chairman, Grassley, ., Yes, ., Before, we, be...
                               ...                        
30173    [Mr., POWELL, ., I, wish, to, thank, the, chai...
30174    [The, CHAIRMAN, ., Thank, you, ,, sir, ., Now,...
30175    [Senator, SCOTT, ., IS, that, room, 2300, ,, M...
30176    [The, CHAIRMAN, ., It, is, the, Judiciary, Com...
30177    [Senator, SCOTT, ., Room, 2228, ., I, just, sa...
Name: Statements, Length: 30178, dtype: object

Turn this dataframe into a list object.

words = list(words) # the tokenized words are now structured as a list of lists

Create a master list of all words ** across **all documents. This list will be used to construct our features set below.

import itertools
all_words = (list(itertools.chain.from_iterable(words))) # we will use this information below when we build our naive bayes classifier.

Now let’s pre-process the text in the original dataframe, tokenizing words and converting everything to lower case.

data['Statements'] = data['Statements'].apply(lambda x: x.lower())
data['Statements'] = data['Statements'].apply(nltk.word_tokenize)

Create a list object that links texts with party label. We will split this data up into a training and test set for our classifier.

documents = list(zip(data['Statements'], data['Speaker (Party)(or nominated by)']))

Now let’s use this information to create a features extractor for our documents. The following features extractor indicates whether a given document contains a word from our master lists of words (all_words). To limit the number of features that the classifier needs to process, we begin by constructing a list of the 2000 most frequent words in the overall corpus [1]. We then define a feature extractor [2] that simply checks whether each of these words is present in a given document.

all_words = nltk.FreqDist(w.lower() for w in all_words) #[1]
word_features = list(all_words)[:2000]

def document_features(document): #[2]
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

Now that we’ve defined our feature extractor, we can use it to train a classifier to label new, previously unseen texts. To check how reliable the resulting classifier is, we compute its accuracy on the test set. And once again, we can use the command show_most_informative_features( ) to find out which features the classifier found to be most informative.

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features(5)
Most Informative Features
       contains(kennedy) = True                D : nan    =     37.5 : 1.0
   contains(innumerable) = True              nan : R      =     26.6 : 1.0
       contains(gorsuch) = True                R : D      =     26.4 : 1.0
     contains(important) = True                R : nan    =     25.6 : 1.0
         contains(judge) = True                R : nan    =     25.5 : 1.0
print(nltk.classify.accuracy(classifier, test_set))
0.35

With approximately 35% accuracy, this is a pretty abysmal classifier. This classifier performs more poorly than if we just flipped a coin to assign party label (i.e. 50% accuracy)!

What could we do to improve our classifier? How about we restrict our sample to a single confirmation hearing, i.e. Neil Gorsuch’s hearing in 2017? How well does the classifier perform for this subset?

data = pd.read_csv('../input/Oct-7-2019-Supreme-Court-Confirmation-Hearing-Transcript.csv', encoding='latin1')
data = data.reset_index()


data = data[data['Statements'].notnull() & (data['Year'] == 2017)] # only hearings from 2017
words = data['Statements'].apply(nltk.word_tokenize)
words = list(words)

all_words = (list(itertools.chain.from_iterable(words)))


data['Statements'] = data['Statements'].apply(lambda x: x.lower())
data['Statements'] = data['Statements'].apply(nltk.word_tokenize)


documents = list(zip(data['Statements'], data['Speaker (Party)(or nominated by)']))

all_words = nltk.FreqDist(w.lower() for w in all_words)
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

0.87

Restricting our classifier to 2017 improved our accuracy by over 50%! Why do you think the classifier does a better job on 2017 than for the corpus as a whole?

Exercises

  • Using the sentiment labels you constructed for the Supreme Court Confirmation Hearings transcripts, use a Naive Bayes classifier to predict the sentiment of statements made by and before the Senate Judiciary Committee.
  • Evaluate the accuracy of your classifier.
  • Run error analysis on your classifier. Which features are contributing to misclassification?
  • Try improving the accuracy of your classifier by adding or substracting word features or subsetting the data.