Preprocess Text in Python --- A Cleaner and Faster Approach

Preprocess Text in Python --- A Cleaner and Faster Approach

Motivation

Well, I think it all start with one of my favorite tweets from 2013:

When building NLP models, pre-processing your data is extremely important. For example, different stopwords removal, stemming and lemmization might have huge impact on the accuracy of your models. Often times, the order of how you do the cleaning is also critical. Do you want to remove certain words first then tokenize the text? Or tokenize then remove the tokens? What we need is a clear to understand and yet flexiable code to do the pre-processing job. When using R, the pipe operator %>% kind of taken care of the most part. However, there is no really good equivlent in Python because the natural different of Python and R: (Long but very good read)

In other words, it’s like saying that when OOP was born, it was also born with the Gang-of-Four design patterns baked into it’s core as its backing theory of thought (outside of types and inheritance and methods etc.), and every implemented OOP language included these patterns and abstractions by default for you to take advantage of, and that these patterns were bullet-proofed by centuries of research. But that can already never be correct — the Singleton pattern is by now widely recognized as an anti-pattern, and GoF authors said they would remove it, if only they could go back in time.

But we can definitely hack our way around this using Python Class

Design

Let’s create a snippet of text as an example:

sample = """<h1>Title Goes Here</h1>
<b>Bolded Text</b>
<i>Italicized Text</i>
<img src="this should all be gone"/>
<a href="this will be gone, too">But this will still be here!</a>
I run. He ran. She is running. Will they stop running?
I talked. She was talking. They talked to them about running. Who ran to the talking runner?
[Some text we don't want to keep is in here]
¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!
something... is! wrong() with.,; this :: sentence.
I can't do this anymore. I didn't know them. Why couldn't you have dinner at the restaurant?
My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.
Don't do it.... Just don't. Billy! I know what you're doing. This is a great little house you've got here.
[This is some other unwanted text]
John: "Well, well, well."
James: "There, there. There, there."
&nbsp;&nbsp;
There are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.
I have to go get 2 tutus from 2 different stores, too.
22    45   1067   445
{{Here is some stuff inside of double curly braces.}}
{Here is more stuff in single curly braces.}
[DELETE]
</body>
</html>"""

Say you want to strip some html characters and use regular expressions to remove open and close double brackets and anything in between them:

import re, string, unicodedata
from bs4 import BeautifulSoup

def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text

sample = denoise_text(sample)
print(sample)
    Title Goes Here
    Bolded Text
    Italicized Text
    
    But this will still be here!
    I run. He ran. She is running. Will they stop running?
    I talked. She was talking. They talked to them about running. Who ran to the talking runner?
    
    ¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!
    something... is! wrong() with.,; this :: sentence.
    I can't do this anymore. I didn't know them. Why couldn't you have dinner at the restaurant?
    My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.
    Don't do it.... Just don't. Billy! I know what you're doing. This is a great little house you've got here.
    
    John: "Well, well, well."
    James: "There, there. There, there."
    
    There are a lot of reasons not to do this. There are 101 reasons not to do it. 1000000 reasons, actually.
    I have to go get 2 tutus from 2 different stores, too.
    22    45   1067   445
    {{Here is some stuff inside of double curly braces.}}
     {Here is more stuff in single curly braces.}

This perfectly demonstrates our normal workflow: 1. Define couple of functions like strip_html(), remove_punctuation() … 2. Run them one by one or define another “master function” to run them all like above example 3. Found that we need add more functions or change the order of the function runs 4. Make changes to the “master function” by copy and paste, then re-define it 5. Run new “master function” on the text 6. Rinse and repeat

It is not very flexiable and easy to maintain, isn’t it?

Solution

If we put those functions in a Class and let the function return self, we can use the dot notation . to chain them together, at any order!

class cleantext():
    
    def __init__(self, text = None):
        self.text = text
        
    def strip_html(self):
        soup = BeautifulSoup(self.text, "html.parser")
        self.text = soup.get_text()
        return self

    def remove_between_square_brackets(self):
        self.text = re.sub('\[[^]]*\]', '', self.text)
        return self

    def remove_numbers(self):
        self.text = re.sub('[-+]?[0-9]+', '', self.text)
        return self
    
    def do_all(self, text):
        
        self.text = text
        self = self.strip_html()
        self = self.remove_numbers()
       
        return self.words


ct = cleantext(sample)
print(ct.strip_html().remove_between_square_brackets().remove_numbers().text)
    Title Goes Here
    Bolded Text
    Italicized Text
    
    But this will still be here!
    I run. He ran. She is running. Will they stop running?
    I talked. She was talking. They talked to them about running. Who ran to the talking runner?
    
    ¡Sebastián, Nicolás, Alejandro and Jéronimo are going to the store tomorrow morning!
    something... is! wrong() with.,; this :: sentence.
    I can't do this anymore. I didn't know them. Why couldn't you have dinner at the restaurant?
    My favorite movie franchises, in order: Indiana Jones; Marvel Cinematic Universe; Star Wars; Back to the Future; Harry Potter.
    Don't do it.... Just don't. Billy! I know what you're doing. This is a great little house you've got here.
    
    John: "Well, well, well."
    James: "There, there. There, there."
    
    There are a lot of reasons not to do this. There are  reasons not to do it.  reasons, actually.
    I have to go get  tutus from  different stores, too.
              
    {{Here is some stuff inside of double curly braces.}}
    {Here is more stuff in single curly braces.}

This makes our code readable and easy to manipulate. We can read out our code too — just read the dot . as “then” in your mind : > sample text then strip html then remove between square brackets then remove numbers”

Full implementation

So my full definition of the class looks like this: (example and many functions are from KDNugget)

import re, string, unicodedata
import nltk
import contractions
import inflect
from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer

class cleantext():
    
    def __init__(self, text = "test"):
        self.text = text
        
    def strip_html(self):
        soup = BeautifulSoup(self.text, "html.parser")
        self.text = soup.get_text()
        return self

    def remove_between_square_brackets(self):
        self.text = re.sub('\[[^]]*\]', '', self.text)
        return self

    def remove_numbers(self):
        self.text = re.sub('[-+]?[0-9]+', '', self.text)
        return self

    def replace_contractions(self):
        """Replace contractions in string of text"""
        self.text = contractions.fix(self.text)
        return self
    
    def get_words(self):
        self.words = nltk.word_tokenize(self.text)
        return self

    def remove_non_ascii(self):
        """Remove non-ASCII characters from list of tokenized words"""
        new_words = []
        for word in self.words:
            new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
            new_words.append(new_word)
        self.words = new_words
        return self

    def to_lowercase(self):
        """Convert all characters to lowercase from list of tokenized words"""
        new_words = []
        for word in self.words:
            new_word = word.lower()
            new_words.append(new_word)
        self.words = new_words
        return self

    def remove_punctuation(self):
        """Remove punctuation from list of tokenized words"""
        new_words = []
        for word in self.words:
            new_word = re.sub(r'[^\w\s]', '', word)
            if new_word != '':
                new_words.append(new_word)
        self.words = new_words
        return self

    def replace_numbers(self):
        """Replace all interger occurrences in list of tokenized words with textual representation"""
        p = inflect.engine()
        new_words = []
        for word in self.words:
            if word.isdigit():
                new_word = p.number_to_words(word)
                new_words.append(new_word)
            else:
                new_words.append(word)
        self.words = new_words
        return self

    def remove_stopwords(self):
        """Remove stop words from list of tokenized words"""
        new_words = []
        for word in self.words:
            if word not in stopwords.words('english'):
                new_words.append(word)
        self.words = new_words
        return self

    def stem_words(self):
        """Stem words in list of tokenized words"""
        stemmer = LancasterStemmer()
        stems = []
        for word in self.words:
            stem = stemmer.stem(word)
            stems.append(stem)
        self.words = stems
        return self

    def lemmatize_verbs(self):
        """Lemmatize verbs in list of tokenized words"""
        lemmatizer = WordNetLemmatizer()
        lemmas = []
        for word in self.words:
            lemma = lemmatizer.lemmatize(word, pos='v')
            lemmas.append(lemma)
        self.words = lemmas
        return self
    
    def join_words(self):
        self.words = ' '.join(self.words)
        return self
    
    def do_all(self, text):
        
        self.text = text
        self = self.strip_html()
        self = self.remove_numbers()
        self = self.replace_contractions()
        self = self.get_words()
        self = self.remove_punctuation()
        self = self.remove_non_ascii()
        self = self.remove_stopwords()
        self = self.stem_words()
        self = self.lemmatize_verbs()
        self = self.join_words()
        
        return self.words

Now we can pre-process our text easily, in a chain-like matter:

ct = cleantext(sample)

ct.\
strip_html().\
remove_numbers().\
replace_contractions().\
get_words().\
remove_punctuation().\
remove_non_ascii().\
remove_stopwords().\
stem_words().\
lemmatize_verbs().\
join_words().\
words
'titl goe her bold text it text but stil i run he run she run wil stop run i talk she talk they talk run who run talk run som text want keep sebast nicola alejandro jeronimo go stor tomorrow morn someth wrong send i anym i know why could din resta my favorit movy franch ord indian jon marvel cinem univers star war back fut harry pot just bil i know thi gre littl hous get thi unw text john wel wel wel jam ther ther ther lot reason ther reason reason act i go get tut diff stor her stuff insid doubl cur brac her stuff singl cur brac delet'

You can easily add or modify any step on your code, it’s like a pipeline!

Apply on Pandas DataFrame

Say right now we have a dataframe and one column contains the string we would like to process. We can easily use apply()

import pandas as pd
df = pd.DataFrame({'id': range(0,100), 'docs':sample } )
df.head()
id docs
0 0 <h1>Title Goes Here</h1>\n<b>Bolded Text</b>\n...
1 1 <h1>Title Goes Here</h1>\n<b>Bolded Text</b>\n...
2 2 <h1>Title Goes Here</h1>\n<b>Bolded Text</b>\n...
3 3 <h1>Title Goes Here</h1>\n<b>Bolded Text</b>\n...
4 4 <h1>Title Goes Here</h1>\n<b>Bolded Text</b>\n...
df['cleaned_docs'] = df['docs'].apply(ct.do_all)
df.head()
id docs cleaned_docs
0 0 <h1>Title Goes Here</h1>\n<b>Bolded Text</b>\n... titl goe her bold text it text but stil i run ...
1 1 <h1>Title Goes Here</h1>\n<b>Bolded Text</b>\n... titl goe her bold text it text but stil i run ...
2 2 <h1>Title Goes Here</h1>\n<b>Bolded Text</b>\n... titl goe her bold text it text but stil i run ...
3 3 <h1>Title Goes Here</h1>\n<b>Bolded Text</b>\n... titl goe her bold text it text but stil i run ...
4 4 <h1>Title Goes Here</h1>\n<b>Bolded Text</b>\n... titl goe her bold text it text but stil i run ...

Parallelization using dask

Care for some speed up using parallelization? No problem. dask to the rescue. (dask documents)

import dask.dataframe as dd
from dask.multiprocessing import get
import timeit

The idea here is we partition a pandas dataframe into “dask dataframe”, then we can run the job parallelly by putting different partition on different workers:

def dask_this(df):
    res = df['docs'].apply(ct.do_all)
    return res  
ddata = dd.from_pandas(df, npartitions=10)
type(ddata)
dask.dataframe.core.DataFrame

Doing a benchmark on our dataframes with 100 samples:

import time
start_time = time.time()
res = ddata.map_partitions(dask_this).compute(scheduler='processes', num_workers=10)
print("--- %s seconds ---" % (time.time() - start_time))
--- 1.2451355457305908 seconds ---
import time
start_time = time.time()
res = df['docs'].apply(ct.do_all)
print("--- %s seconds ---" % (time.time() - start_time))
--- 8.75901198387146 seconds ---

~7x speed up with 10 cores! Lets all start dasking!

Avatar
Michael Yan
Staff Data Scientist @ SpotHero

A student of the last forbidden art of Data Science.

comments powered by Disqus