"* [Rare words and spelling](#Rare-words-and-spelling)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this session we are going to learn how to preprocess texts, also known as *text wrangling*. This task involves data munging, text cleansing, specific preprocessing, tokenization, stemming or lemmatization and stop word removal.\n",
"\n",
"The main objectives of this session are:\n",
"* Learn how to preprocess text sources\n",
"* Learn to use some of the most popular NLP libraries\n",
"\n",
"We are going to use as an example part of a computer review included in [Liu's Product Review of IJCA 2015](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets) and a tweet from the [Semeval 2013 Task 2 dataset](https://www.cs.york.ac.uk/semeval-2013/task2/data/uploads/datasets/readme.txt), slightly modified for learning purposes."
"review = \"\"\"I purchased this monitor because of budgetary concerns. This item was the most inexpensive 17 inch monitor \n",
"available to me at the time I made the purchase. My overall experience with this monitor was very poor. When the \n",
"screen wasn't contracting or glitching the overall picture quality was poor to fair. I've viewed numerous different \n",
"monitor models since I 'm a college student and this particular monitor had as poor of picture quality as \n",
"any I 've seen.\"\"\"\n",
"\n",
"tweet = \"\"\"@concert Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to \n",
" her music!!!! WOW!!! #ladygaga #britney\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tools"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this session we are going to use several libraries, which provide complementary features:\n",
"* [NLTK](nltk.org/book_1ed/) - provides functionalities for sentence splitting, tokenization, lemmatization, NER, collocations ... and access to many lexical resources (WordNet, Corpora, ...)\n",
"* [Gensim](https://radimrehurek.com/gensim/) - provides functionalities for corpora management, LDA and LSI, among others.\n",
"* [TextBlob](http://textblob.readthedocs.io/) - provides a simple way to access to many of the NLP functions. It is simpler than NLTK and integrates additional functionatities, such as language detection, spelling or even sentiment analysis.\n",
"* [CLiPS](http://www.clips.ua.ac.be/pages/pattern-en#parser) -- contains a fast part-of-speech tagger for English (identifies nouns, adjectives, verbs, etc. in a sentence), sentiment and mood analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface. Unfortunately, it does not support Python 3 yet.\n",
"\n",
"\n",
"In order to use nltk, we should download first the lexical resources we are going to use. We can updated them later. For this, you need:\n",
"If you inspect the window, you can get an overview of available lexical resources (corpora, lexicons and grammars). For example, you can find some relevant sentiment lexicons in corpora (SentiWordNet, Sentence Polarity Dataset, Vader, Opinion Lexicon or VADER Sentiment Lexicon). Don't forget to close the window once the data has been downloaded."
"In this case we will use raw text. In case you need to clean the documents (eliminate HTML markup, etc.), you can use libraries such as [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tokenization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tokenization is the process of transforming a text into tokens. Depending on the input, we can want to split the text into sentences or words. Moreover, some input such as Twitter can require taking into account processing special tokens, such as hashtags.\n",
"\n",
"NLTK provides good support for [tokenization](http://www.nltk.org/api/nltk.tokenize.html).\n",
"\n",
"Next we are going to practice several of these features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sentence Splitter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use a standard sentence splitter (*sent_tokenize* which uses *PunkTonenizer*), or train a sentence splitter, using the class [*PunktSentenceTokenizer*](http://www.nltk.org/api/nltk.tokenize.html).\n",
"\n",
"If the text is multilingual, we can install [*textblob*](http://textblob.readthedocs.io/) or [*langdetect*](https://pypi.python.org/pypi/langdetect?) to detect the text language and select the most suitable sentence splitter. NLTK comes with 17 trained languages for sentence splitting."
"words = [word_tokenize(t) for t in sent_tokenize(review)]\n",
"print(words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our case, we are not interested in processing every sentence, we have split into sentence just for learning purposes. So, we are going to get the word tokens."
"We can define our own word tokenizer using regular expressions (and using the class [*RegexpTokenizer*](http://www.nltk.org/api/nltk.tokenize.html).\n",
"NLTK provides support for stemming in the package [*stem*](http://www.nltk.org/api/nltk.stem.html). There are several available stemmers:PorterStemmer, lancaster or WordNetLemmatizer. Check the API for more details. Here we are going the output of different stemmers (and the execution time). You can observe that some of them use rules and do not behave properly always (e.g. Porter algorithm)."
"As we can see, we get the forms *are* and *is* instead of *be*. This is because we have not introduce the Part-Of-Speech (POS), and the default POS is 'n' (name).\n",
"\n",
"The main difference between stemmers and lemmatizers is that stemmers operate in isolated words, while lemmatizers take into account the context (e.g. POS). However, stemmers are quicker and require fewer resources.\n"
" lemmas_clean = [w for w in lemmas if w not in stoplist]\n",
" punctuation = set(string.punctuation)\n",
" words = [w for w in lemmas_clean if w not in punctuation]\n",
" return words\n",
"\n",
"print(preprocess(review))\n",
"print(preprocess(tweet, type='tweet'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Rare words and spelling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In large corpus, we may want to clean rare words (probably they are typos) and correct spelling. \n",
"\n",
"For the first task, you can exclude the least frequent words (or compare with their frequency in a corpus). NLTK provides facilities for calculating frequencies.\n",
"\n",
"For the second task, you can use spell-checker packages such as [*textblob*](http://textblob.readthedocs.io/) or [*autocorrect*](https://pypi.python.org/pypi/autocorrect/)."
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",