mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-14 18:42:28 +00:00
530 lines
17 KiB
Plaintext
530 lines
17 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Course Notes for Learning Intelligent Systems"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Lexical Processing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Table of Contents\n",
|
|
"* [Objectives](#Objectives)\n",
|
|
"* [Tools](#Tools)\n",
|
|
"* [Cleansing](#Cleansing)\n",
|
|
"* [Tokenization](#Tokenization)\n",
|
|
"* [Sentence Splitter](#Sentence-Splitter)\n",
|
|
"* [Word Splitter](#Word-Splitter)\n",
|
|
"* [Stemming and Lemmatization](#Stemming-and-Lemmatization)\n",
|
|
"* [Stop word removal](#Stop-word-removal)\n",
|
|
"* [Punctuation removal](#Punctuation-removal)\n",
|
|
"* [Rare words and spelling](#Rare-words-and-spelling)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Objectives"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this session we are going to learn how to preprocess texts, also known as *text wrangling*. This task involves data munging, text cleansing, specific preprocessing, tokenization, stemming or lemmatization and stop word removal.\n",
|
|
"\n",
|
|
"The main objectives of this session are:\n",
|
|
"* Learn how to preprocess text sources\n",
|
|
"* Learn to use some of the most popular NLP libraries\n",
|
|
"\n",
|
|
"We are going to use as an example part of a computer review included in [Liu's Product Review of IJCA 2015](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets) and a tweet from the [Semeval 2013 Task 2 dataset](https://www.cs.york.ac.uk/semeval-2013/task2/data/uploads/datasets/readme.txt), slightly modified for learning purposes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"review = \"\"\"I purchased this monitor because of budgetary concerns. This item was the most inexpensive 17 inch monitor \n",
|
|
"available to me at the time I made the purchase. My overall experience with this monitor was very poor. When the \n",
|
|
"screen wasn't contracting or glitching the overall picture quality was poor to fair. I've viewed numerous different \n",
|
|
"monitor models since I 'm a college student and this particular monitor had as poor of picture quality as \n",
|
|
"any I 've seen.\"\"\"\n",
|
|
"\n",
|
|
"tweet = \"\"\"@concert Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to \n",
|
|
" her music!!!! WOW!!! #ladygaga #britney\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Tools"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this session we are going to use several libraries, which provide complementary features:\n",
|
|
"* [NLTK](nltk.org/book_1ed/) - provides functionalities for sentence splitting, tokenization, lemmatization, NER, collocations ... and access to many lexical resources (WordNet, Corpora, ...)\n",
|
|
"* [Gensim](https://radimrehurek.com/gensim/) - provides functionalities for corpora management, LDA and LSI, among others.\n",
|
|
"* [TextBlob](http://textblob.readthedocs.io/) - provides a simple way to access to many of the NLP functions. It is simpler than NLTK and integrates additional functionatities, such as language detection, spelling or even sentiment analysis.\n",
|
|
"* [CLiPS](http://www.clips.ua.ac.be/pages/pattern-en#parser) -- contains a fast part-of-speech tagger for English (identifies nouns, adjectives, verbs, etc. in a sentence), sentiment and mood analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface. Unfortunately, it does not support Python 3 yet.\n",
|
|
"\n",
|
|
"\n",
|
|
"In order to use nltk, we should download first the lexical resources we are going to use. We can updated them later. For this, you need:\n",
|
|
"* install nltk. Execute 'conda install nltk'\n",
|
|
"* import nltk\n",
|
|
"* Run *nltk.download()* (the first time we use it). A window will appear. You should select just the corpus 'book' and press download.\n",
|
|
"If you inspect the window, you can get an overview of available lexical resources (corpora, lexicons and grammars). For example, you can find some relevant sentiment lexicons in corpora (SentiWordNet, Sentence Polarity Dataset, Vader, Opinion Lexicon or VADER Sentiment Lexicon). Don't forget to close the window once the data has been downloaded."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import nltk\n",
|
|
"nltk.download()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Cleansing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this case we will use raw text. In case you need to clean the documents (eliminate HTML markup, etc.), you can use libraries such as [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Tokenization"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Tokenization is the process of transforming a text into tokens. Depending on the input, we can want to split the text into sentences or words. Moreover, some input such as Twitter can require taking into account processing special tokens, such as hashtags.\n",
|
|
"\n",
|
|
"NLTK provides good support for [tokenization](http://www.nltk.org/api/nltk.tokenize.html).\n",
|
|
"\n",
|
|
"Next we are going to practice several of these features."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Sentence Splitter"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can use a standard sentence splitter (*sent_tokenize* which uses *PunkTonenizer*), or train a sentence splitter, using the class [*PunktSentenceTokenizer*](http://www.nltk.org/api/nltk.tokenize.html).\n",
|
|
"\n",
|
|
"If the text is multilingual, we can install [*textblob*](http://textblob.readthedocs.io/) or [*langdetect*](https://pypi.python.org/pypi/langdetect?) to detect the text language and select the most suitable sentence splitter. NLTK comes with 17 trained languages for sentence splitting."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk.tokenize import sent_tokenize, word_tokenize\n",
|
|
"\n",
|
|
"sentences = sent_tokenize(review, language='english')\n",
|
|
"print(sentences)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Word Splitter"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next stem is dividing every sentence (or the full step) into words."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"words = [word_tokenize(t) for t in sent_tokenize(review)]\n",
|
|
"print(words)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In our case, we are not interested in processing every sentence, we have split into sentence just for learning purposes. So, we are going to get the word tokens."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"words = word_tokenize(review)\n",
|
|
"print(words)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can define our own word tokenizer using regular expressions (and using the class [*RegexpTokenizer*](http://www.nltk.org/api/nltk.tokenize.html).\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk.tokenize import TweetTokenizer\n",
|
|
"tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
|
|
"tweet_tokens = tknzr.tokenize(tweet)\n",
|
|
"print (\"With TweetTokenizer \" + \" \".join(tweet_tokens))\n",
|
|
"print(\"With word_tokenizer \" + \" \".join(word_tokenize(tweet)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Stemming and Lemmatization"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"NLTK provides support for stemming in the package [*stem*](http://www.nltk.org/api/nltk.stem.html). There are several available stemmers:PorterStemmer, lancaster or WordNetLemmatizer. Check the API for more details. Here we are going the output of different stemmers (and the execution time). You can observe that some of them use rules and do not behave properly always (e.g. Porter algorithm)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer\n",
|
|
"from nltk.stem.snowball import EnglishStemmer\n",
|
|
"import time\n",
|
|
"\n",
|
|
"porter = PorterStemmer()\n",
|
|
"lancaster = LancasterStemmer()\n",
|
|
"wordnet = WordNetLemmatizer()\n",
|
|
"snowball = EnglishStemmer()\n",
|
|
"\n",
|
|
"words = \"boys children are have is has Madrid\"\n",
|
|
"\n",
|
|
"start = time.time()\n",
|
|
"print(\"Porter: \" + \" \".join([porter.stem(w) for w in word_tokenize(words)]))\n",
|
|
"end = time.time()\n",
|
|
"print(\"Execution time: \" + str(end - start))\n",
|
|
"start = time.time()\n",
|
|
"print(\"Lancaster: \" + \" \".join([lancaster.stem(w) for w in word_tokenize(words)]))\n",
|
|
"end = time.time()\n",
|
|
"print(\"Execution time: \" + str(end - start))\n",
|
|
"start = time.time()\n",
|
|
"print(\"WordNet: \" + \" \".join([wordnet.lemmatize(w) for w in word_tokenize(words)]))\n",
|
|
"end = time.time()\n",
|
|
"print(\"Execution time: \" + str(end - start))\n",
|
|
"start = time.time()\n",
|
|
"print(\"SnowBall: \" + \" \".join([snowball.stem(w) for w in word_tokenize(words)]))\n",
|
|
"end = time.time()\n",
|
|
"print(\"Execution time: \" + str(end - start))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As we can see, we get the forms *are* and *is* instead of *be*. This is because we have not introduce the Part-Of-Speech (POS), and the default POS is 'n' (name).\n",
|
|
"\n",
|
|
"The main difference between stemmers and lemmatizers is that stemmers operate in isolated words, while lemmatizers take into account the context (e.g. POS). However, stemmers are quicker and require fewer resources.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"verbs = \"are crying is have has\"\n",
|
|
"print(\"WordNet: \" + \" \".join([wordnet.lemmatize(w, pos='v') for w in word_tokenize(verbs)]))\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Depending of the application, we can select stemmers or lemmatizers. \n",
|
|
"\n",
|
|
"Regarding Twitter, we could use specialised software for managing tweets, such as [*TweetNLP*](http://www.cs.cmu.edu/~ark/TweetNLP/).\n",
|
|
"\n",
|
|
"Now we go back to our example and we apply stemming."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def preprocess(words, type='doc'):\n",
|
|
" if (type == 'tweet'):\n",
|
|
" tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
|
|
" tokens = tknzr.tokenize(words)\n",
|
|
" else:\n",
|
|
" tokens = nltk.word_tokenize(words.lower())\n",
|
|
" porter = nltk.PorterStemmer()\n",
|
|
" lemmas = [porter.stem(t) for t in tokens]\n",
|
|
" return lemmas\n",
|
|
"print(preprocess(review))\n",
|
|
"print(preprocess(tweet, type='tweet'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Stop word removal"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next step is removing stop words."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk.corpus import stopwords\n",
|
|
"\n",
|
|
"stoplist = stopwords.words('english')\n",
|
|
"print(stoplist)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def preprocess(words, type='doc'):\n",
|
|
" if (type == 'tweet'):\n",
|
|
" tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
|
|
" tokens = tknzr.tokenize(tweet)\n",
|
|
" else:\n",
|
|
" tokens = nltk.word_tokenize(words.lower())\n",
|
|
" porter = nltk.PorterStemmer()\n",
|
|
" lemmas = [porter.stem(t) for t in tokens]\n",
|
|
" stoplist = stopwords.words('english')\n",
|
|
" lemmas_clean = [w for w in lemmas if w not in stoplist]\n",
|
|
" return lemmas_clean\n",
|
|
"\n",
|
|
"print(preprocess(review))\n",
|
|
"print(preprocess(tweet, type='tweet'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Punctuation removal"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Punctuation is useful for sentence splitting and POS tagging. Once we have used it, we can remove it easily."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import string\n",
|
|
"\n",
|
|
"def preprocess(words, type='doc'):\n",
|
|
" if (type == 'tweet'):\n",
|
|
" tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
|
|
" tokens = tknzr.tokenize(tweet)\n",
|
|
" else:\n",
|
|
" tokens = nltk.word_tokenize(words.lower())\n",
|
|
" porter = nltk.PorterStemmer()\n",
|
|
" lemmas = [porter.stem(t) for t in tokens]\n",
|
|
" stoplist = stopwords.words('english')\n",
|
|
" lemmas_clean = [w for w in lemmas if w not in stoplist]\n",
|
|
" punctuation = set(string.punctuation)\n",
|
|
" words = [w for w in lemmas_clean if w not in punctuation]\n",
|
|
" return words\n",
|
|
"\n",
|
|
"print(preprocess(review))\n",
|
|
"print(preprocess(tweet, type='tweet'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Rare words and spelling"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In large corpus, we may want to clean rare words (probably they are typos) and correct spelling. \n",
|
|
"\n",
|
|
"For the first task, you can exclude the least frequent words (or compare with their frequency in a corpus). NLTK provides facilities for calculating frequencies.\n",
|
|
"\n",
|
|
"For the second task, you can use spell-checker packages such as [*textblob*](http://textblob.readthedocs.io/) or [*autocorrect*](https://pypi.python.org/pypi/autocorrect/)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"frec = nltk.FreqDist(nltk.word_tokenize(review))\n",
|
|
"print(\"Most frequent\")\n",
|
|
"print(frec.most_common(10))\n",
|
|
"print(\"Least frequent\")\n",
|
|
"print(list(frec.keys())[-10:])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## References\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
|
|
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Licence"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.1"
|
|
},
|
|
"latex_envs": {
|
|
"LaTeX_envs_menu_present": true,
|
|
"autocomplete": true,
|
|
"bibliofile": "biblio.bib",
|
|
"cite_by": "apalike",
|
|
"current_citInitial": 1,
|
|
"eqLabelWithNumbers": true,
|
|
"eqNumInitial": 1,
|
|
"hotkeys": {
|
|
"equation": "Ctrl-E",
|
|
"itemize": "Ctrl-I"
|
|
},
|
|
"labels_anchors": false,
|
|
"latex_user_defs": false,
|
|
"report_style_numbering": false,
|
|
"user_envs_cfg": false
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|