mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-13 02:12:28 +00:00
399 lines
12 KiB
Plaintext
399 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Course Notes for Learning Intelligent Systems"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Syntactic Processing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Table of Contents\n",
|
|
"\n",
|
|
"* [Objectives](#Objectives)\n",
|
|
"* [POS Tagging](#POS-Tagging)\n",
|
|
"* [NER](#NER)\n",
|
|
"* [Parsing and Chunking](#Parsing-and-Chunking)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Objectives"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this session we are going to learn how to analyse the syntax of text. In particular, we will learn\n",
|
|
"* Understand and perform POS (Part of Speech) tagging\n",
|
|
"* Understand and perform NER (Named Entity Recognition)\n",
|
|
"* Understand and parse texts\n",
|
|
"\n",
|
|
"We will use the same examples than in the previous notebook, slightly modified for learning purposes."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"review = \"\"\"I purchased this Dell monitor because of budgetary concerns. This item was the most inexpensive 17 inch Apple monitor \n",
|
|
"available to me at the time I made the purchase. My overall experience with this monitor was very poor. When the \n",
|
|
"screen wasn't contracting or glitching the overall picture quality was poor to fair. I've viewed numerous different \n",
|
|
"monitor models since I 'm a college student at UPM in Madrid and this particular monitor had as poor of picture quality as \n",
|
|
"any I 've seen.\"\"\"\n",
|
|
"\n",
|
|
"tweet = \"\"\"@concert Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to \n",
|
|
" her music!!!! WOW!!! #ladygaga #britney\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# POS Tagging"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"POS Tagging is the process of assigning a grammatical category (known as *part of speech*, POS) to a word. For this purpose, the most common approach is using an annotated corpus such as Penn Treebank. The tag set (categories) depends on the corpus annotation. Fortunately, nltk defines a [universal tagset](http://www.nltk.org/book/ch05.html):\n",
|
|
"\n",
|
|
"\n",
|
|
"Tag\t| Meaning | English Examples\n",
|
|
"----|---------|------------------\n",
|
|
"ADJ\t| adjective | new, good, high, special, big, local\n",
|
|
"ADP\t| adposition | on, of, at, with, by, into, under\n",
|
|
"ADV\t| adverb | really, already, still, early, now\n",
|
|
"CONJ| conjunction | and, or, but, if, while, although\n",
|
|
"DET | determiner, article | the, a, some, most, every, no, which\n",
|
|
"NOUN | noun\t | year, home, costs, time, Africa\n",
|
|
"NUM\t| numeral | twenty-four, fourth, 1991, 14:24\n",
|
|
"PRT | particle | at, on, out, over per, that, up, with\n",
|
|
"PRON | pronoun | he, their, her, its, my, I, us\n",
|
|
"VERB | verb\t| is, say, told, given, playing, would\n",
|
|
". | punctuation marks | . , ; !\n",
|
|
"X | other | ersatz, esprit, dunno, gr8, univeristy"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk import pos_tag, word_tokenize\n",
|
|
"print (pos_tag(word_tokenize(review), tagset='universal'))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We could have used another tagset for POS, such as UPenn."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"print (pos_tag(word_tokenize(review)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The meaning of these tags can be obtained here:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import nltk\n",
|
|
"nltk.help.upenn_tagset()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We are going to use the Univeral tagset in this example. Based on this POS info, we could use correctly now the WordNetLemmatizer. The WordNetLemmatizer only is interesting for 4 POS categories: ADJ, ADV, NOUN, and VERB. This is because WordNet lemmatizer will only lemmatize adjectives, adverbs, nouns and verbs, and it needs that all the provided tags are in [n, a, r, v]."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk.stem import WordNetLemmatizer\n",
|
|
"\n",
|
|
"review_postagged = pos_tag(word_tokenize(review), tagset='universal')\n",
|
|
"pos_mapping = {'NOUN': 'n', 'ADJ': 'a', 'VERB': 'v', 'ADV': 'r', 'ADP': 'n', 'CONJ': 'n', \n",
|
|
" 'PRON': 'n', 'NUM': 'n', 'X': 'n' }\n",
|
|
"\n",
|
|
"wordnet = WordNetLemmatizer()\n",
|
|
"lemmas = [wordnet.lemmatize(w, pos=pos_mapping[tag]) for (w,tag) in review_postagged if tag in pos_mapping.keys()]\n",
|
|
"print(lemmas)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# NER"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Named Entity Recognition (NER) is an information retrieval for identifying named entities of places, organisation of persons. NER usually relies in a tagged corpus. NER algorithms can be trained for new corpora. Here we are using the Brown tagset (http://www.comp.leeds.ac.uk/ccalas/tagsets/brown.html)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk import ne_chunk, pos_tag, word_tokenize\n",
|
|
"ne_tagged = ne_chunk(pos_tag(word_tokenize(review)), binary=False)\n",
|
|
"print(ne_tagged) "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"NLTK comes with other NER implementations. We can also use online services, such as [OpenCalais](http://www.opencalais.com/), [DBpedia Spotlight](https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service) or [TagME](http://tagme.di.unipi.it/)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Parsing and Chunking"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Parsing** is the process of obtaining a parsing tree given a grammar. It which can be very useful to understand the relationship among the words.\n",
|
|
"\n",
|
|
"As we have seen in class, we can follow a traditional approach and obtain a full parsing tree or shallow parsing (chunking) and obtain a partial tree.\n",
|
|
"\n",
|
|
"We can use the StandfordParser that is integrated in NLTK, but it requires to configure the CLASSPATH, which can be a bit annoying. Instead, we are going to see some demos to understand how grammars work. In case you are interested, you can consult the [manual](http://www.nltk.org/api/nltk.parse.html) to run it.\n",
|
|
"\n",
|
|
"In the following example, you will run two interactive context-free parser (http://www.nltk.org/book/ch08.html): shift-reduce parser (botton-up) and recursive descent parser (top-down).\n",
|
|
"\n",
|
|
"\n",
|
|
"First, we run the shift-reduce parser. The panel on the left shows the grammar as a list of production rules. The panel on the right contains the stack and the remaining input.\n",
|
|
"\n",
|
|
"You should:\n",
|
|
"* Run pressing 'step' until the sentence is fully analyzed. With each step, the parser either shifts one word onto the stack or reduces two subtrees of the stack into a new subtree.\n",
|
|
"* Try to act as the parser. Instead of pressing 'step', press 'shift' and 'reduce'. Follow the 'always shift before reduce' rule. It is likely you will reach a state where the parser cannot proceed. You can go back with 'Undo'. You can try to change the order of the grammar rules or add new grammar rules."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk.app import srparser_app\n",
|
|
"srparser_app.app()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we run the recursive descent parser. Observe the different parsing strategies and consult the [book](http://www.nltk.org/api/nltk.parse.html)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk.app import rdparser_app\n",
|
|
"rdparser_app.app()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Chunking** o **shallow parsing** aims at extracting relevant parts of the sentence. There are (two main approaches)[http://www.nltk.org/book/ch07.html] to chunking: using regular expressions based on POS tags, or training a chunk parser.\n",
|
|
"\n",
|
|
"We are going to illustrate the first technique for extracting NP chunks.\n",
|
|
"\n",
|
|
"We define regular expressions for the chunks we want to get."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from nltk.chunk.regexp import *\n",
|
|
"pattern = \"\"\"NP: {<PRON><ADJ><NOUN>+} \n",
|
|
" {<DET>?<ADV>?<ADJ|NUM>*?<NOUN>+}\n",
|
|
" \"\"\"\n",
|
|
"NPChunker = RegexpParser(pattern)\n",
|
|
"\n",
|
|
"reviews_pos = (pos_tag(word_tokenize(review), tagset='universal'))\n",
|
|
"\n",
|
|
"chunks_np = NPChunker.parse(reviews_pos)\n",
|
|
"print(chunks_np)\n",
|
|
"chunks_np.draw()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we can traverse the trees and obtain the strings as follows."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def extractTrees(parsed_tree, category='NP'):\n",
|
|
" return list(parsed_tree.subtrees(filter=lambda x: x.label()==category))\n",
|
|
"\n",
|
|
"extractTrees(chunks_np, 'NP')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def extractStrings(parsed_tree, category='NP'):\n",
|
|
" return [\" \".join(word for word, pos in vp.leaves()) for vp in extractTrees(parsed_tree, category)]\n",
|
|
" \n",
|
|
"extractStrings(chunks_np)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## References\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
|
|
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Licence"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.5.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
}
|