1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-12-22 03:38:13 +00:00

cambiado n_topics por n_components por compatibilidad

This commit is contained in:
Carlos A. Iglesias 2019-04-22 23:50:16 +02:00
parent 9d1b88dfea
commit e42299ac7a

View File

@ -84,17 +84,17 @@
"\n",
"Each of these files contains 28 columns:\n",
"\n",
"* essay_id: A unique identifier for each individual student essay\n",
"* essay_set: 1-8, an id for each set of essays\n",
"* essay: The ascii text of a student's response\n",
"* rater1_domain1: Rater 1's domain 1 score; all essays have this\n",
"* rater2_domain1: Rater 2's domain 1 score; all essays have this\n",
"* rater3_domain1: Rater 3's domain 1 score; only some essays in set 8 have this.\n",
"* domain1_score: Resolved score between the raters; all essays have this\n",
"* rater1_domain2: Rater 1's domain 2 score; only essays in set 2 have this\n",
"* rater2_domain2: Rater 2's domain 2 score; only essays in set 2 have this\n",
"* domain2_score: Resolved score between the raters; only essays in set 2 have this\n",
"* rater1_trait1 score - rater3_trait6 score: trait scores for sets 7-8\n",
"* **essay_id**: A unique identifier for each individual student essay\n",
"* **essay_set**: 1-8, an id for each set of essays\n",
"* **essay**: The ascii text of a student's response\n",
"* **rater1_domain1**: Rater 1's domain 1 score; all essays have this\n",
"* **rater2_domain1**: Rater 2's domain 1 score; all essays have this\n",
"* **rater3_domain1**: Rater 3's domain 1 score; only some essays in set 8 have this.\n",
"* **domain1_score**: Resolved score between the raters; all essays have this\n",
"* **rater1_domain2**: Rater 1's domain 2 score; only essays in set 2 have this\n",
"* **rater2_domain2**: Rater 2's domain 2 score; only essays in set 2 have this\n",
"* **domain2_score**: Resolved score between the raters; only essays in set 2 have this\n",
"* **rater1_trait1 score - rater3_trait6 score**: trait scores for sets 7-8\n",
"\n",
"The dataset is provided in the folder *data-kaggle/training_set_rel3.tsv*.\n",
"\n",
@ -102,7 +102,7 @@
"\n",
"The dataset has been anonymized to remove personally identifying information from the essays using the Named Entity Recognizer (NER) from the Stanford Natural Language Processing group and a variety of other approaches. The relevant entities are identified in the text and then replaced with a string such as \"@PERSON1.\"\n",
"\n",
"The entitities identified by NER are: \"PERSON\", \"ORGANIZATION\", \"LOCATION\", \"DATE\", \"TIME\", \"MONEY\", \"PERCENT\"\n",
"The entities identified by NER are: \"PERSON\", \"ORGANIZATION\", \"LOCATION\", \"DATE\", \"TIME\", \"MONEY\", \"PERCENT\"\n",
"\n",
"Other replacements made: \"MONTH\" (any month name not tagged as a date by the NER), \"EMAIL\" (anything that looks like an e-mail address), \"NUM\" (word containing digits or non-alphanumeric symbols), and \"CAPS\" (any capitalized word that doesn't begin a sentence, except in essays where more than 20% of the characters are capitalized letters), \"DR\" (any word following \"Dr.\" with or without the period, with any capitalization, that doesn't fall into any of the above), \"CITY\" and \"STATE\" (various cities and states)."
]
@ -393,7 +393,7 @@
"\n",
"The basic idea is:\n",
"* **Pipelines** consist of sequential steps: one step works on the results of the previous step\n",
"* ** FeatureUnions** consist of parallel tasks whose result is grouped when all have finished."
"* **FeatureUnions** consist of parallel tasks whose result is grouped when all have finished."
]
},
{
@ -427,7 +427,7 @@
" ])),\n",
" ('lda', Pipeline([ \n",
" ('count', CountVectorizer(tokenizer=custom_tokenizer)),\n",
" ('lda', LatentDirichletAllocation(n_topics=4, max_iter=5,\n",
" ('lda', LatentDirichletAllocation(n_components=4, max_iter=5,\n",
" learning_method='online', \n",
" learning_offset=50.,\n",
" random_state=0))\n",