mirror of
https://github.com/gsi-upm/sitc
synced 2026-03-03 02:08:17 +00:00
Updated to Pandas 3.X and corrected typos
This commit is contained in:
@@ -61,19 +61,19 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n",
|
||||
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is the process of manually converting or mapping data from one \"raw\" form (datos en bruto) into another format that makes the data more easily consumable with the help of semi-automated tools.\n",
|
||||
"\n",
|
||||
"*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
|
||||
"*Scikit-learn* estimators, which assume that all values are numerical. This is a common feature in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
|
||||
"Some of the most common tasks are:\n",
|
||||
"* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n",
|
||||
"* Remove samples with missing values or replace the missing values with a value (median, mean, or interpolation)\n",
|
||||
"* Encode categorical variables as integers\n",
|
||||
"* Combine datasets\n",
|
||||
"* Rename variables and convert types\n",
|
||||
"* Transform / scale variables\n",
|
||||
"\n",
|
||||
"We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n",
|
||||
"We are going to play again with the Titanic dataset to practice with Pandas DataFrames and to introduce a number of preprocessing facilities from scikit-learn.\n",
|
||||
"\n",
|
||||
"First we load the dataset and we get a dataframe."
|
||||
"First, we load the dataset, and we get a dataframe."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -450,7 +450,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Pivot tables are an intuitive way to analyze data, and an alternative to group columns.\n",
|
||||
"Pivot tables are an intuitive way to analyze data and an alternative to grouping columns.\n",
|
||||
"\n",
|
||||
"This command makes a table with rows Sex and columns Pclass, and\n",
|
||||
"averages the result of the column Survived, thereby giving the percentage of survivors in each grouping."
|
||||
@@ -580,7 +580,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
|
||||
"In this case, there are no duplicates. In case we needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -596,7 +596,7 @@
|
||||
"source": [
|
||||
"Here we check how many null values there are.\n",
|
||||
"\n",
|
||||
"We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
|
||||
"We use sum() instead of count() to get the total number of records. Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -653,7 +653,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n",
|
||||
"Observe that the Passenger with 889 now has an Agent of 28 (median) instead of NaN. \n",
|
||||
"\n",
|
||||
"Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
|
||||
"\n",
|
||||
@@ -791,7 +791,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n",
|
||||
"As we saw, we have several non-numerical columns: **Name**, **Sex**, **Ticket**, **Cabin**, and **Embarked**.\n",
|
||||
"\n",
|
||||
"**Name** and **Ticket** do not seem informative.\n",
|
||||
"\n",
|
||||
@@ -863,29 +863,6 @@
|
||||
"df[-5:]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -956,25 +933,18 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Now we replace as previously the categories with integers\n",
|
||||
"# Now we replace, as previously, the categories with integers\n",
|
||||
"df[\"Embarked\"] = df[\"Embarked\"].map({\"S\": 0, \"C\": 1, \"Q\": 2})\n",
|
||||
"df[-5:]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Although this transformation can be ok, we are introducing *an error*. Some classifiers could think that there is an order in S, C, Q, and that Q is higher than S. \n",
|
||||
"Although this transformation may be acceptable, we are introducing an error. Some classifiers could think that there is an order in S, C, and Q, and that Q is higher than S. \n",
|
||||
"\n",
|
||||
"To avoid this error, Scikit learn provides a facility for transforming all the categorical features into integer ones. In fact, it creates a new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0 and Q=0.\n",
|
||||
"To avoid this error, scikit-learn provides a facility for transforming all categorical features into integer features. In fact, it creates a new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0, and Q=0.\n",
|
||||
"\n",
|
||||
"We will learn how to do this in the next notebook. More details can be found in the [Scikit-learn documentation](http://scikit-learn.org/stable/modules/preprocessing.html)."
|
||||
]
|
||||
|
||||
Reference in New Issue
Block a user