1
0
mirror of https://github.com/gsi-upm/sitc synced 2026-03-02 17:58:16 +00:00

Update 3_3_Data_Munging_with_Pandas.ipynb

Adapted to pandas 3.X and corrected typos
This commit is contained in:
Carlos A. Iglesias
2026-03-02 15:47:41 +01:00
committed by GitHub
parent ff76909b87
commit 8c82c6fbcd

View File

@@ -63,9 +63,9 @@
"source": [
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n",
"\n",
"*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
"*Scikit-learn* estimators which assume that all values are numerical. This is a common feature in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
"Some of the most common tasks are:\n",
"* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n",
"* Remove samples with missing values or replace the missing values with a value (median, mean, or interpolation)\n",
"* Encode categorical variables as integers\n",
"* Combine datasets\n",
"* Rename variables and convert types\n",
@@ -73,7 +73,7 @@
"\n",
"We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n",
"\n",
"First we load the dataset and we get a dataframe."
"First, we load the dataset, and we get a dataframe."
]
},
{
@@ -129,7 +129,7 @@
"metadata": {},
"outputs": [],
"source": [
"# We can list non numerical properties, with a boolean indexing of the Series df.dtypes\n",
"# We can list non-numerical properties, with a boolean indexing of the Series df.dtypes\n",
"df.dtypes[df.dtypes == object]"
]
},
@@ -423,7 +423,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n",
"# Mean age, SibSp, Survived of passengers older than 25 who survived, grouped by Passenger Class and Sex \n",
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].mean()"
]
},
@@ -433,7 +433,7 @@
"metadata": {},
"outputs": [],
"source": [
"# We can also decide which function apply in each column\n",
"# We can also decide which function applies in each column\n",
"\n",
"#Show mean Age, mean SibSp, and number of passengers older than 25 that survived, grouped by Passenger Class and Sex\n",
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].agg({'Age': np.mean, \n",
@@ -470,7 +470,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to analyze multi-index, the percentage of survivoers, given sex and age, and distributed by Pclass."
"Now we want to analyze multi-index, the percentage of survivors, given sex and age, and distributed by Pclass."
]
},
{
@@ -581,7 +581,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
"In this case, there are no duplicates. In case we would need, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
]
},
{
@@ -597,7 +597,7 @@
"source": [
"Here we check how many null values there are.\n",
"\n",
"We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
"We use sum() instead of count(), or we would get the total number of records. Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
]
},
{
@@ -626,7 +626,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
"Most of the samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
]
},
{
@@ -654,13 +654,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n",
"Observe that the Passenger with 889 now has an Agent of 28 (median) instead of NaN. \n",
"\n",
"Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
"\n",
"In addition, we could drop rows with any or all null values (method *dropna()*).\n",
"\n",
"If we want to modify directly the *df* object, we should add the parameter *inplace* with value *True*."
"If we want to modify the* df* object directly, we should add the parameter *inplace* with value *True*."
]
},
{
@@ -700,7 +700,7 @@
"metadata": {},
"outputs": [],
"source": [
"# There are not labels for rows, so we use the numeric index\n",
"# There are no labels for rows, so we use the numeric index\n",
"df.iloc[889]"
]
},
@@ -780,27 +780,27 @@
"metadata": {},
"source": [
"\n",
"**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
"**Scikit-learn** also provides a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysing non numerical columns"
"# Analysing non-numerical columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n",
"As we saw, we have several non-numerical columns: **Name**, **Sex**, **Ticket**, **Cabin**, and **Embarked**.\n",
"\n",
"**Name** and **Ticket** do not seem informative.\n",
"\n",
"Regarding **Cabin**, most values were missing, so we can ignore it. \n",
"\n",
"**Sex** and **Embarked** are categorical features, so we will encode as integers."
"**Sex** and **Embarked** are categorical features, so we will encode them as integers."
]
},
{
@@ -826,7 +826,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. "
"*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input and would interpret the categories as ordered, which is not the case. "
]
},
{
@@ -835,7 +835,7 @@
"metadata": {},
"outputs": [],
"source": [
"#First we check if there is any null values. Observe the use of any()\n",
"#First, we check if there are any null values. Observe the use of any()\n",
"df['Sex'].isnull().any()"
]
},
@@ -862,8 +862,7 @@
"metadata": {},
"outputs": [],
"source": [
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
"df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
"df[\"Sex\"] = df["\Sex\"].map({\"male\": 0, \"female\": 1})\n",
"df[-5:]"
]
},
@@ -873,7 +872,7 @@
"metadata": {},
"outputs": [],
"source": [
"#An alternative is to create a new column with the encoded valuesm and define a mapping\n",
"#An alternative is to create a new column with the encoded values and define a mapping\n",
"df = df_original.copy()\n",
"df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)\n",
"df.head()"
@@ -937,7 +936,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Now we replace as previosly the categories with integers\n",
"# Now we replace, as previously, the categories with integers\n",
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
@@ -985,7 +984,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]