Update 3_3_Data_Munging_with_Pandas.ipynb

Adapted to pandas 3.X and corrected typos
2026-07-14 15:49:20 +00:00 · 2026-03-02 15:47:41 +01:00
parent ff76909b87
commit 8c82c6fbcd
1 changed files with 23 additions and 24 deletions
--- a/ml2/3_3_Data_Munging_with_Pandas.ipynb
+++ b/ml2/3_3_Data_Munging_with_Pandas.ipynb
@@ -63,9 +63,9 @@
   "source": [
    "[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n",
    "\n",
-    "*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
+    "*Scikit-learn* estimators which assume that all values are numerical. This is a common feature in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
    "Some of the most common tasks are:\n",
-    "* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n",
+    "* Remove samples with missing values or replace the missing values with a value (median, mean, or interpolation)\n",
    "* Encode categorical variables as integers\n",
    "* Combine  datasets\n",
    "* Rename variables and convert types\n",
@@ -73,7 +73,7 @@
    "\n",
    "We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n",
    "\n",
-    "First we load the dataset and we get a dataframe."
+    "First, we load the dataset, and we get a dataframe."
   ]
  },
  {
@@ -129,7 +129,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# We can list non numerical properties, with a boolean indexing of the Series df.dtypes\n",
+    "# We can list non-numerical properties, with a boolean indexing of the Series df.dtypes\n",
    "df.dtypes[df.dtypes == object]"
   ]
  },
@@ -423,7 +423,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n",
+    "# Mean age, SibSp, Survived of passengers older than 25 who survived, grouped by Passenger Class and Sex \n",
    "df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].mean()"
   ]
  },
@@ -433,7 +433,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# We can also decide which function apply in each column\n",
+    "# We can also decide which function applies in each column\n",
    "\n",
    "#Show mean Age, mean SibSp, and number of passengers older than 25 that survived,  grouped by Passenger Class and Sex\n",
    "df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].agg({'Age': np.mean, \n",
@@ -470,7 +470,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we want to analyze multi-index, the percentage of survivoers, given sex and age, and distributed by Pclass."
+    "Now we want to analyze multi-index, the percentage of survivors, given sex and age, and distributed by Pclass."
   ]
  },
  {
@@ -581,7 +581,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
+    "In this case, there are no duplicates. In case we would need, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
   ]
  },
  {
@@ -597,7 +597,7 @@
   "source": [
    "Here we check how many null values there are.\n",
    "\n",
-    "We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
+    "We use sum() instead of count(), or we would get the total number of records. Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
   ]
  },
  {
@@ -626,7 +626,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
+    "Most of the samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
   ]
  },
  {
@@ -654,13 +654,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n",
+    "Observe that the Passenger with 889 now has an Agent of 28 (median) instead of NaN. \n",
    "\n",
    "Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
    "\n",
    "In addition, we could drop rows with any or all null values (method *dropna()*).\n",
    "\n",
-    "If we want to modify directly the *df* object, we should add the parameter *inplace* with value *True*."
+    "If we want to modify the* df* object directly, we should add the parameter *inplace* with value *True*."
   ]
  },
  {
@@ -700,7 +700,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# There are not labels for rows, so we use the numeric index\n",
+    "# There are no labels for rows, so we use the numeric index\n",
    "df.iloc[889]"
   ]
  },
@@ -780,27 +780,27 @@
   "metadata": {},
   "source": [
    "\n",
-    "**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
+    "**Scikit-learn** also provides a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Analysing non numerical columns"
+    "# Analysing non-numerical columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n",
+    "As we saw, we have several non-numerical columns: **Name**, **Sex**, **Ticket**, **Cabin**, and **Embarked**.\n",
    "\n",
    "**Name** and **Ticket** do not seem informative.\n",
    "\n",
    "Regarding **Cabin**, most values were missing, so we can ignore it. \n",
    "\n",
-    "**Sex** and **Embarked** are categorical features, so we will encode as integers."
+    "**Sex** and **Embarked** are categorical features, so we will encode them as integers."
   ]
  },
  {
@@ -826,7 +826,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since  scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. "
+    "*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since  scikit-learn estimators expect continuous input and would interpret the categories as ordered, which is not the case. "
   ]
  },
  {
@@ -835,7 +835,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "#First we check if there is any null values. Observe the use of any()\n",
+    "#First, we check if there are any null values. Observe the use of any()\n",
    "df['Sex'].isnull().any()"
   ]
  },
@@ -862,8 +862,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
-    "df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
+    "df[\"Sex\"] = df["\Sex\"].map({\"male\": 0, \"female\": 1})\n",
    "df[-5:]"
   ]
  },
@@ -873,7 +872,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "#An alternative is to create a new column with the encoded valuesm and define a mapping\n",
+    "#An alternative is to create a new column with the encoded values and define a mapping\n",
    "df = df_original.copy()\n",
    "df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)\n",
    "df.head()"
@@ -937,7 +936,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# Now we replace as previosly the categories with integers\n",
+    "# Now we replace, as previously, the categories with integers\n",
    "df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
    "df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
    "df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
@@ -985,7 +984,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]