1
0
mirror of https://github.com/gsi-upm/sitc synced 2026-03-02 17:58:16 +00:00

Add files via upload

Updated to pandas 3.X and corrected typos
This commit is contained in:
Carlos A. Iglesias
2026-03-02 16:07:53 +01:00
committed by GitHub
parent 8e9d3cfdad
commit 722da8fc6c

View File

@@ -63,9 +63,9 @@
"source": [
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n",
"\n",
"*Scikit-learn* estimators which assume that all values are numerical. This is a common feature in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
"*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
"Some of the most common tasks are:\n",
"* Remove samples with missing values or replace the missing values with a value (median, mean, or interpolation)\n",
"* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n",
"* Encode categorical variables as integers\n",
"* Combine datasets\n",
"* Rename variables and convert types\n",
@@ -73,7 +73,7 @@
"\n",
"We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n",
"\n",
"First, we load the dataset, and we get a dataframe."
"First we load the dataset and we get a dataframe."
]
},
{
@@ -120,7 +120,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see some features have a numerical type (int64 and float64), and others have a type *object*. The object type in Pandas is a String. We observe that most features are integers, except for Name, Sex, Ticket, Cabin, and Embarked."
"We see some features have a numerical type (int64 and float64), and others has a type *object*. The object type is a String in Pandas. We observe that most features are integers, except for Name, Sex, Ticket, Cabin and Embarked."
]
},
{
@@ -129,7 +129,7 @@
"metadata": {},
"outputs": [],
"source": [
"# We can list non-numerical properties, with a boolean indexing of the Series df.dtypes\n",
"# We can list non numerical properties, with a boolean indexing of the Series df.dtypes\n",
"df.dtypes[df.dtypes == object]"
]
},
@@ -164,7 +164,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass); we could have selected only the interesting columns."
"Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass), we could have selected only the interesting columns."
]
},
{
@@ -181,7 +181,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selecting rows in a DataFrame."
"## Selecting rows in a DataFrame"
]
},
{
@@ -281,14 +281,13 @@
"source": [
"DataFrames provide a set of functions for selection that we will need later\n",
"\n",
"\n",
"|Operation | Syntax | Result |\n",
"|-----------------------------|\n",
"|Select column | df[col] | Series |\n",
"|Select row by label | df.loc[label] | Series |\n",
"|Select row by integer location | df.iloc[loc] | Series |\n",
"|Slice rows\t | df[5:10]\t | DataFrame |\n",
"|Select rows by boolean vector | df[bool_vec] | DataFrame |"
"| Operation | Syntax | Result |\n",
"|--------------------------------|--------------|----------|\n",
"| Select column | `df[col]` | Series |\n",
"| Select row by label | `df.loc[label]` | Series |\n",
"| Select row by integer location | `df.iloc[loc]` | Series |\n",
"| Slice rows | `df[5:10]` | DataFrame |\n",
"| Select rows by boolean vector | `df[bool_vec]` | DataFrame |"
]
},
{
@@ -423,7 +422,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Mean age, SibSp, Survived of passengers older than 25 who survived, grouped by Passenger Class and Sex \n",
"# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n",
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].mean()"
]
},
@@ -433,7 +432,7 @@
"metadata": {},
"outputs": [],
"source": [
"# We can also decide which function applies in each column\n",
"# We can also decide which function apply in each column\n",
"\n",
"#Show mean Age, mean SibSp, and number of passengers older than 25 that survived, grouped by Passenger Class and Sex\n",
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].agg({'Age': np.mean, \n",
@@ -470,7 +469,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to analyze multi-index, the percentage of survivors, given sex and age, and distributed by Pclass."
"Now we want to analyze multi-index, the percentage of survivoers, given sex and age, and distributed by Pclass."
]
},
{
@@ -581,7 +580,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, there are no duplicates. In case we would need, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
"In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
]
},
{
@@ -597,7 +596,7 @@
"source": [
"Here we check how many null values there are.\n",
"\n",
"We use sum() instead of count(), or we would get the total number of records. Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
"We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
]
},
{
@@ -626,7 +625,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of the samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
"Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
]
},
{
@@ -654,13 +653,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe that the Passenger with 889 now has an Agent of 28 (median) instead of NaN. \n",
"Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n",
"\n",
"Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
"\n",
"In addition, we could drop rows with any or all null values (method *dropna()*).\n",
"\n",
"If we want to modify the* df* object directly, we should add the parameter *inplace* with value *True*."
"In addition, we could drop rows with any or all null values (method *dropna()*)."
]
},
{
@@ -669,7 +666,7 @@
"metadata": {},
"outputs": [],
"source": [
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
"df['Age'] = df['Age'].fillna(df['Age'].mean())\n",
"df[-5:]"
]
},
@@ -700,7 +697,7 @@
"metadata": {},
"outputs": [],
"source": [
"# There are no labels for rows, so we use the numeric index\n",
"# There are not labels for rows, so we use the numeric index\n",
"df.iloc[889]"
]
},
@@ -762,7 +759,7 @@
"metadata": {},
"outputs": [],
"source": [
"df['Sex'].fillna('male', inplace=True)\n",
"df['Sex'] = df['Sex'].fillna('male')\n",
"df[-5:]"
]
},
@@ -780,27 +777,27 @@
"metadata": {},
"source": [
"\n",
"**Scikit-learn** also provides a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
"**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysing non-numerical columns"
"# Analysing non numerical columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we saw, we have several non-numerical columns: **Name**, **Sex**, **Ticket**, **Cabin**, and **Embarked**.\n",
"As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n",
"\n",
"**Name** and **Ticket** do not seem informative.\n",
"\n",
"Regarding **Cabin**, most values were missing, so we can ignore it. \n",
"\n",
"**Sex** and **Embarked** are categorical features, so we will encode them as integers."
"**Sex** and **Embarked** are categorical features, so we will encode as integers."
]
},
{
@@ -811,7 +808,7 @@
"source": [
"# We remove Cabin and Ticket. We should specify the axis\n",
"# Use axis 0 for dropping rows and axis 1 for dropping columns\n",
"df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
"df = df.drop(['Cabin', 'Ticket'], axis=1)\n",
"df[-5:]"
]
},
@@ -826,7 +823,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input and would interpret the categories as ordered, which is not the case. "
"*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. "
]
},
{
@@ -835,7 +832,7 @@
"metadata": {},
"outputs": [],
"source": [
"#First, we check if there are any null values. Observe the use of any()\n",
"#First we check if there is any null values. Observe the use of any()\n",
"df['Sex'].isnull().any()"
]
},
@@ -862,17 +859,40 @@
"metadata": {},
"outputs": [],
"source": [
"df[\"Sex\"] = df["\Sex\"].map({\"male\": 0, \"female\": 1})\n",
"df[\"Sex\"] = df[\"Sex\"].map({\"male\": 0, \"female\": 1})\n",
"df[-5:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#An alternative is to create a new column with the encoded values and define a mapping\n",
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#An alternative is to create a new column with the encoded valuesm and define a mapping\n",
"df = df_original.copy()\n",
"df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)\n",
"df.head()"
@@ -926,7 +946,7 @@
"outputs": [],
"source": [
"#Replace nulls with the most common value\n",
"df['Embarked'].fillna('S', inplace=True)\n",
"df['Embarked'] = df['Embarked'].fillna('S')\n",
"df['Embarked'].isnull().any()"
]
},
@@ -936,13 +956,18 @@
"metadata": {},
"outputs": [],
"source": [
"# Now we replace, as previously, the categories with integers\n",
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
"# Now we replace as previously the categories with integers\n",
"df[\"Embarked\"] = df[\"Embarked\"].map({\"S\": 0, \"C\": 1, \"Q\": 2})\n",
"df[-5:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
@@ -984,7 +1009,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
@@ -1006,7 +1031,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
"version": "3.12.2"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
@@ -1027,5 +1052,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 4
}