1
0
mirror of https://github.com/gsi-upm/sitc synced 2026-03-03 02:08:17 +00:00

Updated to Pandas 3.X and corrected typos

This commit is contained in:
cif
2026-03-02 17:40:58 +01:00
parent 5c440527ac
commit 65da5ae714
8 changed files with 105 additions and 135 deletions

View File

@@ -68,7 +68,7 @@
"source": [
"In the previous session, we introduced two libraries for visualisation: *matplotlib* and *seaborn*. We are going to review new functionalities in this notebook, as well as the integration of *pandas* with *matplotlib*.\n",
"\n",
"Visualisation is usually combined with munging. We have done this in separate notebooks for learning purposes. We we are going to examine again the dataset, combinging both techniques, and applying the knowledge we got in the previous notebook."
"Visualisation is usually combined with munging. We have done this in separate notebooks for learning purposes. We are going to examine the dataset again, combining both techniques, and applying the knowledge we got in the previous notebook."
]
},
{
@@ -93,11 +93,11 @@
" * 'hexbin' for hexagonal bin plots\n",
" * 'pie' for pie charts\n",
" \n",
"Every plot kind has an equivalent on the Dataframe.plot accessor. This means, you can use **df.plot(kind='line')** or **df.plot.line**. Check the [plot documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot) to learn the rest of parameters.\n",
"Every plot kind has an equivalent on the Dataframe.plot accessor. This means, you can use **df.plot(kind='line')** or **df.plot.line**. Check the [plot documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot) to learn the rest of the parameters.\n",
"\n",
"In addition, the module *pandas.tools.plotting* provides: **scatter_matrix**.\n",
"\n",
"You can consult more details in the [documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html)."
"You can consult the [documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html) for more details."
]
},
{
@@ -151,12 +151,9 @@
"# Cleaning\n",
"df_clean = df.copy() # We copy to see what happens with na values\n",
"df_clean['Age'] = df['Age'].fillna(df['Age'].median())\n",
"df_clean.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
"df_clean.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
"df_clean.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
"df_clean.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
"df_clean.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
"df_clean.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
"df_clean['Sex'] = df_clean['Sex'].map({'male': 0, 'female': 1})\n",
"df_clean = df_clean.drop(['Cabin', 'Ticket'], axis=1)\n",
"df_clean['Embarked'] = df_clean['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})\n",
"df_clean.head()"
]
},
@@ -249,8 +246,8 @@
"metadata": {},
"outputs": [],
"source": [
"# General description of relationship between variables using Seaborn PairGrid\n",
"# We use df_clean, since the null values of df would give us an error, you can check it.\n",
"# General description of the relationship between variables using Seaborn PairGrid\n",
"# We use df_clean because the null values in df would cause an error; you can check it.\n",
"g = sns.PairGrid(df_clean, hue=\"Survived\")\n",
"g.map(sns.scatterplot)\n",
"g.add_legend()"
@@ -280,7 +277,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can observe, for example, that more women survived as well as more people in 3rd class. \n",
"We can observe, for example, that more women and more people in the 3rd class survived. \n",
"\n",
"We can represent these findings."
]
@@ -337,9 +334,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see the histogram is slightly *right skewed* (*sesgada a la derecha*), so we will replace null values with the median instead of the mean.\n",
"We see the histogram is slightly *right-skewed* (*sesgada a la derecha*), so we will replace null values with the median instead of the mean.\n",
"\n",
"In case we have a significant *skewed distribution*, the extreme values in the long tail can have a disproportionately large influence on our model. So, it can be good to transform the variable before building our model to reduce skewness.Taking the natural logarithm or the square root of each point are two simple transformations. "
"If we have a significantly skewed distribution, extreme values in the long tail can exert a disproportionately large influence on our model. So, it can be good to transform the variable before building our model to reduce skewness. Taking the natural logarithm or the square root of each point is a simple transformation. "
]
},
{
@@ -410,7 +407,7 @@
"metadata": {},
"outputs": [],
"source": [
"# We can observe the detail for children\n",
"# We can observe the details for children\n",
"df[df.Age < 20].hist(column='Age', by='Survived', sharey=True)"
]
},
@@ -428,9 +425,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There were null values, we will recap at the end of this notebook how to manage them.\n",
"There were null values; we will recap at the end of this notebook how to manage them.\n",
"\n",
"We are going now to see the distribution of passengers younger than 20 that survived."
"We are now going to see the distribution of passengers younger than 20 who survived."
]
},
{
@@ -448,7 +445,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Passengers older than 25 that survived grouped by Sex\n",
"# Passengers older than 25 who survived grouped by Sex\n",
"\n",
"df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().plot(kind='bar')"
]
@@ -615,7 +612,7 @@
"source": [
"#Graphical representation\n",
"# You can add the parameter estimator to change the estimator. (e.g. estimator=np.median)\n",
"# For example, estimator=np.size is you get the same chart than with countplot\n",
"# For example, estimator=np.size is you get the same as with countplot\n",
"#sns.barplot(x='Sex', y='Survived', data=df, estimator=np.size)\n",
"sns.barplot(x='Sex', y='Survived', data=df)"
]
@@ -624,7 +621,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see now if men and women follow the same age distribution."
"We can see now whether men and women follow the same age distribution."
]
},
{
@@ -814,7 +811,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that most outliers are in class 1. In particular, we see some values higher thatn 500 that should be an error."
"We see that most outliers are in class 1. In particular, we see some values higher than 500 that should be an error."
]
},
{
@@ -902,9 +899,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Since there are missing values, we will replace them by the most popular value ('S'), and we will also encode it since it is a categorical variable.\n",
"Since there are missing values, we will replace them with the most popular value ('S'), and we will also encode it since it is a categorical variable.\n",
"\n",
"We can see if this has impact on its survival."
"We can see if this has an impact on its survival."
]
},
{
@@ -953,7 +950,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We have to fill null values (2 null values) and encode this variable, since it is categorical. We will do it after reviewing the rest of features."
"We have to fill in the null values (2 nulls) and encode this variable, since it is categorical. We will do it after reviewing the rest of the features."
]
},
{
@@ -995,7 +992,7 @@
"source": [
"We can see that most passengers traveled without siblings or spouses. \n",
"\n",
"We analyse if this had an impact on its survival."
"We analyse whether this had an impact on its survival."
]
},
{
@@ -1020,7 +1017,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that it does not provide too much information. While the survival mean for all passengers is 38%, passengers with 0 SibSp have a 34% probability of survival. Surprisingly, passengers with 1 sibling or spouse have a higher probability, 53%. We are going to see the distribution by gender"
"We see that it does not provide too much information. While the survival rate for all passengers is 38%, passengers with 0 SibSp have a 34% survival rate. Surprisingly, passengers with 1 sibling or spouse have a higher probability, 53%. We are going to see the distribution by gender"
]
},
{
@@ -1061,7 +1058,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We observe that when SibSp > 2, the survival probability decreases to the half. We are going to check if there is a difference in the age. "
"We observe that when SibSp > 2, the survival probability decreases to half. We are going to check if there is a difference in the age. "
]
},
{
@@ -1150,7 +1147,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The feature Parch (Parents-Children Aboard) is somewhat related to the previous one, since it reflects family ties. It is well known that in emergencies, family groups often all die or evacuate together, so it is expected that it will also have an impact on our model."
"The feature Parch (Parents-Children Aboard) is somewhat related to the previous one, since it reflects family ties. It is well known that in emergencies, family groups often die or evacuate together, so it is expected that this will also affect our model."
]
},
{
@@ -1418,13 +1415,10 @@
"#df = df_original.copy()\n",
"#df['SexEncoded'] = df.Sex\n",
"#\n",
"#df.loc[df[\"SexEncoded\"] == 'male', \"SexEncoded\"] = 0\n",
"#df.loc[df[\"SexEncoded\"] == \"female\", \"SexEncoded\"] = 1\n",
"#df[\"SexEncoded\"] = df[\"Sex\"].map({\"male\": 0, \"female\": 1})\n",
"#\n",
"#df['EmbarkedEncoded'] = df.Embarked\n",
"#df.loc[df[\"EmbarkedEncoded\"] == \"S\", \"EmbarkedEncoded\"] = 0\n",
"#df.loc[df[\"EmbarkedEncoded\"] == \"C\", \"EmbarkedEncoded\"] = 1\n",
"#df.loc[df[\"EmbarkedEncoded\"] == \"Q\", \"EmbarkedEncoded\"] = 2\n",
"#df[\"EmbarkedEncoded\"] = df[\"Embarked\"].map({\"S\": 0, \"C\": 1, \"Q\": 2})\n",
"#df.head()"
]
},
@@ -1439,9 +1433,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we see previously, translating categorical variables into integers can introduce an order. In our case, this is not a problem, since *Sex* is a binary variable, and we can consider there exists an order in *Pclass*.\n",
"As we saw previously, translating categorical variables into integers can introduce an order. In our case, this is not a problem, since *Sex* is a binary variable and we can assume an order in *Pclass*.\n",
"\n",
"Nevertheless, we are going to introduce a general approach to encode categorical variables using some facilities provided by scikit-learn."
"Nevertheless, we will introduce a general approach to encoding categorical variables using facilities provided by scikit-learn."
]
},
{
@@ -1489,7 +1483,8 @@
"outputs": [],
"source": [
"#Remove nulls\n",
"df['Embarked'].fillna('S', inplace=True)\n",
"# Fill missing values in 'Embarked'\n",
"df['Embarked'] = df['Embarked'].fillna('S')\n",
"df = pd.get_dummies(df, columns=['Embarked', 'Pclass'])\n",
"df.head()"
]
@@ -1514,7 +1509,8 @@
"metadata": {},
"outputs": [],
"source": [
"df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
"# Drop unwanted columns\n",
"df = df.drop(columns=['Cabin', 'Ticket'])\n",
"df.head()"
]
},
@@ -1557,7 +1553,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
@@ -1579,7 +1575,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
"version": "3.12.2"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
@@ -1600,5 +1596,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 4
}