1
0
mirror of https://github.com/gsi-upm/sitc synced 2026-03-02 17:58:16 +00:00

Update 3_4_Visualisation_Pandas.ipynb

Corrected typos
This commit is contained in:
Carlos A. Iglesias
2026-03-02 15:57:39 +01:00
committed by GitHub
parent 8c82c6fbcd
commit f09997743d

View File

@@ -59,7 +59,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction: preprocessing"
"# Introduction: preprocessing."
]
},
{
@@ -68,7 +68,7 @@
"source": [
"In the previous session, we introduced two libraries for visualisation: *matplotlib* and *seaborn*. We are going to review new functionalities in this notebook, as well as the integration of *pandas* with *matplotlib*.\n",
"\n",
"Visualisation is usually combined with munging. We have done this in separated notebooks for learning purposes. We we are going to examine again the dataset, combinging both techniques, and applying the knowledge we got in the previous notebook."
"Visualisation is usually combined with munging. We have done this in separate notebooks for learning purposes. We we are going to examine again the dataset, combinging both techniques, and applying the knowledge we got in the previous notebook."
]
},
{
@@ -93,7 +93,7 @@
" * 'hexbin' for hexagonal bin plots\n",
" * 'pie' for pie charts\n",
" \n",
"Every plot kind has an equivalent on Dataframe.plot accessor. This means, you can use **df.plot(kind='line')** or **df.plot.line**. Check the [plot documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot) to learn the rest of parameters.\n",
"Every plot kind has an equivalent on the Dataframe.plot accessor. This means, you can use **df.plot(kind='line')** or **df.plot.line**. Check the [plot documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot) to learn the rest of parameters.\n",
"\n",
"In addition, the module *pandas.tools.plotting* provides: **scatter_matrix**.\n",
"\n",
@@ -135,7 +135,7 @@
"metadata": {},
"outputs": [],
"source": [
"#We get a URL with raw content (not HTML one)\n",
"#We get a URL with raw content (not an HTML one)\n",
"url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
"df = pd.read_csv(url)\n",
"df_original = df.copy() # Copy to have a version of df without modifications\n",
@@ -171,7 +171,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous session we saw that *Seaborn* provides several facilities for working with DataFrames. We are going to review some of them."
"In the previous session, we saw that *Seaborn* provides several facilities for working with DataFrames. We are going to review some of them."
]
},
{
@@ -249,8 +249,8 @@
"metadata": {},
"outputs": [],
"source": [
"# General description of relationship between variables uwing Seaborn PairGrid\n",
"# We use df_clean, since the null values of df would gives us an error, you can check it.\n",
"# General description of relationship between variables using Seaborn PairGrid\n",
"# We use df_clean, since the null values of df would give us an error, you can check it.\n",
"g = sns.PairGrid(df_clean, hue=\"Survived\")\n",
"g.map(sns.scatterplot)\n",
"g.add_legend()"
@@ -260,7 +260,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two many variables, we are going to represent only a subset."
"There are too many variables, we are going to represent only a subset."
]
},
{
@@ -319,7 +319,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We saw that there are 177 missing values of age. We are going this feature with more detail."
"We saw that there are 177 missing values of age. We are going to implement this feature with more detail."
]
},
{
@@ -391,7 +391,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We observe that non survived is left skewed. Most children survived."
"We observe that non-survived is left skewed. Most children survived."
]
},
{
@@ -586,7 +586,7 @@
"metadata": {},
"outputs": [],
"source": [
"# How many passergers survived by sex\n",
"# How many passengers survived by sex\n",
"df.groupby('Sex')['Survived'].sum()"
]
},
@@ -596,7 +596,7 @@
"metadata": {},
"outputs": [],
"source": [
"# How many passergers survived by sex\n",
"# How many passengers survived by sex\n",
"df.groupby('Sex')['Survived'].mean()"
]
},
@@ -604,7 +604,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that 74% of female survived, while only 18% of male survived."
"We see that 74% of females survived, while only 18% of males survived."
]
},
{
@@ -771,7 +771,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see the distribution is right sweked. We are going to detect outliers using a box plot"
"We see the distribution is right-skewed. We are going to detect outliers using a box plot."
]
},
{
@@ -790,7 +790,7 @@
"outputs": [],
"source": [
"# We can see the same with matplotlib.\n",
"# There is a bug and if you import seaborn, you should add 'sym='k.' to show the outliers\n",
"# There is a bug, and if you import seaborn, you should add 'sym='k.' to show the outliers\n",
"df.boxplot(column='Fare', return_type='axes', sym='k.')"
]
},
@@ -995,7 +995,7 @@
"source": [
"We can see that most passengers traveled without siblings or spouses. \n",
"\n",
"We analyse if this had impact on its survival."
"We analyse if this had an impact on its survival."
]
},
{
@@ -1020,7 +1020,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that it does not provide too much information. While the survival mean of all passengers is 38%, passengers with 0 SibSp has 34% of probability. Surprisingly, passengers with 1 sibling or spouse have a higher probability, 53%. We are going to see the distribution by gender"
"We see that it does not provide too much information. While the survival mean for all passengers is 38%, passengers with 0 SibSp have a 34% probability of survival. Surprisingly, passengers with 1 sibling or spouse have a higher probability, 53%. We are going to see the distribution by gender"
]
},
{
@@ -1204,7 +1204,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see the probability of surviving is higher in 2 and 3. Sincethere were too few rows for Parch >= 3, this part is not relevant."
"We see the probability of surviving is higher in 2 and 3. Since there were too few rows for Parch >= 3, this part is not relevant."
]
},
{
@@ -1229,7 +1229,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We observe that Parch has an important impact for men in first and second class. We are going to check the age."
"We observe that Parch has an important impact on men in first and second class. We are going to check the age."
]
},
{
@@ -1261,7 +1261,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We observe that there is a significant difference, so we suspect that this feature has impact of men in first and second class."
"We observe that there is a significant difference, so we suspect that this feature has an impact on men in first and second class."
]
},
{
@@ -1275,7 +1275,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Age: null values"
"## Feature Age: null values."
]
},
{
@@ -1337,7 +1337,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Embarking: null values"
"## Feature Embarking: null values."
]
},
{
@@ -1360,7 +1360,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"As we discussed previously, we will replace these missing values by the most popular one (mode): S."
"As we discussed previously, we will replace these missing values with the most popular one (mode): S."
]
},
{
@@ -1370,7 +1370,7 @@
"outputs": [],
"source": [
"#Replace nulls with the most common value\n",
"df['Embarked'].fillna('S', inplace=True)\n",
"df['Embarked'] = df['Embarked'].fillna('S')\n",
"df['Embarked'].isnull().any()"
]
},
@@ -1378,14 +1378,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Cabin: null values"
"## Feature Cabin: null values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to analyse Cabin in the exercise"
"We are going to analyse Cabin in the exercise."
]
},
{
@@ -1399,14 +1399,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Recap: encoding categorical features"
"## Recap: encoding categorical features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous notebook we saw how to encode categorical features. We are going to explore an alternative way."
"In the previous notebook, we saw how to encode categorical features. We are going to explore an alternative way."
]
},
{
@@ -1432,14 +1432,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Encoding Categorical Variables as Binary ones"
"## Encoding Categorical Variables as Binary Ones"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we see previously, translating categorical variables into integer can introduce an order. In our case, this is not a problem, since *Sex* is a binary variable, and we can consider there exists an order in *Pclass*.\n",
"As we see previously, translating categorical variables into integers can introduce an order. In our case, this is not a problem, since *Sex* is a binary variable, and we can consider there exists an order in *Pclass*.\n",
"\n",
"Nevertheless, we are going to introduce a general approach to encode categorical variables using some facilities provided by scikit-learn."
]
@@ -1461,8 +1461,8 @@
"\n",
"df = df_original.copy() # take original df\n",
"\n",
"# We define here the categorical columns have non integer values, so we need to convert them\n",
"# into integers first with LabelEncoder. This can be omitted if the are already integers.\n",
"# We define here the categorical columns have non-integer values, so we need to convert them\n",
"# into integers first with LabelEncoder. This can be omitted if they are already integers.\n",
"\n",
"label_enc = LabelEncoder()\n",
"label_sex = label_enc.fit_transform(df['Sex'])\n",