mirror of
https://github.com/gsi-upm/sitc
synced 2026-03-02 17:58:16 +00:00
Add files via upload
Updated to pandas 3.X and corrected typos
This commit is contained in:
committed by
GitHub
parent
8e9d3cfdad
commit
722da8fc6c
@@ -63,9 +63,9 @@
|
||||
"source": [
|
||||
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n",
|
||||
"\n",
|
||||
"*Scikit-learn* estimators which assume that all values are numerical. This is a common feature in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
|
||||
"*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
|
||||
"Some of the most common tasks are:\n",
|
||||
"* Remove samples with missing values or replace the missing values with a value (median, mean, or interpolation)\n",
|
||||
"* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n",
|
||||
"* Encode categorical variables as integers\n",
|
||||
"* Combine datasets\n",
|
||||
"* Rename variables and convert types\n",
|
||||
@@ -73,7 +73,7 @@
|
||||
"\n",
|
||||
"We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n",
|
||||
"\n",
|
||||
"First, we load the dataset, and we get a dataframe."
|
||||
"First we load the dataset and we get a dataframe."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -120,7 +120,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We see some features have a numerical type (int64 and float64), and others have a type *object*. The object type in Pandas is a String. We observe that most features are integers, except for Name, Sex, Ticket, Cabin, and Embarked."
|
||||
"We see some features have a numerical type (int64 and float64), and others has a type *object*. The object type is a String in Pandas. We observe that most features are integers, except for Name, Sex, Ticket, Cabin and Embarked."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -129,7 +129,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# We can list non-numerical properties, with a boolean indexing of the Series df.dtypes\n",
|
||||
"# We can list non numerical properties, with a boolean indexing of the Series df.dtypes\n",
|
||||
"df.dtypes[df.dtypes == object]"
|
||||
]
|
||||
},
|
||||
@@ -164,7 +164,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass); we could have selected only the interesting columns."
|
||||
"Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass), we could have selected only the interesting columns."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -181,7 +181,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Selecting rows in a DataFrame."
|
||||
"## Selecting rows in a DataFrame"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -281,14 +281,13 @@
|
||||
"source": [
|
||||
"DataFrames provide a set of functions for selection that we will need later\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"|Operation | Syntax | Result |\n",
|
||||
"|-----------------------------|\n",
|
||||
"|Select column | df[col] | Series |\n",
|
||||
"|Select row by label | df.loc[label] | Series |\n",
|
||||
"|Select row by integer location | df.iloc[loc] | Series |\n",
|
||||
"|Slice rows\t | df[5:10]\t | DataFrame |\n",
|
||||
"|Select rows by boolean vector | df[bool_vec] | DataFrame |"
|
||||
"| Operation | Syntax | Result |\n",
|
||||
"|--------------------------------|--------------|----------|\n",
|
||||
"| Select column | `df[col]` | Series |\n",
|
||||
"| Select row by label | `df.loc[label]` | Series |\n",
|
||||
"| Select row by integer location | `df.iloc[loc]` | Series |\n",
|
||||
"| Slice rows | `df[5:10]` | DataFrame |\n",
|
||||
"| Select rows by boolean vector | `df[bool_vec]` | DataFrame |"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -423,7 +422,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Mean age, SibSp, Survived of passengers older than 25 who survived, grouped by Passenger Class and Sex \n",
|
||||
"# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n",
|
||||
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].mean()"
|
||||
]
|
||||
},
|
||||
@@ -433,7 +432,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# We can also decide which function applies in each column\n",
|
||||
"# We can also decide which function apply in each column\n",
|
||||
"\n",
|
||||
"#Show mean Age, mean SibSp, and number of passengers older than 25 that survived, grouped by Passenger Class and Sex\n",
|
||||
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].agg({'Age': np.mean, \n",
|
||||
@@ -470,7 +469,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we want to analyze multi-index, the percentage of survivors, given sex and age, and distributed by Pclass."
|
||||
"Now we want to analyze multi-index, the percentage of survivoers, given sex and age, and distributed by Pclass."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -581,7 +580,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case, there are no duplicates. In case we would need, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
|
||||
"In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -597,7 +596,7 @@
|
||||
"source": [
|
||||
"Here we check how many null values there are.\n",
|
||||
"\n",
|
||||
"We use sum() instead of count(), or we would get the total number of records. Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
|
||||
"We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -626,7 +625,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Most of the samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
|
||||
"Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -654,13 +653,11 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Observe that the Passenger with 889 now has an Agent of 28 (median) instead of NaN. \n",
|
||||
"Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n",
|
||||
"\n",
|
||||
"Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
|
||||
"\n",
|
||||
"In addition, we could drop rows with any or all null values (method *dropna()*).\n",
|
||||
"\n",
|
||||
"If we want to modify the* df* object directly, we should add the parameter *inplace* with value *True*."
|
||||
"In addition, we could drop rows with any or all null values (method *dropna()*)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -669,7 +666,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
|
||||
"df['Age'] = df['Age'].fillna(df['Age'].mean())\n",
|
||||
"df[-5:]"
|
||||
]
|
||||
},
|
||||
@@ -700,7 +697,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# There are no labels for rows, so we use the numeric index\n",
|
||||
"# There are not labels for rows, so we use the numeric index\n",
|
||||
"df.iloc[889]"
|
||||
]
|
||||
},
|
||||
@@ -762,7 +759,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df['Sex'].fillna('male', inplace=True)\n",
|
||||
"df['Sex'] = df['Sex'].fillna('male')\n",
|
||||
"df[-5:]"
|
||||
]
|
||||
},
|
||||
@@ -780,27 +777,27 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"**Scikit-learn** also provides a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
|
||||
"**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Analysing non-numerical columns"
|
||||
"# Analysing non numerical columns"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As we saw, we have several non-numerical columns: **Name**, **Sex**, **Ticket**, **Cabin**, and **Embarked**.\n",
|
||||
"As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n",
|
||||
"\n",
|
||||
"**Name** and **Ticket** do not seem informative.\n",
|
||||
"\n",
|
||||
"Regarding **Cabin**, most values were missing, so we can ignore it. \n",
|
||||
"\n",
|
||||
"**Sex** and **Embarked** are categorical features, so we will encode them as integers."
|
||||
"**Sex** and **Embarked** are categorical features, so we will encode as integers."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -811,7 +808,7 @@
|
||||
"source": [
|
||||
"# We remove Cabin and Ticket. We should specify the axis\n",
|
||||
"# Use axis 0 for dropping rows and axis 1 for dropping columns\n",
|
||||
"df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
|
||||
"df = df.drop(['Cabin', 'Ticket'], axis=1)\n",
|
||||
"df[-5:]"
|
||||
]
|
||||
},
|
||||
@@ -826,7 +823,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input and would interpret the categories as ordered, which is not the case. "
|
||||
"*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -835,7 +832,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#First, we check if there are any null values. Observe the use of any()\n",
|
||||
"#First we check if there is any null values. Observe the use of any()\n",
|
||||
"df['Sex'].isnull().any()"
|
||||
]
|
||||
},
|
||||
@@ -862,17 +859,40 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df[\"Sex\"] = df["\Sex\"].map({\"male\": 0, \"female\": 1})\n",
|
||||
"df[\"Sex\"] = df[\"Sex\"].map({\"male\": 0, \"female\": 1})\n",
|
||||
"df[-5:]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#An alternative is to create a new column with the encoded values and define a mapping\n",
|
||||
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#An alternative is to create a new column with the encoded valuesm and define a mapping\n",
|
||||
"df = df_original.copy()\n",
|
||||
"df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)\n",
|
||||
"df.head()"
|
||||
@@ -926,7 +946,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#Replace nulls with the most common value\n",
|
||||
"df['Embarked'].fillna('S', inplace=True)\n",
|
||||
"df['Embarked'] = df['Embarked'].fillna('S')\n",
|
||||
"df['Embarked'].isnull().any()"
|
||||
]
|
||||
},
|
||||
@@ -936,13 +956,18 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Now we replace, as previously, the categories with integers\n",
|
||||
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
|
||||
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
|
||||
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
|
||||
"# Now we replace as previously the categories with integers\n",
|
||||
"df[\"Embarked\"] = df[\"Embarked\"].map({\"S\": 0, \"C\": 1, \"Q\": 2})\n",
|
||||
"df[-5:]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
@@ -984,7 +1009,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
@@ -1006,7 +1031,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.5"
|
||||
"version": "3.12.2"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
@@ -1027,5 +1052,5 @@
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user