1
0
mirror of https://github.com/gsi-upm/sitc synced 2026-03-03 02:08:17 +00:00

Updated to Pandas 3.X and corrected typos

This commit is contained in:
cif
2026-03-02 17:40:58 +01:00
parent 5c440527ac
commit 65da5ae714
8 changed files with 105 additions and 135 deletions

View File

@@ -39,9 +39,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we are going to train a classifier with the preprocessed Titanic dataset. \n",
"In this notebook, we will train a classifier on the preprocessed Titanic dataset. \n",
"\n",
"We are going to use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook."
"We will use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook."
]
},
{
@@ -63,7 +63,7 @@
"\n",
"from pandas import Series, DataFrame\n",
"\n",
"# Training and test spliting\n",
"# Training and test splitting\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn import preprocessing\n",
"\n",
@@ -100,29 +100,33 @@
"metadata": {},
"outputs": [],
"source": [
"#We get a URL with raw content (not HTML one)\n",
"#We get a URL with raw content (not an HTML one)\n",
"url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
"df = pd.read_csv(url)\n",
"df.head()\n",
"\n",
"\n",
"#Fill missing values\n",
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
"df['Sex'].fillna('male', inplace=True)\n",
"df['Embarked'].fillna('S', inplace=True)\n",
"# --- Fill missing values ---\n",
"## Age: fill with mean\n",
"df['Age'] = df['Age'].fillna(df['Age'].mean())\n",
"\n",
"# Encode categorical variables\n",
"df['Age'] = df['Age'].fillna(df['Age'].median())\n",
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
"df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
"## Sex: fill missing with 'male'\n",
"df['Sex'] = df['Sex'].fillna('male')\n",
"\n",
"# Drop colums\n",
"df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)\n",
"## Embarked: fill missing with 'S'\n",
"df['Embarked'] = df['Embarked'].fillna('S')\n",
"\n",
"#Show proprocessed df\n",
"# --- Encode categorical variables ---\n",
"# Sex: male=0, female=1\n",
"df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})\n",
"\n",
"# Embarked: S=0, C=1, Q=2\n",
"df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})\n",
"\n",
"# --- Drop unnecessary columns ---\n",
"df = df.drop(columns=['Cabin', 'Ticket', 'Name'])\n",
"\n",
"#Show preprocessed df\n",
"df.head()"
]
},
@@ -239,7 +243,7 @@
"metadata": {},
"outputs": [],
"source": [
"#This step will take some time \n",
"# This step will take some time \n",
"# Train - This is not needed if you use K-Fold\n",
"\n",
"model.fit(X_train, y_train)\n",
@@ -447,7 +451,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"ROC curve helps to select a threshold to balance sensitivity and recall."
"The ROC curve helps to select a threshold to balance sensitivity and recall."
]
},
{
@@ -484,7 +488,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, the thresdhold to decide a class is 0.5, If we modify it, we should use the new thresdhold.\n",
"By default, the threshold to decide a class is 0.5. If we modify it, we should use the new threshold.\n",
"\n",
"threshold = 0.8\n",
"\n",
@@ -524,7 +528,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This is alternative to splitting the dataset into train and test. It will run k times slower than the other method, but it will be more accurate."
"This is an alternative to splitting the dataset into training and test sets. It will run k times slower than the other method but be more accurate."
]
},
{
@@ -555,7 +559,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The traning scores decreases with the number of samples. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples."
"We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The training scores decrease as the number of samples increases. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples."
]
},
{
@@ -578,7 +582,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section we are going to provide an alternative version of the previous one with optimization"
"In this section, we will provide an alternative version of the previous one with optimization"
]
},
{
@@ -628,7 +632,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Any value in the blue survived while anyone in the red did not. Checkout the graph for the linear transformation. It created its decision boundary right on 50%! "
"Any value in the blue survived, while anyone in the red did not. Check out the graph of the linear transformation. It created its decision boundary right on 50%! "
]
},
{
@@ -658,7 +662,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
@@ -680,7 +684,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
"version": "3.12.2"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
@@ -701,5 +705,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 4
}