mirror of
https://github.com/gsi-upm/sitc
synced 2026-03-03 02:08:17 +00:00
Updated to Pandas 3.X and corrected typos
This commit is contained in:
@@ -39,9 +39,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this notebook we are going to train a classifier with the preprocessed Titanic dataset. \n",
|
||||
"In this notebook, we will train a classifier on the preprocessed Titanic dataset. \n",
|
||||
"\n",
|
||||
"We are going to use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook."
|
||||
"We will use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -63,7 +63,7 @@
|
||||
"\n",
|
||||
"from pandas import Series, DataFrame\n",
|
||||
"\n",
|
||||
"# Training and test spliting\n",
|
||||
"# Training and test splitting\n",
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"from sklearn import preprocessing\n",
|
||||
"\n",
|
||||
@@ -100,29 +100,33 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#We get a URL with raw content (not HTML one)\n",
|
||||
"#We get a URL with raw content (not an HTML one)\n",
|
||||
"url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
|
||||
"df = pd.read_csv(url)\n",
|
||||
"df.head()\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"#Fill missing values\n",
|
||||
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
|
||||
"df['Sex'].fillna('male', inplace=True)\n",
|
||||
"df['Embarked'].fillna('S', inplace=True)\n",
|
||||
"# --- Fill missing values ---\n",
|
||||
"## Age: fill with mean\n",
|
||||
"df['Age'] = df['Age'].fillna(df['Age'].mean())\n",
|
||||
"\n",
|
||||
"# Encode categorical variables\n",
|
||||
"df['Age'] = df['Age'].fillna(df['Age'].median())\n",
|
||||
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
|
||||
"df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
|
||||
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
|
||||
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
|
||||
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
|
||||
"## Sex: fill missing with 'male'\n",
|
||||
"df['Sex'] = df['Sex'].fillna('male')\n",
|
||||
"\n",
|
||||
"# Drop colums\n",
|
||||
"df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)\n",
|
||||
"## Embarked: fill missing with 'S'\n",
|
||||
"df['Embarked'] = df['Embarked'].fillna('S')\n",
|
||||
"\n",
|
||||
"#Show proprocessed df\n",
|
||||
"# --- Encode categorical variables ---\n",
|
||||
"# Sex: male=0, female=1\n",
|
||||
"df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})\n",
|
||||
"\n",
|
||||
"# Embarked: S=0, C=1, Q=2\n",
|
||||
"df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})\n",
|
||||
"\n",
|
||||
"# --- Drop unnecessary columns ---\n",
|
||||
"df = df.drop(columns=['Cabin', 'Ticket', 'Name'])\n",
|
||||
"\n",
|
||||
"#Show preprocessed df\n",
|
||||
"df.head()"
|
||||
]
|
||||
},
|
||||
@@ -239,7 +243,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#This step will take some time \n",
|
||||
"# This step will take some time \n",
|
||||
"# Train - This is not needed if you use K-Fold\n",
|
||||
"\n",
|
||||
"model.fit(X_train, y_train)\n",
|
||||
@@ -447,7 +451,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"ROC curve helps to select a threshold to balance sensitivity and recall."
|
||||
"The ROC curve helps to select a threshold to balance sensitivity and recall."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -484,7 +488,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"By default, the thresdhold to decide a class is 0.5, If we modify it, we should use the new thresdhold.\n",
|
||||
"By default, the threshold to decide a class is 0.5. If we modify it, we should use the new threshold.\n",
|
||||
"\n",
|
||||
"threshold = 0.8\n",
|
||||
"\n",
|
||||
@@ -524,7 +528,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This is alternative to splitting the dataset into train and test. It will run k times slower than the other method, but it will be more accurate."
|
||||
"This is an alternative to splitting the dataset into training and test sets. It will run k times slower than the other method but be more accurate."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -555,7 +559,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The traning scores decreases with the number of samples. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples."
|
||||
"We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The training scores decrease as the number of samples increases. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -578,7 +582,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this section we are going to provide an alternative version of the previous one with optimization"
|
||||
"In this section, we will provide an alternative version of the previous one with optimization"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -628,7 +632,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Any value in the blue survived while anyone in the red did not. Checkout the graph for the linear transformation. It created its decision boundary right on 50%! "
|
||||
"Any value in the blue survived, while anyone in the red did not. Check out the graph of the linear transformation. It created its decision boundary right on 50%! "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -658,7 +662,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
@@ -680,7 +684,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.12"
|
||||
"version": "3.12.2"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
@@ -701,5 +705,5 @@
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user