Updated to Pandas 3.X and corrected typos

2026-04-16 22:58:16 +00:00 · 2026-03-02 17:40:58 +01:00
parent 5c440527ac
commit 65da5ae714
8 changed files with 105 additions and 135 deletions
--- a/ml2/3_7_SVM.ipynb
+++ b/ml2/3_7_SVM.ipynb
@@ -39,9 +39,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this notebook we are going to train a classifier with the preprocessed Titanic dataset. \n",
+    "In this notebook, we will train a classifier on the preprocessed Titanic dataset. \n",
    "\n",
-    "We are going to use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook."
+    "We will use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook."
   ]
  },
  {
@@ -63,7 +63,7 @@
    "\n",
    "from pandas import Series, DataFrame\n",
    "\n",
-    "# Training and test spliting\n",
+    "# Training and test splitting\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn import preprocessing\n",
    "\n",
@@ -100,29 +100,33 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "#We get a URL with raw content (not HTML one)\n",
+    "#We get a URL with raw content (not an HTML one)\n",
    "url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
    "df = pd.read_csv(url)\n",
    "df.head()\n",
    "\n",
    "\n",
-    "#Fill missing values\n",
-    "df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
-    "df['Sex'].fillna('male', inplace=True)\n",
-    "df['Embarked'].fillna('S', inplace=True)\n",
+    "# --- Fill missing values ---\n",
+    "## Age: fill with mean\n",
+    "df['Age'] = df['Age'].fillna(df['Age'].mean())\n",
    "\n",
-    "# Encode categorical variables\n",
-    "df['Age'] = df['Age'].fillna(df['Age'].median())\n",
-    "df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
-    "df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
-    "df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
-    "df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
-    "df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
+    "## Sex: fill missing with 'male'\n",
+    "df['Sex'] = df['Sex'].fillna('male')\n",
    "\n",
-    "# Drop colums\n",
-    "df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)\n",
+    "## Embarked: fill missing with 'S'\n",
+    "df['Embarked'] = df['Embarked'].fillna('S')\n",
    "\n",
-    "#Show proprocessed df\n",
+    "# --- Encode categorical variables ---\n",
+    "# Sex: male=0, female=1\n",
+    "df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})\n",
+    "\n",
+    "# Embarked: S=0, C=1, Q=2\n",
+    "df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})\n",
+    "\n",
+    "# --- Drop unnecessary columns ---\n",
+    "df = df.drop(columns=['Cabin', 'Ticket', 'Name'])\n",
+    "\n",
+    "#Show preprocessed df\n",
    "df.head()"
   ]
  },
@@ -239,7 +243,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "#This step will take some time \n",
+    "# This step will take some time \n",
    "# Train - This is not needed if you use K-Fold\n",
    "\n",
    "model.fit(X_train, y_train)\n",
@@ -447,7 +451,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "ROC curve helps to select a threshold to balance sensitivity and recall."
+    "The ROC curve helps to select a threshold to balance sensitivity and recall."
   ]
  },
  {
@@ -484,7 +488,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "By default, the thresdhold to decide a class is 0.5, If we modify it, we should use the new thresdhold.\n",
+    "By default, the threshold to decide a class is 0.5. If we modify it, we should use the new threshold.\n",
    "\n",
    "threshold = 0.8\n",
    "\n",
@@ -524,7 +528,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is alternative to splitting the dataset into train and test. It will run k times slower than the other method, but it will be more accurate."
+    "This is an alternative to splitting the dataset into training and test sets. It will run k times slower than the other method but be more accurate."
   ]
  },
  {
@@ -555,7 +559,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The traning scores decreases with the number of samples. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples."
+    "We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The training scores decrease as the number of samples increases. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples."
   ]
  },
  {
@@ -578,7 +582,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this section we are going to provide an alternative version of the previous one with optimization"
+    "In this section, we will provide an alternative version of the previous one with optimization"
   ]
  },
  {
@@ -628,7 +632,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Any value in the blue survived while anyone in the red did not. Checkout the graph for the linear transformation. It created its decision boundary right on 50%! "
+    "Any value in the blue survived, while anyone in the red did not. Check out the graph of the linear transformation. It created its decision boundary right on 50%! "
   ]
  },
  {
@@ -658,7 +662,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
@@ -680,7 +684,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.12"
+   "version": "3.12.2"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
@@ -701,5 +705,5 @@
  }
 },
 "nbformat": 4,
- "nbformat_minor": 1
+ "nbformat_minor": 4
 }