mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-17 20:12:28 +00:00
commit
92bde106fc
@ -65,9 +65,10 @@
|
||||
"source": [
|
||||
"This section covers different ways to inspect the distribution of samples per feature.\n",
|
||||
"\n",
|
||||
"First of all, let's take a see how many samples of each class we have, using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
|
||||
"First of all, let's see how many samples of each class we have, using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
|
||||
|
||||
"\n",
|
||||
"A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable). \n",
|
||||
"A histogram is a graphical representation of the distribution of numerical data. It is an estimation of the probability distribution of a continuous variable (quantitative variable). \n",
|
||||
"\n",
|
||||
"For building a histogram, we need first to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n",
|
||||
"\n",
|
||||
@ -151,7 +152,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We see we have the same distribution of samples for every class.\n",
|
||||
"As can be seen, we have the same distribution of samples for every class.\n",
|
||||
"The next step is to see the distribution of the features"
|
||||
]
|
||||
},
|
||||
|
@ -50,7 +50,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The goal of this notebook is to learn how separate the dataset into training and test datasets and then preprocess the data."
|
||||
"The goal of this notebook is to learn how to split the dataset into a training and a test datasets and then preprocess the data."
|
||||
]
|
||||
},
|
||||
{
|
||||
@ -78,7 +78,7 @@
|
||||
"source": [
|
||||
"A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n",
|
||||
"\n",
|
||||
"We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ration 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
|
||||
"We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -126,7 +126,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"scikit-learn has a uniform interface for all the estimators, some methods are only available is the estimator is supervised or unsupervised:\n",
|
||||
"scikit-learn has a uniform interface for all the estimators, some methods are only available if the estimator is supervised or unsupervised:\n",
|
||||
"\n",
|
||||
"* Available in *all estimators*:\n",
|
||||
" * **model.fit()**: fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).\n",
|
||||
|
@ -54,7 +54,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The goal of this notebook is to learn how to learn how create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n",
|
||||
"The goal of this notebook is to learn how to create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n",
|
||||
"\n",
|
||||
"There are a number of well known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0 and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n",
|
||||
"\n",
|
||||
|
Loading…
Reference in New Issue
Block a user