pull/3/head
cif2cif 6 years ago
commit 31603bda6a

@ -65,9 +65,10 @@
"source": [
"This section covers different ways to inspect the distribution of samples per feature.\n",
"\n",
"First of all, let's take a see how many samples of each class we have, using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
"First of all, let's see how many samples of each class we have, using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
"\n",
"A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable). \n",
"A histogram is a graphical representation of the distribution of numerical data. It is an estimation of the probability distribution of a continuous variable (quantitative variable). \n",
"\n",
"For building a histogram, we need first to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n",
"\n",
@ -151,7 +152,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see we have the same distribution of samples for every class.\n",
"As can be seen, we have the same distribution of samples for every class.\n",
"The next step is to see the distribution of the features"
]
},

@ -50,7 +50,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal of this notebook is to learn how separate the dataset into training and test datasets and then preprocess the data."
"The goal of this notebook is to learn how to split the dataset into a training and a test datasets and then preprocess the data."
]
},
{
@ -78,7 +78,7 @@
"source": [
"A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n",
"\n",
"We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ration 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
"We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
]
},
{

@ -126,7 +126,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"scikit-learn has a uniform interface for all the estimators, some methods are only available is the estimator is supervised or unsupervised:\n",
"scikit-learn has a uniform interface for all the estimators, some methods are only available if the estimator is supervised or unsupervised:\n",
"\n",
"* Available in *all estimators*:\n",
" * **model.fit()**: fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).\n",

@ -54,7 +54,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal of this notebook is to learn how to learn how create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n",
"The goal of this notebook is to learn how to create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n",
"\n",
"There are a number of well known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0 and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n",
"\n",

Loading…
Cancel
Save