"This is an introduction of general ideas about machine learning and the interface of scikit-learn, taken from the [scikit-learn tutorial](http://www.astroml.org/sklearn_tutorial/general_concepts.html). \n",
"You can skip it during the lab session and read it later,"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Machine learning algorithms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Machine learning algorithms are programs that learn a model from a dataset with the aim of making predictions or learning structures to organize the data.\n",
"\n",
"In scikit-learn, machine learning algorithms take as an input a *numpy* array (n_samples, n_features), where\n",
"* **n_samples**: number of samples. Each sample is an item to process (i.e. classify). A sample can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n",
"* **n_features**: The number of features or distinct traits that can be used to describe each item in a quantitative manner.\n",
"The number of features should be defined in advance. There is a specific type of feature sets that are high dimensional (e.g. millions of features), but most of the values are zero for a given sample. Using (numpy) arrays, all those values that are zero would also take up memory. For this reason, these feature sets are often represented with sparse matrices (scipy.sparse) instead of (numpy) arrays.\n",
"The first step in machine learning is **identifying the relevant features** from the input data, and the second step is **extracting the features** from the input data. \n",
"\n",
"[Machine learning algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/) can be classified according to learning style into:\n",
"* **Supervised learning**: input data (training dataset) has a known label or result. Example problems are classification and regression. A model is prepared through a training process where it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.\n",
"* **Unsupervised learning**: input data is not labeled. A model is prepared by deducing structures present in the input data. This may be to extract general rules. Example problems are clustering, dimensionality reduction and association rule learning.\n",
"* **Semi-supervised learning**:i nput data is a mixture of labeled and unlabeled examples. There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions. Example problems are classification and regression."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Supervised machine learning model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In *supervised machine learning models*, the machine learning algorithm takes as an input a training dataset, composed of feature vectors and labels, and produces a predictive model which is used for make prediction on new data.\n",
"![](files/images/plot_ML_flow_chart_1.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Unsupervised machine learning model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In *unsupervised machine learning models*, the machine learning model algorithm takes as an input the feature vectors and produces a predictive model that is used to fit its parameters so as to best summarize regularities found in the data.\n",
" * **model.fit()**: fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).\n",
"\n",
"* Available in *supervised estimators*:\n",
" * **model.predict()**: given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.\n",
" * **model.predict_proba()**: For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().\n",
"\n",
"* Available in *unsupervised estimators*:\n",
" * **model.transform()**: given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.\n",
" * **model.fit_transform()**: some estimators implement this method, which performs a fit and a transform on the same input data.\n",
"* [General concepts of machine learning with scikit-learn](http://www.astroml.org/sklearn_tutorial/general_concepts.html)\n",
"* [A Tour of Machine Learning Algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",