This notebook provides a tutorial on how to use the library.

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
import logging

logging.basicConfig(level=logging.DEBUG)

# Datasets

Datasets management is made simple. You can view the available datasets:

In [14]:
from gsitk.datasets.datasets import DatasetManager

dm = DatasetManager()
dm.view_datasets()

- sentiment140:
    	 Downloaded: True
    	 # instances: 1600000




Preparing the data:

In [15]:
data = dm.prepare_datasets()

DEBUG:gsitk.datasets.datasets:Preparing data: sentiment140
DEBUG:gsitk.datasets.utils:Checking data path: /data/sentiment140
DEBUG:gsitk.datasets.utils:Verified: trainingandtestdata.zip
DEBUG:gsitk.datasets.datasets:sentiment140 data is ready


In [16]:
data.keys()

dict_keys(['sentiment140'])

Data is a simple pandas DataFrame.

In [17]:
data['sentiment140'].head()

Unnamed: 0,polarity,text
0,-1,"['user', 'url', 'aw', 'elong', ',', 'thats', '..."
1,-1,"['is', 'upset', 'that', 'he', 'cant', 'update'..."
2,-1,"['user', 'i', 'dived', 'many', 'times', 'for',..."
3,-1,"['my', 'whole', 'body', 'feels', 'itchy', 'and..."
4,-1,"['user', 'no', ',', 'its', 'not', 'behaving', ..."


# Preprocessing

# Features

For using a word2vec model as feature extractor, write:

In [20]:
from gsitk.features.word2vec import Word2VecFeatures

w2v_feat = Word2VecFeatures(w2v_model_path='/data/w2vmodel_500d_5mc')

INFO:gensim.utils:loading Word2Vec object from /data/w2vmodel_500d_5mc
INFO:gensim.utils:loading syn0 from /data/w2vmodel_500d_5mc.syn0.npy with mmap=None
INFO:gensim.utils:loading syn1 from /data/w2vmodel_500d_5mc.syn1.npy with mmap=None
INFO:gensim.utils:setting ignored attribute syn0norm to None
INFO:gensim.utils:setting ignored attribute cum_table to None
INFO:gensim.utils:loaded /data/w2vmodel_500d_5mc


Extracting features is made by the method `transform`. All feature extractors implement `transform`.

In [48]:
transformed = w2v_feat.transform(data['sentiment140']['text'].values)
transformed.shape

(1600000, 500)

If extracting the features is time consuming, you can save the features locally:

In [59]:
from gsitk.features import features

features.save_features(transformed, 'w2v__sentiment40')

And you can load them later:

In [29]:
utils.load_features('w2v__sentiment')

DEBUG:gsitk.features.utils:Reading features from w2v__sentiment
DEBUG:gsitk.features.utils:Features are in /data/features/w2v__sentiment40.npy


array([[-0.03798573,  0.03630935,  0.08243822, ..., -0.0287797 ,
         0.00937027,  0.21814214],
       [-0.06142361, -0.03791333,  0.18094143, ...,  0.00306141,
         0.08196757,  0.02467711],
       [-0.03798573,  0.03630935,  0.08243822, ..., -0.0287797 ,
         0.00937027,  0.21814214],
       ..., 
       [-0.03798573,  0.03630935,  0.08243822, ..., -0.0287797 ,
         0.00937027,  0.21814214],
       [-0.03798573,  0.03630935,  0.08243822, ..., -0.0287797 ,
         0.00937027,  0.21814214],
       [-0.03798573,  0.03630935,  0.08243822, ..., -0.0287797 ,
         0.00937027,  0.21814214]])

# Pipes and Evaluation

The evaluation process uses pipes. Pipe are a way of organizing the different elements of the evaluation. Pipes are represented by EvalTuples, that are a way of specifiying which datasets, features and classifiers we want to evaluate.

If we want to include a classifier in our evaluation:

In [49]:
from gsitk.pipe import Model, Features, EvalTuple
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(n_jobs=-1)
sgd.fit(transformed, data['sentiment140']['polarity'].values)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=-1,
       penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False)

In [50]:
models = [Model(name='sgd', classifier=sgd)]

Including features:

In [51]:
feats = [Features(name='w2v__sentiment140', dataset='sentiment140', values=transformed)]

Putting them together:

In [52]:
ets = [EvalTuple(classifier='sgd', features='w2v__sentiment140', labels='sentiment140')]

Running the evaluation:

In [57]:
from gsitk.evaluation.evaluation import Evaluation

ev = Evaluation(datasets=data, features=feats, models=models, tuples=ets)

In [58]:
ev.evaluate()
ev.results

DEBUG:gsitk.evaluation.evaluation:Model sgd predicting from features w2v__sentiment140


Unnamed: 0,Dataset,Features,Model,Accuracy,Precision,Recall,F1-Score
0,sentiment140,w2v__sentiment140,sgd,0.589035,0.577554,0.663056,0.617359
