lunes, 23 de mayo de 2016

Evaluate the Performance of Machine Learning Algorithms in Python using Resampling

You need to know how well your algorithms perform on unseen data.

The best way to evaluate the performance of an algorithm would be to make predictions for new data to which you already know the answers. The second best way is to use clever techniques from statistics called resampling methods that allow you to make accurate estimates for how well your algorithm will perform on new data.

In this post you will discover how you can estimate the accuracy of your machine learning algorithms using resampling methods in Python and scikit-learn.

Let’s get started.

Evaluate the Performance of Machine Learning Algorithms in Python using Resampling

Evaluate the Performance of Machine Learning Algorithms in Python using Resampling
Photo by Doug Waldron, some rights reserved.

About The Recipes

Resampling methods are demonstrated in this post using small code recipes in Python.

Each recipe is designed to be standalone so that you can copy-and-paste it into your project and use it immediately.

The Pima Indians onset of diabetes dataset is used in each recipe. This is a binary classification problem where all of the input variables are numeric. In each recipe it is downloaded directly from the UCI Machine Learning repository. You can replace it with your own dataset as needed.

Evaluate Your Machine Learning Algorithms

Why can’t you train your machine learning algorithm on your dataset and use predictions from this same dataset to evaluate machine learning algorithms?

The simple answer is overfitting.

Imagine an algorithm that remembers every observation it is shown. If you evaluated your machine learning algorithm on the same dataset used to train the algorithm, then an algorithm like this would have a perfect score on the training dataset. But the predictions it made on new data would be terrible.

We must evaluate our machine learning algorithms on data that is not used to train the algorithm.

The evaluation is an estimate that we can use to talk about how well we think the algorithm may actually do in practice. It is not a guarantee of performance.

Once we estimate the performance of our algorithm, we can then re-train the final algorithm on the entire training dataset and get it ready for operational use.

Next up we are going to look at four different techniques that we can use to split up our training dataset and create useful estimates of performance for our machine learning algorithms:

  1. Train and Test Sets.
  2. K-fold Cross Validation.
  3. Leave One Out Cross Validation.
  4. Repeated Random Test-Train Splits.

We will start with the simplest method called Train and Test Sets.

1. Split into Train and Test Sets

The simplest method that we can use to evaluate the performance of a machine learning algorithm is to use different training and testing datasets.

We can take our original dataset, split it into two parts. Train the algorithm on the first part, make predictions on the second part and evaluate the predictions against the expected results.

The size of the split can depend on the size and specifics of your dataset, although it is common to use 67% of the data for training and the remaining 33% for testing.

This algorithm evaluation technique is very fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem. Because of the speed, it is useful to use this approach when the algorithm you are investigating is slow to train.

A downside of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of accuracy.

In the example below we split the data Pima Indians dataset into 67%/33% split for training and test and evaluate the accuracy of a Logistic Regression model.

# Evaluate using a train and a test set
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%") % (result*100.0)

We can see that the estimated accuracy for the model was approximately 75%. Note that in addition to specifying the size of the split, we also specify the random seed. Because the split of the data is random, we want to ensure that the results are reproducible. By specifying the random seed we ensure that we get the same random numbers each time we run the code.

This is important if we want to compare this result to the estimated accuracy of another machine learning algorithm or the same algorithm with a different configuration. To ensure the comparison was apples-for-apples, we must ensure that they are trained and tested on the same data.

Accuracy: 75.591%

2. K-fold Cross Validation

Cross validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split.

It works by splitting the dataset into k-parts (e.g. k=5 or k=10). Each split of the data is called a fold. The algorithm is trained on k-1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set.

After running cross validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.

The result is a more reliable estimate of the performance of the algorithm on new data given your test data. It is more accurate because the algorithm is trained and evaluated multiple times on different data.

The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are common.

In the example below we use 10-fold cross validation.

# Evaluate using Cross Validation
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)

You can see that we report both the mean and the standard deviation of the performance measure. When summarizing performance measures, it is a good practice to summarize the distribution of the measures, in this case assuming a Gaussian distribution of performance (a very reasonable assumption) and recording the mean and standard deviation.

Accuracy: 76.951% (4.841%)

3. Leave One Out Cross Validation

You can configure cross validation so that the size of the fold is 1 (k is set to the number of observations in your dataset). This variation of cross validation is called leave-one-out cross validation.

The result is a large number of performance measures that can be summarized in an effort to give a more reasonable estimate of the accuracy of your model on unseen data. A downside is that it can be a computationally more expensive procedure than k-fold cross validation.

In the example below we use leave-one-out cross validation.

# Evaluate using Leave One Out Cross Validation
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
loocv = cross_validation.LeaveOneOut(n=num_instances)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)

You can see in the standard deviation that the score has more variance than the k-fold cross validation results described above.

Accuracy: 76.823% (42.196%)

4. Repeated Random Test-Train Splits

Another variation on k-fold cross validation is to create a random split of the data like the train/test split described above, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross validation.

This has the speed of using a train/test split and the reduction in variance in the estimated performance of k-fold cross validation. You can also repeat the process many more times as need. A down side is that repetitions may include much of the same data in the train or the test split from run to run, introducing redundancy into the evaluation.

The example below splits the data into a 67%/33% train/test split and repeats the process 10 times.

# Evaluate using Shuffle Split Cross Validation
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_samples = 10
test_size = 0.33
num_instances = len(X)
seed = 7
kfold = cross_validation.ShuffleSplit(n=num_instances, n_iter=num_samples, test_size=test_size, random_state=seed)
model = LogisticRegression()
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)

We can see that the distribution of the performance measure is on par with k-fold cross validation above.

Accuracy: 76.496% (1.698%)

What Techniques to Use When

  • Generally k-fold cross validation is the gold-standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
  • Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
  • Techniques like leave-one-out cross validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size.

The best advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions. If in doubt, use 10-fold cross validation.

Your Guide to Machine Learning with Scikit-Learn

Python Mini-CoursePython and scikit-learn are the rising platform among professional data scientists for applied machine learning.

PDF and Email Course.

FREE 14-Day Mini-Course in
Machine Learning with Python and scikit-learn

Download Your FREE Mini-Course >>

 

 Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

 

Summary

In this post you discovered statistical techniques that you can use to estimate the performance of your machine learning algorithms, called resampling.

Specifically, you learned about:

  1. Train and Test Sets.
  2. Cross Validation.
  3. Leave One Out Cross Validation.
  4. Repeated Random Test-Train Splits.

Do you have any questions about resampling methods or this post? Ask your question in the comments and I will do my best to answer it.

Can You Step-Through Machine Learning Projects 
in Python with scikit-learn and Pandas?

Machine Learning Mastery with PythonDiscover how to confidently step-through machine learning projects end-to-end in Python with scikit-learn in the new Ebook: 

Machine Learning Mastery with Python

Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.

Includes 3 end-to-end projects and a project template to tie it all together.

Ideal for beginners and intermediate levels.

Apply Machine Learning Like A Professional With Python

The post Evaluate the Performance of Machine Learning Algorithms in Python using Resampling appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1TCp10O
via IFTTT

viernes, 20 de mayo de 2016

5 Machine Learning Projects You Can No Longer Overlook

We all know the big machine learning projects out there: Scikit-learn, TensorFlow, Theano, etc. But what about the smaller niche projects that are actively developed, providing useful services to users? Here are 5 such projects.

The popular machine learning projects, in general, are popular because they either provide a wide range of needed services or they were the first (or possibly best) to provide a particular niche service to users. These popular projects include Scikit-learn, TensorFlow, Theano, MXNet (maybe?), Weka (formerly), and so on. Depending on the particular ecosystem(s) you work in, and on your machine learning goals, the projects which you consider popular may differ slightly; however, they all share the similarity that they provide services to a large base of users.

But there are all sorts of smaller machine learning projects out there that people are building and using: pipelines, wrappers, high-level APIs, cleaners, etc. They provide both niche and flexible services, usually for smaller numbers of users, for all sorts of reasons. This post will present 5 such smaller projects for readers to familiarize themselves with.

I'm not necessarily suggesting you go out and try all (or any) of these; if there is some particular requirement you are looking to fill which you happen to find a corresponding tool for in this list, then by all means give it a try. The real value here, however, at least in my view, is checking out what the projects offer, how they are implemented, and what others are building to fit into their ecosystems. You might get some good ideas of where to take your own projects. But best case: something here fills a need perfectly and you solve a problem, thanks to the work the developers listed herein are doing.

There is no real criteria for these items, I'm sorry to report. They are simply a collection of interesting projects I have noted over the past few months and thought were promising enough to share with readers. Also note that I am firmly invested in the Python ecosystem, and so these tools have been discovered accordingly. I don't have any bias against any of the projects that R or C++ or any other particular environment has to offer (and may discover and share such projects in future posts); this list came together generically, however, based on my internet wanderings as I searched for useful tools.

So here they are: 5 machine learning projects you should definitely have a look at, in no particular order (but numbered like they are in order, because I like numbering things):

1. Deepy

Deepy is an extensible deep learning framework based on Theano. It provides a clean, high-level interface for components such as LSTMs, Batch Normalization, and Auto Encoders. Deepy clearly aims for simplicity, and its documentation and examples aim for the same. It also has a sister project, which uses Deepy to implement Deep Recurrent Attentive Writer (DRAW) generative models.

For an example of Deepy's simplicity and cleanliness, here's an example of a multi-layer model with dropout, from the project's Github:

# A multi-layer model with dropout for MNIST task.
from deepy import *

model = NeuralClassifier(input_dim=28*28)
model.stack(Dense(256, 'relu'),
            Dropout(0.2),
            Dense(256, 'relu'),
            Dropout(0.2),
            Dense(10, 'linear'),
            Softmax())

trainer = MomentumTrainer(model)

annealer = LearningRateAnnealer(trainer)

mnist = MiniBatches(MnistDataset(), batch_size=20)

trainer.run(mnist, controllers=[annealer])

You may have even heard of Deepy already; its Github repo has 305 stars and has been forked 51 times, as of this writing. The project is a decent exemplar of high-level deep learning APIs and wrappers that are becoming widespread (or seem to be). Deepy is authored by Raphael Shu.

2. MLxtend

Sebastian Raschka has put together MLxtend, something he is quick to point out is a work in progress, but is also something which attempts to tick a number of different boxes. MLxtend is a collection of useful tools and extensions for machine learning tasks.

MLxtend

Sebastian shared the following with me, regarding the project, how it came to be, and its goals:

Essentially, it's just a collection of useful tools and reference implementations related to ML and data science in general. Why did I come up with it? There are a couple of reasons:

  1. Implementations of algorithms that I couldn't find anywhere else (e.g., the Sequential Feature Selection algorithms, the Majority Voting Classifier, the Stacking estimators, plotting decision regions, ...)
  2. Implementations for teaching purposes (logistic regression, softmax regression, multi-layer perceptron, PCA, kernel PCA...); these impl. focus on code readability rather than pure efficiency
  3. Wrappers for convenience: tensorflow softmax regression and multi-layer perceptrons, column-wise standardization for pandas data frames

This is essentially a library of commonly-used general machine learning functions that Sebastian has written and frequently uses. Additionally, Sebastian really likes to code, and thought that if he were to offer this "zoo" of different things (as he refers to it) up to others that he may keep the code "tidier" than usual.

Many of the implemented functions share similarities with scikit-learn's API, but future addition functionality will not necessarily be restricted by this. The big takeaway here: Sebastian promises that there is much more to come... so stay tuned. There's a good chance that any feature or novel algorithm that Sebastian plays with will end up being packaged in MLxtend.

3. datacleaner

datacleaner is the work of researcher Randal Olson, who is also responsible for the fantastic TPOT machine learning pipeline project. Olson bills Data Cleaner as a "Python tool that automatically cleans data sets and readies them for analysis." He is quick to declare that it is not magic, but also points out what it can do:

What datacleaner will do is save you a ton of time encoding and cleaning your data once it's already in a format that pandas DataFrames can handle.

datacleaner is a work in progress, but is currently capable of handling the following regular (and time-consuming) data cleaning operations: optionally drops rows with missing values; replaces missing values with either mode or median, on a column by column basis; encodes non-numerical variables with numerical equivalents. Randal tells us that he is looking for contributors, especially from those with more ideas on what data cleaning operations datacleaner could perform in an automated fashion.

Randal has an attention to detail that anyone who reads his blog or his Github repos already knows, and the concise documentation for this project is no exception. I have been using datacleaner recently, and so far it delivers on its promises.

4. auto-sklearn

auto-sklearn is automated machine learning for the Scikit-learn environment.

auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading this paper published at the NIPS 2015.

auto-sklearn

Its documentation is quite thorough, and the repo includes a few concise examples. I have, admittedly, not used it yet, but many others have; it has collected nearly 400 stars on Github. Given my propensity for Scikit-learn, I imagine I will try this out in the near future.

auto-sklearn is developed mainly by the Machine Learning for Automated Algorithm Design group at the University of Freiburg.

5. Deep Mining

Deep Mining is a machine learning pipeline auto-tuner, coming to us from Sebastien Dubois of the CSAIL lab at MIT. From the repo:

This software will test iteratively, and smartly, some hyperparameter sets in order to find as quickly as possible the best ones to achieve the best classification accuracy that a pipeline can offer.

Deep Mining

Deep Mining does not seem to be a well-known project, given its relatively modest number of repo stars; however, given that it comes out of CSAIL, and has development activity within the past month, it may be worth benchmarking this against other similar automated pipeline tools. It comes with a few examples, and its usage seems to be straightforward.

More on the methods used:

The folder GCP-HPO contains all the code implementing the Gaussian Copula Process (GCP) and a hyperparameter optimization (HPO) technique based on it. Gaussian Copula Process can be seen as an improved version of the Gaussian Process, that does not assume a Gaussian prior for the marginal distributions but lies on a more complex prior. This new technique is proved to outperform GP-based hyperparameter optimization, which is already far better than the randomized search.

A paper on the GCP approach is forthcoming.

Related:



from KDnuggets http://ift.tt/1OPb27O
via IFTTT

Feature Selection For Machine Learning in Python

The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.

Irrelevant or partially relevant features can negatively impact model performance.

In this post you will discover automatic feature selection techniques that you can use to prepare your machine learning data in python with scikit-learn.

Let’s get started.

Feature Selection For Machine Learning in Python

Feature Selection For Machine Learning in Python
Photo by Baptiste Lafontaine, some rights reserved.

Feature Selection

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

Three benefits of performing feature selection before modeling your data are:

  • Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
  • Improves Accuracy: Less misleading data means modeling accuracy improves.
  • Reduces Training Time: Less data means that algorithms train faster.

You can learn more about feature selection with scikit-learn in the article Feature selection.

Feature Selection for Machine Learning

This section lists 4 feature selection recipes for machine learning in Python

This post contains recipes for feature selection methods.

Each recipe was designed to be complete and standalone so that you can copy-and-paste it directly into you project and use it immediately.

Recipes uses the Pima Indians onset of diabetes dataset to demonstrate the feature selection method. This is a binary classification problem where all of the attributes are numeric.

1. Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

The example below uses the chi squared (chi^2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.

# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])

You can see the scores for each attribute and the 4 attributes chosen (those with the lowest scores): plas, test, mass and age.

[  111.52   1411.887    17.605    53.108  2175.565   127.669     5.393
   181.304]
[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]
 [  89.    94.    28.1   21. ]
 [ 137.   168.    43.1   33. ]]

2. Recursive Feature Elimination

The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

You can learn more about the RFE class in the scikit-learn documentation.

The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.

# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d") % fit.n_features_
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_

You can see that RFE chose the the top 3 features as preg, pedi and age. These are marked True in the support_ array and marked with a choice “1” in the ranking_ array.

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]

3. Principal Component Analysis

Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.

Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.

In the example below, we use PCA and select 3 principal components.

Learn more about the PCA class in scikit-learn by reviewing the PCA API. Dive deeper into the math behind PCA on the Principal Component Analysis Wikipedia article.

# Feature Extraction with PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s") % fit.explained_variance_ratio_
print(fit.components_)

You can see that the transformed dataset (3 principal components) bare little resemblance to the source data.

Explained Variance: [ 0.88854663  0.06159078  0.02579012]
[[ -2.02176587e-03   9.78115765e-02   1.60930503e-02   6.07566861e-02
    9.93110844e-01   1.40108085e-02   5.37167919e-04  -3.56474430e-03]
 [  2.26488861e-02   9.72210040e-01   1.41909330e-01  -5.78614699e-02
   -9.46266913e-02   4.69729766e-02   8.16804621e-04   1.40168181e-01]
 [ -2.24649003e-02   1.43428710e-01  -9.22467192e-01  -3.07013055e-01
    2.09773019e-02  -1.32444542e-01  -6.39983017e-04  -1.25454310e-01]]

4. Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class in the scikit-learn API.

# Feature Importance with Extra Trees Classifier
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
# load data
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

You can see that we are given an importance score for each attribute where the larger score the more important the attribute. The scores suggest at the importance of plas, age and mass.

[ 0.11070069  0.2213717   0.08824115  0.08068703  0.07281761  0.14548537 0.12654214  0.15415431]

Your Guide to Machine Learning with Scikit-Learn

Python Mini-CoursePython and scikit-learn are the rising platform among professional data scientists for applied machine learning.

PDF and Email Course.

FREE 14-Day Mini-Course in
Machine Learning with Python and scikit-learn

Download Your FREE Mini-Course >>

 

 Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

 

Summary

In this post you discovered feature selection for preparing machine learning data in Python with scikit-learn.

You learned about 4 different automatic feature selection techniques:

  • Univariate Selection.
  • Recursive Feature Elimination.
  • Principle Component Analysis.
  • Feature Importance.

If you are looking for more information on feature selection, see these related posts:

Do you have any questions about feature selection or this post? Ask your questions in the comment and I will do my best to answer them.

Can You Step-Through Machine Learning Projects 
in Python with scikit-learn and Pandas?

Machine Learning Mastery with PythonDiscover how to confidently step-through machine learning projects end-to-end in Python with scikit-learn in the new Ebook: 

Machine Learning Mastery with Python

Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.

Includes 3 end-to-end projects and a project template to tie it all together.

Ideal for beginners and intermediate levels.

Apply Machine Learning Like A Professional With Python

The post Feature Selection For Machine Learning in Python appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1qwYBSo
via IFTTT

lunes, 16 de mayo de 2016

Animated great circles on rotating 3D Earth: Example R code

(This article was first published on rbloggers – SNAP Tech, and kindly contributed to R-bloggers)

Making still frames for great circle animations in R

Matthew Leonawicz

Here I share R code I used to produce animated great circle arcs on top of a rotating 3D Earth. The code is not entirely reproducible but you should be able to use what is shared here to create your own video frames given your unique data and computing environment and resources.

The WordPress blog is not the most elegant for displaying lots of code so go to the original full post.

When I make great circle animations, at the core of the process is always an R function that transforms a series of coordinates describing points along a great circle arc into multiple series of great circle arc segments. The goal is simple: plot a series of line segments, saving each plot as a subsequent still frame, rather than plotting the original entire arc as a single plot. The input is generally a data table (much faster to work with than a data frame if you have a lot of data) with longitude and latitude columns where the coordinates in each row describe a subsequent point along one of my paths. I also use a third column to provide a unique group ID for each path to keep them distinct.

Before getting to this process, here is some example code of how I formatted my data this way when using the geosphere package. Do not get bogged down in the details here. This is just an example for fuller context regarding my specific animation referenced above and I will not be focusing on it. Your data will be quite different. You will have to arrange it similarly, but obviously it will be in a different context.

Data prep example

As you can see below, all you need is a data frame with columns of longitude and latitude. I created a SpatialPoints object from these. In my case I wanted to connect great circle arcs between each of two specific locations and all other locations in my data set. I marked the row indices for these two locations.

library(dplyr)
library(geosphere)

load("data.RData")  # a data frame, d, containing 'long' and 'lat' columns
p <- SpatialPoints(cbind(d$long, d$lat), proj4string = CRS("+proj=longlat +datum=WGS84"))
idx1 <- 69  # great circles from coords in all other rows to coords in this row
idx2 <- 2648  # as above

get_paths <- function(x, idx, ...) {
    gcInt <- function(x, x1, x2) {
        x <- gcIntermediate(x[x1, ], x[x2, ], ...)
        if (is.list(x)) {
            x <- x %>% purrr::map2(c(x1, x1 + 0.5), ~data.frame(.x, .y)) %>% 
                bind_rows %>% setnames(c("long", "lat", "group"))
        } else x <- data.frame(x, x1) %>% setnames(c("long", "lat", "group"))
        x
    }
    purrr::map(setdiff(1:length(x), idx), ~gcInt(x, .x, idx)) %>% bind_rows
}

paths1 <- get_paths(p, idx1, addStartEnd = TRUE)
paths2 <- get_paths(p, idx2, addStartEnd = TRUE)

get_paths uses gcIntermediate from the geosphere package to obtain vectors of points defining great circle arcs between my two locations and all others. The resulting data frame has columns, long, lat, and group. I do not break the arcs at the dateline because I intend to draw them on a 3D globe, but if I were making a flat map I would do so. This is why the function handles the case of list output from gcIntermediate and adjusts the group ID by add 0.5 to one group.

Already more than enough details. You can see what I am going for though. Given my data, I want to end up with columns of longitude and latitude defining great circle arcs and broken out by unique group IDs. I show this because it's highly likely you will want to use geosphere in a similar way even if you won't be connecting points in the same way I am here.

Transforming great circle arcs into segments

Okay. You have your data in the right format. Now you want to break up your groups of great circle arcs into a larger number of nested subgroups of great circle arc segments. Let's get some basic prep out of the way.
Here are the packages I am using.

The steup

library(parallel)
library(gridExtra)
library(raster)
library(data.table)
library(dplyr)
library(ggplot2)

eb <- element_blank()
theme_blank <- theme(axis.line = eb, axis.text.x = eb, axis.text.y = eb, axis.ticks = eb, 
    axis.title.x = eb, axis.title.y = eb, legend.position = "none", panel.background = eb, 
    panel.border = eb, panel.grid.major = eb, panel.grid.minor = eb, plot.background = element_rect(colour = "transparent", 
        fill = "transparent"))

world <- map_data("world")

The ggplot2 theme will allow for plotting without all the extraneous stuff like margins and axes and colors I don't want on my Earth. The world map is an aside and has nothing to do with the great circles. I include it here because it was part of my animation. It will be used from plotting the backdrop of nation boundaries on the globe. Having these underneath the great circle animation is helpful for geographic visual orientation.

It's not clear yet, but the reason for including the raster package is because I also use a rasterized map layer of the earth's surface as the bottom layer underneath the nation boundaries. This is also an aside. It doesn't have anything to do with the great circle animation.

Yes, I am using R's built-in parallel package. I happen to be do at least the parallelized operations in this project on a Linux server with 32 CPUs and 260 GB RAM. I'm sorry to say but I can't help you if you want to do this on a local Windows pc for example. If you rework the code for a much more restrictive environment, depending on your data, you may find yourself waiting forever for your output. I just do not recommend doing this type of thing outside of a beefy server environment, at least if time is something you value.

Now, getting to the core of the process for real this time. Here is the actual function I used for the above animation to break great circle arcs into smaller segments.

The main function

df_segs <- function(d, seg.size, n.frames, replicates = 1, direction = "fixed") {
    n <- nrow(d)
    if (n < 3) 
        stop("Data not appropriate for this operation.")
    if (seg.size < 3) 
        stop("Segment size too small.")
    z <- round(runif(2, 2, seg.size))
    z[z > n] <- n
    n1 <- ceiling(diff(c((z[1] - z[2]), n))/z[1])
    if (n.frames - n1 < 100) 
        stop("Insufficient frames")
    offset <- sample(0:(n.frames - n1), replicates)
    
    f <- function(k, d, n, n1, z, offset) {
        ind2 <- z[1] * k
        ind1 <- max(ind2 - z[2], 1)
        if (ind2 > n) 
            ind2 <- n
        d <- slice(d, ind1:ind2)
        purrr::map(offset, ~mutate(d, group = ifelse(replicates == 1, group, 
            group + as.numeric(sprintf(".%d", k))), frameID = .x + k)) %>% bind_rows
    }
    
    if (direction == "reverse") 
        d <- mutate(d, long = rev(long), lat = rev(lat))
    if (direction == "random" && rnorm(1) < 0) 
        d <- mutate(d, long = rev(long), lat = rev(lat))
    d <- purrr::map(1:n1, ~f(.x, d, n, n1, z, offset)) %>% bind_rows %>% arrange(group, 
        frameID)
    d
}

Going through the code, you can see it requires a minimum of 100 desired frames (which would make for quite a short video). You can tweak the maximum number of points in a segment. The minimum is always two. Each segment will vary in length uniformly between the minimum and maximum.

You can leave the data as is, with direction="fixed". This assumes the ordering of points/rows pertaining to each group is intended to show direction. Segments will be assembled in the same order. Alternatively, you can reverse or even randomize the order if it doesn’t matter for the given data.

This is just one example function. You can make your own if this one does not generate the type of segments or provide the kind of random variation in segments you would like.

Let’s go with 900 frames, which will result in about a 30 second video. Recall that in my case I had two different data sets. I combine them at the end of the above code snippet.

n.frames <- 900
set.seed(1)
paths2 <- mutate(paths2, group = group + max(paths1$group))
paths <- bind_rows(paths1, paths2) %>% split(.$group)

Here I make the new table below. It has a subgroup ID column as well as a frame ID column. Group ID is now a decimal. To the left is the original great circle ID. To the right is the sequential segment ID. The latter is random and not all great circle arcs are broken up into the same number of segments. Some cover more rows of the table than others. They do not match up with the frame IDs and some recycling may be required if the number of desired frames is large enough.

paths <- mclapply(paths, df_segs, seg.size = 5, n.frames = n.frames, replicates = 1, 
    direction = "random", mc.cores = 32) %>% bind_rows

Alternatively, code like this will perform the same operation without using parallel processing. This may not be terribly problematic if the data set is not too large. I show this for comparison. But the need for parallel will be much greater in later steps.

paths <- paths %>% split(.$group) %>% purrr::map(~df_segs(.x, 5, n.frames, replicates = 1, 
    direction = "random")) %>% bind_rows

Aside: the background layertile

Where I got the data

I use a rasterized bathymetry surface based on a csv file I downloaded by using the marmap package. I won’t discuss that here. Please see the marmap package for details and examples if this is important. It is straightforward to use that package to download a local copy of a subregion of the data (or the whole map) at a specified resolution.

Projecting background layer to 3D

The project_to_hemisphere function below (adapted from mapproj here) projects points onto the 3D earth and identifies which ones are within the hemisphere field of view given the centroid focal point. Identifying the half which are out of view in any given frame allows me to toss them out for some added efficiency. This is not actually a slow process. The conversion of coordinates is not complex or demanding. The slow part is drawing the high resolution orthographic projection maps with ggplot2.

d.bath <- read.csv("marmap_coord_-180;-90;180;90_res_10.csv") %>% data.table %>% 
    setnames(c("long", "lat", "z"))
r <- raster(extent(-180, 180, -90, 90), res = 1/6)
projection(r) <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
r <- setValues(r, d.bath$z)
project_to_hemisphere <- function(lat, long, lat0, long0) {
    hold <- c(lat, long)
    x <- (pi/180) * c(lat, lat0, long - long0)
    inview <- sin(x[1]) * sin(x[2]) + cos(x[1]) * cos(x[2]) * cos(x[3]) > 0
    data.table(long = hold[2], lat = hold[1], inview = inview)
}

Prepare for plotting

I want a 120-frame period, which is about four seconds of video, in which time I will complete one rotation of the earth. I define a sequence of longitude values, but this I allow to repeat for the total number of video frames, n.frames, specified earlier. Note that I keep the latitude focus (41 degrees) constant in this animation, but there is no reason this cannot vary as well. In the save_maps function below, there is a hard-coded 23.4-degree fixed orientation, but this is something that can also be made variable.

While the data we are interested in – the great circle arc segments – will plot sequentially over a series of 900 frames, both the nation boundaries and bathymetry surface map backdrops are constant so I only need to produce enough frames (120) for one rotation of each of those layers.

I split the path segments table on the frame ID. I will generate still frames in parallel.

In this animation I had originally chosen to aggregate my 10-minute resolution bathymetry surface from the marmap package because that resolution was a bit too fine and even in parallel it was taking an extremely long time for ggplot2 to draw the orthographic projection background maps. If I understand correctly, it is drawing a ton of tiny, filled polygons. It really grinds to a halt if you have a massive amount of them.

Some setup

n.period <- 120
lon_seq <- rep(seq(0, 360, length.out = n.period + 1)[-(n.period + 1)], length = n.frames)
lat_seq <- rep(41, length(lon_seq))
paths <- paths %>% split(.$frameID)
d.bath.agg <- r %>% aggregate(2) %>% rasterToPoints %>% data.table %>% setnames(c("long", 
    "lat", "z"))

Next, in parallel across frames, I project my rasterized background map cells to the 3D surface and retain only the ones which are in view in a given frame as the earth spins. I use the full range of elevation data to retain a constant color palette mapping as cells at different extreme elevations move in and out of the field of view as the earth rotates. I store the range in z.range. I add a frame ID column. Finally I add a frame ID column for the nation boundaries table. The data is constant, but simply needs to be plotted 120 times in 3-degree shifts.

d.tiles <- mclapply(1:n.period, function(i, x, lon, lat) {
    left_join(x, project_to_hemisphere(x$lat, x$long, lat[i], lon[i])) %>% filter(inview) %>% 
        dplyr::select(-inview) %>% mutate(frameID = i)
}, x = d.bath.agg, lat = lat_seq, lon = lon_seq, mc.cores = 32)

z.range <- purrr::map(d.tiles, ~range(.x$z, na.rm = TRUE)) %>% unlist %>% range
d.world <- purrr::map(1:n.period, ~mutate(world, frameID = .x))

Save plots

I've showed the most important of two functions for making my great circle animation, df_segs, which I use to transformed a table of grouped rows of great circle arc coordinates into a much larger one with nested, grouped, sequential arc segments. The other critical function is save_maps.

Here I have generalized it a bit by adding a type argument. It is primarily used for iterating over the data for each frame in a table produced by df_segs (the default type="network"). This saves a map image of sequential great circle arc segments for each frame.

I added the options, type="maplines" and type="maptiles". The former is for the nation boundaries and the latter is for the rasterized map surface.

I call the function in parallel three times to cover each type of output. I generate 120 frames (one earth rotation in three-degree increments) of both the nation boundaries and the surface tiles. The latter takes by far the longest amount of time to process, far longer than even the 900 frames of network traffic itself.

The other main function

save_maps <- function(x, lon_seq, lat_seq, col = NULL, type = "network", z.range = NULL) {
    if (is.null(col)) 
        col <- switch(type, network = c("#FFFFFF25", "#1E90FF25", "#FFFFFF", 
            "#1E90FF50"), maptiles = c("black", "steelblue4"), maplines = "white")
    i <- x$frameID[1]
    if (type == "network") 
        x.lead <- group_by(x, group) %>% slice(n())
    g <- ggplot(x, aes(long, lat))
    if (type == "maptiles") {
        if (is.null(z.range)) 
            z.range <- range(x$z, na.rm = TRUE)
        g <- ggplot(x, aes(long, lat, fill = z)) + geom_tile() + scale_fill_gradientn(colors = col, 
            limits = z.range)
    } else {
        g <- ggplot(x, aes(long, lat, group = group))
        if (type == "maplines") 
            g <- g + geom_path(colour = col)
        if (type == "network") 
            g <- g + geom_path(colour = col[2]) + geom_path(colour = col[1]) + 
                geom_point(data = x.lead, colour = col[3], size = 0.6) + geom_point(data = x.lead, 
                colour = col[4], size = 0.3)
    }
    g <- g + theme_blank + coord_map("ortho", orientation = c(lat_seq[i], lon_seq[i], 
        23.4))
    dir.create(outDir <- file.path("frames", type), recursive = TRUE, showWarnings = FALSE)
    png(sprintf(paste0(outDir, "/", type, "_%03d.png"), i), width = 4 * 1920, 
        height = 4 * 1080, res = 300, bg = "transparent")
    print(g)
    dev.off()
    NULL
}

mclapply(paths, save_maps, lon_seq, lat_seq, type = "network", mc.cores = 30)
mclapply(d.world, save_maps, lon_seq, lat_seq, type = "maplines", mc.cores = 30)
mclapply(d.tiles, save_maps, lon_seq, lat_seq, type = "maptiles", z.range = z.range, 
    mc.cores = 30)

What about the video?

For any project like this I simply drop each of the three sequences of sequentially numbered png files onto its own timeline track as a still image sequence in a standard video editor. I don't use R for this final step. While I could have plotted everything together on a single sequence of frames, it doesn't make sense to do so even when each sequence has equal length. This is because one sequence may take far longer to plot than another sequence. It is more efficient to make a different image sequence for separate layers and mix them in the editor afterward.

You might also notice the very high pixel dimensions of the png outputs. I do this in cases where I plan to zoom during an animation. The larger the image, the more I can zoom in without degradation. You can imagine that this generates a lot of data when you have thousands of frames and each image is several megabytes.

Conclusion: Don't try this at home

Instead, try something like this on a server with plenty of CPUs and RAM. Otherwise try it with a small data set. Due to the particular computing environment I used and the resources needed to process the frames in an efficient and timely fashion, it is impossible to share this example animation code in a completely reproducible fashion. However, you should be able to use df_segs or some alteration of it with a properly formatted input table to generate the type of table you would want to pass subsequently to save_maps (or some alteration of that as well).

That is the gist of this example and should be enough to get you started. Ultimately, you will have to adapt any code here to your unique data, its size and complexity, what exactly you want to do with it, and your available computing resources.

To leave a comment for the author, please follow the link and comment on their blog: rbloggers – SNAP Tech.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...


from R-bloggers http://ift.tt/1XgerOm
via IFTTT

Crash Course On Multi-Layer Perceptron Neural Networks

Artificial neural networks are a fascinating area of study, although they can be intimidating when just getting started.

There are a lot of specialized terminology used when describing the data structures and algorithms used in the field.

In this post you will get a crash course in the terminology and processes used in the field of multi-layer perceptron artificial neural networks. After reading this post you will know:

  • The building blocks of neural networks including neurons, weights and activation functions.
  • How the building blocks are used in layers to create networks.
  • How networks are trained from example data.

Let’s get started.

Crash Course In Neural Networks

Crash Course In Neural Networks
Photo by Joe Stump, some rights reserved.

Crash Course Overview

We are going to cover a lot of ground very quickly in this post. Here is an idea of what is ahead:

  1. Multi-Layer Perceptrons.
  2. Neurons, Weights and Activations.
  3. Networks of Neurons.
  4. Training Networks.

We will start off with an overview of multi-layer perceptrons.

1. Multi-Layer Perceptrons

The field of artificial neural networks is often just called neural networks or multi-layer perceptrons after perhaps the most useful type of neural network. A perceptron is a single neuron model that was a precursor to larger neural networks.

It is a field that investigates how simple models of biological brains can be used to solve difficult computational tasks like the predictive modeling tasks we see in machine learning. The goal is not to create realistic models of the brain, but instead to develop robust algorithms and data structures that we can use to model difficult problems.

The power of neural networks come from their ability to learn the representation in your training data and how to best relate it to the output variable that you want to predict. In this sense neural networks learn a mapping. Mathematically, they are capable of learning any mapping function and have been proven to be a universal approximation algorithm.

The predictive capability of neural networks comes from the hierarchical or multi-layered structure of the networks. The data structure can pick out (learn to represent) features at different scales or resolutions and combine them into higher-order features. For example from lines, to collections of lines to shapes.

2. Neurons

The building block for neural networks are artificial neurons.

These are simple computational units that have weighted input signals and produce an output signal using an activation function.

Model of a Simple Neuron

Model of a Simple Neuron

Neuron Weights

You may be familiar with linear regression, in which case the weights on the inputs are very much like the coefficients used in a regression equation.

Like linear regression, each neuron also has a bias which can be thought of as an input that always has the value 1.0 and it too must be weighted.

For example, a neuron may have two inputs in which case it requires three weights. One for each input and one for the bias.

Weights are often initialized to small random values, such as values in the range 0 to 0.3, although more complex initialization schemes can be used.

Like linear regression, larger weights indicate increased complexity and fragility. It is desirable to keep weights in the network small and regularization techniques can be used.

Activation

The weighted inputs are summed and passed through an activation function, sometimes called a transfer function.

An activation function is a simple mapping of summed weighted input to the output of the neuron. It is called an activation function because it governs the threshold at which the neuron is activated and strength of the output signal.

Historically simple step activation functions were used where if the summed input was above a threshold, for example 0.5, then the neuron would output a value of 1.0, otherwise it would output a 0.0.

Traditionally non-linear activation functions are used. This allows the network to combine the inputs in more complex ways and in turn provide a richer capability in the functions they can model. Non-linear functions like the logistic also called the sigmoid function were used that output a value between 0 and 1 with an s-shaped distribution, and the hyperbolic tangent function also called tanh that outputs the same distribution over the range -1 to +1.

More recently the rectifier activation function has been shown to provide better results.

3. Networks of Neurons

Neurons are arranged into networks of neurons.

A row of neurons is called a layer and one network can have multiple layers. The architecture of the neurons in the network is often called the network topology.

Model of a Simple Network

Model of a Simple Network

Input or Visible Layers

The bottom layer that takes input from your dataset is called the visible layer, because it is the exposed part of the network. Often a neural network is drawn with a visible layer with one neuron per input value or column in your dataset. These are not neurons as described above, but simply pass the input value though to the next layer.

Hidden Layers

Layers after the input layer are called hidden layers because that are not directly exposed to the input. The simplest network structure is to have a single neuron in the hidden layer that directly outputs the value.

Given increases in computing power and efficient libraries, very deep neural networks can be constructed. Deep learning can refer to having many hidden layers in your neural network. They are deep because they would have been unimaginably slow to train historically, but may take seconds or minutes to train using modern techniques and hardware.

Output Layer

The final hidden layer is called the output layer and it is responsible for outputting a value or vector of values that correspond to the format required for the problem.

The choice of activation function in he output layer is strongly constrained by the type of problem that you are modeling. For example:

  • A regression problem may have a single output neuron and the neuron may have no activation function.
  • A binary classification problem may have a single output neuron and use a sigmoid activation function to output a value between 0 and 1 to represent the probability of predicting a value for the class 1. This can be turned into a crisp class value by using a threshold of 0.5 and snap values less than the threshold to 0 otherwise to 1.
  • A multi-class classification problem may have multiple neurons in the output layer, one for each class (e.g. three neurons for the three classes in the famous iris flowers classification problem). In this case a softmax activation function may be used to output a probability of the network predicting each of the class values. Selecting the output with the highest probability can be used to produce a crisp class classification value.

4. Training Networks

Once configured, the neural network needs to be trained on your dataset.

Data Preparation

You must first prepare your data for training on a neural network.

Data must be numerical, for example real values. If you have categorical data, such as a sex attribute with the values “male” and “female”, you can convert it to a real-valued representation called a one hot encoding. This is where one new column is added for each class value (two columns in the case of sex of male and female) and a 0 or 1 is added for each row depending on the class value for that row.

This same one hot encoding can be used on the output variable in classification problems with more than one class. This would create a binary vector from a single column that would be easy to directly compare to the output of the neuron in the network’s output layer, that as described above, would output one value for each class.

Neural networks require the input to be scaled in a consistent way. You can rescale it to the range between 0 and 1 called normalization. Another popular technique is to standardize it so that the distribution of each column has the mean of zero and the standard deviation of 1.

Scaling also applies to image pixel data. Data such as words can be converted to integers, such as the popularity rank of the word in the dataset and other encoding techniques.

Stochastic Gradient Descent

The classical and still preferred training algorithm for neural networks is called stochastic gradient descent.

This is where one row of data is exposed to the network at a time as input. The network processes the input upward activating neurons as it goes to finally produce an output value. This is called a forward pass on the network. It is the type of pass that is also used after the network is trained in order to make predictions on new data.

The output of the network is compared to the expected output and an error is calculated. This error is then propagated back through the network, one layer at a time, and the weights are updated according to the amount that they contributed to the error. This clever bit of math is called the backpropagation algorithm.

The process is repeated for all of the examples in your training data. One of updating the network for the entire training dataset is called an epoch. A network may be trained for tens, hundreds or many thousands of epochs.

Weight Updates

The weights in the network can be updated from the errors calculated for each training example and this is called online learning. It can result in fast but also chaotic changes to the network.

Alternatively, the errors can be saved up across all of the training examples and the network can be updated at the end. This is called batch learning and is often more stable.

Typically, because datasets are so large and because of computational efficiencies, the size of the batch, the number of examples the network is shown before an update is often reduced to a small number, such as tens or hundreds of examples.

The amount that weights are updated is controlled by a configuration parameters called the learning rate. It is also called the step size and controls the step or change made to network weight for a given error. Often small weight sizes are used such as 0.1 or 0.01 or smaller.

The update equation can be complemented with additional configuration terms that you can set.

  • Momentum is a term that incorporates the properties from the previous weight update to allow the weights to continue to change in the same direction even when there is less error being calculated.
  • Learning Rate Decay is used to decrease the learning rate over epochs to allow the network to make large changes to the weights at the beginning and smaller fine tuning changes later in the training schedule.

Prediction

Once a neural network has been trained it can be used to make predictions.

You can make predictions on test or validation data in order to estimate the skill of the model on unseen data. You can also deploy it operationally and use it to make predictions continuously.

The network topology and the final set of weights is all that you need to save from the model. Predictions are made by providing the input to the network and performing a forward-pass allowing it to generate an output that you can use as a prediction.

Get Started in Deep Learning With Python

Deep Learning with Python Mini-Course

Deep Learning gets state-of-the-art results and Python hosts the most powerful tools.
Get started now!

PDF Download and Email Course.

FREE 14-Day Mini-Course on 
Deep Learning With Python

Download Your FREE Mini-Course

 

 Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

More Resources

There are decades of papers and books on the topic of artificial neural networks.

If you are new to the field I recommend the following resources as further reading:

Summary

In this post you discovered artificial neural networks for machine learning.

After reading this post you now know:

  • How neural networks are not models of the brain but are instead computational models for solving complex machine learning problems.
  • That neural networks are comprised of neurons that have weights and activation functions.
  • The networks are organized into layers of neurons and are trained using stochastic gradient descent.
  • That it is a good idea to prepare your data before training a neural network model.

Do you have any questions about neural networks or about this post? Ask your question in the comments and I will do my best to answer it.

Do You Want To Get Started With Deep Learning?

Deep Learning With Python

You can develop and evaluate deep learning models in just a few lines of Python code. You need:

Deep Learning With Python

Take the next step with 14 self-study tutorials and
7 end-to-end projects.

Covers multi-layer perceptrons, convolutional neural networks, objection recognition and more.

Ideal for machine learning practitioners already familiar with the Python ecosystem.

Bring Deep Learning To Your Machine Learning Projects

The post Crash Course On Multi-Layer Perceptron Neural Networks appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1rPK1XE
via IFTTT

jueves, 12 de mayo de 2016

Understand Your Machine Learning Data With Descriptive Statistics in Python

You must understand your data in order to get the best results.

In this post you will discover 7 recipes that you can use in Python to learn more about your machine learning data.

Let’s get started.

Understand Your Machine Learning Data With Descriptive Statistics in Python

Understand Your Machine Learning Data With Descriptive Statistics in Python
Photo by passer-by, some rights reserved.

Python Recipes To Understand Your Machine Learning Data

This section lists 7 recipes that you can use to better understand your machine learning data.

Each recipe is demonstrated by loading the Pima Indians Diabetes classification dataset from the UCI Machine Learning repository.

Open your python interactive environment and try each recipe out in turn.

1. Peek at Your Data

There is no substitute for looking at the raw data.

Looking at the raw data can reveal insights that you cannot get any other way. It can also plant seeds that may later grow into ideas on how to better preprocess and handle the data for machine learning tasks.

You can review the first 20 rows of your data using the head() function on the Pandas DataFrame.

# View first 20 rows
import pandas
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
peek = data.head(20)
print(peek)

You can see that the first column lists the row number, which is handy for referencing a specific observation.

preg  plas  pres  skin  test  mass   pedi  age  class
0      6   148    72    35     0  33.6  0.627   50      1
1      1    85    66    29     0  26.6  0.351   31      0
2      8   183    64     0     0  23.3  0.672   32      1
3      1    89    66    23    94  28.1  0.167   21      0
4      0   137    40    35   168  43.1  2.288   33      1
5      5   116    74     0     0  25.6  0.201   30      0
6      3    78    50    32    88  31.0  0.248   26      1
7     10   115     0     0     0  35.3  0.134   29      0
8      2   197    70    45   543  30.5  0.158   53      1
9      8   125    96     0     0   0.0  0.232   54      1
10     4   110    92     0     0  37.6  0.191   30      0
11    10   168    74     0     0  38.0  0.537   34      1
12    10   139    80     0     0  27.1  1.441   57      0
13     1   189    60    23   846  30.1  0.398   59      1
14     5   166    72    19   175  25.8  0.587   51      1
15     7   100     0     0     0  30.0  0.484   32      1
16     0   118    84    47   230  45.8  0.551   31      1
17     7   107    74     0     0  29.6  0.254   31      1
18     1   103    30    38    83  43.3  0.183   33      0
19     1   115    70    30    96  34.6  0.529   32      1

2. Dimensions of Your Data

You must have a very good handle on how much data you have, both in terms of rows and columns.

  • Too many rows and algorithms may take too long to train. Too few and perhaps you do not have enough data to train the algorithms.
  • Too many features and some algorithms can be distracted or suffer poor performance due to the curse of dimensionality.

You can review the shape and size of your dataset by printing the shape property on the Pandas DataFrame.

# Dimensions of your data
import pandas
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
shape = data.shape
print(shape)

The results are listed in rows then columns. You can see that the dataset has 768 rows and 9 columns.

(768, 9)

3. Data Type For Each Attribute

The type of each attribute is important.

Strings may need to be converted to floating point values or integers to represent categorical or ordinal values.

You can get an idea of the types of attributes by peeking at the raw data, as above. You can also list the data types used by the DataFrame to characterize each attribute using the dtypes property.

# Data Types for Each Attribute
import pandas
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
types = data.dtypes
print(types)

You can see that most of the attributes are integers and that mass and pedi are floating point values.

preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object

4. Descriptive Statistics

Descriptive statistics can give you great insight into the shape of each attribute.

Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute:

  • Count
  • Mean
  • Standard Devaition
  • Minimum Value
  • 25th Percentile
  • 50th Percentile (Median)
  • 75th Percentile
  • Maximum Value
# Statistical Summary
import pandas
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
pandas.set_option('display.width', 100)
pandas.set_option('precision', 3)
description = data.describe()
print(description)

You can see that you do get a lot of data. You will note some calls to pandas.set_option() in the recipe to change the precision of the numbers and the preferred width of the output. This is to make it more readable for this example.

When describing your data this way, it is worth taking some time and reviewing observations from the results. This might include the presence of “NA” values for missing data or surprising distributions for attributes.

preg     plas     pres     skin     test     mass     pedi      age    class
count  768.000  768.000  768.000  768.000  768.000  768.000  768.000  768.000  768.000
mean     3.845  120.895   69.105   20.536   79.799   31.993    0.472   33.241    0.349
std      3.370   31.973   19.356   15.952  115.244    7.884    0.331   11.760    0.477
min      0.000    0.000    0.000    0.000    0.000    0.000    0.078   21.000    0.000
25%      1.000   99.000   62.000    0.000    0.000   27.300    0.244   24.000    0.000
50%      3.000  117.000   72.000   23.000   30.500   32.000    0.372   29.000    0.000
75%      6.000  140.250   80.000   32.000  127.250   36.600    0.626   41.000    1.000
max     17.000  199.000  122.000   99.000  846.000   67.100    2.420   81.000    1.000

5. Class Distribution (Classification Only)

On classification problems you need to know how balanced the class values are.

Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project.

You can quickly get an idea of the distribution of the class attribute in Pandas.

# Class Distribution
import pandas
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
class_counts = data.groupby('class').size()
print(class_counts)

You can see that there are nearly double the number of observations with class 0 (no onset of diabetes) than there are with class 1 (onset of diabetes).

class
0    500
1    268

6. Correlation Between Attributes

Correlation refers to the relationship between two variables and how they may or may not change together.

The most common method for calculating correlation is Pearson’s Correlation Coefficient, that assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all.

Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset. As such, it is a good idea to review all of the pair-wise correlations of the attributes in your dataset. You can use the corr() function on the Pandas DataFrame to calculate a correlation matrix.

# Pairwise Pearson correlations
import pandas
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
pandas.set_option('display.width', 100)
pandas.set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)

The matrix lists all attributes across the top and down the side, to give correlation between all pairs of attributes (twice, because the matrix is symmetrical). You can see the diagonal line through the matrix from the top left to bottom right corners of the matrix shows perfect correlation of each attribute with itself.

preg   plas   pres   skin   test   mass   pedi    age  class
preg   1.000  0.129  0.141 -0.082 -0.074  0.018 -0.034  0.544  0.222
plas   0.129  1.000  0.153  0.057  0.331  0.221  0.137  0.264  0.467
pres   0.141  0.153  1.000  0.207  0.089  0.282  0.041  0.240  0.065
skin  -0.082  0.057  0.207  1.000  0.437  0.393  0.184 -0.114  0.075
test  -0.074  0.331  0.089  0.437  1.000  0.198  0.185 -0.042  0.131
mass   0.018  0.221  0.282  0.393  0.198  1.000  0.141  0.036  0.293
pedi  -0.034  0.137  0.041  0.184  0.185  0.141  1.000  0.034  0.174
age    0.544  0.264  0.240 -0.114 -0.042  0.036  0.034  1.000  0.238
class  0.222  0.467  0.065  0.075  0.131  0.293  0.174  0.238  1.000

7. Skew of Univariate Distributions

Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another.

Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models.

You can calculate the skew of each attribute using the skew() function on the Pandas DataFrame.

# Skew for each attribute
import pandas
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
skew = data.skew()
print(skew)

The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew.

preg     0.901674
plas     0.173754
pres    -1.843608
skin     0.109372
test     2.272251
mass    -0.428982
pedi     1.919911
age      1.129597
class    0.635017

More Recipes

This was just a selection of the most useful summaries and descriptive statistics that you can use on your machine learning data for classification and regression.

There are many other statistics that you could calculate.

Is there a specific statistic that you like to calculate and review when you start working on a new data set? Leave a comment and let me know.

Tips To Remember

This section gives you some tips to remember when reviewing your data using summary statistics.

  • Review the numbers. Generating the summary statistics is not enough. Take a moment to pause, read and really think about the numbers you are seeing.
  • Ask why. Review your numbers and ask a lot of questions. How and why are you seeing specific numbers. Think about how the numbers relate to the problem domain in general and specific entities that observations relate to.
  • Write down ideas. Write down your observations and ideas. Keep a small text file or note pad and jot down all of the ideas for how variables may relate, for what numbers mean, and ideas for techniques to try later. The things you write down now while the data is fresh will be very valuable later when you are trying to think up new things to try.

Your Guide to Machine Learning with Scikit-Learn

Python Mini-CoursePython and scikit-learn are the rising platform among professional data scientists for applied machine learning.

PDF and Email Course.

FREE 14-Day Mini-Course in
Machine Learning with Python and scikit-learn

Download Your FREE Mini-Course >>

 

 Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

 

Summary

In this post you discovered the importance of describing your dataset before you start work on your machine learning project.

You discovered 7 different ways to summarize your dataset using Python and Pandas:

  1. Peek At Your Data
  2. Dimensions of Your Data
  3. Data Types
  4. Class Distribution
  5. Data Summary
  6. Correlations
  7. Skewness

Action Step

  1. Open your Python interactive environment.
  2. Type or copy-and-paste each recipe and see how it works.
  3. Let me know how you go in the comments.

Do you have any questions about Python, Pandas or the recipes in this post? Leave a comment and ask your question, I will do my best to answer it.

Need Help With Machine Learning in Python?

Machine Learning Mastery with PythonFinally understand how to work through a machine learning problem, step-by-step in the new Ebook: 

Machine Learning Mastery with Python

Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.

Includes 3 end-to-end projects and a project template to tie it all together.

Ideal for beginners and intermediate levels.

Apply Machine Learning Like A Professional With Python

The post Understand Your Machine Learning Data With Descriptive Statistics in Python appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1qgyVJG
via IFTTT

martes, 10 de mayo de 2016

How To Load Machine Learning Data in Python

You must be able to load your data before you can start your machine learning project.

The most common format for machine learning data is CSV files. There are a number of ways to load a CSV file in Python.

In this post you will discover the different ways that you can use to load your machine learning data in Python.

Let’s get started.

How To Load Machine Learning Data in Python

How To Load Machine Learning Data in Python
Photo by Ann Larie Valentine, some rights reserved.

Considerations When Loading CSV Data

There are a number of considerations when loading your machine learning data from CSV files.

For reference, you can learn a lot about the expectations for CSV files by reviewing the CSV request for comment titled Common Format and MIME Type for Comma-Separated Values (CSV) Files.

CSV File Header

Does your data have a file header?

If so this can help in automatically assigning names to each column of data. If not, you may need to name your attributes manually.

Either way, you should explicitly specify whether or not your CSV file had a file header when loading your data.

Comments

Does your data have comments?

Comments in a CSV file are indicated by a hash (“#”) at the start of a line.

If you have comments in your file, depending on the method used to load your data, you may need to indicate whether or not to expect comments and the character to expect to signify a comment line.

Delimiter

The standard delimiter that separates values in fields is the comma (“,”) character.

Your file could use a different delimiter like tab (“\t”) in which case you must specify it explicitly.

Quotes

Sometimes field values can have spaces. In these CSV files the values are often quoted.

The default quote character is the double quotation marks “\””. Other characters can be used, and you must specify the quote character used in your file.

Machine Learning Data Loading Recipes

Each recipe is standalone.

This means that you can copy and paste it into your project and use it immediately.

If you have any questions about these recipes or suggested improvements, please leave a comment and I will do my best to answer.

Load CSV with Python Standard Library

The Python API provides the module CSV and the function reader() that can be used to load CSV files.

Once loaded, you convert the CSV data to a NumPy array and use it for machine learning.

For example, you can download the Pima Indians dataset into your local directory (here). All fields are numeric and there is no header line. Running the recipe below will load the CSV file and convert it to a NumPy array.

# Load CSV (using python)
import csv
import numpy
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rb')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')
print(data.shape)

The example loads an object that can iterate over each row of the data and can easily be converted into a NumPy array. Running the example prints the shape of the array.

(768, 9)

For more information on the csv.reader() function, see CSV File Reading and Writing in the Python API documentation.

Load CSV File With NumPy

You can load your CSV data using NumPy and the numpy.loadtxt() function.

This function assumes no header row and all data has the same format. The example below assumes that the file pima-indians-diabetes.data.csv is in your current working directory.

# Load CSV
import numpy
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rb')
data = numpy.loadtxt(raw_data, delimiter=",")
print(data.shape)

Running the example will load the file as a numpy.ndarray and print the shape of the data:

(768, 9)

This example can be modified to load the same dataset directly from a URL as follows:

# Load CSV from URL using NumPy
import numpy
import urllib
url = "http://ift.tt/1Ogiw3p;
raw_data = urllib.urlopen(url)
dataset = numpy.loadtxt(raw_data, delimiter=",")
print(dataset.shape)

Again, running the example produces the same resulting shape of the data.

(768, 9)

For more information on the numpy.loadtxt() function see the API documentation (version 1.10 of numpy).

Load CSV File With Pandas

You can load your CSV data using Pandas and the pandas.read_csv() function.

This function is very flexible and is perhaps my recommended approach for loading your machine learning data. The function returns a pandas.DataFrame that you can immediately start summarizing and plotting.

The example below assumes that the ‘pima-indians-diabetes.data.csv‘ file is in the current working directory.

# Load CSV using Pandas
import pandas
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(filename, names=names)
print(data.shape)

Note that in this example we explicitly specify the names of each attribute to the DataFrame. Running the example displays the shape of the data:

(768, 9)

We can also modify this example to load CSV data directly from a URL.

# Load CSV using Pandas from URL
import pandas
url = "http://ift.tt/1Ogiw3p;
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
print(data.shape)

Again, running the example downloads the CSV file, parses it and displays the shape of the loaded DataFrame.

(768, 9)

To learn more about the pandas.read_csv() function you can refer to the API documentation.

Your Guide to Machine Learning with Scikit-Learn

Python Mini-CoursePython and scikit-learn are the rising platform among professional data scientists for applied machine learning.

PDF and Email Course.

FREE 14-Day Mini-Course in
Machine Learning with Python and scikit-learn

Download Your FREE Mini-Course >>

 

 Download your PDF containing all 14 lessons.

Get your daily lesson via email with tips and tricks.

 

Summary

In this post you discovered how to load your machine learning data in Python.

You learned three specific techniques that you can use:

  • Load CSV with Python Standard Library.
  • Load CSV File With NumPy.
  • Load CSV File With Pandas.

Your action step for this post is to type or copy-and-paste each recipe and get familiar with the different ways that you can load machine learning data in Python.

Do you have any questions about loading machine learning data in Python or about this post? Ask your question in the comments and I will do my best to answer it.

Need Help With Machine Learning in Python?

Machine Learning Mastery with PythonFinally understand how to work through a machine learning problem, step-by-step in the new Ebook: 

Machine Learning Mastery with Python

Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.

Includes 3 end-to-end projects and a project template to tie it all together.

Ideal for beginners and intermediate levels.

Apply Machine Learning Like A Professional With Python

The post How To Load Machine Learning Data in Python appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1UQvupS
via IFTTT