The data features that you use to train your machine learning models have a huge influence on the performance you can achieve.
Irrelevant or partially relevant features can negatively impact model performance.
In this post you will discover automatic feature selection techniques that you can use to prepare your machine learning data in python with scikit-learn.
Let’s get started.
Feature Selection
Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.
Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.
Three benefits of performing feature selection before modeling your data are:
- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: Less data means that algorithms train faster.
You can learn more about feature selection with scikit-learn in the article Feature selection.
Feature Selection for Machine Learning
This section lists 4 feature selection recipes for machine learning in Python
This post contains recipes for feature selection methods.
Each recipe was designed to be complete and standalone so that you can copy-and-paste it directly into you project and use it immediately.
Recipes uses the Pima Indians onset of diabetes dataset to demonstrate the feature selection method. This is a binary classification problem where all of the attributes are numeric.
1. Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable.
The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.
The example below uses the chi squared (chi^2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification) import pandas import numpy from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 # load data url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction test = SelectKBest(score_func=chi2, k=4) fit = test.fit(X, Y) # summarize scores numpy.set_printoptions(precision=3) print(fit.scores_) features = fit.transform(X) # summarize selected features print(features[0:5,:])
You can see the scores for each attribute and the 4 attributes chosen (those with the lowest scores): plas, test, mass and age.
[ 111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304] [[ 148. 0. 33.6 50. ] [ 85. 0. 26.6 31. ] [ 183. 0. 23.3 32. ] [ 89. 94. 28.1 21. ] [ 137. 168. 43.1 33. ]]
2. Recursive Feature Elimination
The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.
It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.
You can learn more about the RFE class in the scikit-learn documentation.
The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much as long as it is skillful and consistent.
# Feature Extraction with RFE from pandas import read_csv from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression # load data url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction model = LogisticRegression() rfe = RFE(model, 3) fit = rfe.fit(X, Y) print("Num Features: %d") % fit.n_features_ print("Selected Features: %s") % fit.support_ print("Feature Ranking: %s") % fit.ranking_
You can see that RFE chose the the top 3 features as preg, pedi and age. These are marked True in the support_ array and marked with a choice “1” in the ranking_ array.
Num Features: 3 Selected Features: [ True False False False False True True False] Feature Ranking: [1 2 3 5 6 1 1 4]
3. Principal Component Analysis
Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.
Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.
In the example below, we use PCA and select 3 principal components.
Learn more about the PCA class in scikit-learn by reviewing the PCA API. Dive deeper into the math behind PCA on the Principal Component Analysis Wikipedia article.
# Feature Extraction with PCA import numpy from pandas import read_csv from sklearn.decomposition import PCA # load data url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction pca = PCA(n_components=3) fit = pca.fit(X) # summarize components print("Explained Variance: %s") % fit.explained_variance_ratio_ print(fit.components_)
You can see that the transformed dataset (3 principal components) bare little resemblance to the source data.
Explained Variance: [ 0.88854663 0.06159078 0.02579012] [[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02 9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [ 2.26488861e-02 9.72210040e-01 1.41909330e-01 -5.78614699e-02 -9.46266913e-02 4.69729766e-02 8.16804621e-04 1.40168181e-01] [ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01 2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]
4. Feature Importance
Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.
In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class in the scikit-learn API.
# Feature Importance with Extra Trees Classifier from pandas import read_csv from sklearn.ensemble import ExtraTreesClassifier # load data url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction model = ExtraTreesClassifier() model.fit(X, Y) print(model.feature_importances_)
You can see that we are given an importance score for each attribute where the larger score the more important the attribute. The scores suggest at the importance of plas, age and mass.
[ 0.11070069 0.2213717 0.08824115 0.08068703 0.07281761 0.14548537 0.12654214 0.15415431]
Your Guide to Machine Learning with Scikit-Learn
Python and scikit-learn are the rising platform among professional data scientists for applied machine learning.
PDF and Email Course.
FREE 14-Day Mini-Course in
Machine Learning with Python and scikit-learn
Download your PDF containing all 14 lessons.
Get your daily lesson via email with tips and tricks.
Summary
In this post you discovered feature selection for preparing machine learning data in Python with scikit-learn.
You learned about 4 different automatic feature selection techniques:
- Univariate Selection.
- Recursive Feature Elimination.
- Principle Component Analysis.
- Feature Importance.
If you are looking for more information on feature selection, see these related posts:
- Feature Selection with the Caret R Package
- Feature Selection to Improve Accuracy and Decrease Training Time
- An Introduction to Feature Selection
- Feature Selection in Python with Scikit-Learn
Do you have any questions about feature selection or this post? Ask your questions in the comment and I will do my best to answer them.
Can You Step-Through Machine Learning Projects
in Python with scikit-learn and Pandas?
Discover how to confidently step-through machine learning projects end-to-end in Python with scikit-learn in the new Ebook:
Machine Learning Mastery with Python
Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.
Includes 3 end-to-end projects and a project template to tie it all together.
Ideal for beginners and intermediate levels.
Apply Machine Learning Like A Professional With Python
The post Feature Selection For Machine Learning in Python appeared first on Machine Learning Mastery.
from Machine Learning Mastery http://ift.tt/1qwYBSo
via IFTTT
No hay comentarios:
Publicar un comentario