You must understand your data in order to get the best results.
In this post you will discover 7 recipes that you can use in Python to learn more about your machine learning data.
Let’s get started.
Python Recipes To Understand Your Machine Learning Data
This section lists 7 recipes that you can use to better understand your machine learning data.
Each recipe is demonstrated by loading the Pima Indians Diabetes classification dataset from the UCI Machine Learning repository.
Open your python interactive environment and try each recipe out in turn.
1. Peek at Your Data
There is no substitute for looking at the raw data.
Looking at the raw data can reveal insights that you cannot get any other way. It can also plant seeds that may later grow into ideas on how to better preprocess and handle the data for machine learning tasks.
You can review the first 20 rows of your data using the head() function on the Pandas DataFrame.
# View first 20 rows import pandas url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = pandas.read_csv(url, names=names) peek = data.head(20) print(peek)
You can see that the first column lists the row number, which is handy for referencing a specific observation.
preg plas pres skin test mass pedi age class 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1 5 5 116 74 0 0 25.6 0.201 30 0 6 3 78 50 32 88 31.0 0.248 26 1 7 10 115 0 0 0 35.3 0.134 29 0 8 2 197 70 45 543 30.5 0.158 53 1 9 8 125 96 0 0 0.0 0.232 54 1 10 4 110 92 0 0 37.6 0.191 30 0 11 10 168 74 0 0 38.0 0.537 34 1 12 10 139 80 0 0 27.1 1.441 57 0 13 1 189 60 23 846 30.1 0.398 59 1 14 5 166 72 19 175 25.8 0.587 51 1 15 7 100 0 0 0 30.0 0.484 32 1 16 0 118 84 47 230 45.8 0.551 31 1 17 7 107 74 0 0 29.6 0.254 31 1 18 1 103 30 38 83 43.3 0.183 33 0 19 1 115 70 30 96 34.6 0.529 32 1
2. Dimensions of Your Data
You must have a very good handle on how much data you have, both in terms of rows and columns.
- Too many rows and algorithms may take too long to train. Too few and perhaps you do not have enough data to train the algorithms.
- Too many features and some algorithms can be distracted or suffer poor performance due to the curse of dimensionality.
You can review the shape and size of your dataset by printing the shape property on the Pandas DataFrame.
# Dimensions of your data import pandas url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = pandas.read_csv(url, names=names) shape = data.shape print(shape)
The results are listed in rows then columns. You can see that the dataset has 768 rows and 9 columns.
(768, 9)
3. Data Type For Each Attribute
The type of each attribute is important.
Strings may need to be converted to floating point values or integers to represent categorical or ordinal values.
You can get an idea of the types of attributes by peeking at the raw data, as above. You can also list the data types used by the DataFrame to characterize each attribute using the dtypes property.
# Data Types for Each Attribute import pandas url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = pandas.read_csv(url, names=names) types = data.dtypes print(types)
You can see that most of the attributes are integers and that mass and pedi are floating point values.
preg int64 plas int64 pres int64 skin int64 test int64 mass float64 pedi float64 age int64 class int64 dtype: object
4. Descriptive Statistics
Descriptive statistics can give you great insight into the shape of each attribute.
Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute:
- Count
- Mean
- Standard Devaition
- Minimum Value
- 25th Percentile
- 50th Percentile (Median)
- 75th Percentile
- Maximum Value
# Statistical Summary import pandas url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = pandas.read_csv(url, names=names) pandas.set_option('display.width', 100) pandas.set_option('precision', 3) description = data.describe() print(description)
You can see that you do get a lot of data. You will note some calls to pandas.set_option() in the recipe to change the precision of the numbers and the preferred width of the output. This is to make it more readable for this example.
When describing your data this way, it is worth taking some time and reviewing observations from the results. This might include the presence of “NA” values for missing data or surprising distributions for attributes.
preg plas pres skin test mass pedi age class count 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000 mean 3.845 120.895 69.105 20.536 79.799 31.993 0.472 33.241 0.349 std 3.370 31.973 19.356 15.952 115.244 7.884 0.331 11.760 0.477 min 0.000 0.000 0.000 0.000 0.000 0.000 0.078 21.000 0.000 25% 1.000 99.000 62.000 0.000 0.000 27.300 0.244 24.000 0.000 50% 3.000 117.000 72.000 23.000 30.500 32.000 0.372 29.000 0.000 75% 6.000 140.250 80.000 32.000 127.250 36.600 0.626 41.000 1.000 max 17.000 199.000 122.000 99.000 846.000 67.100 2.420 81.000 1.000
5. Class Distribution (Classification Only)
On classification problems you need to know how balanced the class values are.
Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project.
You can quickly get an idea of the distribution of the class attribute in Pandas.
# Class Distribution import pandas url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = pandas.read_csv(url, names=names) class_counts = data.groupby('class').size() print(class_counts)
You can see that there are nearly double the number of observations with class 0 (no onset of diabetes) than there are with class 1 (onset of diabetes).
class 0 500 1 268
6. Correlation Between Attributes
Correlation refers to the relationship between two variables and how they may or may not change together.
The most common method for calculating correlation is Pearson’s Correlation Coefficient, that assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all.
Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset. As such, it is a good idea to review all of the pair-wise correlations of the attributes in your dataset. You can use the corr() function on the Pandas DataFrame to calculate a correlation matrix.
# Pairwise Pearson correlations import pandas url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = pandas.read_csv(url, names=names) pandas.set_option('display.width', 100) pandas.set_option('precision', 3) correlations = data.corr(method='pearson') print(correlations)
The matrix lists all attributes across the top and down the side, to give correlation between all pairs of attributes (twice, because the matrix is symmetrical). You can see the diagonal line through the matrix from the top left to bottom right corners of the matrix shows perfect correlation of each attribute with itself.
preg plas pres skin test mass pedi age class preg 1.000 0.129 0.141 -0.082 -0.074 0.018 -0.034 0.544 0.222 plas 0.129 1.000 0.153 0.057 0.331 0.221 0.137 0.264 0.467 pres 0.141 0.153 1.000 0.207 0.089 0.282 0.041 0.240 0.065 skin -0.082 0.057 0.207 1.000 0.437 0.393 0.184 -0.114 0.075 test -0.074 0.331 0.089 0.437 1.000 0.198 0.185 -0.042 0.131 mass 0.018 0.221 0.282 0.393 0.198 1.000 0.141 0.036 0.293 pedi -0.034 0.137 0.041 0.184 0.185 0.141 1.000 0.034 0.174 age 0.544 0.264 0.240 -0.114 -0.042 0.036 0.034 1.000 0.238 class 0.222 0.467 0.065 0.075 0.131 0.293 0.174 0.238 1.000
7. Skew of Univariate Distributions
Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another.
Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models.
You can calculate the skew of each attribute using the skew() function on the Pandas DataFrame.
# Skew for each attribute import pandas url = "http://ift.tt/1Ogiw3p; names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = pandas.read_csv(url, names=names) skew = data.skew() print(skew)
The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew.
preg 0.901674 plas 0.173754 pres -1.843608 skin 0.109372 test 2.272251 mass -0.428982 pedi 1.919911 age 1.129597 class 0.635017
More Recipes
This was just a selection of the most useful summaries and descriptive statistics that you can use on your machine learning data for classification and regression.
There are many other statistics that you could calculate.
Is there a specific statistic that you like to calculate and review when you start working on a new data set? Leave a comment and let me know.
Tips To Remember
This section gives you some tips to remember when reviewing your data using summary statistics.
- Review the numbers. Generating the summary statistics is not enough. Take a moment to pause, read and really think about the numbers you are seeing.
- Ask why. Review your numbers and ask a lot of questions. How and why are you seeing specific numbers. Think about how the numbers relate to the problem domain in general and specific entities that observations relate to.
- Write down ideas. Write down your observations and ideas. Keep a small text file or note pad and jot down all of the ideas for how variables may relate, for what numbers mean, and ideas for techniques to try later. The things you write down now while the data is fresh will be very valuable later when you are trying to think up new things to try.
Your Guide to Machine Learning with Scikit-Learn
Python and scikit-learn are the rising platform among professional data scientists for applied machine learning.
PDF and Email Course.
FREE 14-Day Mini-Course in
Machine Learning with Python and scikit-learn
Download your PDF containing all 14 lessons.
Get your daily lesson via email with tips and tricks.
Summary
In this post you discovered the importance of describing your dataset before you start work on your machine learning project.
You discovered 7 different ways to summarize your dataset using Python and Pandas:
- Peek At Your Data
- Dimensions of Your Data
- Data Types
- Class Distribution
- Data Summary
- Correlations
- Skewness
Action Step
- Open your Python interactive environment.
- Type or copy-and-paste each recipe and see how it works.
- Let me know how you go in the comments.
Do you have any questions about Python, Pandas or the recipes in this post? Leave a comment and ask your question, I will do my best to answer it.
Need Help With Machine Learning in Python?
Finally understand how to work through a machine learning problem, step-by-step in the new Ebook:
Machine Learning Mastery with Python
Take the next step with 16 self-study lessons covering data preparation, feature selection, ensembles and more.
Includes 3 end-to-end projects and a project template to tie it all together.
Ideal for beginners and intermediate levels.
Apply Machine Learning Like A Professional With Python
The post Understand Your Machine Learning Data With Descriptive Statistics in Python appeared first on Machine Learning Mastery.
from Machine Learning Mastery http://ift.tt/1qgyVJG
via IFTTT
No hay comentarios:
Publicar un comentario