Cheminformatics - QSAR: marzo 2016

jueves, 31 de marzo de 2016

Another Route to Jupyter Notebooks – Azure Machine Learning

In much the same way that the IBM DataScientist Workbench seeks to provide some level of integration between analysis tools such as Jupyter notebooks and data access and storage, Azure Machine Learning studio also provides a suite of tools for accessing and working with data in one location. Microsoft’s offering is new to me, but […]

from R-bloggers http://ift.tt/1Tl0UnG
via IFTTT

miércoles, 30 de marzo de 2016

100 Active Blogs on Analytics, Big Data, Data Mining, Data Science, Machine Learning

Stay on top of your data science skills game! Here's a list of about 100 most active and interesting blogs on Big Data, Data Science, Data Mining, Machine Learning, and Artificial intelligence.

from KDnuggets http://ift.tt/1pHDEEs
via IFTTT

martes, 29 de marzo de 2016

Linear Regression Tutorial Using Gradient Descent for Machine Learning

Stochastic Gradient Descent is an important and widely used algorithm in machine learning. In this post you will discover how to use Stochastic Gradient Descent to learn the coefficients for a simple linear regression model by minimizing the error on a training dataset. After reading this post you will know: The form of the Simple […]

The post Linear Regression Tutorial Using Gradient Descent for Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1Rplnq4
via IFTTT

domingo, 27 de marzo de 2016

Linear Regression for Machine Learning

Linear regression is perhaps one of the most well known and well understood algorithms in statistics and machine learning. In this post you will discover the linear regression algorithm, how it works and how you can best use it in on your machine learning projects. In this post you will learn: Why linear regression belongs […]

The post Linear Regression for Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1S99h0F
via IFTTT

Simple Linear Regression Tutorial for Machine Learning

Linear regression is a very simple method but has proven to be very useful for a large number of situations. In this post you will discover exactly how linear regression works step-by-step. After reading this post you will know: How to calculate a simple linear regression step-by-step. How to perform all of the calculations using […]

The post Simple Linear Regression Tutorial for Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1RmF8hV
via IFTTT

When it comes to science - its the economy stupid.

from Simply Statistics http://ift.tt/1SeW8Dd
via IFTTT

jueves, 24 de marzo de 2016

Top 10 Data Science Resources on Github

The top 10 data science projects on Github are chiefly composed of a number of tutorials and educational resources for learning and doing data science. Have a look at the resources others are using and learning from.

from KDnuggets http://ift.tt/1SjmqXz
via IFTTT

Lift Analysis – A Data Scientist’s Secret Weapon

Gain insight into using lift analysis as a metric for doing data science. Understand how to use it for evaluating the performance and quality of a machine learning model.

from KDnuggets http://ift.tt/1VAUCRf
via IFTTT

Is your Classification Model making lucky guesses?

by Shaheen Gauher, PhD, Data Scientist at Microsoft At the heart of a classification model is the ability to assign a class to an object based on its description or features. When we build a classification model, often we have to prove that the model we built is significantly better than random guessing. How do we know if our machine learning model performs better than a classifier built by assigning labels or classes arbitrarily (through random guess, weighted guess etc.)? I will call the latter non-machine learning classifiers as these do not learn from the data. A machine learning classifier...

from R-bloggers http://ift.tt/1SgRdEc
via IFTTT

martes, 22 de marzo de 2016

Must Know Tips for Deep Learning Neural Networks, Part 1

Deep learning is white hot research topic. Add some solid deep learning neural network tips and tricks from a PhD researcher.

from KDnuggets http://ift.tt/1VzXn5p
via IFTTT

lunes, 21 de marzo de 2016

Ensemble Learning in R

Guest post by Stefan Feuerriegel Previous research in data mining has devised numerous different algorithms for learning tasks. While an individual algorithm might already work decently, one can usually obtain a better predictive by combining several. This approach is referred to as ensemble learning. Common examples include random forests, boosting and AdaBost in particular. Our slide deck is positioned at the intersection of teaching the basic idea of ensemble learning and providing practical insights in R. Therefore, each algorithm comes with an easy-to-understand explanation on how to use it in R. We hope that the slide deck enables practitioners to quickly adopt ensemble learning for their applications in R. Moreover, the materials might lay the groundwork for courses on data mining and machine learning. Download the slides here: http://ift.tt/1Ufkymv Download the exercise sheet here: http://ift.tt/1XHbhkT

from R-bloggers http://ift.tt/1XHbhkP
via IFTTT

domingo, 20 de marzo de 2016

Improve SVM Tuning through Parallelism

As pointed out in the chapter 10 of “The Elements of Statistical Learning”, ANN and SVM (support vector machines) share similar pros and cons, e.g. lack of interpretability and good predictive power. However, in contrast to ANN usually suffering from local minima solutions, SVM is always able to converge globally. In addition, SVM is less […]

from R-bloggers http://ift.tt/1pUrHLQ
via IFTTT

miércoles, 16 de marzo de 2016

New KDnuggets Tutorials Page: Learn R, Python, Data Visualization, Data Science, and more

Introducing new KDnuggets Tutorials page with useful resources for learning about Business Analytics, Big Data, Data Science, Data Mining, R, Python, Data Visualization, Spark, Deep Learning and more.

from KDnuggets http://ift.tt/1MmICLK
via IFTTT

lunes, 14 de marzo de 2016

Parametric and Nonparametric Machine Learning Algorithms

What is a parametric machine learning algorithm and how is it different from a nonparametric machine learning algorithm? In this post you will discover the difference between parametric and nonparametric machine learning algorithms. Let’s get started. Learning a Function Machine learning can be summarized as learning a function (f) that maps input variables (X) to output […]

The post Parametric and Nonparametric Machine Learning Algorithms appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1M0yRbt
via IFTTT

sábado, 12 de marzo de 2016

The Data Science Puzzle, Explained

The puzzle of data science is examined through the relationship between several key concepts in the data science realm. As we will see, far from being concrete concepts etched in stone, divergent opinions are inevitable; this is but another opinion to consider.

from KDnuggets http://ift.tt/225MUAr
via IFTTT

How Machine Learning Algorithms Work (they learn a mapping of input to output)

How do machine learning algorithms work? There is a common principle that underlies all supervised machine learning algorithms for predictive modeling. In this post you will discover how machine learning algorithms actually work by understanding the common principle that underlies all algorithms. Let’s get started. Learning a Function Machine learning algorithms are described as learning […]

The post How Machine Learning Algorithms Work (they learn a mapping of input to output) appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1QPvz7B
via IFTTT

Computing Classification Evaluation Metrics in R

by Said Bleik, Shaheen Gauher, Data Scientists at Microsoft Evaluation metrics are the key to understanding how your classification model performs when applied to a test dataset. In what follows, we present a tutorial on how to compute common metrics that are often used in evaluation, in addition to metrics generated from random classifiers, which help in justifying the value added by your predictive model, especially in cases where the common metrics suggest otherwise. Creating the Confusion Matrix Accuracy Per-class Precision, Recall, and F-1 Macro-averaged Metrics One-vs-all Matrices Average Accuracy Micro-averaged Metrics Evaluation on Highly Imbalanced Datasets Majority-class Metrics Random-guess...

from R-bloggers http://ift.tt/1LhJMxf
via IFTTT

viernes, 11 de marzo de 2016

An Introduction to XGBoost R package

Introduction

XGBoost is a library designed and optimized for boosting trees algorithms. Gradient boosting trees model is originally proposed by Friedman et al. The underlying algorithm of XGBoost is similar, specifically it is an extension of the classic gbm algorithm. By employing multi-threads and imposing regularization, XGBoost is able to utilize more computational power and get more accurate prediction. Please refer to this tutorial for the details of the model.

One evidence of its accuracy is that XGBoost is used in more than half of the winning solutions in machine learning challenges hosted at Kaggle. We have prepared a (incomplete) list of winning solutions.

There're various high-level interfaces. Currently there are interfaces of XGBoost in C++, R, python, Julia, Java and Scala. The core functions in XGBoost are implemented in C++, thus it is easy to share models among different interfaces. This post is going to focus on the R package xgboost, which has a friendly user interface and comprehensive documentation. Based on the statistics from the RStudio CRAN mirror, The package has been downloaded for more than 4,000 times in the last month.

The R package xgboost has won the 2016 John M. Chambers Statistical Software Award. From the very beginning of the work, our goal is to make a package which brings convenience and joy to the users. Thus we will introduce several details of the R pacakge xgboost that (we think) users would love to know.

A 1-minute Beginner's Guide

xgboost is available on both CRAN and Github. To install the stable/pre-compiled version from CRAN, simply run:

install.packages('xgboost')

You can also install from our weekly updated drat repo:

install.packages("drat", repos="http://ift.tt/1HUn6Br;)
drat:::addRepo("dmlc")
install.packages("xgboost", repos="http://ift.tt/1pATOzv;, type="source")

In order to run a machine learning algorithm, we need a data set first. xgboost has a demo data set about mushrooms. This data set records biological attributes of different mushroom species, and the target is to predict whether it is poisonous. Users can load the data with

require(xgboost)

data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test

Each variable is a list containing two things, label and data. The next step is to feed this data to xgboost. Besides the data, we need to train the model with some other parameters:

nrounds: the number of decision trees in the final model
objective: the training objective to use, where "binary:logistic" means a binary classifier.

The simplest training command is as follows:

model <- xgboost(data = train$data, label = train$label,
                 nrounds = 2, objective = "binary:logistic")

## [0]  train-error:0.000614
## [1]  train-error:0.001228

We can make prediction on the test data set easily:

preds = predict(model, test$data)

Sometimes it is important to use cross-validation to exam our model. In xgboost we provide a function xgb.cv to do that. Basically users can just copy every thing from xgboost, and specify nfold:

cv.res <- xgb.cv(data = train$data, label = train$label, nfold = 5,
                 nrounds = 2, objective = "binary:logistic")

## [0]  train-error:0.000921+0.000343   test-error:0.001228+0.000687
## [1]  train-error:0.001075+0.000172   test-error:0.001228+0.000687

To get to know more details about the usage of xgboost, please visit our tutorial for more information.

Efficient Algorithm

If you have experiences in training models on a large data set, then you probably agree that waiting for the training to be done is boring. Time is a resource, so the training speed of an learning algorithm is important. This is determined by both algorithm and implementation. We pay attention to these issues when building xgboost, thus we are confident that xgboost is one of the fastest learning algorithm of gradient boosting algorithm. The reasons for the good efficiency are:

The computational part is implemented in C++.
It can be multi-threaded on a single machine.
It preprocesses the data before the training algorithm.

This figure is generated with the dataset from the Higgs Boson Competition. It can be described from two aspects:

With only one thread, the effect of preprocessing and C++ is already obvious.
The multi-threading is almost linear with the number of threads, thus boosting the efficiency further.

Convenient Interface

As the developers of xgboost, we are also heavy users of xgboost. We value the experience on this tool. During the development, we try to shape the package to be user-friendly. Here are several details we would like to share, please click the title to visit the sample code.

Customized Objective

xgboost can take customized objective. This means the model could be trained to optimize the objective defined by user. This is not often seen in other tools, since most of the algorithms are binded with a specific objective. With xgboost, one can train a model which maximize the work on the correct direction.

Let's use log-likelihood as an demo. We define the following function to calculate the first and second order of gradient of the loss function:

loglossobj <- function(preds, dtrain) {
  # dtrain is the internal format of the training data
  # We extract the labels from the training data
  labels <- getinfo(dtrain, "label")
  # We compute the 1st and 2nd gradient, as grad and hess
  preds <- 1/(1 + exp(-preds))
  grad <- preds - labels
  hess <- preds * (1 - preds)
  # Return the result as a list
  return(list(grad = grad, hess = hess))
}

Then we train the model as

model <- xgboost(data = train$data, label = train$label,
                 nrounds = 2, objective = loglossobj, eval_metric = "error")

## [0]  train-error:0.001228
## [1]  train-error:0.001228

The result should be equivalent to objective = "binary:logistic".

Early Stopping

A usual scenario is when we are not sure how many trees we need, we will firstly try some numbers and check the result. If the number we try is too small, we need to make it larger; If the number is too large, we are wasting time to wait for the termination. By setting the parameter early_stopping, xgboost will terminate the training process if the performance is getting worse in the iteration.

bst <- xgb.cv(data = train$data, label = train$label, nfold = 5,
              nrounds = 20, objective = "binary:logistic",
              early.stop.round = 3, maximize = FALSE)

## [0]  train-error:0.000921+0.000343   test-error:0.001228+0.000686
## [1]  train-error:0.001228+0.000172   test-error:0.001228+0.000686
## [2]  train-error:0.000653+0.000442   test-error:0.001075+0.000875
## [3]  train-error:0.000422+0.000416   test-error:0.000767+0.000940
## [4]  train-error:0.000192+0.000429   test-error:0.000460+0.001029
## [5]  train-error:0.000192+0.000429   test-error:0.000460+0.001029
## [6]  train-error:0.000000+0.000000   test-error:0.000000+0.000000
## [7]  train-error:0.000000+0.000000   test-error:0.000000+0.000000
## [8]  train-error:0.000000+0.000000   test-error:0.000000+0.000000
## [9]  train-error:0.000000+0.000000   test-error:0.000000+0.000000
## Stopping. Best iteration: 7

Here we are doing cross validation. early.stop.round = 3 means if the performance is not getting better for 3 steps, then the program will stop. maximize = FALSE means our goal is not to maximize the evaluation, where the default evaluation metric for binary classification is the classification error rate. We can see that even we ask the model to train 20 trees, it stopped after the performance is perfect.

Continue Training

Sometimes we might want to try to do 1000 iterations and check the result, then decide if we need another 1000 ones. Usually the second step could only be done by starting from the beginning, again. In xgboost users can continue the training on the previous model, thus the second step will cost you the time for the additional iterations only. The theoratical reason that we are capable to do this is because each tree is only trained based the prediction result of the previous trees. Once we get the prediction by the current trees, we can start to train the next one.

This feature involves with the internal data format of xgboost: xgb.DMatrix. An xgb.DMatrix object contains the features, target and other side informations, e.g. weights, missing values.

First, let us define an xgb.DMatrix object for this data:

dtrain <- xgb.DMatrix(train$data, label = train$label)

Next we train the model with it:

model <- xgboost(data = dtrain, nrounds = 2, objective = "binary:logistic")

## [0]  train-error:0.000614
## [1]  train-error:0.001228

Note that we have the label included in the dtrain object. Then we make prediction on the current training data:

pred_train <- predict(model, dtrain, outputmargin=TRUE)

Here the parameter outputmargin indicates that we don't need a logistic transformation of the result.

Finally we put the previous prediction result as an additional information to the object dtrain, so that the training algorithm knows where to start.

setinfo(dtrain, "base_margin", pred_train)

## [1] TRUE

Now observe how is the result changed:

model <- xgboost(data = dtrain, nrounds = 2, objective = "binary:logistic")

## [0]  train-error:0.000614
## [1]  train-error:0.000614

Handle Missing Values

Missing value is commonly seen in real-world data sets. Handling missing values has no rule to apply to all cases, since there could be various reasons for the values to be missing. In xgboost we choose a soft way to handle missing values. When using a feature with missing values to do splitting, xgboost will assign a direction to the missing values instead of a numerical value. Specifically, xgboost guides all the data points with missing values to the left and right respectively, then choose the direction with a higher gain with regard to the objective.

To enable this feature, simply set the parameter missing to mark the missing value label. To demonstrate it, we can manually make a dataset with missing values.

dat <- matrix(rnorm(128), 64, 2)
label <- sample(0:1, nrow(dat), replace = TRUE)
for (i in 1:nrow(dat)) {
  ind <- sample(2, 1)
  dat[i, ind] <- NA
}

Then we only need to specify the missing value marker, NA, in our code:

model <- xgboost(data = dat, label = label, missing = NA,
                 nrounds = 2, objective = "binary:logistic")

## [0]  train-error:0.281250
## [1]  train-error:0.281250

Practically, the default value of missing is exactly NA, therefore we don't even need to specify it in a standard case.

Model Inspection

The model used by xgboost is gradient boosting trees, therefore a model usually contains multiple tree models. A typical ensemble of two trees looks like this:

bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
               eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
xgb.plot.tree(feature_names = agaricus.train$data@Dimnames[[2]], model = bst)

We can make an interpretation on the model easily. xgboost provides a function xgb.plot.tree to plot the model so that we can have a direct impression on the result.

However, what if we have way more trees?

bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
               eta = 1, nthread = 2, nround = 10, objective = "binary:logistic")
xgb.plot.tree(feature_names = agaricus.train$data@Dimnames[[2]], model = bst)

It is starting to make things messy. We even have a hard time to inspect every detail on the plot. It is not easy to tell a story with too many conditions.

Multiple-in-one plot

In xgboost, we provide a function xgb.plot.multi.trees to ensemble several trees into a single one! This function is inspired by this blogpost: http://ift.tt/1RRDUtP. This is done with the following observations:

Almost all the trees in an ensemble model have the same shape. If the maximum depth is determined, this holds for all the binary trees.
On each node there would be more than one feature that have appeared on this position. But we can describe it by the frequency of each feature thus make a frequenct table.

Here is an example of an "ensembled" tree visualization.

bst <- xgboost(data = train$data, label = train$label, max.depth = 15,
                 eta = 1, nthread = 2, nround = 30, objective = "binary:logistic",
                 min_child_weight = 50)
xgb.plot.multi.trees(model = bst, feature_names = agaricus.train$data@Dimnames[[2]], features.keep = 3)

The text in the nodes indicates the distribution of the features selected at this position. If we hover our mouse on the nodes, we get hte information of the path.

Feature Importance

If the tree is too deep, or the number of features is large, then it is still gonna be difficult to find any useful patterns. One simplified way is to check feature importance instead. How do we define feature importance in xgboost?

In xgboost, each split tries to find the best feature and splitting point to optimize the objective. We can calculate the gain on each node, and it is the contribution from the selected feature. In the end we look into all the trees, and sum up all the contribution for each feature and treat it as the importance. If the number of features is large, we can also do a clustering on features before we make the plot. Here's an example of the feature importance plot from the function xgb.plot.importance:

bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
               eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
importance_matrix <- xgb.importance(agaricus.train$data@Dimnames[[2]], model = bst)
xgb.plot.importance(importance_matrix)

Deepness

There is more than one way to understand the structure of the trees, besides plotting them all. Since there are all binary trees, we can have a clear figure in mind if we get to know the depth of each leaf. The function xgb.plot.deepness is inspired by this blogpost: http://ift.tt/1lg0Dob.

From the function xgb.plot.deepness, we can get two plots summarizing the distribution of leaves according to the change of depth in the tree.

bst <- xgboost(data = train$data, label = train$label, max.depth = 15,
                 eta = 1, nthread = 2, nround = 30, objective = "binary:logistic",
                 min_child_weight = 50)
xgb.plot.deepness(model = bst)

The upper plot shows the number of leaves per level of deepness. The lower plot shows noramlized weighted cover per leaf (weighted sum of instances). From this information, we can see that for the 5-th and 6-th level, there are actually not many leaves. To avoid overfitting, we can restrict the depth of trees to be a small number.

martes, 8 de marzo de 2016

scikit-feature: Open-Source Feature Selection Repository in Python

scikit-feature is an open-source feature selection repository in python, with around 40 popular algorithms in feature selection research. It is developed by Data Mining and Machine Learning Lab at Arizona State University.

from KDnuggets http://ift.tt/21Hmic7
via IFTTT

R: Remove constant and identical features programmatically

##### Removing constant features cat("n## Removing the constants features.n") for (f in names(train)) { if (length(unique(train[[f]])) == 1) { cat(f, "is constant in train. We delete it.n") train[[f]] <- NULL t...

from R-bloggers http://ift.tt/1U9O777
via IFTTT