jueves, 30 de julio de 2015

Predict Social Network Influence with R and H2O Ensemble Learning

What is H2O? H2O is an awesome machine learning framework. It is really great for data scientists and business analysts “who need scalable and fast machine learning”. H2O is completely open source and what makes it important is that works right of the box. There seems to be no easier way to start with scalable

The post Predict Social Network Influence with R and H2O Ensemble Learning appeared first on ThinkToStart.



from R-bloggers http://ift.tt/1KBhv06
via IFTTT

miércoles, 29 de julio de 2015

Predicting Titanic deaths on Kaggle

Kaggle has a competition to predict who will die on the famous Titanic 'Machine Learning from Disaster''. It is placed as knowledge competition. Just up there to learn. I am late to the party, it has been been for 1 1/2 year, to end by end 2015. It is ...

from R-bloggers http://ift.tt/1MiwW0e
via IFTTT

This R Data Import Tutorial Is Everything You Need

You might find that loading data into R can be quite frustrating. Almost every single type of file that you want to get into R seems to require its own function, and even then you might get lost in the functions’ arguments. In short, it can be fairly easy to mix up things from time to […]

The post This R Data Import Tutorial Is Everything You Need appeared first on The DataCamp Blog .



from R-bloggers http://ift.tt/1Jv9o2c
via IFTTT

Choosing a Classifier

In order to illustrate the problem of chosing a classification model consider some simulated data, > n = 500 > set.seed(1) > X = rnorm(n) > ma = 10-(X+1.5)^2*2 > mb = -10+(X-1.5)^2*2 > M = cbind(ma,mb) > set.seed(1) > Z = sample(1:2,size=n,replace=TRUE) > Y = ma*(Z==1)+mb*(Z==2)+rnorm(n)*5 > df = data.frame(Z=as.factor(Z),X,Y) A first strategy is to split the dataset in two parts, a training dataset, and a testing dataset. > df1 = training = df[1:300,] > df2 = testing = df[301:500,] The Holdout Method: … Continue reading Choosing a Classifier

from R-bloggers http://ift.tt/1TPDOne
via IFTTT

Installing and Starting SparkR Locally on Windows OS and RStudio

Introduction With the recent release of Apache Spark 1.4.1 on July 15th, 2015, I wanted to write a step-by-step guide to help new users get up and running with SparkR locally on a Windows machine using command shell and RStudio. SparkR provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows […]

from R-bloggers http://ift.tt/1JGWLRR
via IFTTT

Predicting Titanic deaths on Kaggle II: gbm

Following my previous post I have decided to try and use a different method: generalized boosted regression models (gbm). I have read the background in Elements of Statistical Learning and arthur charpentier's nice post on it. This data ...

from R-bloggers http://ift.tt/1eqZndJ
via IFTTT

Tiny Data, Approximate Bayesian Computation and the Socks of Karl Broman: The Movie

This is a screencast of my UseR! 2015 presentation: Tiny Data, Approximate Bayesian Computation and the Socks of Karl Broman. Based on the original blog post it is a quick’n’dirty introduction to approximate Bayesian computation (and is also, in a sense, an introduction to Bayesian statistics in general). Here it is, if you have 15 minutes to spare:

Regarding Quick’n’Dirty

The video is short and makes a lot of simplifications/omissions, some which are:

  • There are not just one, but many many different algorithms for doing approximate Bayesian computation, where the algorithm outlined in the video is called ABC rejection sampling. What make these approximate Bayesian computational methods (and not just Bayesian computational methods) is that they require, what I have called, a generative model and an acceptance criterion. What I call standard Bayesian computation in the video (but which is normally just called Bayesian computation) instead requires a function that calculates the likelihood given the data and some fixed parameter values.
  • I mention in the video that approximate Bayesian computation is the slowest way you can fit a statistical model, and for many common statistical models this is the case. However for some models it might be very expensive to evaluate the likelihood, and in that case approximate Bayesian computation can actually be faster. As usual, it all depends on the context…
  • I mention “drawing random parameter values from the prior”, or something similar, in the video. Invoking “randomness” always makes me a bit uneasy, and I just want to mention that the purpose of “drawing random parameters” is just to get a vector/list of parameter values that is a good enough representation of the prior probability distribution. It just happens to be the case that random number generators (like rnbinom and rbeta) are a convenient way of creating such representative distributions.

For a Slower’n’Cleaner introduction to approximate Bayesian computation I would actually recommend the Wikipedia page, which is pretty good!



from R-bloggers http://ift.tt/1LJIee1
via IFTTT

miércoles, 15 de julio de 2015

Using R to analyze pro motorcycle racing

EMC recently ran a competion to find out why John McGuinness, the legendary motorcycle racer known as the "Morecambe Missile", is outperforms the average motorcycle racer. To answer this question, EMC instrumented his bike and his suit with a number of real-time sensors. (Data collected included gear and RPM for the bike, and heart rate and acceleration for the rider.) They did the same for Adam Child, a motorcycle racing journalist who acted as a "control" in this study. The two motorcyclists then completed 10 laps of Circuit Monteblanco in Spain, and the sensor data was provided for analysis in...

from R-bloggers http://ift.tt/1O31WPf
via IFTTT

Easy Bayesian Bootstrap in R

A while back I wrote about how the classical non-parametric bootstrap can be seen as a special case of the Bayesian bootstrap. Well, one difference between the two methods is that, while it is straightforward to roll a classical bootstrap in R, there is no easy way to do a Bayesian bootstrap. This post, in an attempt to change that, introduces a bayes_boot function that should make it pretty easy to do the Bayesian bootstrap for any statistic in R. If you just want a function you can copy-n-paste into R go to The bayes_boot function below. Otherwise here is a quick example of how to use the function, followed by some details on the implementation.

A quick example

So say you scraped the heights of all the U.S. Presidents off Wikipedia (american_presidents.csv) and you want to run a Bayesian bootstrap analysis on the mean height of U.S. Presidents (don’t ask me why you would want to do this). Then, using the bayes_boot function found below, you can run the following:

presidents <- read.csv("american_presidents.csv")
bb_mean <- bayes_boot(presidents$height_cm, mean, n1 = 1000)

Here is how to get a 95% credible interval:

quantile(bb_mean, c(0.025, 0.975))

## 2.5% 97.5% 
## 177.8 181.8

And, of course, we can also plot this:

(Here, and below, I will save you from the slightly messy plotting code, but if you really want to see it you can check out the full script here.)

Now, say we want run a linear regression on presidential heights over time, and we want to use the Bayesian bootstrap to gauge the uncertainty in the regression coefficients. Then we will have to do a little more work, as the second argument to bayes_boot should be a function that takes the data as the first argument and that returns a vector of parameters/coefficients:

bb_linreg <- bayes_boot(presidents, function(data) {
  lm(height_cm ~ order, data)$coef
}, n1 = 1000)

Ok, so it is not really over time, as we use the order of the president as the predictor variable, but close enough. Again, we can get a 95% credible interval of the slope:

quantile(bb_linreg$order, c(0.025, 0.975))

## 2.5% 97.5% 
## 0.03979 0.34973

And here is a plot showing the mean posterior regression line with a smatter of lines drawn from the posterior to visualize the uncertainty:

Given the model and the data, the average height of American presidents increases by around 0.2 cm for each president elected to office. So, either we have that around the 130th president the average height of presidents will be around 2 meters (≈ 6’7’’), or perhaps a linear regression isn’t really a reasonable model here… Anyhow, it was easy to do the Bayesian bootstrap! :)

How to implement a Bayesian bootstrap in R

It is possible to characterize the statistical model underlying the Bayesian bootstrap in a couple of different ways, but all can be implemented by the same computational procedure:

To generate a Bayesian bootstrap sample of size n1, repeat the following n1 times:

  1. Draw weights from a uniform Dirichlet distribution with the same dimension as the number of data points.
  2. Calculate the statistic, using the Dirichlet draw to weight the data, and record it.

1. Drawing weights from a uniform Dirichlet distribution

One way to characterize drawing from an n-dimensional uniform Dirichlet distribution is as drawing a vector of length n where the values are positive, sum to 1.0, and where any combination of values is equally likely. Another way to characterize a uniform Dirichlet distribution is as a uniform distribution over the unit simplex, where a unit simplex is a generalization of a triangle to higher dimensions, with sides that are 1.0 long (hence the unit). The figure below pictures the one, two, three and four-dimensional unit simplex:


Image source: Introduction to Discrete Differential Geometry by Peter Schröder

Drawing from an n-dimensional uniform Dirichlet distribution can be done by drawing $text{Gamma(1,1)}$ distributed numbers and normalizing these to sum to 1.0 (source). As a $text{Gamma(1,1)}$ distribution is the same as an $text{Exponential}(1)$ distribution, the following two lines of R code implements drawing n1 draws from an n dimensional uniform Dirichlet distribution:

dirichlet_sample <- matrix( rexp(n * n1, 1) , ncol = n, byrow = TRUE)
dirichlet_sample <- dirichlet_sample / rowSums(dirichlet_sample)

With n <- 4 and n1 <- 3 you could, for example, get:

## [,1] [,2] [,3] [,4]
## [1,] 0.61602 0.06459 0.2297 0.08973
## [2,] 0.05384 0.12774 0.4685 0.34997
## [3,] 0.17419 0.42458 0.1649 0.23638

2. Calculate the statistic using a Dirichlet draw to weight the data

Here is where, if you were doing a classical non-parametric bootstrap, you would use your resampled data to calculate a statistic (say a mean). Instead, we will want to calculate our statistic of choice using the Dirichlet draw to weight the data. This is completely straightforward if the statistic can be calculated using weighted data, which is the case for weighted.mean(x, w) and lm(..., weights). For the many statistics that do not accept weights, such as median and cor, we will have to perform a second sampling step where we (1) sample from the data according to the probabilities defined by the Dirichlet weights, and (2) use this resampled data to calculate the statistic. It is important to notice that we here want to draw an as large sample as possible from the data, and not a sample of the same size as the original data. The point is that the proportion of times a datapoint occurs in this resampled dataset should be roughly proportional to that datapoint’s weight.

Note that doing this second resampling step won’t work if the statistic changes with the sample size! An example of such a statistic would be the sample standard deviation (sd), population standard deviation would be fine, however

Bringing it all together

Below is a small example script that takes the presidents dataset and does a Bayesian Bootstrap analysis of the median height. Here n1 is the number of bootstrap draws and n2 is the size of the resampled data used to calculate the median for each Dirichlet draw.

n1 <- 3000
n2 <- 1000
n_data <- nrow(presidents)
# Generating a n1 by n_data matrix where each row is an n_data dimensional
# Dirichlet draw.
weights <- matrix( rexp(n_data * n1, 1) , ncol = n_data, byrow = TRUE)
weights <- weights / rowSums(weights)

bb_median <- rep(NA, n1)
for(i in 1:n1) {
  data_sample <- sample(presidents$height_cm, size = n2, replace = TRUE, prob = weights[i,])
  bb_median[i] <- median(data_sample)
}

# Now bb_median represents the posterior median height, and we can do all
# the usual stuff, like calculating a 95% credible interval.
quantile(bb_median, c(0.025, 0.975))

## 2.5% 97.5% 
## 178 183

If we were interested in the mean instead, we could skip resampling the data and use the weights directly, like this:

bb_mean <- rep(NA, n1)
for(i in 1:n1) {
  bb_mean[i] <- weighted.mean(presidents$height_cm, w = weights[i,])
}
quantile(bb_mean, c(0.025, 0.975))

## 2.5% 97.5% 
## 177.8 181.9

If possible, you will probably want to use the weight method; it will be much faster as you skip the costly resampling step. What size of the bootstrap samples (n1) and size of the resampled data (n2) to use? The boring answers are: “As many as you can afford” and “Depends on the situation”, but you’ll probably want at least 1000 of each.

The bayes_boot function

Here follows a handy function for running a Bayesian bootstrap that you can copy-n-paste directly into your R-script. It should accept any type of data that comes as a vector, matrix or data.frame and allows you to use both statistics that can deal with weighted data (like weighted.mean) and statistics that don’t (like median). See above and below for examples of how to use it.

Caveat: While I have tested this function for bugs, do keep an eye open and tell me if you find any. Again, note that doing the second resampling step (use_weights = FALSE) won’t work if the statistic changes with the sample size!

# Performs a Bayesian bootstrap and returns a sample of size n1 representing the
# posterior distribution of the statistic. Returns a vector if the statistic is
# one-dimensional (like for mean(...)) or a data.frame if the statistic is
# multi-dimensional (like for the coefs. of lm).
# Parameters
# data The data as either a vector, matrix or data.frame.
# statistic A function that accepts data as its first argument and possibly
# the weights as its second, if use_weights is TRUE. 
# Should return a numeric vector.
# n1 The size of the bootstrap sample.
# n2 The sample size used to calculate the statistic each bootstrap draw.
# use_weights Whether the statistic function accepts a weight argument or
# should be calculated using resampled data.
# weight_arg If the statistic function includes a named argument for the
# weights this could be specified here.
# ... Further arguments passed on to the statistic function.
bayes_boot <- function(data, statistic, n1 = 1000, n2 = 1000 , use_weights = FALSE, weight_arg = NULL, ...) {
  # Draw from a uniform Dirichlet dist. with alpha set to rep(1, n_dim).
  # Using the facts that you can transform gamma distributed draws into 
  # Dirichlet draws and that rgamma(n, 1) <=> rexp(n, 1)
  dirichlet_weights <- matrix( rexp(NROW(data) * n1, 1) , ncol = NROW(data), byrow = TRUE)
  dirichlet_weights <- dirichlet_weights / rowSums(dirichlet_weights)

  if(use_weights) {
    stat_call <- quote(statistic(data, w, ...))
    names(stat_call)[3] <- weight_arg
    boot_sample <- apply(dirichlet_weights, 1, function(w) {
      eval(stat_call)
    })
  } else {
    if(is.null(dim(data)) || length(dim(data)) < 2) { # data is a list type of object
      boot_sample <- apply(dirichlet_weights, 1, function(w) {
        data_sample <- sample(data, size = n2, replace = TRUE, prob = w)
        statistic(data_sample, ...)
      })
    } else { # data is a table type of object
      boot_sample <- apply(dirichlet_weights, 1, function(w) {
        index_sample <- sample(nrow(data), size = n2, replace = TRUE, prob = w)
        statistic(data[index_sample, ,drop = FALSE], ...)
      })
    }
  }
  if(is.null(dim(boot_sample)) || length(dim(boot_sample)) < 2) {
    # If the bootstrap sample is just a simple vector return it.
    boot_sample
  } else {
    # Otherwise it is a matrix. Since apply returns one row per statistic
    # let's transpose it and return it as a data frame.
    as.data.frame(t(boot_sample))
  }
}

More examples using bayes_boot

Let’s start by drawing some fake data from an exponential distribution with mean 1.0 and compare using the following methods to infer the mean:

  • The classical non-parametric bootstrap using boot from the boot package.
  • Using bayes_boot with “two level sampling”, that is, sampling both weights and then resampling the data according to those weights.
  • Using bayes_boot with weights (use_weights = TRUE)
  • Assuming an exponential distribution (the “correct” distribution since we know where the data came from), with a flat prior over the mean.

First generating some data:

set.seed(1337)
exp_data <- rexp(8, rate = 1)
exp_data

## [1] 0.15 0.13 2.26 0.92 0.17 1.55 0.13 0.02

Then running the four different methods:

library(boot)
b_classic <- boot(exp_data, function(x, i) { mean(x[i])}, R = 10000)
bb_sample <- bayes_boot(exp_data, mean, n1 = 10000, n2 = 1000)
bb_weight <- bayes_boot(exp_data, weighted.mean, n1 = 10000, use.weights = TRUE, weight_arg = "w")

# Just a hack to sample from the posterior distribution when 
# assuming an exponential distribution with a Uniform(0, 10) prior
prior <- seq(0.001, 10, 0.001)
post_prob <- sapply(prior, function(mean) { prod(dexp(exp_data, 1/mean)) })
post_samp <- sample(prior, size = 10000, replace = TRUE, prob = post_prob)

Here are the resulting posterior/sampling distributions:

This was mostly to show off the syntax of bayes_boot, but some things to point out in the histograms above are that:

  • Using the Bayesian bootstrap with two level sampling or weights result in very similar posterior distributions, which should be the case when the size of the resampled data is large (here set to n2 = 1000).
  • The classical non-parametric bootstrap is pretty similar to the Bayesian bootstrap (as we would expect).
  • The bootstrap distributions are somewhat similar to the posterior mean assuming an exponential distribution, but completely misses out on the uncertainty in the right tail. This is due to the “somewhat peculiar model assumptions” of the bootstrap as critiqued by Rubin (1981)

Finally, a slightly more complicated example, where we do Bayesian bootstrap analysis of LOESS regression applied to the cars dataset on the speed of cars and the resulting distance it takes to stop. The loess function returns, among other things, a vector of fitted y values, one value for each x value in the data. These y values define the smoothed LOESS line and is what you would usually plot after having fitted a LOESS. Now we want to use the Bayesian bootstrap to gauge the uncertainty in the LOESS line. As the loess function accepts weighted data, we’ll simply create a function that takes the data with weights and returns the fitted y values. We’ll then plug that function into bayes_boot:

boot_fn <- function(cars, weights) {
  loess(dist ~ speed, cars, weights = weights)$fitted
}

bb_loess <- bayes_boot(cars, boot_fn, n1 = 1000, use_weights = TRUE, weight_arg = "weights")

To plot this takes a couple of lines more:

# Plotting the data
plot(cars$speed, cars$dist, pch = 20, col = "tomato4", xlab = "Car speed in mph",
     ylab = "Stopping distance in ft", main = "Speed and Stopping distances of Cars")

# Plotting a scatter of Bootstrapped LOESS lines to represent the uncertainty.
for(i in sample(nrow(bb_loess), 20)) {
  lines(cars$speed, bb_loess[i,], col = "gray")
}
# Finally plotting the posterior mean LOESS line
lines(cars$speed, colMeans(bb_loess, na.rm = TRUE), type ="l",
      col = "tomato", lwd = 4)

Fun fact: The cars dataset is from the 20s! Which explains why the fastest car travels at 25 mph. It would be interesting to see a comparison with stopping times for modern cars!

References

Rubin, D. B. (1981). The Bayesian Bootstrap. The annals of statistics, 9(1), 130-134. pdf



from R-bloggers http://ift.tt/1I0QPIs
via IFTTT

Seeing Data as the Product of Underlying Structural Forms

Matrix factorization follows from the realization that nothing forces us to accept the data as given. We start with objects placed in rows and record observations on those objects arrayed along the top in columns. Neither the objects nor the measuremen...

from R-bloggers http://ift.tt/1RzWB8v
via IFTTT

miércoles, 8 de julio de 2015

Sports Data and R – Scope for a Thematic (Rather than Task) View? (Living Post)

Via my feeds, I noticed a package announcement today for cricketR!, a new package for analysing cricket performance data. This got me wondering (again!) about what other sports related packages there might be out there, either in terms of functional thematic packages (to do with sport in general, or one sport in particular), or particular […]

from R-bloggers http://ift.tt/1CnfYKR
via IFTTT

Time series outlier detection (a simple R function)

(By Andrea Venturini) Imagine you have a lot of time series – they may be short ones – related to a lot of different measures and very little time to find outliers. You need something not too sophisticated to solve quickly the mess. This is – very shortly speaking – the typical situation in which you can adopt washer.AV() function in R language. In this linked document (washer) you have the function and an example of actual application in R language:  a data.frame (dati) with temperature and rain (phen) measures (value) in 4 periods of time (time) and in 20 geographical zones (zone). (20*4*2=160  arbitrary observations). > dati           phen time zone value 1   Temperature    1  a01   2.0 2   Temperature    1  a02  20.0 … … 160        Rain    4  a20   8.5   The example of 20 meteorological stations measuring rainfall and temperature is useful to understand in which situation you can implement the washer() methodology. This methodology considers only 3 observations in a group of time series, for instance all 20 terns between time 2 and 4: if the their shape is similar between each other than no outlier will be detected, otherwise – as it happens to the orange time […]

from R-bloggers http://ift.tt/1HN6OYJ
via IFTTT

viernes, 3 de julio de 2015

A Tutorial on Writing Simulation Apps in Shiny

I stand in awe of the R wizards who post their latest and greatest Shiny feats on R-Bloggers. It’s taken me a while to find my way around Shiny, but at last I feel ready to fill in a few gaps for others who may be just starting out and who aspire to write reasonably involved, user-friendly simulation apps. To make it a bit more fun I’ve written it up as an R Markdown document hosted on shinyapps.io: here you go.

Prerequisites are indicated in the tutorial; so are links to the source code. If you have feedback or suggestions for improvements, comment here or post an Issue on the GitHub site.



from R-bloggers http://ift.tt/1GVlhgj
via IFTTT

3-step lesson, going into the life of machine learning

Automatic Machine Learning Introduction "I want to develop a model that automatically learns over time", a really challenging objective. We'll develop in this post a procedure that loads data, build a model, make predictions and, if something chang...

from R-bloggers http://ift.tt/1HAJgb3
via IFTTT

Review: Machine Learning with R Cookbook

"Machine Learning with R Cookbook" by Chiu Yu-Wei is nothing more or less than it purports to be: a collection of 110 recipes for applying Data Analysis and Machine Learning techniques in R. I was asked by the publishers to review this book and found it to be an interesting and informative read. It will […]

The post Review: Machine Learning with R Cookbook appeared first on Exegetic Analytics.



from R-bloggers http://ift.tt/1LZD2zj
via IFTTT

jueves, 2 de julio de 2015

Variable Selection using Cross-Validation (and Other Techniques)

A natural technique to select variables in the context of generalized linear models is to use a stepŵise procedure. It is natural, but contreversial, as discussed by Frank Harrell  in a great post, clearly worth reading. Frank mentioned about 10 points against a stepwise procedure. It yields R-squared values that are badly biased to be high. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution. The method yields confidence intervals for effects and predicted values that are … Continue reading Variable Selection using Cross-Validation (and Other Techniques)

from R-bloggers http://ift.tt/1FU0Q2D
via IFTTT

miércoles, 1 de julio de 2015

How Airbnb built a data science team

From Venturebeat: Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production MySQL database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set

from Simply Statistics http://ift.tt/1NvtRYt
via IFTTT