martes, 29 de septiembre de 2015

Using Linear Regression to Predict Energy Output of a Power Plant

In this article, I will show you how to fit a linear regression to predict the energy output at a Combined Cycle Power Plant(CCPP). The dataset is obtained from the UCI Machine Learning Repository. The dataset contains five columns, namely, Ambient Temperature (AT), Ambient Pressure (AP), Relative Humidity (RH), Exhaust Vacuum (EV), and net hourly […]

from R-bloggers

Why Big Data? Learning Curves

by Bob Horton Microsoft Senior Data Scientist Learning curves are an elaboration of the idea of validating a model on a test set, and have been widely popularized by Andrew Ng’s Machine Learning course on Coursera. Here I present a simple simulation that illustrates this idea. Imagine you use a sample of your data to train a model, then use the model to predict the outcomes on data where you know what the real outcome is. Since you know the “real” answer, you can calculate the overall error in your predictions. The error on the same data set used to...

from R-bloggers

Welcome to Vancouver, the City that Never Plays Itself

Vancouver, Canada is the third biggest film production city in North America after LA and New York yet it never plays itself.

from TwistedSifter

lunes, 28 de septiembre de 2015

Shiny CRUD App

In this post, we write a shiny app that lets you display and modify data that is stored in a database table. Shiny and Databases Everybody loves Shiny, and rightly so. It lets you publish reproducible research, brings R applications to non-R users, and can even serve as a general purpose GUI for R code. […]

The post Shiny CRUD App appeared first on ipub.

from R-bloggers

viernes, 25 de septiembre de 2015

How a Beginner Used Small Projects To Get Started in Machine Learning and Compete on Kaggle

It is valuable to get insight into how real people are getting started in machine learning. In this post you will discover how a beginner (just like you) got started and is making great progress in applying machine learning. I find interviews like this absolutely fascinating because of all of the things you can learn. […]

The post How a Beginner Used Small Projects To Get Started in Machine Learning and Compete on Kaggle appeared first on Machine Learning Mastery.

from Machine Learning Mastery

miércoles, 23 de septiembre de 2015

How do you know if your model is going to work? Part 4: Cross-validation techniques

by John Mount (more articles) and Nina Zumel (more articles). In this article we conclude our four part series on basic model testing. When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it's better than the models that you rejected? In this concluding Part 4 of our four part mini-series "How do you know if your model is going to work?" we demonstrate cross-validation techniques. Previously we worked on: Part 1: The problem Part 2: In-training set measures Part 3: Out of sample...

from R-bloggers

How do you know if your model is going to work?

Authors: John Mount (more articles) and Nina Zumel (more articles). Our four part article series collected into one piece. Part 1: The problem Part 2: In-training set measures Part 3: Out of sample procedures Part 4: Cross-validation techniques “Essentially, all models are wrong, but some are useful.” George Box Here’s a caricature of a data … Continue reading How do you know if your model is going to work?

from R-bloggers

Fitting a neural network in R; neuralnet package

Neural networks have always been one of the most fascinating machine learning model in my opinion, not only because of the fancy backpropagation algorithm, but also because of their complexity (think of deep learning with many hidden layers) and structure inspired by the brain. Neural networks have not always been popular, partly because they were, […]

from R-bloggers

martes, 22 de septiembre de 2015

EARL London 2015: Our Highlights

    We were overwhelmed by the positive comments from attendees at last week’s EARL conference in London. We are in the process of collecting survey responses from all delegates, but in the meantime a quick straw poll at Mango … Continue reading

from R-bloggers

How do you know if your model is going to work? Part 4: Cross-validation techniques

Authors: John Mount (more articles) and Nina Zumel (more articles). In this article we conclude our four part series on basic model testing. When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that … Continue reading How do you know if your model is going to work? Part 4: Cross-validation techniques

from R-bloggers

lunes, 21 de septiembre de 2015

The Leek group guide to writing your first paper

The @jtleek guide to writing your first academic paper — Stephen Turner (@genetics_blog) September 17, 2015 I have written guides on reviewing papers, sharing data,  and writing R packages. One thing I haven't touched on until now has been writing papers. Certainly for me, and I think for a lot of students, the hardest

from Simply Statistics

Free online Data Science and Machine Learning course starts Sep 24

Microsoft is sponsoring another free MOOC starting on September 24: Data Science and Machine Learning Essentials. This course provides a five-week introduction to machine learning and data science concepts, including the open-source programming tools for data science: R and Python. (Read more about the course in this post on TechNet.) This course is organized into 5 weekly modules, each concluding with a quiz (and if you wish, can purchase a verified certificate from edX to show off your passing grade). The course is presented by Cynthia Rudin (Professor of Statistics at MIT) and Steve Elston (author of Data Science in...

from R-bloggers

viernes, 18 de septiembre de 2015

From functional programming to MapReduce in R

The MapReduce paradigm has long been a staple of big data computational strategies. However, properly leveraging MapReduce can be a …

Continue reading

from R-bloggers

jueves, 17 de septiembre de 2015

Hypothesis Driven Development Part IV: Testing The Barroso/Santa Clara Rule

This post will deal with applying the constant-volatility procedure written about by Barroso and Santa Clara in their paper “Momentum … Continue reading

from R-bloggers

Searching for duplicate resource names in PMC article titles

I enjoyed this article by Keith Bradnam, and the associated tweets, on the problem of duplicated names for bioinformatics software. I figured that to some degree at least, we should be able to search for such instances, since the titles of published articles that describe software often follow a particular pattern. There may even be […]

from R-bloggers

miércoles, 16 de septiembre de 2015

Philosophy Graduate to Machine Learning Practitioner (an interview with Brian Thomas)

Getting started in machine learning can be frustrating. There’s so much to learn that it feels overwhelming. So much so that many developers interested in machine learning never get started. The idea of creating models on ad hoc datasets and entering a Kaggle competition sounds exciting a far off goal. So how did a Philosophy graduate get started in machine learning? […]

The post Philosophy Graduate to Machine Learning Practitioner (an interview with Brian Thomas) appeared first on Machine Learning Mastery.

from Machine Learning Mastery

martes, 15 de septiembre de 2015

How do you know if your model is going to work? Part 3: Out of sample procedures

Authors: John Mount (more articles) and Nina Zumel (more articles). When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 3 of our four part mini-series “How do … Continue reading How do you know if your model is going to work? Part 3: Out of sample procedures

from R-bloggers

lunes, 14 de septiembre de 2015

How to perform a Logistic Regression in R

Logistic regression is a method for fitting a regression curve, y = f(x), when y is a categorical variable. The typical use of this model is predicting y given a set of predictors x. The predictors can be continuous, categorical or a mix of both. The categorical variable y, in general, can assume different values. […]

from R-bloggers

sábado, 12 de septiembre de 2015

Recommendation Systems in R

These systems are used in cross-selling industries, and they measure correlated items as well as their user rate. This last point wasn't included the apriori algorithm (or association rules), used in market basket analysis. The link: http://blog.yha...

from R-bloggers

martes, 8 de septiembre de 2015

First year books

I had to read a lot of books in graduate school. Some were life-changing, and others were forgettable.

If I could bring a reading list back in time for my ‘first year’ graduate self, it would include the following:

Bayesian Data Analysis

Third Edition, by Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin

Probably the most useful book I’ve ever owned.

The Art of R Programming

by Norman Matloff

This book made me less bad at programming in R.


Second Edition, by Judea Pearl

Ecology is complicated. We often lack replicated controlled experiments with random treatment assignment. This book helps with that.

Statistics for Spatio-Temporal Data

by Noel Cressie and Christopher Wikle

A thoughtful treatment of hierarchical modeling in a spatial, temporal, and spatiotemporal context. Has breadth with a healthy dose of outside references for depth.

Ecological Models and Data in R

by Benjamin M. Bolker

Covers fundamental ideas about likelihood and process-oriented modeling while building R proficiency.

Bayesian Models: A Statistical Primer for Ecologists

by N. Thompson Hobbs & Mevin B. Hooten

An introduction to the process of model building and estimation for non-math/stats oriented readers.

Data Analysis Using Regression and Multilevel/Hierarchical Models

by Andrew Gelman and Jennifer Hill

A gentle introduction to multilevel modeling, with plenty of graphics and integration with R.

Statistical Inference

Second Edition, by George Casella and Roger L. Berger

Essential for understanding the mathematical and probabilistic foundations of statistics. Read it after brushing up on calculus.

Linear Algebra

by George Shilov

I wish I had taken a class in linear algebra as an undergraduate, but I instead had to catch up in my first year of grad school. This book made it relatively painless.

Single and Multivariable Calculus

by David Guichard and friends

Because I took a few calculus classes in high school and college and didn’t know why.

Mathematical Tools for Understanding Infectious Disease Dynamics

by Odo Diekmann, Hans Heesterbeek & Tom Britton

Mathematical epidemiology is a huge topic. This book introduces common models and approaches from first principles, with plenty of problems along the way to make sure you’re following along. Read it with a notebook and pencil handy.

from R-bloggers

lunes, 7 de septiembre de 2015

How do you know if your model is going to work? Part 2: In-training set measures

Authors: John Mount (more articles) and Nina Zumel (more articles). When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it’s better than the models that you rejected? In this Part 2 of our four part mini-series “How do … Continue reading How do you know if your model is going to work? Part 2: In-training set measures

from R-bloggers

How do you know if your model is going to work? Part 1: The problem

Authors: John Mount (more articles) and Nina Zumel (more articles). “Essentially, all models are wrong, but some are useful.” George Box Here’s a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, … Continue reading How do you know if your model is going to work? Part 1: The problem

from R-bloggers

domingo, 6 de septiembre de 2015

How to Use a Machine Learning Checklist to Get Accurate Predictions, Reliably (even if you are a beginner)

How do you get accurate results using machine learning on problem after problem? The difficulty is that each problem is unique, requiring different data sources, features, algorithms, algorithm configurations and on and on. The solution is to use a checklist that guarantees a good result every time. In this post you will discover a checklist […]

The post How to Use a Machine Learning Checklist to Get Accurate Predictions, Reliably (even if you are a beginner) appeared first on Machine Learning Mastery.

from Machine Learning Mastery

miércoles, 2 de septiembre de 2015

Logistic Regression in R – Part One

Please note that an earlier version of this post had to be retracted because it contained some content which was generated at work. I have since chosen to rewrite the document in a series of posts. Please recognize that this may take some time. Apologies for any inconvenience.   Logistic regression is used to analyze the […]

from R-bloggers

martes, 1 de septiembre de 2015

Bayesian regression models using Stan in R

It seems the summer is coming to end in London, so I shall take a final look at my ice cream data that I have been playing around with to predict sales statistics based on temperature for the last couple of weeks [1], [2], [3].

Here I will use the new brms (GitHub, CRAN) package by Paul-Christian Bürkner to derive the 95% prediction credible interval for the four models I introduced in my first post about generalised linear models. Additionally, I am interested to predict how much ice cream I should hold in stock for a hot day at 35ºC, such that I only run out of ice cream with a probability of 2.5%.

Stan models with brms

Like in my previous post about the log-transformed linear model with Stan, I will use Bayesian regression models to estimate the 95% prediction credible interval from the posterior predictive distribution.

Thanks to brms this will take less than a minute of coding, because brm allows me to specify my models in the usual formula syntax and I can leave it to the package functions to create and execute the Stan files.

Let's start. Here is the data again:

My models are written down in very much the same way as with glm. Only the binomial model requires a slightly different syntax. Here I use the default priors and link functions:

Last week I wrote the Stan model for the log-transformed linear model myself. Here is the output of brm. The estimated parameters are quite similar, apart from (sigma):

I access the underlying Stan model via log.lin.mod$model and note that the prior of (sigma) is modelled via a Cauchy distribution, unlike the inverse Gamma I used last week. I believe that explains my small difference in (sigma).

To review the model I start by plotting the trace and density plots for the MCMC samples.

Prediction credible interval

The predict function gives me access to the posterior predictive statistics, including the 95% prediction credible interval.

Combining the outputs of all four models into one data frame gives me then the opportunity to compare the prediction credible intervals of the four models in one chart.

There are no big news in respect of the four models, but for the fact that here I can look at the posterior prediction credible intervals, rather then the theoretical distributions two weeks ago. The over-prediction of the log-transformed linear model is apparent again.

How much stock should I hold on a hot day?

Running out of stock on a hot summer's day would be unfortunate, because those are days when sells will be highest. But how much stock should I hold?

Well, if I set the probability of selling out at 2.5%, then I will have enough ice cream to sell with 97.5% certainty. To estimate those statistics I have to calculate the 97.5% percentile of the posterior predictive samples.

Ok, I have four models and four answers ranging from 761 to 2494. The highest number is more than 3 times the lowest number!

I had set the market size at 800 in my binomial model, so I am not surprised by its answer of 761. Also, I noted earlier that the log-normal distribution is skewed to the right, so that explains the high prediction of 2494. The Poisson model, like the log-transform linear model, has the implicit exponential growth assumption. Its mean forecast is well over 1000 as I pointed out earlier and hence the 97.5% prediction of 1510 is to be expected.


How much ice cream should I hold in stock? Well, if I believe strongly in my assumption of a market size of 800, then I should stick with the output of the binomial model, or perhaps hold 800 just in case.

Another aspect to consider would be the cost of holding unsold stock. Ice cream can usually be stored for some time, but the situation would be quite different for freshly baked bread or bunches of flowers that have to be sold quickly.

Session Info

Note, I used the developer version of brms from GitHub.
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base

other attached packages:
[1] lattice_0.20-33 brms_0.4.1.9000 ggplot2_1.0.1
[4] rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0

loaded via a namespace (and not attached):
[1] codetools_0.2-14 digest_0.6.8 MASS_7.3-43
[4] grid_3.2.2 plyr_1.8.3 gtable_0.1.2
[7] stats4_3.2.2 magrittr_1.5 scales_0.2.5
[10] stringi_0.5-5 reshape2_1.4.1 proto_0.3-10
[13] tools_3.2.2 stringr_1.0.0 munsell_0.4.2
[16] parallel_3.2.2 colorspace_1.2-6
This post was originally published on mages' blog.

from R-bloggers

Visualizing Twitter history with streamgraphs in R

I was exploring ways to visualize my Twitter history, and ended up creating this interactive streamgraph of my 20 most used hashtags in Twitter:

The graph shows how my Twitter activity has varied a lot. The top three hashtags are #datascience, #rstats and #opendata (no surprises there). There are also event-related hashtags that show up only once, such as #tomorrow2015 and #iccss2015, and annually repeating ones, such as #apps4finland.

How this was made?

Twitter has quite a strict policy for obtaining data, but they do allow one to download the full personal Twitter history, i.e. all tweets as a convenient csv file (instructions here), so that’s what I did.

The visualization was created with the streamgraph R package that uses the great htmlwidgets framework for easy creation of javascript visualizations from R. The plots are designed for daily data, but this ended up being too messy, so I aggregated the data on monthly level instead.

Embedding the streamgraph htmlwidget into this Jekyll blog required a bit of hazzle. As pointed out in the comments here, the widget must be first created as a standalone html file and then embedded as an iframe. Hopefully there will be a more straightforward way to include htmlwidgets to Jekyll blogs in the future!

Some problems:

  • The size of the widget has to be fixed when creating, so it will not scale automatically. This could possibly be fixed in the streamgraph package following this.
  • Font size of the graph is very small, but I could not find a way to change it, even in the javascript source.

The script for producing the streamgraph from the Twitter data is here. It is also printed below, with the help of read_chunk(). See more details from the rmarkdown source for this post.

# Script for producing a streamgraph of tweet hashtags

# Load packages

# Read my tweets
tweets_df <- read_csv("files/R/tweets.csv") %>%
  select(timestamp, text) %>%
  mutate(text = tolower(text))

# Pick hashtags with regexp
hashtags_list <- regmatches(tweets_df$text, gregexpr("#[[:alnum:]]+", tweets_df$text))

# Create a new data_frame with (timestamp, hashtag) -pairs
hashtags_df <- data_frame()
for (i in which(sapply(hashtags_list, length) > 0)) {
  hashtags_df <- bind_rows(hashtags_df, data_frame(timestamp = tweets_df$timestamp[i],
                                                   hashtag = hashtags_list[[i]]))

# Process data for plotting
hashtags_df <- hashtags_df %>%
  # Pick top 20 hashtags
  filter(hashtag %in% names(sort(table(hashtag), decreasing=TRUE))[1:20]) %>%
  # Group by year-month (daily is too messy)
  # Need to add '-01' to make it a valid date for streamgraph
  mutate(yearmonth = paste0(format(as.Date(timestamp), format="%Y-%m"), "-01")) %>%
  group_by(yearmonth, hashtag) %>%
  summarise(value = n())

# Create streamgraph
sg <- streamgraph(data = hashtags_df, key = "hashtag", value = "value", date = "yearmonth",
                 offset = "silhouette", interpolate = "cardinal",
                 width = "700", height = "400") %>%
  sg_legend(TRUE, "hashtag: ") %>%
  sg_axis_x(tick_interval = 1, tick_units = "year", tick_format = "%Y")

# Save it for viewing in the blog post
# For some reason I can not save it to files/R/ direclty so need to use file.rename()
saveWidget(sg, file="twitter_streamgraph.html", selfcontained = TRUE)
file.rename("twitter_streamgraph.html", "files/R/twitter_streamgraph.html")

from R-bloggers

Evaluating Logistic Regression Models in R

This post provides an overview of performing diagnostic and performance evaluation on logistic regression models in R. After training a statistical model, it’s important to understand how well that model did in regards to it’s accuracy and predictive power. The following content will provide the background and theory to ensure that the right technique are being utilized for evaluating […]

from R-bloggers