martes, 29 de septiembre de 2015
Using Linear Regression to Predict Energy Output of a Power Plant
from R-bloggers http://ift.tt/1juYJyY
via IFTTT
Why Big Data? Learning Curves
from R-bloggers http://ift.tt/1KP8tKe
via IFTTT
Welcome to Vancouver, the City that Never Plays Itself
from TwistedSifter http://ift.tt/1iFFHFl
via IFTTT
lunes, 28 de septiembre de 2015
Shiny CRUD App
In this post, we write a shiny app that lets you display and modify data that is stored in a database table. Shiny and Databases Everybody loves Shiny, and rightly so. It lets you publish reproducible research, brings R applications to non-R users, and can even serve as a general purpose GUI for R code. […]
The post Shiny CRUD App appeared first on ipub.
from R-bloggers http://ift.tt/1MztjRx
via IFTTT
viernes, 25 de septiembre de 2015
How a Beginner Used Small Projects To Get Started in Machine Learning and Compete on Kaggle
It is valuable to get insight into how real people are getting started in machine learning. In this post you will discover how a beginner (just like you) got started and is making great progress in applying machine learning. I find interviews like this absolutely fascinating because of all of the things you can learn. […]
The post How a Beginner Used Small Projects To Get Started in Machine Learning and Compete on Kaggle appeared first on Machine Learning Mastery.
from Machine Learning Mastery http://ift.tt/1LbTEYH
via IFTTT
miércoles, 23 de septiembre de 2015
How do you know if your model is going to work? Part 4: Cross-validation techniques
from R-bloggers http://ift.tt/1ivCW9F
via IFTTT
How do you know if your model is going to work?
from R-bloggers http://ift.tt/1iLoET8
via IFTTT
Fitting a neural network in R; neuralnet package
from R-bloggers http://ift.tt/1L5kL7E
via IFTTT
martes, 22 de septiembre de 2015
EARL London 2015: Our Highlights
from R-bloggers http://ift.tt/1NIl6fx
via IFTTT
How do you know if your model is going to work? Part 4: Cross-validation techniques
from R-bloggers http://ift.tt/1Frbp3J
via IFTTT
lunes, 21 de septiembre de 2015
The Leek group guide to writing your first paper
from Simply Statistics http://ift.tt/1QLKlO3
via IFTTT
Free online Data Science and Machine Learning course starts Sep 24
from R-bloggers http://ift.tt/1ODoitJ
via IFTTT
viernes, 18 de septiembre de 2015
From functional programming to MapReduce in R
from R-bloggers http://ift.tt/1FQwYFF
via IFTTT
jueves, 17 de septiembre de 2015
Hypothesis Driven Development Part IV: Testing The Barroso/Santa Clara Rule
from R-bloggers http://ift.tt/1UXB7UF
via IFTTT
Searching for duplicate resource names in PMC article titles
from R-bloggers http://ift.tt/1QGUtHO
via IFTTT
miércoles, 16 de septiembre de 2015
Philosophy Graduate to Machine Learning Practitioner (an interview with Brian Thomas)
Getting started in machine learning can be frustrating. There’s so much to learn that it feels overwhelming. So much so that many developers interested in machine learning never get started. The idea of creating models on ad hoc datasets and entering a Kaggle competition sounds exciting a far off goal. So how did a Philosophy graduate get started in machine learning? […]
The post Philosophy Graduate to Machine Learning Practitioner (an interview with Brian Thomas) appeared first on Machine Learning Mastery.
from Machine Learning Mastery http://ift.tt/1iPv6bp
via IFTTT
martes, 15 de septiembre de 2015
How do you know if your model is going to work? Part 3: Out of sample procedures
from R-bloggers http://ift.tt/1idE2H1
via IFTTT
lunes, 14 de septiembre de 2015
How to perform a Logistic Regression in R
from R-bloggers http://ift.tt/1glKURb
via IFTTT
sábado, 12 de septiembre de 2015
Recommendation Systems in R
from R-bloggers http://ift.tt/1UOEoAl
via IFTTT
martes, 8 de septiembre de 2015
First year books
I had to read a lot of books in graduate school. Some were life-changing, and others were forgettable.
If I could bring a reading list back in time for my ‘first year’ graduate self, it would include the following:
Bayesian Data Analysis
Third Edition, by Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin
Probably the most useful book I’ve ever owned.
The Art of R Programming
by Norman Matloff
This book made me less bad at programming in R.
Causality
Second Edition, by Judea Pearl
Ecology is complicated. We often lack replicated controlled experiments with random treatment assignment. This book helps with that.
Statistics for Spatio-Temporal Data
by Noel Cressie and Christopher Wikle
A thoughtful treatment of hierarchical modeling in a spatial, temporal, and spatiotemporal context. Has breadth with a healthy dose of outside references for depth.
Ecological Models and Data in R
by Benjamin M. Bolker
Covers fundamental ideas about likelihood and process-oriented modeling while building R proficiency.
Bayesian Models: A Statistical Primer for Ecologists
by N. Thompson Hobbs & Mevin B. Hooten
An introduction to the process of model building and estimation for non-math/stats oriented readers.
Data Analysis Using Regression and Multilevel/Hierarchical Models
by Andrew Gelman and Jennifer Hill
A gentle introduction to multilevel modeling, with plenty of graphics and integration with R.
Statistical Inference
Second Edition, by George Casella and Roger L. Berger
Essential for understanding the mathematical and probabilistic foundations of statistics. Read it after brushing up on calculus.
Linear Algebra
by George Shilov
I wish I had taken a class in linear algebra as an undergraduate, but I instead had to catch up in my first year of grad school. This book made it relatively painless.
Single and Multivariable Calculus
by David Guichard and friends
Because I took a few calculus classes in high school and college and didn’t know why.
Mathematical Tools for Understanding Infectious Disease Dynamics
by Odo Diekmann, Hans Heesterbeek & Tom Britton
Mathematical epidemiology is a huge topic. This book introduces common models and approaches from first principles, with plenty of problems along the way to make sure you’re following along. Read it with a notebook and pencil handy.
from R-bloggers http://ift.tt/1itDxZQ
via IFTTT
lunes, 7 de septiembre de 2015
How do you know if your model is going to work? Part 2: In-training set measures
from R-bloggers http://ift.tt/1UxpMdF
via IFTTT
How do you know if your model is going to work? Part 1: The problem
from R-bloggers http://ift.tt/1JO20mF
via IFTTT
domingo, 6 de septiembre de 2015
How to Use a Machine Learning Checklist to Get Accurate Predictions, Reliably (even if you are a beginner)
How do you get accurate results using machine learning on problem after problem? The difficulty is that each problem is unique, requiring different data sources, features, algorithms, algorithm configurations and on and on. The solution is to use a checklist that guarantees a good result every time. In this post you will discover a checklist […]
The post How to Use a Machine Learning Checklist to Get Accurate Predictions, Reliably (even if you are a beginner) appeared first on Machine Learning Mastery.
from Machine Learning Mastery http://ift.tt/1Ku8F7S
via IFTTT
miércoles, 2 de septiembre de 2015
Logistic Regression in R – Part One
from R-bloggers http://ift.tt/1Kr0tW1
via IFTTT
martes, 1 de septiembre de 2015
Bayesian regression models using Stan in R
Here I will use the new
brms
(GitHub, CRAN) package by Paul-Christian Bürkner to derive the 95% prediction credible interval for the four models I introduced in my first post about generalised linear models. Additionally, I am interested to predict how much ice cream I should hold in stock for a hot day at 35ºC, such that I only run out of ice cream with a probability of 2.5%.Stan models with brms
Like in my previous post about the log-transformed linear model with Stan, I will use Bayesian regression models to estimate the 95% prediction credible interval from the posterior predictive distribution.Thanks to
brms
this will take less than a minute of coding, because brm
allows me to specify my models in the usual formula syntax and I can leave it to the package functions to create and execute the Stan files.Let's start. Here is the data again:
My models are written down in very much the same way as with
glm
. Only the binomial model requires a slightly different syntax. Here I use the default priors and link functions:Last week I wrote the Stan model for the log-transformed linear model myself. Here is the output of
brm
. The estimated parameters are quite similar, apart from (sigma):I access the underlying Stan model via
log.lin.mod$model
and note that the prior of (sigma) is modelled via a Cauchy distribution, unlike the inverse Gamma I used last week. I believe that explains my small difference in (sigma).To review the model I start by plotting the trace and density plots for the MCMC samples.
Prediction credible interval
Thepredict
function gives me access to the posterior predictive statistics, including the 95% prediction credible interval.Combining the outputs of all four models into one data frame gives me then the opportunity to compare the prediction credible intervals of the four models in one chart.
There are no big news in respect of the four models, but for the fact that here I can look at the posterior prediction credible intervals, rather then the theoretical distributions two weeks ago. The over-prediction of the log-transformed linear model is apparent again.
How much stock should I hold on a hot day?
Running out of stock on a hot summer's day would be unfortunate, because those are days when sells will be highest. But how much stock should I hold?Well, if I set the probability of selling out at 2.5%, then I will have enough ice cream to sell with 97.5% certainty. To estimate those statistics I have to calculate the 97.5% percentile of the posterior predictive samples.
Ok, I have four models and four answers ranging from 761 to 2494. The highest number is more than 3 times the lowest number!
I had set the market size at 800 in my binomial model, so I am not surprised by its answer of 761. Also, I noted earlier that the log-normal distribution is skewed to the right, so that explains the high prediction of 2494. The Poisson model, like the log-transform linear model, has the implicit exponential growth assumption. Its mean forecast is well over 1000 as I pointed out earlier and hence the 97.5% prediction of 1510 is to be expected.
Conclusions
How much ice cream should I hold in stock? Well, if I believe strongly in my assumption of a market size of 800, then I should stick with the output of the binomial model, or perhaps hold 800 just in case.Another aspect to consider would be the cost of holding unsold stock. Ice cream can usually be stored for some time, but the situation would be quite different for freshly baked bread or bunches of flowers that have to be sold quickly.
Session Info
Note, I used the developer version ofbrms
from GitHub.
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] lattice_0.20-33 brms_0.4.1.9000 ggplot2_1.0.1
[4] rstan_2.7.0-1 inline_0.3.14 Rcpp_0.12.0
loaded via a namespace (and not attached):
[1] codetools_0.2-14 digest_0.6.8 MASS_7.3-43
[4] grid_3.2.2 plyr_1.8.3 gtable_0.1.2
[7] stats4_3.2.2 magrittr_1.5 scales_0.2.5
[10] stringi_0.5-5 reshape2_1.4.1 proto_0.3-10
[13] tools_3.2.2 stringr_1.0.0 munsell_0.4.2
[16] parallel_3.2.2 colorspace_1.2-6
This post was originally published on mages' blog.from R-bloggers http://ift.tt/1UjMANO
via IFTTT
Visualizing Twitter history with streamgraphs in R
I was exploring ways to visualize my Twitter history, and ended up creating this interactive streamgraph of my 20 most used hashtags in Twitter:
The graph shows how my Twitter activity has varied a lot. The top three hashtags are #datascience, #rstats and #opendata (no surprises there). There are also event-related hashtags that show up only once, such as #tomorrow2015 and #iccss2015, and annually repeating ones, such as #apps4finland.
How this was made?
Twitter has quite a strict policy for obtaining data, but they do allow one to download the full personal Twitter history, i.e. all tweets as a convenient csv file (instructions here), so that’s what I did.
The visualization was created with the streamgraph R package that uses the great htmlwidgets framework for easy creation of javascript visualizations from R. The plots are designed for daily data, but this ended up being too messy, so I aggregated the data on monthly level instead.
Embedding the streamgraph htmlwidget into this Jekyll blog required a bit of hazzle. As pointed out in the comments here, the widget must be first created as a standalone html file and then embedded as an iframe. Hopefully there will be a more straightforward way to include htmlwidgets to Jekyll blogs in the future!
Some problems:
- The size of the widget has to be fixed when creating, so it will not scale automatically. This could possibly be fixed in the streamgraph package following this.
- Font size of the graph is very small, but I could not find a way to change it, even in the javascript source.
The script for producing the streamgraph from the Twitter data is here. It is also printed below, with the help of read_chunk()
. See more details from the rmarkdown source for this post.
# Script for producing a streamgraph of tweet hashtags
# Load packages
library("readr")
library("dplyr")
library("lubridate")
library("streamgraph")
library("htmlwidgets")
# Read my tweets
tweets_df <- read_csv("files/R/tweets.csv") %>%
select(timestamp, text) %>%
mutate(text = tolower(text))
# Pick hashtags with regexp
hashtags_list <- regmatches(tweets_df$text, gregexpr("#[[:alnum:]]+", tweets_df$text))
# Create a new data_frame with (timestamp, hashtag) -pairs
hashtags_df <- data_frame()
for (i in which(sapply(hashtags_list, length) > 0)) {
hashtags_df <- bind_rows(hashtags_df, data_frame(timestamp = tweets_df$timestamp[i],
hashtag = hashtags_list[[i]]))
}
# Process data for plotting
hashtags_df <- hashtags_df %>%
# Pick top 20 hashtags
filter(hashtag %in% names(sort(table(hashtag), decreasing=TRUE))[1:20]) %>%
# Group by year-month (daily is too messy)
# Need to add '-01' to make it a valid date for streamgraph
mutate(yearmonth = paste0(format(as.Date(timestamp), format="%Y-%m"), "-01")) %>%
group_by(yearmonth, hashtag) %>%
summarise(value = n())
# Create streamgraph
sg <- streamgraph(data = hashtags_df, key = "hashtag", value = "value", date = "yearmonth",
offset = "silhouette", interpolate = "cardinal",
width = "700", height = "400") %>%
sg_legend(TRUE, "hashtag: ") %>%
sg_axis_x(tick_interval = 1, tick_units = "year", tick_format = "%Y")
# Save it for viewing in the blog post
# For some reason I can not save it to files/R/ direclty so need to use file.rename()
saveWidget(sg, file="twitter_streamgraph.html", selfcontained = TRUE)
file.rename("twitter_streamgraph.html", "files/R/twitter_streamgraph.html")
from R-bloggers http://ift.tt/1KBuCAo
via IFTTT
Evaluating Logistic Regression Models in R
from R-bloggers http://ift.tt/1UoA2Q9
via IFTTT