jueves, 25 de febrero de 2016

Compare The Performance of Machine Learning Algorithms in R

How do you compare the estimated accuracy of different machine learning algorithms effectively? In this post you will discover 8 techniques that you can use to compare machine learning algorithms in R. You can use these techniques to choose the most accurate model, and be able to comment on the statistical significance and the absolute […]

The post Compare The Performance of Machine Learning Algorithms in R appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1VHk0CL
via IFTTT

lunes, 22 de febrero de 2016

Machine Learning Project Template in R

Applied machine learning is an empirical skill. You cannot get better at it by reading books and blog posts. You have to practice. In this post you will discover the simple 6-step machine learning project template that you can use to jump-start your project in R. Let’s get started. Practice Machine Learning With End-to-End Projects […]

The post Machine Learning Project Template in R appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1KBjA0b
via IFTTT

jueves, 11 de febrero de 2016

Getting Started with Microsoft SQL Server on Azure

by Joseph Rickert If you are an R user and work for an enterprise where Microsoft SQL server is important, it is extremely helpful to have easy access to SQL Server databases. Over the past year, we have run several posts on this topic including a comprehensive four part series from Microsoft's Gregory Vandenbrouck on using various flavors of SQL with Azure as a data source (Part1, Part2, Part3 and Part4) as well as several posts on using the advanced features of Microsoft R Server (formerly Revolution R Enterprise) with SQL Server 2016. (See for example this recent post Credit...

from R-bloggers http://ift.tt/1Skp7KZ
via IFTTT

miércoles, 10 de febrero de 2016

Tutorial: Credit Card Fraud Detection with SQL Server 2016 R Services

If you have a database of credit-card transactions with a small percentage tagged as fraudulent, how can you create a process that automatically flags likely fraudulent transactions in the future? That's the premise behind the latest Data Science Deep Dive on MSDN. This tutorial provides a step by step to using the R language and the big-data statistical models of the RevoScaleR package of SQL Server 2016 R Services to build and use a predictive model to detect fraud. To follow along with the tutorial you'll need to install SQL Server 2016 and R Services on a database server, and...

from R-bloggers http://ift.tt/1Q6h0Sb
via IFTTT

How To Load Your Machine Learning Data Into R

You need to be able to load data into R when working on a machine learning problem. In this short post, you will discover how you can load your data files into R and start your machine learning project. Let’s get started. Access To Your Data The most common way to work with data in […]

The post How To Load Your Machine Learning Data Into R appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1TaStvx
via IFTTT

lunes, 8 de febrero de 2016

Tune Machine Learning Algorithms in R (random forest case study)

It is difficult to find a good machine learning algorithm for your problem. But once you do, how do you get the best performance out of it. In this post you will discover three ways that you can tune the parameters of a machine learning algorithm in R. Walk through a real example step-by-step with […]

The post Tune Machine Learning Algorithms in R (random forest case study) appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1oa7Ya8
via IFTTT

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Ensembles can give you a boost in accuracy on your dataset. In this post you will discover how you can create three of the most powerful types of ensembles in R. This case study will step you through Boosting, Bagging and Stacking and show you how you can continue to ratchet up the accuracy of […]

The post How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking) appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1KzgxWb
via IFTTT

Data Exploration with Kaggle – R Tutorial

Ever wonder where to begin your data analysis? Exploratory Data Analysis (EDA) is often the best starting point. Take the new hands-on course from Kaggle &  DataCamp “Data Exploration with Kaggle Scripts” to learn the essentials of Data Exploration and begin navigating the world of data.

By the end of the course you will learn how to apply various R packages and tools in combination in order to extract all of their usefulness for exploring your data. Furthermore, you will also be guided through the process of submitting your first Kaggle Script to your profile, and will publish analysis on Kaggle Scripts that you’ve personalized with information from your own life. (Tip: make sure to share your profile link with hiring managers and peers to easily show off and discuss your work.)

Play course now!

What you’ll learn

Throughout Data Exploration with Kaggle Scripts you will be challenged with hands-on interactive tasks and exercises that will help you learn effectively. You will learn and explore:

  • Whether it is a smart career choice to pursue a PhD based on the the 2013 American Community dataset.

  • What the optimal length of chopsticks is.

  • Which racing pigeons are fastest.

  • How different are the average marijuana prices throughout the US.

  • How to make beautiful visualizations with ggplot2.

  • How to upload your first Kaggle Scripts.

  • Learn the popularity of your name or a friend’s name in the USA over the past 100+ years.

  • Explore changes in enrollment, tuition, admissions rate, and degree types for different colleges.

During the entire course, the skills you have just acquired will be tested, thereby preparing you for independant sailing through Kaggle datasets.

Take this course now and you will always know where to begin your analysis and how to unravel usefull information in an elegant manner. Enjoy!



from R-bloggers http://ift.tt/1QRekEG
via IFTTT

martes, 2 de febrero de 2016

Better prediction intervals for time series forecasts

Forecast Combination

I’ve referred several times to this blog post by Rob Hyndman in which he shows that a simple averaging of the ets() and auto.arima() functions in his {forecast} R package not only out performs ets() and auto.arima() individually (in the long run, not every time), they outperform nearly every method that was entered in the M3 competition in the year 2000. This is a good example of a well known part of the forecasting craft, that a “Forecast Combination” of models will produce results closer to reality than its individual component models do.

In an earlier post, I showed that adding X13-SEATS to the combination further improves the forecast, reaching a point where the Mean Absolute Scaled Error (compared to the actual observations once made available) for the ets-auto.arima-SEATS combination is lower even than the competition’s top ranking Theta method.

Prediction intervals

A question for the forecaster is what prediction interval to use in a forecast combination. A prediction interval is a similar but not identical concept to a confidence interval. A prediction interval is an estimate of a value (or rather, the range of likely values) that isn’t yet known but is going to be observed at some point in the future. Whereas a confidence interval is an estimate of the likely range of values for a fundamentally unobserveable parameter. A prediction interval needs to take into account uncertainty in the model, uncertain estimates of the parameters in a model (ie the confidence intervals for those parameters), and also the individual randomness associated with the particular point or points being predicted.

Prediction intervals for forecasts are well known to be usually too narrow. For example, one study found prediction intervals calculated to include the true results 95% of the time only get it right between 71% and 87% of the time (thanks to Hyndman again for making that result easily available on his blog). There are a number of contributing reasons but the main one is that the uncertainty in the model building and selection process is not adequately taken into account. Most methods of developing prediction intervals are in effect estimating a range of values conditional on the model being correct in the first place. As our models are only simplifications of reality we fail more often than we would if the model were exactly right.

On a cursory glance, I couldn’t find much discussion of how to produce a prediction interval from a forecast combination. Hyndman avoids referring the issue in the first post linked to above by making claims only about point estimates “If you only want point forecasts, that (average of ets and auto.arima) is the best approach available in the forecast package.” One paper I found makes the common sense suggestion of taking averages of the prediction intervals from the component models. But if the original prediction intervals are too narrow, is averaging them going to help?

Forecasting makes me nervous, and in my day job when the reality is noticeably different from the forecast points it causes problems. I’d like there to be much more focus on the range of prediction intervals in communicating forecasts, and I’d like that range to be accurate or if anything slightly conservative (eg I’d be happier for 83% of observations to come inside my 80% prediction interval than for 77%).

Introducing hybridf()

I like combining auto.arima() and ets() for a quick, effective hybrid forecast that is probably as good as can be hoped for with a univariate series. In fact, it’s the sort of thing that could easily be done thousands of times in a day so to make it more convenient I created a function hybridf() that does this for me in R and produces an object of class “forecast”. This means that I can fit a forecast with one line of code and that other functionality Hyndman developed for that class, like the standard forecast plot, can be used on the resulting object. Here it is in action with the monthly time series of accidental deaths in the USA from 1973 to 1978, used in Brockwell and Davis’ 1991 Time Series: Theory and Methods and one of R’s in-built datasets.

library(forecast)
source("http://ift.tt/1nwbYkG;) 
fc <- hybridf(USAccDeaths)
par(mfrow = c(3, 1), bty = "l")
plot(fc)
plot(fc$fc_ets)
plot(fc$fc_aa)

The dark grey areas are 80% prediction intervals and the light grey the 95% prediction interval. The top panel shows the hybrid forecast. The dark blue line is just the average of the point forecasts of the other two methods, and the prediction interval takes the conservative view of showing the widest range of values of combining the two. So the hybrid prediction interval will be wider than the prediction intervals for either of the contributing models.

Testing against the M3 competition series

In these days of easy access to computers and to data, we don’t have to just theorise about the success rates of different prediction intervals, we can test methods against actual data. I used the 3,003 M3 competition datasets to compare the 80% and 95% prediction intervals generated by ets(), auto.arima(), and my hybridf(). After fitting the models on the given historical data and producing forecasts of the desired length, I counted how many of the actual results were in the prediction intervals. Here’s the results:

variable Success
ets_p80 0.75
ets_p95 0.90
auto.arima_p80 0.74
auto.arima_p95 0.88
hybrid_p80 0.83
hybrid_p95 0.94

My hybrid method has prediction intervals that succeed at close to the advertised rates, whereas both ets() and auto.arima() are less successful. For example, the hybrid 80% prediction interval contains the actual results 83% of the time, and the 95% prediction interval has the actual result 94% of the time; whereas for auto.arima the success rates are 74% and 88% respectively.

Here’s how I tested that on the M3 data. I build a little function pi_accuracy() to help, which makes use of the fact that objects of class forecast return a matrix called “lower” and another called “upper”, with a column for each prediction interval level. As this is only a temporary function for this blog, I leave it so it only works with the default values of forecast objects producing 80% and 95% intervals:

#------------------setup------------------------
library(showtext)
library(ggplot2)
library(scales)
library(forecast)
library(Mcomp)
library(tidyr)
library(dplyr)

source("http://ift.tt/1nwbYkG;)

font.add.google("Poppins", "myfont")
showtext.auto()
theme_set(theme_light(base_family = "myfont"))

pi_accuracy <- function(fc, yobs){
   # checks the success of prediction intervals of an object of class 
   # forecast with actual values
   if(length(yobs) != length(fc$mean)){
      stop("yobs needs to be the same length as the forecast period.")
   }
   n <- length(yobs)
   yobsm <- cbind(yobs, yobs)
   In <- (yobsm > fc$lower & yobsm < fc$upper) 
   colnames(In) <- c("Series 1", "Series 2")
   Success <- colMeans(In)
   return(list(In = In, Success = Success, n = n))
 }

Actually fitting all the forecasts is relatively straightforward. It took about an hour on my laptop. As the hybridf() function returns an object that provides the underlying ets() and auto.arima() objects, they don’t need to be refitted and there’s some modest efficiency.

#============forecasting with default values===============
num_series <- length(M3) # ie 3003
results <- matrix(0, nrow = num_series, ncol = 7)

for(i in 1:num_series){
   cat(i, " ")        # let me know how it's going as it loops through...
   series <- M3[[i]]
   x <- series$x      # ie the data to be fitted
   xx <- series$xx    # ie the true, actual values of the forecast period
   h <- length(xx)    # ie the length of the forecast period
   
   fc3 <- hybridf(x, h = h)
   results[i, 5:6] <- pi_accuracy(fc3, xx)$Success
   
   fc1 <- fc3$fc_ets
   results[i, 1:2] <- pi_accuracy(fc1, xx)$Success
   
   fc2 <- fc3$fc_aa
   results[i, 3:4] <- pi_accuracy(fc2, xx)$Success
   
   results[i, 7] <- h
}

results <- as.data.frame(results)

names(results) <- c("ets_p80", "ets_p95", "auto.arima_p80", "auto.arima_p95",
                    "hybrid_p80", "hybrid_p95", "h")

# The results are saved as percentages that were in the intervals,
# and the forecast lengths are different, so we need to weight by
# forecast length (h) to get the actual total percentage of observations
# that were within prediction interval. This code produces the table
# reproduced in the blog post above: 
results %>% 
   gather(variable, value, -h) %>%
   mutate(weighted_value = value * h) %>%
   group_by(variable) %>%
   summarise(Success = round(sum(weighted_value) / sum(h), 2))

results %>%
   gather(variable, value, -h) %>%
   mutate(Level = ifelse(grepl("p80", variable), "80%", "95%"),
          Level = factor(Level, levels = c("95%", "80%")),
          variable = gsub("_p[0-9].", "", variable)) %>%
   ggplot(aes(x = h, y = value, colour = Level)) +
   facet_grid(Level~variable) +
   scale_y_continuous("Percentage of actual results within forecast prediction intervaln",
                      label = percent, breaks = c(0, .25, .5, .75, .8, .95, 1)) +
   labs(x = "Forecast period", colour = "Desired level") +
   ggtitle("Prediction interval success for three forecasting methods on 3003 M3 timeseries") +
   geom_jitter(alpha = 0.2, width = 1.3, height = 0.1, shape = 1) +
   geom_smooth(se = FALSE, method = "lm") +
   theme(panel.grid.minor = element_blank())

An interesting pattern emerges when we look at the success rates of individual forecasts, as in the image above. A small collection of unfortunates have 0% of the actual data within the prediction interval - things went wrong and stayed wrong. Generally, the longer the forecast period, the higher the accuracy rate of the prediction intervals. Prediction intervals get wider as they forecast further periods out; and the randomness that is explicitly included in the intervals this way starts to dominate over the sunk cost inaccuracy of having a wrong model in the first place. For longer forecast periods, the standard prediction intervals tend towards performing as advertised, whereas for shorter forecast periods they are over-optimistic.

Bootstrapping

The forecast methods for both ets() and auto.arima() have the option to estimate prediction intervals by simulation and bootstrapping residuals rather than analytically, and those methods are inherited by my hybridf(). I checked the value of these prediction intervals too. The results are very similar to the non-bootstrap results; if anything, the prediction intervals based on bootstrap and simulation are slightly less accurate, but the difference is nothing to write home about.

variable Success
ets_p80 0.72
ets_p95 0.88
auto.arima_p80 0.70
auto.arima_p95 0.86
hybrid_p80 0.80
hybrid_p95 0.92
#=====with bootstrapping instead of formulae for the prediction intervals=============

num_series <- length(M3)
resultsb <- matrix(0, nrow = num_series, ncol = 7)

for(i in 1:num_series){
   cat(i, " ")
   series <- M3[[i]]
   x <- series$x
   xx <- series$xx
   h <- length(xx)
   
   fc3 <- hybridf(x, h = h, simulate = TRUE, bootstrap.ets = TRUE, bootstrap.aa = TRUE)
   resultsb[i, 5:6] <- pi_accuracy(fc3, xx)$Success
   
   fc1 <- fc3$fc_ets
   resultsb[i, 1:2] <- pi_accuracy(fc1, xx)$Success
   
   fc2 <- fc3$fc_aa
   resultsb[i, 3:4] <- pi_accuracy(fc2, xx)$Success
   
   resultsb[i, 7] <- h
}

resultsb <- as.data.frame(resultsb)

names(resultsb) <- c("ets_p80", "ets_p95", "auto.arima_p80", "auto.arima_p95",
                    "hybrid_p80", "hybrid_p95", "h")

resultsb %>% 
   gather(variable, value, -h) %>%
   mutate(weighted_value = value * h) %>%
   group_by(variable) %>%
   summarise(Success = round(sum(weighted_value) / sum(h), 2))

Conclusions

  • The hybridf() function at http://ift.tt/1nwbYkK provides a convenient and easy way of performing a forecast combination of ets() and auto.arima(), which gives high performing point estimates and an easily-managed object of class forecast.
  • Tested against the M3 competition data, the prediction intervals from hybridf(), formed by combining the prediction intervals of ets() and auto.arima() in a conservative manner (“take the widest range covered by superimposing the two source intervals”) performs true to the desired level ie the 80% prediction interval contains the true value just over 80% of the time, and the 95% prediction interval contains the true value just under 95% of the time.


from R-bloggers http://ift.tt/1PI8nGV
via IFTTT

Your First Machine Learning Project in R Step-By-Step (tutorial and template for future projects)

Do you want to do machine learning using R, but you’re having trouble getting started? In this post you will complete your first machine learning project using R. In this step-by-step tutorial you will: Download and install R and get the most useful package for machine learning in R. Load a dataset and understand it’s structure using statistical summaries […]

The post Your First Machine Learning Project in R Step-By-Step (tutorial and template for future projects) appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1Px7IMO
via IFTTT