lunes, 31 de agosto de 2015

Videographer Captures Amazing Interaction Between Seal and Diver

seal-gets-belly-rub-from-diver
Diver Gary Grayson has an unforgettable experience with a couple Atlantic Grey Seals while diving in the Isles of Scilly off the coast of Great Britain.

from TwistedSifter http://ift.tt/1MYdQvu
via IFTTT

domingo, 30 de agosto de 2015

Find Your Machine Learning Tribe: Get Started And Avoid Getting The Wrong Advice

Machine learning is a fascinating and powerful field of study filled with algorithms and data. The thing is, there are so many different types of people interested in machine learning, and each has different needs. It is important to understand what it is you want from machine learning and to tailor your self-study to those needs. […]

The post Find Your Machine Learning Tribe: Get Started And Avoid Getting The Wrong Advice appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1JFaVa3
via IFTTT

viernes, 28 de agosto de 2015

Get to know Cortana Analytics: Workshop and webinars

Cortana Analytics Suite is Microsoft's cloud-based big data and advanced analytics suite. It includes a complete set of all the services need to build advanced analytics applications: from data ingestion and management, data warehousing, advanced analytics, data visualization and solution frameworks. You can use Cortana Analytics to build applications using R, by incorporating services including Data Factory, HDInsights Hadoop, and ML Studio. If you'd like to spend some quality time with other developers and the Microsoft development team, there will be a first-ever Cortana Analytics Workshop at Microsoft HQ in Redmond, September 10-11. Or if you just want to check...

from R-bloggers http://ift.tt/1EpIj4v
via IFTTT

Understand Machine Learning Algorithms By Implementing Them From Scratch (and tactics to get around bad code)

Implementing machine learning algorithms from scratch seems like a great way for a programmer to understand machine learning. And maybe it is. But there some downsides to this approach too. In this post you will discover some great resources that you can use to implement machine learning algorithms from scratch. You will also discover some of […]

The post Understand Machine Learning Algorithms By Implementing Them From Scratch (and tactics to get around bad code) appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1NK41C1
via IFTTT

Spatio-Temporal Kriging in R

Preface

I am writing this post more for reminding to myself some theoretical background and the steps needed to perform spatio-temporal kriging in gstat. This month I had some free time to spend on small projects not specifically related to my primary occupation. I decided to spend some time trying to learn this technique since it may become useful in the future. However, I have never used it before so I had to first try to understand its basics both in terms of theoretical background and programming.Since I have used several resources to get a handle on it, I decided to share my experience and thoughts on this blog post because they may become useful for other people trying the same method. However, this post cannot be considered a full review of spatio-temporal kriging and its theoretical basis. I just mentioned some important details to guide myself and the reader through the comprehension of the topic, but these are clearly not exhaustive. At the end of the post I included some references to additional material you may want to browse for more details.

Introduction

This is the first time I considered spatio-temporal interpolation. Even though many datasets are indexed in both space and time, in the majority of cases time is not really taken into account for the interpolation. As an example we can consider temperature observations measured hourly from various stations in a determined study area. There are several different things we can do with such a dataset. We could for instance create a series of maps with the average daily or monthly temperatures. Time is clearly considered in these studies, but not explicitly during the interpolation phase. If we want to compute daily averages we first perform the averaging and then kriging. However, the temporal interactions are not considered in the kriging model. An example of this type of analysis is provided by (Gräler, 2012) in the following image, which depicts monthly averages for some environmental parameter in Germany:

There are cases and datasets in which performing 2D kriging on “temporal slices” may be appropriate. However, there are other instances where this is not possible and therefore the only solution is take time into account during kriging. For doing so two possible solutions are suggested in literature: using time as a third dimension, or fit a covariance model with both spatial and temporal components (Gräler et al., 2013).

Time as the third dimension

The idea behind this technique is extremely easy to grasp. To better understand it we can simply take a look at the equation to calculate the sample semivariogram, from Sherman (2011):

Under Matheron’s Intrinsic Hypothesis (Oliver et al., 1989) we can assume that the variance between two points, si and sj, depends only on their separation, which we indicate with the vector h in Eq.1. If we imagine a 2D example (i.e. purely spatial), the vector h is simply the one that connects two points, i and j, with a line, and its value can be calculated with the Euclidean distance:

If we consider a third dimension, which can be depth, elevation or time; it is easy to imagine Eq.2 be adapted to accommodate an additional dimension. The only problem with this method is that in order for it to work properly the temporal dimension needs to have a range similar to the spatial dimension. For this reason time needs to be scaled to align it with the spatial dimension. In Gräler et al. (2013) they suggest several ways to optimize the scaling and achieve meaningful results. Please refer to this article for more information.

Spatio-Temporal Variogram

The second way of taking time into account is to adapt the covariance function to the time component. In this case for each point si there will be a time ti associated with it, and to calculate the variance between this point and another we would need to calculate their spatial separation h and their temporal separation u. Thus, the spatio-temporal variogram can be computed as follows, from Sherman (2011):

With this equation we can compute a variogram taking into account every pair of points separated by distance h and time u.

Spatio-Temporal Kriging in R

In R we can perform spatio-temporal kriging directly from gstat with a set of functions very similar to what we are used to in standard 2D kriging. The package spacetime provides ways of creating objects where the time component is taken into account, and gstat uses these formats for its space-time analysis. Here I will present an example of spatio-temporal kriging using sensors’ data.

Data

In 2011, as part of the OpenSense project, several wireless sensors to measure air pollution (O3, NO2, NO, SO2, VOC, and fine particles) were installed on top of trams in the city of Zurich. The project now is in its second phase and more information about it can be found here: http://ift.tt/1JAF8ac In this page some examples data about Ozone and Ultrafine particles are also distributed in csv format. These data have the following characteristics: time is in UNIX format, while position is in degrees (WGS 84). I will use these data to test spatio-temporal kriging in R.

Packages

To complete this exercise we need to load several packages. First of all sp, for handling spatial objects, and gstat, which has all the function to actually perform spatio-temporal kriging. Then spacetime, which we need to create the spatio-temporal object. These are the three crucial packages. However, I also loaded some others that I used to complete smaller tasks. I loaded the raster package, because I use the functions coordinates and projection to create spatial data. There is no need of loading it, since the same functions are available under different names in sp. However, I prefer these two because they are easier to remember. The last packages are rgdal and rgeos, for performing various operations on geodata.The script therefore starts like:
library(gstat)
library(sp)
library(spacetime)
library(raster)
library(rgdal)
library(rgeos)

Data Preparation

There are a couple of issues to solve before we can dive into kriging. The first is that we need to do is translating the time from UNIX to POSIXlt or POSIXct, which are standard ways of representing time in R. This very first thing we have to do is of course setting the working directory and loading the csv file:
setwd("...")
data <- read.table("ozon_tram1_14102011_14012012.csv", sep=",", header=T)

Now we need to address the UNIX time. So what is UNIX time anyway? It is a way of tracking time as the number of seconds between a particular time and the UNIX epoch, which is January the 1st 1970 GMT. Basically, I am writing the first draft of this post on August the 18th at 16:01:00 CET. If I count the number of seconds from the UNIX epoch to this exact moment (there is an app for that!!) I find the UNIX time, which is equal to: 1439910060 Now let's take a look at one entry in the column “generation_time” of our dataset:
> paste(data$generation_time[1])
[1] "1318583686494"

As you may notice here the UNIX time is represented by 13 digits, while in the example above we just had 10. The UNIX time here represents also the milliseconds, which is something we cannot represent in R (as far as I know). So we cannot just convert each numerical value into POSIXlt, but we first need to extract only the first 10 digits, and then convert it. This can be done in one line of code but with multiple functions:
data$TIME <- as.POSIXlt(as.numeric(substr(paste(data$generation_time), 1, 10)), origin="1970-01-01")

We first need to transform the UNIX time from numerical to character format, using the function paste(data$generation_time). This creates the character string shown above, which we can then subset using the function substr. This function is used to subtract characters from a string and takes three arguments: a string, a starting character and a stopping character. In this case we want to basically delete the last 3 numbers from our string, so we set the start on the first number (start=1), and the stop at 10 (stop=10). Then we need to change the numerical string back to a numerical format, using the function as.numeric. Now we just need one last function to tell R that this particular number is a Date/Time object. We can do this using the function as.POSIXlt, which takes the actual number we just created plus an origin. Since we are using UNIX time, we need to set the starting point at "1970-01-01". We can test this function of the first element of the vector data$generation_time to test its output:
> as.POSIXlt(as.numeric(substr(paste(data$generation_time[1]), start=1, stop=10)), origin="1970-01-01")
[1] "2011-10-14 11:14:46 CEST"

Now the data.frame data has a new column named TIME where the Date/Time information are stored. Another issue with this dataset is in the formats of latitude and longitude. In the csv files these are represented in the format below:
> data$longitude[1]
[1] 832.88198
76918 Levels: 829.4379 829.43822 829.44016 829.44019 829.4404 ... NULL

> data$latitude[1]
[1] 4724.22833
74463 Levels: 4721.02182 4721.02242 4721.02249 4721.02276 ... NULL

Basically geographical coordinates are represented in degrees and minutes, but without any space. For example, for this point the longitude is 8°32.88’, while the latitude is 47°24.22’. For obtaining coordinates with a more manageable format we would again need to use strings.
data$LAT <- as.numeric(substr(paste(data$latitude),1,2))+(as.numeric(substr(paste(data$latitude),3,10))/60)

data$LON <- as.numeric(substr(paste(data$longitude),1,1))+(as.numeric(substr(paste(data$longitude),2,10))/60)

We use again a combination of paste and substr to extract only the numbers we need. For converting this format into degrees, we need to sum the degrees with the minutes divided by 60. So in the first part of the equation we just need to extract the first two digits of the numerical string and transform them back to numerical format. In the second part we need to extract the remaining of the strings, transform them into numbers and then divided them by 60.This operation creates some NAs in the dataset, for which you will get a warning message. We do not have to worry about it as we can just exclude them with the following line:
data <- na.omit(data) 


Subset

The ozone dataset by OpenSense provides ozone readings every minute or so, from October the 14th 2011 at around 11 a.m., up until January the 14th 2012 at around 2 p.m.
> min(data$TIME)
[1] "2011-10-14 11:14:46 CEST"

> max(data$TIME)
[1] "2012-01-14 13:40:43 CET"

The size of this dataset is 200183 rows, which makes it kind of big for perform kriging without a very powerful machine. For this reason before we can proceed with this example we have to subset our data to make them more manageable. To do so we can use the standard subsetting method for data.frame objects using Date/Time:
> sub <- data[data$TIME>=as.POSIXct('2011-12-12 00:00 CET')&data$TIME<=as.POSIXct('2011-12-14 23:00 CET'),]
> nrow(sub)
[1] 6734

Here I created an object named sub, in which I used only the readings from midnight on December the 12th to 11 p.m. on the 14th. This creates a subset of 6734 observations, for which I was able to perform the whole experiment using around 11 Gb of RAM. After this step we need to transform the object sub into a spatial object, and then I changed its projection into UTM so that the variogram will be calculated on metres and not degrees. These are the steps required to achieve all this:
#Create a SpatialPointsDataFrame
coordinates(sub)=~LON+LAT
projection(sub)=CRS("+init=epsg:4326")

#Transform into Mercator Projection
ozone.UTM <- spTransform(sub,CRS("+init=epsg:3395"))

Now we have the object ozone.UTM, which is a SpatialPointsDataFrame with coordinates in metres.

Spacetime Package

Gstat is able to perform spatio-temporal kriging exploiting the functionalities of the package spacetime, which was developed by the same team as gstat. In spacetime we have two ways to represent spatio-temporal data: STFDF and STIDF formats. The first represents objects with a complete space time grid. In other words in this category are included objects such as the grid of weather stations presented in Fig.1. The spatio-temporal object is created using the n locations of the weather stations and the m time intervals of their observations. The spatio-temporal grid is of size 𝑛∙𝑚. STIDF objects are the one we are going to use for this example. These are unstructured spatio-temporal objects, where both space and time change dynamically. For example, in this case we have data collected on top of trams moving around the city of Zurich. This means that the location of the sensors is not consistent throughout the sampling window. Creating STIDF objects is fairly simple, we just need to disassemble the data.frame we have into a spatial, temporal and data components, and then merge them together to create the STIDF object.The first thing to do is create the SpatialPoints object, with the locations of the sensors at any given time:
ozoneSP <- SpatialPoints(ozone.UTM@coords,CRS("+init=epsg:3395")) 

This is simple to do with the function SpatialPoints in the package sp. This function takes two arguments, the first is a matrix or a data.frame with the coordinates of each point. In this case I used the coordinates of the SpatialPointsDataFrame we created before, which are provided in a matrix format. Then I set the projection in UTM.
At this point we need to perform a very important operation for kriging, which is check whether we have some duplicated points. It may happen sometime that there are points with identical coordinates. Kriging cannot handle this and returns an error, generally in the form of a “singular matrix”. Most of the time in which this happens the problem is related to duplicated locations. So we now have to check if we have duplicates here, using the function zerodist:
dupl <- zerodist(ozoneSP) 

It turns out that we have a couple of duplicates, which we need to remove. We can do that directly in the two lines of code we would need to create the data and temporal component for the STIDF object:
ozoneDF <- data.frame(PPB=ozone.UTM$ozone_ppb[-dupl[,2]]) 

In this line I created a data.frame with only one column, named PPB, with the ozone observations in part per billion. As you can see I removed the duplicated points by excluding the rows from the object ozone.UTM with the indexes included in one of the columns of the object dupl. We can use the same trick while creating the temporal part:
ozoneTM <- as.POSIXct(ozone.UTM$TIME[-dupl[,2]],tz="CET") 

Now all we need to do is combine the objects ozoneSP, ozoneDF and ozoneTM into a STIDF:
timeDF <- STIDF(ozoneSP,ozoneTM,data=ozoneDF) 

This is the file we are going to use to compute the variogram and perform the spatio-temporal interpolation. We can check the raw data contained in the STIDF object by using the spatio-temporal version of the function spplot, which is stplot:
stplot(timeDF) 



Variogram

The actual computation of the variogram at this point is pretty simple, we just need to use the appropriate function: variogramST. Its use is similar to the standard function for spatial kriging, even though there are some settings for the temporal component that need to be included.
var <- variogramST(PPB~1,data=timeDF,tunit="hours",assumeRegular=F,na.omit=T) 

As you can see here the first part of the call to the function variogramST is identical to a normal call to the function variogram; we first have the formula and then the data source. However, then we have to specify the time units (tunits) or the time lags (tlags). I found the documentation around this point a bit confusing to be honest. I tested various combinations of parameters and the line of code I presented is the only one that gives me what appear to be good results. I presume that what I am telling to the function is to aggregate the data to the hours, but I am not completely sure. I hope some of the readers can shed some light on this!!
I must warn you that this operation takes quite a long time, so please be aware of that. I personally ran it overnight.

Plotting the Variogram

Basically the spatio-temporal version of the variogram includes different temporal lags. Thus what we end up with is not a single variogram but a series, which we can plot using the following line:
plot(var,map=F) 

which return the following image:

Among all the possible types of visualizations for spatio-temporal variogram, this for me is the easiest to understand, probably because I am used to see variogram models. However, there are also other ways available to visualize it, such as the variogram map:
plot(var,map=T) 


And the 3D wireframe:
plot(var,wireframe=T) 


Variogram Modelling

As in a normal 2D kriging experiment, at this point we need to fit a model to our variogram. For doing so we will use the function vgmST and fit.StVariogram, which are the spatio-temporal matches for vgm and fit.variogram.
Below I present the code I used to fit all the models. For the automatic fitting I used most of the settings suggested in the following demo:
demo(stkrige) 

Regarding the variogram models, in gstat we have 5 options: separable, product sum, metric, sum metric, and simple sum metric. You can find more information to fit these model, including all the equations presented below, in (Gräler et al., 2015), which is available in pdf (I put the link in the "More Information" section).

Separable

This covariance model assumes separability between the spatial and the temporal component, meaning that the covariance function is given by:

According to (Sherman, 2011): “While this model is relatively parsimonious and is nicely interpretable, there are many physical phenomena which do not satisfy the separability”. Many environmental processes for example do not satisfy the assumption of separability. This means that this model needs to be used carefully.
The first thing to set are the upper and lower limits for all the variogram parameters, which are used during the automatic fitting:
# lower and upper bounds
pars.l <- c(sill.s = 0, range.s = 10, nugget.s = 0,sill.t = 0, range.t = 1, nugget.t = 0,sill.st = 0, range.st = 10, nugget.st = 0, anis = 0)
pars.u <- c(sill.s = 200, range.s = 1000, nugget.s = 100,sill.t = 200, range.t = 60, nugget.t = 100,sill.st = 200, range.st = 1000, nugget.st = 100,anis = 700)

To create a separable variogram model we need to provide a model for the spatial component, one for the temporal component, plus the overall sill:
separable <- vgmST("separable", space = vgm(-60,"Sph", 500, 1),time = vgm(35,"Sph", 500, 1), sill=0.56) 

This line creates a basic variogram model, and we can check how it fits our data using the following line:
plot(var,separable,map=F) 


One thing you may notice is that the variogram parameters do not seem to have anything in common with the image shown above. I mean, in order to create this variogram model I had to set the sill of the spatial component at -60, which is total nonsense. However, I decided to try to fit this model by-eye as best as I could just to show you how to perform this type of fitting and calculate its error; but in this case it cannot be taken seriously. I found that for the automatic fit the parameters selected for vgmST do not make much of a difference, so probably you do not have to worry too much about the parameters you select in vgmST.
We can check how this model fits our data by using the function fit.StVariogram with the option fit.method=0, which keeps this model but calculates its Mean Absolute Error (MSE), compared to the actual data:
> separable_Vgm <- fit.StVariogram(var, separable, fit.method=0)
> attr(separable_Vgm,"MSE")
[1] 54.96278

This is basically the error of the eye fit. However, we can also use the same function to automatically fit the separable model to our data (here I used the settings suggested in the demo):
> separable_Vgm <- fit.StVariogram(var, separable, fit.method=11,method="L-BFGS-B", stAni=5, lower=pars.l,upper=pars.u)
> attr(separable_Vgm, "MSE")
[1] 451.0745

As you can see the error increases. This probably demonstrates that this model is not suitable for our data, even though with some magic we can create a pattern that is similar to what we see in the observations. In fact, if we check the fit by plotting the model it is clear that this variogram cannot properly describe our data:
plot(var,separable_Vgm,map=F) 


To check the parameters of the model we can use the function extractPar:
> extractPar(separable_Vgm)
range.s nugget.s range.t nugget.t sill
199.999323 10.000000 99.999714 1.119817 17.236256

Product Sum

A more flexible variogram model for spatio-temporal data is the product sum, which do not assume separability. The equation of the covariance model is given by:
with k > 0.In this case in the function vgmST we need to provide both the spatial and temporal component, plus the value of the parameter k (which needs to be positive):
prodSumModel <- vgmST("productSum",space = vgm(1, "Exp", 150, 0.5),time = vgm(1, "Exp", 5, 0.5),k = 50) 

I first tried to set k = 5, but R returned an error message saying that it needed to be positive, which I did not understand. However, with 50 it worked and as I mentioned the automatic fit does not care much about these initial values, probably the most important things are the upper and lower bounds we set before.
We can then proceed with the fitting process and we can check the MSE with the following two lines:
> prodSumModel_Vgm <- fit.StVariogram(var, prodSumModel,method = "L-BFGS-B",lower=pars.l)
> attr(prodSumModel_Vgm, "MSE")
[1] 215.6392

This process returns the following model:

Metric

This model assumes identical covariance functions for both the spatial and the temporal components, but includes a spatio-temporal anisotropy (k) that allows some flexibility.
In this model all the distances (spatial, temporal and spatio-temporal) are treated equally, meaning that we only need to fit a joint variogram to all three. The only parameter we have to modify is the anisotropy k. In R k is named stAni and creating a metric model in vgmST can be done as follows:
metric <- vgmST("metric", joint = vgm(50,"Mat", 500, 0), stAni=200) 

The automatic fit produces the following MSE:

> metric_Vgm <- fit.StVariogram(var, metric, method="L-BFGS-B",lower=pars.l)
> attr(metric_Vgm, "MSE")
[1] 79.30172

We can plot this model to visually check its accuracy:


Sum Metric

A more complex version of this model is the sum metric, which includes a spatial and temporal covariance models, plus the joint component with the anisotropy:
This model allows maximum flexibility, since all the components can be set independently. In R this is achieved with the following line:
sumMetric <- vgmST("sumMetric", space = vgm(psill=5,"Sph", range=500, nugget=0),time = vgm(psill=500,"Sph", range=500, nugget=0), joint = vgm(psill=1,"Sph", range=500, nugget=10), stAni=500) 

The automatic fit can be done like so:

> sumMetric_Vgm <- fit.StVariogram(var, sumMetric, method="L-BFGS-B",lower=pars.l,upper=pars.u,tunit="hours")
> attr(sumMetric_Vgm, "MSE")
[1] 58.98891

Which creates the following model:

Simple Sum Metric

As the title suggests, this is a simpler version of the sum metric model. In this case instead of having total flexibility for each component we restrict them to having a single nugget. Basically we still have to set all the parameters, even though we do not care about setting the nugget in each component since we need to set a nugget effect for all three:
SimplesumMetric <- vgmST("simpleSumMetric",space = vgm(5,"Sph", 500, 0),time = vgm(500,"Sph", 500, 0), joint = vgm(1,"Sph", 500, 0), nugget=1, stAni=500) 

This returns a model similar to the sum metric:

> SimplesumMetric_Vgm <- fit.StVariogram(var, SimplesumMetric,method = "L-BFGS-B",lower=pars.l)
> attr(SimplesumMetric_Vgm, "MSE")
[1] 59.36172


Choosing the Best Model

We can visually compare all the models we fitted using wireframes in the following way:
plot(var,list(separable_Vgm, prodSumModel_Vgm, metric_Vgm, sumMetric_Vgm, SimplesumMetric_Vgm),all=T,wireframe=T) 


The most important parameter to take into account for selecting the best model is certainly the MSE. By looking at the these it is clear that the best model is the sum metric, with an error of around 59, so I will use this for kriging.

Prediction Grid

Since we are performing spatio-temporal interpolation, it is clear that we are interested in estimating new values in both space and time. For this reason we need to create a spatio-temporal prediction grid. In this case I first downloaded the road network for the area around Zurich, then I cropped it to match the extension of my study area, and then I created the spatial grid:
roads <- shapefile("VEC25_str_l_Clip/VEC25_str_l.shp") 

This is the shapefile with the road network extracted from the Vector25 map of Switzerland. Unfortunately for copyright reasons I cannot share it. This file is projected in CH93, which is the Swiss national projection. Since I wanted to perform a basic experiment, I decided not to include the whole network, but only the major roads that in Switzerland are called Klass1. So the first thing I did was extracting from the roads object only the lines belonging to Klass1 streets:
Klass1 <- roads[roads$objectval=="1_Klass",] 

Then I changed the projection of this object from CH93 to UTM, so that it is comparable with what I used so far:
Klass1.UTM <- spTransform(Klass1,CRS("+init=epsg:3395")) 

Now I can crop this file so that I obtain only the roads within my study area. I can use the function crop in rgeos, with the object ozone.UTM that I created before:
Klass1.cropped <- crop(Klass1.UTM,ozone.UTM) 

This gives me the road network around the locations where the data were collected. I can show you the results with the following two lines:
plot(Klass1.cropped)
plot(ozone.UTM,add=T,col="red")


Where the Klass1 roads are in black and the data points are represented in red. With this selection I can now use the function spsample to create a random grid of points along the road lines:
sp.grid.UTM <- spsample(Klass1.cropped,n=1500,type="random") 

This generates the following grid, which I think I can share with you in RData format (gridST.RData):
As I mentioned, now we need to add a temporal component to this grid. We can do that again using the package spacetime. We first need to create a vector of Date/Times using the function seq:
tm.grid <- seq(as.POSIXct('2011-12-12 06:00 CET'),as.POSIXct('2011-12-14 09:00 CET'),length.out=5) 

This creates a vector with 5 elements (length.out=5), with POSIXct values between the two Date/Times provided. In this case we are interested in creating a spatio-temporal data frame, since we do not yet have any data for it. Therefore we can use the function STF to merge spatial and temporal data into a spatio-temporal grid:
grid.ST <- STF(sp.grid.UTM,tm.grid) 

This can be used as new data in the kriging function.

Kriging

This is probably the easiest step in the whole process. We have now created the spatio-temporal data frame, compute the best variogram model and create the spatio-temporal prediction grid. All we need to do now is a simple call to the function krigeST to perform the interpolation:
pred <- krigeST(PPB~1, data=timeDF, modelList=sumMetric_Vgm, newdata=grid.ST) 

We can plot the results again using the function stplot:

stplot(pred) 



More information

There are various tutorial available that offer examples and guidance in performing spatio-temporal kriging. For example we can just write:
vignette("st", package = "gstat") 

and a pdf will open with some of the instructions I showed here. Plus there is a demo available at:
demo(stkrige) 

In the article “Spatio-Temporal Interpolation using gstat” Gräler et al. explain in details the theory behind spatio-temporal kriging. The pdf of this article can be found here: http://ift.tt/1PAZuR4There are also some books and articles that I found useful to better understand the topic, for which I will put the references at the end of the post.

References

Gräler, B., 2012. Different concepts of spatio-temporal kriging [WWW Document]. URL http://ift.tt/1PAZv7k (accessed 8.18.15).

Gräler, B., Pebesma, Edzer, Heuvelink, G., 2015. Spatio-Temporal Interpolation using gstat.

Gräler, B., Rehr, M., Gerharz, L., Pebesma, E., 2013. Spatio-temporal analysis and interpolation of PM10 measurements in Europe for 2009.

Oliver, M., Webster, R., Gerrard, J., 1989. Geostatistics in Physical Geography. Part I: Theory. Trans. Inst. Br. Geogr., New Series 14, 259–269. doi:10.2307/622687

Sherman, M., 2011. Spatial statistics and spatio-temporal data: covariance functions and directional properties. John Wiley & Sons.




All the code snippets were created by Pretty R at inside-R.org
































from R-bloggers http://ift.tt/1JAF7TU
via IFTTT

miércoles, 26 de agosto de 2015

Choosing Machine Learning Algorithms: Lessons from Microsoft Azure

Microsoft recently launched support for machine learning in their Azure cloud computing platform. Buried in some of their technical documentation for the platform are some resources that you may find useful for thinking about what machine learning algorithm to use in different situations. In this post we take a look at the Microsoft recommendations for […]

The post Choosing Machine Learning Algorithms: Lessons from Microsoft Azure appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1IdggzS
via IFTTT

How to use lists in R

In the [last post](http://ift.tt/18yhf3k), I went over the basics of lists, including constructing, manipulating, and converting lists to other classes. Knowing the basics, in this post, we'll use the **apply()** functions to see just how powerful working with lists can be. I've done two posts on apply for dataframes and matrics, [here](http://ift.tt/1i25x6o) and [here](http://ift.tt/1A8JbS4), so give those a read if you need a refresher. Intro to apply-based functions for lists There are a variety of apply functions that can be used depending on what you want to do. The table below shows the function, what it inputs, and what it outputs: For example, if we have a list and you want to produce a vector (of the same length), we use **sapply()**. If we have a vector and want to produce a list of the same length, we use **lapply()**. Let's try an example. The syntax of lapply is: lapply(INPUT, function(x) (Some function here)) where INPUT, as we see from the table above, must be a vector or a list, and function(x) is any kind of function that takes **each element of the INPUT** and applies the function to it. The function can be something that already exists in R, or it can be a new function that you've written up. For example, let's construct a list of 3 vectors like so: mylist=5} #apply that function to the list sapply(mylist, span.fun) Creating a list using lapply You don't need to have a list already created to use lapply() - in fact, lapply can be used to _make_ a list. This is because the key about **lapply()** is that it *returns* a list of the same length as whatever you input. For example, let's initialize a list to have 2 empty matrices that are size 2x3. We'll use lapply(): our input is just a vector containing 1 and 2, and the function we specify uses the matrix() function to construct a 2x3 matrix of empty cells for each element of this vector, so it returns a list of two such matrices. If instead of empty matrices you wanted to fill these matrices with random numbers, you could do so too. Check out both possibilities below. #initialize list to to 2 empty matrices of 2 by 3 list2

from R-bloggers http://ift.tt/1K09Wn4
via IFTTT

lunes, 24 de agosto de 2015

Predicting Titanic deaths on Kaggle IV: random forest revisited

On July 19th I used randomForest to predict the deaths on Titanic in the Kaggle competition. Subsequently I found that both bagging and boosting gave better predictions than randomForest. This I found somewhat unsatisfactory, hence I am now revisi...

from R-bloggers http://ift.tt/1MOw2ct
via IFTTT

domingo, 23 de agosto de 2015

5 Techniques To Understand Machine Learning Algorithms Without the Background in Mathematics

Where does theory fit into a top-down approach to studying machine learning? In the traditional approach to teaching machine learning, theory comes first requiring an extensive background in mathematics to be able to understand it. In my approach to teaching machine learning, I start with teaching you how to work problems end-to-end and deliver results. […]

The post 5 Techniques To Understand Machine Learning Algorithms Without the Background in Mathematics appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1h6CPAM
via IFTTT

viernes, 21 de agosto de 2015

Data Manipulation with dplyr

dplyr is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation. Here, I will provide a basic overview of some of the most useful functions contained in the package. For this article, I will be using […]

from R-bloggers http://ift.tt/1JnbDIQ
via IFTTT

functional enrichment analysis with NGS data

I found that there is a Bioconductor package, seq2pathway, that can apply functional analysis to NGS data. It consists of two components, seq2gene and gene2pathway. seq2gene converts genomic coordination to genes while gene2pathway performs functional analysis at gene level. Read More: 1007 Words Totally

from R-bloggers http://ift.tt/1Jm7Jvr
via IFTTT

jueves, 20 de agosto de 2015

Practice Machine Learning with Small In-Memory Datasets from the UCI Machine Learning Repository

Where can you get good datasets to practice machine learning? Datasets that are real-world so that they are interesting and relevant, although small enough for you to review in Excel and work through on your desktop. In this post you will discover a database of high-quality, real-world, and well understood machine learning datasets that you […]

The post Practice Machine Learning with Small In-Memory Datasets from the UCI Machine Learning Repository appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1K90BaY
via IFTTT

Kickin’ it with elastic net regression

With the kind of data that I usually work with, overfitting regression models can be a huge problem if I'm not careful. Ridge regression is a really effective technique for thwarting overfitting. It does this by penalizing the L2 norm…

Continue reading



from R-bloggers http://ift.tt/1JjNK0s
via IFTTT

martes, 18 de agosto de 2015

Evaluating Logistic Regression Models

Logistic regression is a technique that is well suited for examining the relationship between a categorical response variable and one or more categorical or continuous predictor variables. The model is generally presented in the following format, where β refers to the parameters and x represents the independent variables. log(odds)=β0+β1∗x1+...+βn∗xn The log(odds), or log-odds ratio, is defined […]

from R-bloggers http://ift.tt/1NhOpXk
via IFTTT

8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

Has this happened to you? You are working on your dataset. You create a classification model and get 90% accuracy immediately. “Fantastic” you think. You dive a little deeper and discover that 90% of the data belongs to one class. Damn! This is an example of an imbalanced dataset and the frustrating results it can […]

The post 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1MwQBZm
via IFTTT

lunes, 17 de agosto de 2015

Machine Learning for Programmers: Leap from developer to machine learning practitioner

(a.k.a. my answer to the question: “how do I get started in machine learning?“) I’m a developer. I have read a book or some posts on machine learning. I have watched some of the Coursera machine learning course. I still don’t know how to get started… Does this sound familiar? The most common question I’m asked by developers on […]

The post Machine Learning for Programmers: Leap from developer to machine learning practitioner appeared first on Machine Learning Mastery.



from Machine Learning Mastery http://ift.tt/1WwvM51
via IFTTT

viernes, 14 de agosto de 2015

The Best Tools for Saving Web Pages, Forever

Your favorite content on the Internet may disappear. Learn about the best software tools and web archiving services that will help you save any web page on the Internet, forever.

from Digital Inspiration Technology Blog http://ift.tt/pY6ny7
via IFTTT

lunes, 10 de agosto de 2015

Turning your R (or Python) models into APIs



More and more real-world systems are relying on data science and analytical models to deliver sophisticated functionality or improved user experiences. For example, Microsoft combined the power of advanced predictive models and web services to develop the real-time voice translation feature in Skype. Facebook and Google continuously improve their deep learning models for better face recognition features in their photo service.

Some have characterised this trend as a shift from Software-as-a-Service (SaaS) to an era of Models-as-a-Service (MaaS). These models are often written in statistical programming languages (e.g., R, Python), which are especially well suited to analytical tasks.

With analytical models playing an increasingly important role in real-world systems, and with more models being developed in R and Python, we need powerful ways of turning these models into APIs for others to consume.

But how?

It is actually a lot easier than you might think. I wrote a step-by-step guide about deploying analytical models as REST APIs. This guide will walk you through how to set up your own MaaS WITHOUT a team of full-stack developers/engineers. All you need are the R/Python models you develop and a Domino Data Lab account. You can find the full article on ProgrammableWeb here.

You can also find my slides for the related LondonR talk here:


Deploying your Predictive Models as a Service via Domino from Jo-fai Chow

When I created this blog back in 2013, my aim was simply to learn ggplot2. Thanks to the feedback and advice from the R community, I continued to learn new stuff and somehow found an opportunity to work for Domino and Virgin Media. I wouldn't say I have seen enough to make a fair comparison with other programming communities. But so far the support from the R community, for me, has been truly special! I believe blogging is one of the best ways to contribute so I better get back to the writing habit! For the next post, I would like to talk about using R with other Microsoft tools (SQL Server, PowerPoint) in a commercial environment.

London Kaggle Meetup

I met Alex Glaser and Wojtek Kostelecki after my LondonR talk. They have already set up a meetup for Kagglers. We are working on a collaborative project to build / test / stack models on the Domino platform. For more information, join the meetup first. Let's Kaggle together!



from R-bloggers http://ift.tt/1DBeOMJ
via IFTTT

Argentine general election, 2015

The 2015 Argentine's presidential election to be held next October 25th is approaching and the dispute begun to appear more clearly since the major parties announced their potential candidates last June. This Sunday, the political parties are holding their primaries for the upcoming presidential election. As in US, in Argentine the primaries are important for […]

from R-bloggers http://ift.tt/1gSsVmT
via IFTTT

Soap analytics: Text mining “Goede tijden slechte tijden” plot summaries….

Sorry for the local nature of this blog post. I was watching Dutch television and zapping between channels the other day and I stumbled upon “Goede Tijden Slechte Tijden” (GTST). This is a Dutch soap series broadcast by RTL Nederland. … Continue reading

from R-bloggers http://ift.tt/1gSsVD7
via IFTTT

martes, 4 de agosto de 2015

Feature Engineering versus Feature Extraction: Game On!

"Feature engineering" is a fancy term for making sure that your predictors are encoded in the model in a manner that makes it as easy as possible for the model to achieve good performance. For example, if your have a date field as a predictor and there are larger differences in response for the weekends versus the weekdays, then encoding the date in this way makes it easier to achieve good results.

However, this depends on a lot of things.

First, it is model-dependent. For example, trees might have trouble with a classification data set if the class boundary is a diagonal line since their class boundaries are made using orthogonal slices of the data (oblique trees excepted).

Second, the process of predictor encoding benefits the most from subject-specific knowledge of the problem. In my example above, you need to know the patterns of your data to improve the format of the predictor. Feature engineering is very different in image processing, information retrieval, RNA expressions profiling, etc. You need to know something about the problem and your particular data set to do it well.

Here is some training set data where two predictors are used to model a two-class system (I'll unblind the data at the end):

There is also a corresponding test set that we will use below.

There are some observations that we can make:

  • The data are highly correlated (correlation = 0.85)
  • Each predictor appears to be fairly right-skewed
  • They appear to be informative in the sense that you might be able to draw a diagonal line to differentiate the classes

Depending on what model that we might choose to use, the between-predictor correlation might bother us. Also, we should look to see of the individual predictors are important. To measure this, we'll use the area under the ROC curve on the predictor data directly.

Here are univariate box-plots of each predictor (on the log scale):

There is some mild differentiation between the classes but a significant amount of overlap in the boxes. The area under the ROC curves for predictor A and B are 0.61 and 0.59, respectively. Not so fantastic.

What can we do? Principal component analysis (PCA) is a pre-processing method that does a rotation of the predictor data in a manner that creates new synthetic predictors (i.e. the principal components or PC's). This is conducted in a way where the first component accounts for the majority of the (linear) variation or information in the predictor data. The second component does the same for any information in the data that remains after extracting the first component and so on. For these data, there are two possible components (since there are only two predictors). Using PCA in this manner is typically called feature extraction.

Let's compute the components:

> library(caret)
> head(example_train)
   PredictorA PredictorB Class
2    3278.726  154.89876   One
3    1727.410   84.56460   Two
4    1194.932  101.09107   One
12   1027.222   68.71062   Two
15   1035.608   73.40559   One
16   1433.918   79.47569   One
> pca_pp <- preProcess(example_train[, 1:2],
+                      method = c("center", "scale", "pca"))
+ pca_pp

Call:
preProcess.default(x = example_train[, 1:2], method = c("center",
 "scale", "pca"))

Created from 1009 samples and 2 variables
Pre-processing: centered, scaled, principal component signal extraction 

PCA needed 2 components to capture 95 percent of the variance
> train_pc <- predict(pca_pp, example_train[, 1:2])
> test_pc <- predict(pca_pp, example_test[, 1:2])
> head(test_pc, 4)
        PC1         PC2
1 0.8420447  0.07284802
5 0.2189168  0.04568417
6 1.2074404 -0.21040558
7 1.1794578 -0.20980371

Note that we computed all the necessary information from the training set and apply these calculations to the test set. What do the test set data look like?

These are the test set predictors simply rotated.

PCA is unsupervised, meaning that the outcome classes are not considered when the calculations are done. Here, the area under the ROC curves for the first component is 0.5 and 0.81 for the second component. These results jive with the plot above; the first component has an random mixture of the classes while the second seems to separate the classes well. Box plots of the two components reflect the same thing:

There is much more separation in the second component.

This is interesting. First, despite PCA being unsupervised, it managed to find a new predictor that differentiates the classes. Secondly, it is the last component that is most important to the classes but the least important to the predictors. It is often said that PCA doesn't guarantee that any of the components will be predictive and this is true. Here, we get lucky and it does produce something good.

However, imagine that there are hundreds of predictors. We may only need to use the first X components to capture the majority of the information in the predictors and, in doing so, discard the later components. In this example, the first component accounts for 92.4% of the variation in the predictors; a similar strategy would probably discard the most effective predictor.

How does the idea of feature engineering come into play here? Given these two predictors and seeing the first scatterplot shown above, one of the first things that occurs to me is "there are two correlated, positive, skewed predictors that appear to act in tandem to differentiate the classes". The second thing that occurs to be is "take the ratio". What does that data look like?

The corresponding area under the ROC curve is 0.8, which is nearly as good as the second component. A simple transformation based on visually exploring the data can do just as good of a job as an unbiased empirical algorithm.

These data are from the cell segmentation experiment of Hill et al, and predictor A is the "surface of a sphere created from by rotating the equivalent circle about its diameter" (labeled as EqSphereAreaCh1 in the data) and predictor B is the perimeter of the cell nucleus (PerimCh1). A specialist in high content screening might naturally take the ratio of these two features of cells because it makes good scientific sense (I am not that person). In the context of the problem, their intuition should drive the feature engineering process.

However, in defense of an algorithm such as PCA, the machine has some benefit. In total, there are almost sixty predictors in these data whose features are just as arcane as EqSphereAreaCh1. My personal favorite is the "Haralick texture measurement of the spatial arrangement of pixels based on the co-occurrence matrix". Look that one up some time. The point is that there are often too many features to engineer and they might be completely unintuitive from the start.

Another plus for feature extraction is related to correlation. The predictors in this particular data set tend to have high between-predictor correlations and for good reasons. For example, there are many different ways to quantify the eccentricity of a cell (i.e. how elongated it is). Also, the size of a cell's nucleus is probably correlated with the size of the overall cell and so on. PCA can mitigate the effect of these correlations in one fell swoop. An approach of manually taking ratios of many predictors seems less likely to be effective and would take more time.

Last year, in one of the R&D groups that I support, there was a bit of a war being waged between the scientists who focused on biased analysis (i.e. we model what we know) versus the unbiased crowd (i.e. just let the machine figure it out). I fit somewhere in-between and believe that there is a feedback loop between the two. The machine can flag potentially new and interesting features that, once explored, become part of the standard book of "known stuff".



from R-bloggers http://ift.tt/1IHYXvX
via IFTTT

Generalised Linear Models in R

Linear models are the bread and butter of statistics, but there is a lot more to it than taking a ruler and drawing a line through a couple of points.

Some time ago Rasmus Bååth published an insightful blog article about how such models could be described from a distribution centric point of view, instead of the classic error terms convention.

I think the distribution centric view makes generalised linear models (GLM) much easier to understand as well. That’s the purpose of this post.

Using data on ice cream sales statistics I will set out to illustrate different models, starting with traditional linear least square regression, moving on to a linear model, a log-transformed linear model and then on to generalised linear models, namely a Poisson (log) GLM and Binomial (logistic) GLM. Additionally, I will run a simulation with each model.

Along the way I aim to reveal the intuition behind a GLM using Ramus’ distribution centric description. I hope to clarify why one would want to use different distributions and link functions.

The data

Here is the example data set I will be using. It shows the number of ice creams sold at different temperatures.
icecream   temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1, 
18.5, 19.4, 22.1, 22.6, 23.4, 25.1),
units=c(185L, 215L, 332L, 325L, 408L, 421L,
406L, 412L, 522L, 445L, 544L, 614L)
)
I will be comparing the results from this model with others. Hence, I write a helper function to provide consistent graphical outputs:
basicPlot   plot(units ~ temp, data=icecream, bty="n", lwd=2,
main="Number of ice creams sold", col="#00526D",
xlab="Temperature (Celsius)",
ylab="Units sold", ...)
axis(side = 1, col="grey")
axis(side = 2, col="grey")
}
basicPlot()
As expected more ice cream is sold at higher temperature.

The challenge

I would like to create a model that predicts the number of ice creams sold for different temperatures, even outside the range of the available data.

I am particularly interested in how my models will behave in the more extreme cases when it is freezing outside, say when the temperature drops to 0ºC and also what it would say for a very hot summer's day at 35ºC.

Linear least square

My first approach is to take a ruler and draw a straight line through the points, such that it minimises the average distance between the points and the line. That's basically a linear least square regression line:
lsq.mod basicPlot()
abline(lsq.mod, col="orange", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "linear least square"),
col=c("#00526D","orange"), pch=c(1,NA))
That's easy and looks not unreasonable.

Linear regression

For the least square method I didn't think about distributions at all. It is just common sense, or is it?

Let's start with a probability distribution centric description of the data.

I believe the observation (y_i) was drawn from a Normal (aka Gaussian) distribution with a mean (mu_i), depending on the temperature (x_i) and a constant variance (sigma^2) across all temperatures.

On another day with the same temperature I might have sold a different quantity of ice cream, but over many days with the same temperature the number of ice creams sold would start to fall into a range defined by (mu_ipmsigma).

Thus, using a distribution centric notation my model looks like this:
[
y_i sim mathcal{N}(mu_i, sigma^2), \
mathbb{E}[y_i]=mu_i = alpha + beta x_i ; text{for all}; i
]Note, going forward I will drop the (text{for all}; i) statement for brevity.

Or, alternatively I might argue that the residuals, the difference between the observations and predicted values, follow a Gaussian distribution with a mean of 0 and variance (sigma^2):
[
varepsilon_i = y_i - mu_i sim mathcal{N}(0, sigma^2)
]Furthermore, the equation
[
mathbb{E}[y_i]=mu_i = alpha + beta x_i
]states my belief that the expected value of (y_i) is identical to the parameter (mu_i) of the underlying distribution, while the variance is constant.

The same model written in the classic error terms convention would be:
[
y_i = alpha + beta x_i + varepsilon_i, text{ with }
varepsilon_i sim mathcal{N}(0, sigma^2)
]I think the probability distribution centric convention makes it clearer that my observation is just one realisation from a distribution. It also emphasises that the parameter of the distribution is modelled linearly.

To model this in R explicitly I use the glm function, in which I specify the "response distribution" (namely the number of ice creams) as Gaussian and the link function from the expected value of the distribution to its parameter (i.e. temperature) as identity. Indeed, this really is the trick with a GLM, it describes how the distribution of the observations and the expected value, often after a smooth transformation, relate to the actual measurements (predictors) in a linear way. The 'link' is the inverse function of the original transformation of the data.

That's what its says on the GLM tin and that's all there is to it!

(Actually, there is also the bit, which estimates the parameters from the data via maximum likelihood, but I will skip this here.)

lin.mod               family=gaussian(link="identity"))
library(arm) # for 'display' function only
display(lin.mod)
## glm(formula = units ~ temp, family = gaussian(link = "identity"),
## data = icecream)
## coef.est coef.se
## (Intercept) -159.47 54.64
## temp 30.09 2.87
## ---
## n = 12, k = 2
## residual deviance = 14536.3, null deviance = 174754.9 (difference = 160218.6)
## overdispersion parameter = 1453.6
## residual sd is sqrt(overdispersion) = 38.13
Thus, to mimic my data I could generate random numbers from the following Normal distribution:
[
y_i sim mathcal{N}(mu_i, sigma^2) text{ with }
mu_i = -159.5 + 30.1 ; x_i text{ and }
sigma = 38.1
]Although the linear model looks fine in the range of temperatures observed, it doesn't make much sense at 0ºC. The intercept is at -159, which would mean that customers return on average 159 units of ice cream on a freezing day. Well, I don't think so.

Log-transformed linear regression

Ok, perhaps I can transform the data first. Ideally I would like ensure that the transformed data has only positive values. The first transformation that comes to my mind in those cases is the logarithm.

So, let's model the ice cream sales on a logarithmic scale. Thus my model changes to:
[
log(y_i) sim mathcal{N}(mu_i, sigma^2)\
mathbb{E}[log(y_i)]=mu_i = alpha + beta x_i
]This model implies that I believe the sales follow a log-normal distribution:[y_i sim logmathcal{N}(mu_i, sigma^2)]Note, the log-normal distribution is skewed to the right, which means that I regard higher sales figures more likely than lower sales figures.

Although the model is still linear on a log-scale, I have to remember to transform the predictions back to the original scale because (mathbb{E}[log(y_i)] neq log(mathbb{E}[y_i])). This is shown below. For a discussion of the transformation of the lognormal distribution see for example the help page of the R function rlnorm.
[
y_i sim logmathcal{N}(mu_i, sigma^2)\
mathbb{E}[y_i] = exp(mu_i + sigma^2/2)
]
log.lin.mod               family=gaussian(link="identity"))
display(log.lin.mod)
## glm(formula = log(units) ~ temp, family = gaussian(link = "identity"),
## data = icecream)
## coef.est coef.se
## (Intercept) 4.40 0.20
## temp 0.08 0.01
## ---
## n = 12, k = 2
## residual deviance = 0.2, null deviance = 1.4 (difference = 1.2)
## overdispersion parameter = 0.0
## residual sd is sqrt(overdispersion) = 0.14
log.lin.sig log.lin.pred basicPlot()
lines(icecream$temp, log.lin.pred, col="red", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "log-transformed LM"),
col=c("#00526D","red"), pch=c(1,NA))

This plot looks a little better than the previous linear model and it predicts that I would sell, on average, 82 ice creams when the temperature is 0ºC:
exp(coef(log.lin.mod)[1])
## (Intercept)
## 81.62131
Although this model makes a little more sense, it appears that is predicts too many sales at the low and high-end of the observed temperature range. Furthermore, there is another problem with this model and the previous linear model as well. The assumed model distributions generate real numbers, but ice cream sales only occur in whole numbers. As a result, any draw from the model distribution should be a whole number.

Poisson regression

The classic approach for count data is the Poisson distribution.

The Poisson distribution has only one parameter, here (mu_i), which is also its expected value. The canonical link function for (mu_i) is the logarithm, i.e. the logarithm of the expected value is regarded a linear function of the predictors. This means I have to apply the exponential function to the linear model to get back to the original scale. This is distinctively different to the log-transformed linear model above, where the original data was transformed, not the expected value of the data. The log function here is the link function, as it links the transformed expected value to the linear model.

Here is my model:
[
y_i sim text{Poisson}(mu_i)\
mathbb{E}[y_i]=mu_i=exp(alpha + beta x_i) = exp(alpha) exp(beta x_i)\
log(mu_i) = alpha + beta x_i]Let me say this again, although the expected value of my observation is a real number, the Poisson distribution will generate only whole numbers, in line with the actual sales.
pois.mod               family=poisson(link="log"))
display(pois.mod)
## glm(formula = units ~ temp, family = poisson(link = "log"), data = icecream)
## coef.est coef.se
## (Intercept) 4.54 0.08
## temp 0.08 0.00
## ---
## n = 12, k = 2
## residual deviance = 60.0, null deviance = 460.1 (difference = 400.1)
pois.pred basicPlot()
lines(icecream$temp, pois.pred, col="blue", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "Poisson (log) GLM"),
col=c("#00526D","blue"), pch=c(1,NA))

This looks pretty good. The interpretation of the coefficients should be clear now. I have to use the exp function to make a prediction of sales for a given temperature. However, R will do this for me automatically, if I set in the predict statement above type="response".

From the coefficients I can read off that a 0ºC I expect to sell (exp(4.45)=94) ice creams and that with each one degree increase in temperature the sales are predicted to increase by (exp(0.076) - 1 = 7.9%).

Note, the exponential function turns the additive scale into a multiplicative one.

So far, so good. My model is in line with my observations. Additionally, it will not predict negative sales and if I would simulate from a Poisson distribution with a mean given by the above model I will always only get whole numbers back.

However, my model will also predict that I should expect to sell over 1000 ice creams if the temperature reaches 32ºC:
predict(pois.mod, newdata=data.frame(temp=32), type="response")
## 1
## 1056.651
Perhaps the exponential growth in my model looks a little too good to be true. Indeed, I am pretty sure that my market will be saturated at around 800. Warning: this is a modelling assumption from my side!

Binomial regression

Ok, let's me think about the problem this way: If I have 800 potential sales then I'd like to understand the proportion sold at a given temperature.

This suggests a binomial distribution for the number of successful sales out of 800. The key parameter for the binomial distribution is the probability of success, the probability that someone will buy ice cream as a function of temperature.

Dividing my sales statistics by 800 would give me a first proxy for the probability of selling all ice cream.

Therefore I need an S-shape curve that maps the sales statistics into probabilities between 0 and 100%.

A canonical choice is the logistic function:
[
text{logit}(u) = frac{e^u}{e^u + 1} = frac{1}{1 + e^{-u}}
]
With that my model can be described as:
[
y_i sim text{Binom}(n, mu_i)\
mathbb{E}[y_i]=mu_i=text{logit}^{-1}(alpha + beta x_i)\
text{logit}(mu_i) = alpha + beta x_i]
market.size icecream$opportunity bin.glm     family=binomial(link = "logit"))
display(bin.glm)
## glm(formula = cbind(units, opportunity) ~ temp, family = binomial(link = "logit"),
## data = icecream)
## coef.est coef.se
## (Intercept) -2.97 0.11
## temp 0.16 0.01
## ---
## n = 12, k = 2
## residual deviance = 84.4, null deviance = 909.4 (difference = 825.0)
bin.pred basicPlot()
lines(icecream$temp, bin.pred, col="purple", lwd=2)
legend(x="topleft", bty="n", lwd=c(2,2), lty=c(NA,1),
legend=c("observation", "Binomial (logit) GLM"),
col=c("#00526D","purple"), pch=c(1,NA))

The chart doesn't look too bad at all!

As the temperature increases higher and higher this model will predict that sales will reach market saturation, while all the other models so far would predict higher and higher sales.

I can predict sales at 0ºC and 35ºC using the inverse of the logistic function, which is given in R as plogis:

# Sales at 0 Celsius
plogis(coef(bin.glm)[1])*market.size
## (Intercept)
## 39.09618
# Sales at 35 Celsius
plogis(coef(bin.glm)[1] + coef(bin.glm)[2]*35)*market.size
## (Intercept)
## 745.7449
So, that is 39 ice creams at 0ºC and 746 ice creams at 35ºC. Yet, these results will of course change if I change my assumptions on the market size. A market size of 1,000 would suggest that I can sell 55 units at 0ºC and 846 at 35ºC.

Summary

Let's bring all the models together into one graph, with temperatures ranging from 0 to 35ºC.
temp p.lm p.log.lm                   0.5 * summary(log.lin.mod)$dispersion)
p.pois p.bin basicPlot(xlim=range(temp), ylim=c(-20,market.size))
lines(temp, p.lm, type="l", col="orange", lwd=2)
lines(temp, p.log.lm, type="l", col="red", lwd=2)
lines(temp, p.pois, type="l", col="blue", lwd=2)
lines(temp, p.bin, type="l", col="purple", lwd=2)
legend(x="topleft",
legend=c("observation",
"linear model",
"log-transformed LM",
"Poisson (log) GLM",
"Binomial (logit) GLM"),
col=c("#00526D","orange", "red",
"blue", "purple"),
bty="n", lwd=rep(2,5),
lty=c(NA,rep(1,4)),
pch=c(1,rep(NA,4)))

The chart shows the predictions of my four models over a temperature range from 0 to 35ºC. Although the linear model looks OK between 10 and perhaps 30ºC, it shows clearly its limitations. The log-transformed linear and Poisson models appear to give similar predictions, but will predict an ever accelerating increase in sales as temperature increases. I don't believe this makes sense as even the most ice cream loving person can only eat so much ice cream on a really hot day. And that's why I would go with the Binomial model to predict my ice cream sales.

Simulations

Having used the distribution centric view to describe my models leads naturally to simulations. If the models are good, then I shouldn't be able to identify the simulation from the real data.

In all my models the linear structure was
[
g(mu_i) = alpha + beta x_i]or in matrix notation
[
g(mu) = A v] with (A_{i,cdot} = [1, x_i]) and (v=[alpha, beta]), whereby (A) is the model matrix, (v) the vector of coefficients and (g) the link function.

With that being said, let's simulate data from each distribution for the temperatures measured in the original data and compare against the actual sales units.
n A set.seed(1234)
(rand.normal mean=A %*% coef(lin.mod),
sd=sqrt(summary(lin.mod)$dispersion)))
## [1] 152.5502 278.3509 339.2073 244.5335 374.3981 404.4103 375.2385
## [8] 403.3892 483.9470 486.5775 526.3881 557.6662
(rand.logtrans meanlog=A %*% coef(log.lin.mod),
sdlog=sqrt(summary(log.lin.mod)$dispersion)))
## [1] 196.0646 266.1002 326.8102 311.5621 315.0249 321.1920 335.3733
## [8] 564.6961 515.9348 493.4822 530.8356 691.2354
(rand.pois lambda=exp(A %*% coef(pois.mod))))
## [1] 220 266 279 337 364 434 349 432 517 520 516 663
(rand.bin size=market.size,
prob=plogis(A %*% coef(bin.glm))))
## [1] 197 249 302 313 337 386 387 414 516 538 541 594
basicPlot(ylim=c(100,700))
cols alpha.f = 0.75)
points(icecream$temp, rand.normal, pch=19, col=cols[1])
points(icecream$temp, rand.logtrans, pch=19, col=cols[2])
points(icecream$temp, rand.pois, pch=19, col=cols[3])
points(icecream$temp, rand.bin, pch=19, col=cols[4])
legend(x="topleft",
legend=c("observation",
"linear model",
"log-transformed LM",
"Poisson (log) GLM",
"Binomial (logit) GLM"),
col=c("#00526D",cols), lty=NA,
bty="n", lwd=rep(2,5),
pch=c(1,rep(19,4)))

The chart displays one simulation per observation from each model, but I believe it shows already some interesting aspects. Not only do I see that the Poisson and Binomial model generate whole numbers, while the Normal and log-transformed Normal predict real numbers, I notice the skewness of the log-normal distribution in the red point at 19.4ºC.

Furthermore, the linear model predicts equally likely figures above and below the mean and at 16.4ºC the prediction appears a little low - perhaps as a result.

Additionally the high sales prediction at 25.1ºC of the log-transformed Normal and Poisson model shouldn't come as a surprise either.

Again, the simulation of the binomial model seems the most realistic to me.

Conclusions

I hope this little article illustrated the intuition behind generalised linear models. Fitting a model to data requires more than just applying an algorithm. In particular it is worth to think about:
  • the range of expected values: are they bounded or range from (-infty) to (infty)?
  • the type of observations: do I expect real numbers, whole numbers or proportions?
  • how to link the distribution parameters to the observations

There are many aspects of GLMs which I haven't touched on here, such as:
  • all the above models incorporate a fixed level of volatility. However, in practice, the variability of making a sale at low temperatures might be significantly different than at high temperatures. Check the residual plots and consider an over-dispersed model.
  • I used so called canonical link functions in my models, which have nice theoretical properties, but other choices are possible as well.
  • the distribution has to be chosen from the exponential family, e.g. Normal, Gamma, Poisson, binomial, Tweedie, etc.
  • GLMs use maximum likelihood as the criteria for fitting the models. I sketched the difference between least square and maximum likelihood in an earlier post.

R code

You find the code of this post also on GitHub.

Session Info

R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] arm_1.8-6 lme4_1.1-8 Matrix_1.2-1 MASS_7.3-42

loaded via a namespace (and not attached):
[1] minqa_1.2.4 tools_3.2.1 coda_0.17-1 abind_1.4-3
[5] Rcpp_0.11.6 splines_3.2.1 nlme_3.1-121 grid_3.2.1
[9] nloptr_1.0.4 lattice_0.20-31
This post was originally published on mages' blog.

from R-bloggers http://ift.tt/1Im4Xa7
via IFTTT