Cheminformatics - QSAR: noviembre 2015

lunes, 23 de noviembre de 2015

Setting up an AWS instance for R, RStudio, OpenCPU, or Shiny Server

While most web-developers have worked with Amazon AWS, Microsoft Azure, or similar platforms before, this is still not the case for many R number crunchers. Especially researchers at academic institutions have less exposure to these commercial offerings. Time to change that! In this post, we explain how to set up an Ubuntu server instance on AWS, […]

The post Setting up an AWS instance for R, RStudio, OpenCPU, or Shiny Server appeared first on ipub.

from R-bloggers http://ift.tt/1MN3lZ2
via IFTTT

miércoles, 18 de noviembre de 2015

Deep Learning with MXNetR

Deep learning has been an active field of research for some years, there are breakthroughs in image and language understanding etc. However, there has not yet been a good deep learning package in R that offers state-of-art deep learning models and the real GPU support to do fast training on these models.

In this post, we introduce MXNetR, an R package that brings fast GPU computation and state-of-art deep learning to the R community. MXNet allows you to flexibly configure state-of-art deep learning models backed by the fast CPU and GPU back-end. This post will cover the following topics:

Train your first neural network in five minutes
Use MXNet for Handwritten Digits Classification Competition
Classify real world images using state-of-art deep learning models.

Train your first neural network in five minutes

A Classfication Task

Let's use a demo data to demonstrate the basic grammar and parameters of mxnet. Firstly we load in the data:

require(mlbench)

## Loading required package: mlbench

require(mxnet)

## Loading required package: mxnet
## Loading required package: methods

data(Sonar, package="mlbench")

Sonar[,61] = as.numeric(Sonar[,61])-1
train.ind = c(1:50, 100:150)
train.x = data.matrix(Sonar[train.ind, 1:60])
train.y = Sonar[train.ind, 61]
test.x = data.matrix(Sonar[-train.ind, 1:60])
test.y = Sonar[-train.ind, 61]

Next we are going to use a multi-layer perceptron as our classifier. In mxnet, we offer a function called mx.mlp so that users can build a general multi-layer neural network to do classification or regression.

There are several parameters we have to feed to mx.mlp:

Training data and label.
Number of hidden nodes in each hidden layers.
Number of nodes in the output layer.
Type of the activation.
Type of the output loss.
The device to train (GPU or CPU).
Other parameters for mx.model.FeedForward.create.

The following code piece is showing a possible way of using mx.mlp (the output is modified to be shorter):

mx.set.seed(0)
model <- mx.mlp(train.x, train.y, hidden_node=10, out_node=2, out_activation="softmax",
                num.round=20, array.batch.size=15, learning.rate=0.07, momentum=0.9, 
                eval.metric=mx.metric.accuracy)

## Auto detect layout of input matrix, use rowmajor..
## Start training with 1 devices
## [1] Train-accuracy=0.488888888888889
## [2] Train-accuracy=0.514285714285714
## [3] Train-accuracy=0.514285714285714

...

## [18] Train-accuracy=0.838095238095238
## [19] Train-accuracy=0.838095238095238
## [20] Train-accuracy=0.838095238095238

Note that mx.set.seed is the correct function to control the random process in mxnet. The reason is that most of MXNet random number generator can run on different devices, such as GPU. We need to use massively parallel PRNG on GPU to get fast random number generations. It can also be quite costly to seed these PRNGs. So we introduced mx.set.seed to control random numbers in MXNet. You can see the accuracy from the training process. It is also easy to make prediction and evaluate it.

preds = predict(model, test.x)

## Auto detect layout of input matrix, use rowmajor..

pred.label = max.col(t(preds))-1
table(pred.label, test.y)

##           test.y
## pred.label  0  1
##          0 24 14
##          1 36 33

Note for multi-class prediction, mxnet outputs nclass x nexamples, and each row corresponds to probability of that class.

A Regression Task with Structure Configuration

Now let's learn something new. We use the following code to load and process the data:

data(BostonHousing, package="mlbench")

train.ind = seq(1, 506, 3)
train.x = data.matrix(BostonHousing[train.ind, -14])
train.y = BostonHousing[train.ind, 14]
test.x = data.matrix(BostonHousing[-train.ind, -14])
test.y = BostonHousing[-train.ind, 14]

Although we can use mx.mlp again to do regression by changing the out_activation, this time we are going to introduce a flexible way to configure neural networks in mxnet. The configuration is done by the "Symbol" system in mxnet, which takes care of the links among nodes, the activation, dropout ratio, etc. To configure a multi-layer neural network, we can do it in the following way:

# Define the input data
data <- mx.symbol.Variable("data")
# A fully connected hidden layer
# data: input source
# num_hidden: number of neurons in this layer
fc1 <- mx.symbol.FullyConnected(data, num_hidden=1)

# Use linear regression for the output layer
lro <- mx.symbol.LinearRegressionOutput(fc1)

What matters for a regression task is mainly the last function. It enables the new network to optimize for squared loss. We can now train on this simple data set. In this configuration, we dropped the hidden layer so the input layer is directly connected to the output layer. For more information about the symbolic operation in mxnet, please check our tutorial on this topic.

Next we can make prediction with this structure and other parameters with mx.model.FeedForward.create:

mx.set.seed(0)
model <- mx.model.FeedForward.create(lro, X=train.x, y=train.y,
                                     ctx=mx.cpu(), num.round=50, array.batch.size=20,
                                     learning.rate=2e-6, momentum=0.9, eval.metric=mx.metric.rmse)

## Auto detect layout of input matrix, use rowmajor..
## Start training with 1 devices
## [1] Train-rmse=16.063282524034
## [2] Train-rmse=12.2792375712573
## [3] Train-rmse=11.1984634005885

...

## [48] Train-rmse=8.26890902770415
## [49] Train-rmse=8.25728089053853
## [50] Train-rmse=8.24580511500735

It is also easy to make prediction.

preds = predict(model, test.x)

## Auto detect layout of input matrix, use rowmajor..

sqrt(mean((preds-test.y)^2))

## [1] 7.800502

Here we also changed the eval.metric for regression. Currently we have four pre-defined metrics "accuracy", "rmse", "mae" and "rmsle". One might wonder how to customize the evaluation metric. mxnet provides the interface for users to define their own metric of interests:

demo.metric.mae <- mx.metric.custom("mae", function(label, pred) {
  res <- mean(abs(label-pred))
  return(res)
})

This is an example for mean absolute error. We can simply plug it in the training function:

mx.set.seed(0)
model <- mx.model.FeedForward.create(lro, X=train.x, y=train.y,
                                     ctx=mx.cpu(), num.round=50, array.batch.size=20,
                                     learning.rate=2e-6, momentum=0.9, eval.metric=demo.metric.mae)

## Auto detect layout of input matrix, use rowmajor..
## Start training with 1 devices
## [1] Train-mae=13.1889538083225
## [2] Train-mae=9.81431959337658
## [3] Train-mae=9.21576419870059

...

## [48] Train-mae=6.41731406417158
## [49] Train-mae=6.41011292926139
## [50] Train-mae=6.40312503493494

Congratulations! Now you have learnt the basic for using mxnet. We can go further to tackle some real world problems!

Handwritten Digits Classification Competition

MNIST is a handwritten digits image data set created by Yann LeCun. Every digit is represented by a 28x28 image. It has become a standard data set to test classifiers on simple image input. Neural network is no doubt a strong model for image classification. There's a long-term hosted competition on Kaggle using this data set.

We will present the basic usage of mxnet to compete in this challenge.

Data Loading

First, let us download the data from here, and put them under the data/ folder in your working directory.

Then we can read them in R and convert to matrices.

require(mxnet)

## Loading required package: mxnet
## Loading required package: methods

train <- read.csv('data/train.csv', header=TRUE)
test <- read.csv('data/test.csv', header=TRUE)
train <- data.matrix(train)
test <- data.matrix(test)

train.x <- train[,-1]
train.y <- train[,1]

Here every image is represented as a single row in train/test. The greyscale of each image falls in the range [0, 255], we can linearly transform it into [0,1] by

train.x <- t(train.x/255)
test <- t(test/255)

We also transpose the input matrix to npixel x nexamples, which is the column major format accepted by mxnet (and the convention of R).

In the label part, we see the number of each digit is fairly even:

table(train.y)

## train.y
##    0    1    2    3    4    5    6    7    8    9
## 4132 4684 4177 4351 4072 3795 4137 4401 4063 4188

Network Configuration

Now we have the data. The next step is to configure the structure of our network.

data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

In mxnet, we use its own data type symbol to configure the network. data <- mx.symbol.Variable("data") use data to represent the input data, i.e. the input layer.
Then we set the first hidden layer by fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128). This layer has data as the input, its name and the number of hidden neurons.
The activation is set by act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu"). The activation function takes the output from the first hidden layer fc1.
The second hidden layer takes the result from act1 as the input, with its name as "fc2" and the number of hidden neurons as 64.
the second activation is almost the same as act1, except we have a different input source and name.
Here comes the output layer. Since there's only 10 digits, we set the number of neurons to 10.
Finally we set the activation to softmax to get a probabilistic prediction.

Training

We are almost ready for the training process. Before we start the computation, let's decide what device should we use.

devices <- mx.cpu()

Here we assign CPU to mxnet. After all these preparation, you can run the following command to train the neural network! Note again that mx.set.seed is the correct function to control the random process in mxnet.

mx.set.seed(0)
model <- mx.model.FeedForward.create(softmax, X=train.x, y=train.y,
                                     ctx=devices, num.round=10, array.batch.size=100,
                                     learning.rate=0.07, momentum=0.9,  eval.metric=mx.metric.accuracy,
                                     initializer=mx.init.uniform(0.07),
                                     epoch.end.callback=mx.callback.log.train.metric(100))

## Start training with 1 devices
## Batch [100] Train-accuracy=0.6563
## Batch [200] Train-accuracy=0.777999999999999
## Batch [300] Train-accuracy=0.827466666666665
## Batch [400] Train-accuracy=0.855499999999999
## [1] Train-accuracy=0.859832935560859
## Batch [100] Train-accuracy=0.9529
## Batch [200] Train-accuracy=0.953049999999999
## Batch [300] Train-accuracy=0.955866666666666
## Batch [400] Train-accuracy=0.957525000000001
## [2] Train-accuracy=0.958309523809525

...

## Batch [100] Train-accuracy=0.9937
## Batch [200] Train-accuracy=0.99235
## Batch [300] Train-accuracy=0.991966666666668
## Batch [400] Train-accuracy=0.991425000000003
## [9] Train-accuracy=0.991500000000003
## Batch [100] Train-accuracy=0.9942
## Batch [200] Train-accuracy=0.99245
## Batch [300] Train-accuracy=0.992433333333334
## Batch [400] Train-accuracy=0.992275000000002
## [10] Train-accuracy=0.992380952380955

Prediction and Submission

To make prediction, we can simply write

preds <- predict(model, test)
dim(preds)

## [1]    10 28000

It is a matrix with 28000 rows and 10 cols, containing the desired classification probabilities from the output layer. To extract the maximum label for each row, we can use the max.col in R:

pred.label <- max.col(t(preds)) - 1
table(pred.label)

## pred.label
##    0    1    2    3    4    5    6    7    8    9
## 2818 3195 2744 2767 2683 2596 2798 2790 2784 2825

With a little extra effort in writting to a csv file, we can have our submission to the competition!

submission <- data.frame(ImageId=1:ncol(test), Label=pred.label)
write.csv(submission, file='submission.csv', row.names=FALSE, quote=FALSE)

LeNet

Next we are going to introduce a new network structure: LeNet. It is proposed by Yann LeCun to recognize handwritten digits. Now we are going to demonstrate how to construct and train an LeNet in mxnet.

First we construct the network:

# input
data <- mx.symbol.Variable('data')
# first conv
conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20)
tanh1 <- mx.symbol.Activation(data=conv1, act_type="tanh")
pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max",
                          kernel=c(2,2), stride=c(2,2))
# second conv
conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5), num_filter=50)
tanh2 <- mx.symbol.Activation(data=conv2, act_type="tanh")
pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max",
                          kernel=c(2,2), stride=c(2,2))
# first fullc
flatten <- mx.symbol.Flatten(data=pool2)
fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 <- mx.symbol.Activation(data=fc1, act_type="tanh")
# second fullc
fc2 <- mx.symbol.FullyConnected(data=tanh3, num_hidden=10)
# loss
lenet <- mx.symbol.SoftmaxOutput(data=fc2)

Then let us reshape the matrices into arrays:

train.array <- train.x
dim(train.array) <- c(28, 28, 1, ncol(train.x))
test.array <- test
dim(test.array) <- c(28, 28, 1, ncol(test))

Next we are going to compare the training speed on different devices, so the definition of the devices goes first:

n.gpu <- 1
device.cpu <- mx.cpu()
device.gpu <- lapply(0:(n.gpu-1), function(i) {
  mx.gpu(i)
})

As you can see, we can pass a list of devices, to ask mxnet to train on multiple GPUs (you can do similar thing for cpu, but since internal computation of cpu is already multi-threaded, there is less gain than using GPUs).

We start by training on CPU first. Because it takes a bit time to do so, we will only run it for one iteration.

mx.set.seed(0)
tic <- proc.time()
model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y,
                                     ctx=device.cpu, num.round=1, array.batch.size=100,
                                     learning.rate=0.05, momentum=0.9, wd=0.00001,
                                     eval.metric=mx.metric.accuracy,
                                     epoch.end.callback=mx.callback.log.train.metric(100))

## Start training with 1 devices
## Batch [100] Train-accuracy=0.1066
## Batch [200] Train-accuracy=0.16495
## Batch [300] Train-accuracy=0.401766666666667
## Batch [400] Train-accuracy=0.537675
## [1] Train-accuracy=0.557136038186157

print(proc.time() - tic)

##    user  system elapsed
## 130.030 204.976  83.821

Training on GPU:

mx.set.seed(0)
tic <- proc.time()
model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y,
                                     ctx=device.gpu, num.round=5, array.batch.size=100,
                                     learning.rate=0.05, momentum=0.9, wd=0.00001,
                                     eval.metric=mx.metric.accuracy,
                                     epoch.end.callback=mx.callback.log.train.metric(100))

## Start training with 1 devices
## Batch [100] Train-accuracy=0.1066
## Batch [200] Train-accuracy=0.1596
## Batch [300] Train-accuracy=0.3983
## Batch [400] Train-accuracy=0.533975
## [1] Train-accuracy=0.553532219570405
## Batch [100] Train-accuracy=0.958
## Batch [200] Train-accuracy=0.96155
## Batch [300] Train-accuracy=0.966100000000001
## Batch [400] Train-accuracy=0.968550000000003
## [2] Train-accuracy=0.969071428571432
## Batch [100] Train-accuracy=0.977
## Batch [200] Train-accuracy=0.97715
## Batch [300] Train-accuracy=0.979566666666668
## Batch [400] Train-accuracy=0.980900000000003
## [3] Train-accuracy=0.981309523809527
## Batch [100] Train-accuracy=0.9853
## Batch [200] Train-accuracy=0.985899999999999
## Batch [300] Train-accuracy=0.986966666666668
## Batch [400] Train-accuracy=0.988150000000002
## [4] Train-accuracy=0.988452380952384
## Batch [100] Train-accuracy=0.990199999999999
## Batch [200] Train-accuracy=0.98995
## Batch [300] Train-accuracy=0.990600000000001
## Batch [400] Train-accuracy=0.991325000000002
## [5] Train-accuracy=0.991523809523812

print(proc.time() - tic)

##    user  system elapsed
##   9.288   1.680   6.889

As you can see by using GPU, we can get a much faster speedup in training! Finally we can submit the result to Kaggle again to see the improvement of our ranking!

preds <- predict(model, test.array)
pred.label <- max.col(t(preds)) - 1
submission <- data.frame(ImageId=1:ncol(test), Label=pred.label)
write.csv(submission, file='submission.csv', row.names=FALSE, quote=FALSE)

Classify Real-World Images with Pre-trained Model

After the MNIST examples, are you ready to take one step further? One of the cool thing that a deep learning algorithm can do is to classify real world images.

In this example we will show how to use a pre-trained Inception-BatchNorm Network to predict the class of real world image. The network architecture is decribed in [1].

The pre-trained Inception-BatchNorm network can be downloaded from this link This model gives the recent state-of-art prediction accuracy on image net dataset.

Pacakge Loading

To get started, we load the mxnet package by

require(mxnet)

In this example, we also need the package imager to load and preprocess the images in R.

require(imager)

Load the Pretrained Model

Make sure you unzip the pre-trained model in current folder. And we can use the model loading function to load the model into R.

model = mx.model.load("Inception/Inception_BN", iteration=39)

We also need to load in the mean image, which is used for preprocessing using mx.nd.load.

mean.img = as.array(mx.nd.load("Inception/mean_224.nd")[["mean_img"]])

Load and Preprocess the Image

Now we are ready to classify a real image. In this example, we simply take the parrots image from imager. But you can always change it to other images. Firstly we will test it on a photo of Mt. Baker in north WA.

Load and plot the image:

im <- load.image("Pics/MtBaker.jpg")
plot(im)

Before feeding the image to the deep net, we need to do some preprocessing to make the image fit in the input requirement of deepnet. The preprocessing includes cropping, and substraction of the mean. Because mxnet is deeply integerated with R, we can do all the processing in R function.

The preprocessing function:

preproc.image <-function(im, mean.image) {
  # crop the image
  shape <- dim(im)
  short.edge <- min(shape[1:2])
  yy <- floor((shape[1] - short.edge) / 2) + 1
  yend <- yy + short.edge - 1
  xx <- floor((shape[2] - short.edge) / 2) + 1
  xend <- xx + short.edge - 1
  croped <- im[yy:yend, xx:xend,,]
  # resize to 224 x 224, needed by input of the model.
  resized <- resize(croped, 224, 224)
  # convert to array (x, y, channel)
  arr <- as.array(resized)
  dim(arr) = c(224, 224, 3)
  # substract the mean
  normed <- arr - mean.img
  # Reshape to format needed by mxnet (width, height, channel, num)
  dim(normed) <- c(224, 224, 3, 1)
  return(normed)
}

We use the defined preprocessing function to get the normalized image.

normed <- preproc.image(im, mean.img)

Classify the Image

Now we are ready to classify the image! We can use the predict function to get the probability over classes.

prob <- predict(model, X=normed)
dim(prob)

## [1] 1000    1

As you can see prob is a 1000 times 1 array, which gives the probability over the 1000 image classes of the input.

We can extract the top-5 class index.

max.idx <- order(prob[,1], decreasing = TRUE)[1:5]
max.idx

## [1] 981 971 980 673 975

These indices do not make too much sense. So let us see what it really represents. We can read the names of the classes from the following file.

synsets <- readLines("Inception/synset.txt")

And let us print the corresponding lines:

print(paste0("Predicted Top-classes: ", synsets[max.idx]))

## [1] "Predicted Top-classes: n09472597 volcano"      
## [2] "Predicted Top-classes: n09193705 alp"          
## [3] "Predicted Top-classes: n09468604 valley, vale" 
## [4] "Predicted Top-classes: n03792972 mountain tent"
## [5] "Predicted Top-classes: n09288635 geyser"

Mt. Baker is indeed a vocalno. We can also see the second most possible guess "alp" is also correct.

Let's see if it still does a good job on some other images. The following photo is taken in Vancouver downtown.

im <- load.image("Pics/Vancouver.jpg")
plot(im)

normed <- preproc.image(im, mean.img)
prob <- predict(model, X=normed)
max.idx <- order(prob[,1], decreasing = TRUE)[1:5]
print(paste0("Predicted Top-classes: ", synsets[max.idx]))

## [1] "Predicted Top-classes: n09332890 lakeside, lakeshore"    
## [2] "Predicted Top-classes: n03983396 pop bottle, soda bottle"
## [3] "Predicted Top-classes: n13133613 ear, spike, capitulum"  
## [4] "Predicted Top-classes: n12144580 corn"                   
## [5] "Predicted Top-classes: n02980441 castle"

This photo is indeed taken at lakeside. One interesting guess is the fifth guess "castle". The outline of the building in the city is recognized as the battlements on a castle. We might need more pictures containing "battlements with glass windows" to teach the model about modern city.

How about this photo taken on Titlis:

im <- load.image("Pics/Switzerland.jpg")
plot(im)

normed <- preproc.image(im, mean.img)
prob <- predict(model, X=normed)
max.idx <- order(prob[,1], decreasing = TRUE)[1:5]
print(paste0("Predicted Top-classes: ", synsets[max.idx]))

## [1] "Predicted Top-classes: n04371774 swing"                         
## [2] "Predicted Top-classes: n04275548 spider web, spider's web"      
## [3] "Predicted Top-classes: n01773549 barn spider, Araneus cavaticus"
## [4] "Predicted Top-classes: n03000684 chain saw, chainsaw"           
## [5] "Predicted Top-classes: n03888257 parachute, chute"

This time the main element is small and cannot stand out from the "noisy" background. This time the result is not perfect, but we can still find similarity between "swing" and "gondola".

Now, why don't you take a photo around and ask mxnet to tell you what is included? Have some fun!

Try it out and Contribute

You can find MXNet on github. Besides R, MXNet also support python and Julia, and allows interpolations of models and analysis results between different language bindings. MXNet is built by a active community of users. Please fork the project on github and contribute your wisdom to make the project even better :)

Acknowledgement

We would like to thank the RcppCore Team for their great helps to make MXNetR happen.

[1] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

from R-bloggers http://ift.tt/1LjVOiA
via IFTTT

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance

The New York City Taxi & Limousine Commission has released a staggeringly detailed historical dataset covering over 1.1 billion individual taxi trips in the city from January 2009 through June 2015. Taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it’s a story of New York. How bad is the rush hour traffic from Midtown to JFK? Where does the Bridge and Tunnel crowd hang out on Saturday nights? What time do investment bankers get to work? How has Uber changed the landscape for taxis? And could Bruce Willis and Samuel L. Jackson have made it from 72nd and Broadway to Wall Street in less than 30 minutes? The dataset addresses all of these questions and many more.

I mapped the coordinates of every trip to local census tracts and neighborhoods, then set about in an attempt to extract stories and meaning from the data. This post covers a lot, but for those who want to pursue more analysis on their own: everything in this post—the data, software, and code—is freely available. Full instructions to download and analyze the data for yourself are available on GitHub.

Maps

I’m certainly not the first person to use the public taxi data to make maps, but I hadn’t previously seen a map that includes the entire dataset of pickups and drop offs since 2009 for both yellow and green taxis. You can click the maps to view high resolution versions:

These maps show every taxi pickup and drop off, respectively, in New York City from 2009–2015. The maps are made up of tiny dots, where brighter regions indicate more taxi activity. The green tinted regions represent activity by green boro taxis, which can only pick up passengers in upper Manhattan and the outer boroughs. Notice how pickups are more heavily concentrated in Manhattan, while drop offs extend further into the outer boroughs.

If you think these are pretty, I recommend checking out the high resolution images of pickups and drop offs.

NYC Taxi Data

The official TLC trip record dataset contains data for over 1.1 billion taxi trips from January 2009 through June 2015, covering both yellow and green taxis. Each individual trip record contains precise location coordinates for where the trip started and ended, timestamps for when the trip started and ended, plus a few other variables including fare amount, payment method, and distance traveled.

I used PostgreSQL to store the data and PostGIS to perform geographic calculations, including the heavy lifting of mapping latitude/longitude coordinates to NYC census tracts and neighborhoods. The full dataset takes up 267 GB on disk, before adding any indexes. For more detailed information on the database schema and geographic calculations, take a look at the GitHub repository.

Uber Data

Thanks to the folks at FiveThirtyEight, there is also some publicly available data covering nearly 19 million Uber rides in NYC from April–September 2014 and January–June 2015, which I’ve incorporated into the dataset. The Uber data is not as detailed as the taxi data, in particular Uber provides time and location for pickups only, not drop offs, but I wanted to provide a unified dataset including all available taxi and Uber data. Each trip in the dataset has a cab_type_id, which indicates whether the trip was in a yellow taxi, green taxi, or Uber car.

Borough Trends, and the Rise of Uber

The introduction of the green boro taxi program in August 2013 dramatically increased the amount of taxi activity in the outer boroughs. Here’s a graph of taxi pickups in Brooklyn, the most populous borough, split by cab type:

From 2009–2013, a period during which migration from Manhattan to Brooklyn generally increased, yellow taxis nearly doubled the number of pickups they made in Brooklyn.

Once boro taxis appeared on the scene, though, the green taxis quickly overtook yellow taxis so that as of June 2015, green taxis accounted for 70% of Brooklyn’s 850,000 monthly taxi pickups, while yellow taxis have decreased Brooklyn pickups back to their 2009 rate. Yellow taxis still account for more drop offs in Brooklyn, since many people continue to take taxis from Manhattan to Brooklyn, but even in drop offs, the green taxis are closing the gap.

Let’s add Uber into the mix. I live in Brooklyn, and although I sometimes take taxis, an anecdotal review of my credit card statements suggests that I take about four times as many Ubers as I do taxis. It turns out I’m not alone: between June 2014 and June 2015, the number of Uber pickups in Brooklyn grew by 525%! As of June 2015, the most recent data available when I wrote this, Uber accounts for more than twice as many pickups in Brooklyn compared to yellow taxis, and is rapidly approaching the popularity of green taxis:

Note that Uber data is only available from Apr 2014–Sep 2014, then from Jan 2015–Jun 2015, hence the gap in the graph

Manhattan, not surprisingly, accounts for by far the largest number of taxi pickups of any borough. In any given month, around 85% of all NYC taxi pickups occur in Manhattan, and most of those are made by yellow taxis. Even though green taxis are allowed to operate in upper Manhattan, they account for barely a fraction of yellow taxi activity:

Uber has grown dramatically in Manhattan as well, notching a 275% increase in pickups from June 2014 to June 2015, while taxi pickups declined by 9% over the same period. Uber made 1.4 million more Manhattan pickups in June 2015 than it did in June 2014, while taxis made 1.1 million fewer pickups. However, even though Uber picked up nearly 2 million Manhattan passengers in June 2015, Uber still accounts for less than 15% of total Manhattan pickups:

Queens still has more yellow taxi pickups than green taxi pickups, but that’s entirely because LaGuardia and JFK airports are both in Queens, and they are heavily served by yellow taxis. And although Uber has experienced nearly Brooklyn-like growth in Queens, it still lags behind yellow and green taxis, though again the yellow taxis are heavily influenced by airport pickups:

If we restrict to pickups at LaGuardia and JFK Airports, we can see that Uber has grown to over 100,000 monthly pickups, but yellow cabs still shuttle over 80% of car-hailing airport passengers back into the city:

The Bronx and Staten Island have significantly lower taxi volume, but you can see graphs for both on GitHub. The most noteworthy observations are that almost no yellow taxis venture to the Bronx, and Uber is already more popular than taxis on Staten Island.

How Long does it Take to Get to an NYC Airport?

Most of these vehicles [heading to JFK Airport] would undoubtedly be using the Van Wyck Expressway; Moses's stated purpose in proposing it was to provide a direct route to the airport from mid-Manhattan. But the Van Wyck Expressway was designed to carry—under "optimum" conditions (good weather, no accidents or other delays)—2,630 vehicles per hour. Even if the only traffic using the Van Wyck was JFK traffic, the expressway's capacity would not be sufficient to handle it.

[...] The air age was just beginning: air traffic was obviously going to boom to immense dimensions. If the Van Wyck expressway could not come anywhere near handling JFK's traffic when that traffic was 10,000 persons per hour, what was going to happen when that traffic increased to 15,000 persons per hour? To 20,000?

—Robert Caro, The Power Broker: Robert Moses and the Fall of New York (1974)

A subject near and dear to all New Yorkers' hearts: how far in advance do you have to hail a cab in order to make your flight at one of the three area airports? Of course, this depends on many factors: is there bad rush hour traffic? Is the UN in session? Will your cab driver know a “secret” shortcut to avoid the day’s inevitable bottleneck on the Van Wyck?

I took all weekday taxi trips to the airports and calculated the distribution of how long it took to travel from each neighborhood to the airports at each hour of the day. In most cases, the worst hour to travel to an airport is 4–5 PM. For example, the median taxi trip leaving Midtown headed for JFK Airport between 4 and 5 PM takes 64 minutes! 10% of trips during that hour take over 84 minutes—good luck making your flight in that case.

If you left Midtown heading for JFK between 10 and 11 AM, you’d face a median trip time of 38 minutes, with a 90% chance of getting there in less than 50 minutes. Google Maps estimates about an hour travel time on public transit from Bryant Park to JFK, so depending on the time of day and how close you are to a subway stop, your expected travel time might be better on public transit than in a cab, and you could save a bunch of money.

The stories are similar for traveling to LaGuardia and Newark airports, and from other neighborhoods. You can see the graphs for airport travel times from any neighborhood by selecting it in the dropdown below:

Pick a neighborhood Battery Park-Lower ManhattanCentral Harlem NorthCentral Harlem SouthChelsea-Flatiron-Union SquareChinatownClinton (Hell's Kitchen)East Harlem NorthEast Harlem SouthEast VillageGramercyLenox Hill-Roosevelt IslandLincoln SquareLower East SideMarble Hill-InwoodMidtownMorningside HeightsMurray Hill-Kips BaySoHo-TriBeCa-Little ItalyStuyvesant TownTurtle Bay-East MidtownUpper East SideUpper West SideWashington Heights NorthWashington Heights SouthWest VillageYorkvilleBedfordBrooklyn Heights-Cobble HillBushwickCarroll Gardens-Red HookClinton HillCrown Heights NorthCrown Heights SouthDUMBO-Downtown-Boerum HillEast New YorkEast WilliamsburgFlatbushFort GreeneGreenpointPark Slope-GowanusProspect HeightsProspect Lefferts Gardens-WingateStuyvesant HeightsWilliamsburgAstoriaBriarwood-Jamaica HillsElmhurstFlushingForest HillsHunters Point-SunnysideJackson HeightsJamaicaKew GardensLong Island CityMaspethOld AstoriaSouth Ozone ParkSpringfield Gardens South-BrookvilleSteinwayWoodsideMott Haven-Port MorrisWest Concourse

Turn on javascript (or click through from RSS) to view airport graphs for other neighborhoods.

Travel time from Midtown, Manhattan to...

LaGuardia Airport

JFK Airport

Newark Airport

You can view airport graphs for other neighborhoods by selecting a neighborhood from the dropdown above.

Could Bruce Willis and Samuel L. Jackson have made it from the Upper West Side to Wall Street in 30 minutes?

Airports aren’t the only destinations that suffer from traffic congestion. In Die Hard: With a Vengeance, John McClane (Willis) and Zeus Carver (Jackson) have to make it from 72nd and Broadway to the Wall Street 2/3 subway station during morning rush hour in less than 30 minutes, or else a bomb will go off. They commandeer a taxi, drive it frantically through Central Park, tailgate an ambulance, and just barely make it in time (of course the bomb goes off anyway…). Thanks to the TLC’s publicly available data, we can finally address audience concerns about the realism of this sequence.

McClane and Carver leave the Upper West Side at 9:50 AM, so I took all taxi rides that:

Picked up in the Upper West Side census tracts between West 70th and West 74th streets
Dropped off in the downtown tract containing the Wall Street 2/3 subway stop
Picked up on a weekday morning between 9:20 and 10:20 AM

And made a histogram of travel times:

There are 580 such taxi trips in the dataset, with a mean travel time of 29.8 minutes, and a median of 29 minutes. That means that half of such trips actually made it within the allotted time of 30 minutes! Now, our heroes might need a few minutes to commandeer a cab and get down to the subway platform on foot, so if we allot 3 minutes for those tasks and 27 minutes for driving, then only 39% of trips make it in 27 minutes or less. Still, in the movie they make it seem like a herculean task with almost zero probability of success, when in reality it’s just about average. This seems to be the rare action movie sequence which is actually easier to recreate in real life than in the movies!

How Does Weather Affect Taxi and Uber Ridership?

Since 2009, the days with the fewest city-wide taxi trips all have obvious relationships to the weather. The days with the fewest taxi trips were:

Sunday, August 28, 2011, Hurricane Irene, 28,596 trips
Monday, December 27, 2010, North American blizzard, 69,650 trips
Monday, October 29, 2012: Hurricane Sandy, 111,605 trips

I downloaded daily Central Park weather data from the National Climatic Data Center, and joined it to the taxi data to see if we could learn anything else about the relationship between weather and taxi rides. There are lots of confounding variables, including seasonal trends, annual growth due to boro taxis, and whether weather events happen to fall on weekdays or weekends, but it would appear that snowfall has a significant negative impact on daily taxi ridership:

On the other hand, rain alone does not seem to affect total daily ridership:

Since Uber trip data is only available for a handful of months, it’s more difficult to measure the impact of weather on Uber ridership. Uber is well-known for its surge pricing during times of high demand, which often includes inclement weather. There were a handful of rainy and snowy days in the first half of 2015 when Uber data is available, so for each rain/snow day, I calculated the total number of trips made by taxis and Ubers, and compared that to each service’s daily average over the previous week. For example, Uber’s ratio of 69% on 1/26/15 means that there were 69% as many Uber trips made that day compared to Uber’s daily average from 1/19–1/25:

Date Snowfall in inches Taxi trips vs. prev week Uber trips vs. prev week


1/26/15	5.5	55%	69%
1/27/15	4.3	33%	41%
2/2/15	5.0	91%	107%
3/1/15	4.8	85%	88%
3/5/15	7.5	83%	100%
3/20/15	4.5	105%	134%

Date Precipitation in inches Taxi trips vs. prev week Uber trips vs. prev week


1/18/15	2.1	98%	112%
3/14/15	0.8	114%	130%
4/20/15	1.4	90%	105%
5/31/15	1.5	96%	116%
6/1/15	0.7	99%	106%
6/21/15	0.6	92%	94%
6/27/15	1.1	114%	147%

Although this data does not conclusively prove anything, on every single inclement weather day in 2015, in both rain and snow, Uber provided more trips relative to its previous week’s average than taxis did. Part of this is probably because the number of Uber cars is still growing, so all things held constant, we’d expect Uber to provide more trips on each successive day, while total taxi trips stay flat. But for Uber’s ratio to be higher every single day seems unlikely to be random chance, though again I have no justification to make any strong claims. Whether it’s surge pricing or something else, Uber’s capacity seems less negatively impacted by bad weather relative to taxi capacity.

NYC Late Night Taxi Index

Many real estate listings these days include information about the neighborhood: rankings of local schools, walkability scores, and types of local businesses. We can use the taxi data to draw some inferences about what parts of the city are popular for going out late at night by looking at the percentage of each census tract’s taxi pickups that occur between 10 PM and 5 AM—the time period I’ve deemed “late night.”

Some people want to live in a city that never sleeps, while others prefer their peace and quiet. According to the late night taxi index, if you’re looking for a neighborhood with vibrant nightlife, try Williamsburg, Greenpoint, or Bushwick in Brooklyn. The census tract with the highest late night taxi index is in East Williamsburg, where 76% of taxi pickups occur between 10 PM and 5 AM. If you insist on Manhattan, then your best bets are the Lower East Side or the Meatpacking District.

Conversely, if you want to avoid the nighttime commotion, head uptown to the Upper East or Upper West Side (if you’re not already there…). The stretch in the east 80s between 5th Avenue and Park Avenue has the lowest late night taxi index, with only 5% of all taxi pickups occurring during the nighttime hours.

Here’s a map of all census tracts that had at least 50,000 taxi pickups, where darker shading represents a higher score on the late night taxi index:

BK nights: 76% of the taxi pickups that occur in one of East Williamsburg's census tracts happen between 10 PM and 5 AM, the highest rate in the city. A paltry 5% of taxi pickups in some Upper East Side tracts occur in the late night hours

Whither the Bridge and Tunnel Crowd?

The “bridge and tunnel” moniker applies, on a literal level, to anyone who travels onto the island of Manhattan via a bridge or tunnel, most often from New Jersey, Long Island, or the outer boroughs. Typically it’s considered an insult, though, with the emerging popularity of the outer boroughs, well, let’s just say the Times is on it.

In order to measure B&T destinations from the taxi data, I isolated all trips originating near Penn Station on Saturday evenings between 6 PM and midnight. Penn Station serves as the point of disembarkation for New Jersey Transit and Long Island Rail Road, so although not everyone hailing a taxi around Penn Station on a Saturday evening just took the train into the city, it should be at least a decent proxy for B&T trends. Here’s the map of the neighborhoods where these rides dropped off:

The most popular destinations for B&T trips are in Murray Hill, the Meatpacking District, Chelsea, and Midtown. We can even drill down to the individual trip level to see exactly where these trips wind up. Here’s a map of Murray Hill, the most popular B&T destination, where each dot represents a single Saturday evening taxi trip originating at Penn Station:

As reported, repeatedly, in the NYT, the heart of Murray Hill nightlife lies along 3rd Avenue, in particular the stretch from 32nd to 35th streets. Taxi data shows the plurality of Saturday evening taxi trips from Penn Station drop off in this area, with additional clusters in the high 20s on 3rd Avenue, further east along 34th Street, and a spot on East 39th Street between 1st and 2nd avenues. With a bit more work we might be able to reverse geocode these coordinates to actual bar names, perhaps putting a more scientific spin on this classic of the genre from Complex.

Northside Williamsburg

According to taxi activity, the most ascendant census tract in the entire city since 2009 lies on Williamsburg’s north side, bounded by North 14th St to the north, Berry St to the east, North 7th St to the south, and the East River to the west:

The Northside neighborhood is known for its nightlife: a full 72% of pickups occur during the late night hours. It’s difficult to compare 2009–2015 taxi growth across census tracts and boroughs because of the introduction of the green boro taxi program, but the Northside tract had a larger increase in total taxi pickups over that time period than any other tract in the city, with the exception of the airports:

Even before the boro taxi program began in August 2013, Northside Williamsburg experienced a dramatic increase in taxi activity, growing from a mere 500 monthly pickups in June 2009, to 10,000 in June 2013, and 25,000 by June 2015. Let’s look at an animated map of taxi pickups to see if we can learn anything:

The cool thing about the animation is that it lets us pinpoint the exact locations of some of the more popular Northside businesses to open in the past few years, in particular along Wythe Avenue:

May 2012: Wythe Hotel, Wythe and N 11th
January 2013: Output nightclub, Wythe and N 12th
March 2014: Verboten nightclub, N 11th between Wythe and Kent

Meanwhile, I’m sure the developers of the future William Vale and Hoxton hotels hope that the Northside’s inexorable rise continues, but at least according to taxi data, pickups have remained stable since mid-2014, perhaps indicating that the neighborhood’s popularity has plateaued?

Privacy Concerns, East Hampton Edition

The first time the TLC released public taxi data in 2013, following a FOIL request by Chris Whong, it included supposedly anonymized taxi medallion numbers for every trip. In fact it was possible to decode each trip’s actual medallion number, as described by Vijay Pandurangan. This led to many discussions about data privacy, and the TLC removed all information about medallion numbers from the more recent data releases.

But the data still contains precise latitude and longitude coordinates, which can potentially be used to determine where people live, work, socialize, and so on. This is all fun and games when we’re looking at the hottest new techno club in Northside Williamsburg, but when it’s people’s homes it gets a bit weird. NYC is of course very dense, and if you take a rush hour taxi ride from one populus area to another, say Grand Central Terminal to the Upper East Side, it’s unlikely that there’s anything unique about your trip that would let someone figure out where you live or work.

But what if you’re going somewhere a bit off the beaten path for taxis? In that case, your trip might well be unique, and it might reveal information about you. For example, I don’t know who owns one of theses beautiful oceanfront homes on East Hampton’s exclusive Further Lane (exact address redacted to protect the innocent):

But I do know the exact Brooklyn Heights location and time from which someone (not necessarily the owner) hailed a cab, rode 106.6 miles, and paid a $400 fare with a credit card, including a $110.50 tip. If the TLC truly wanted to remove potentially personal information, they would have to remove latitude and longitude coordinates from the dataset entirely. There’s a tension that public data is supposed to let people know how well the taxi system serves different parts of the city, so maybe the TLC should provide census tracts instead of coordinates, or perhaps only coordinates within busy parts of Manhattan, but providing coordinates that uniquely identify a rider’s home feels excessive.

Investment Bankers

While we’re on the topic of the Hamptons: we’ve already covered the hipsters of Williamsburg and the B&Ts of Murray Hill, why not see what the taxi data can tell us about investment bankers, yet another of New York’s distinctive subcultures?

Goldman Sachs lends itself nicely to analysis because its headquarters at 200 West Street has a dedicated driveway, marked “Hudson River Greenway” on this Google Map:

We can isolate all taxi trips that dropped off in that driveway to get a sense of where Goldman Sachs employees—at least the ones who take taxis—come from in the mornings, and when they arrive. Here’s a histogram of weekday drop off times at 200 West Street:

The cabs start dropping off around 5 AM, then peak hours are 7–9 AM, before tapering off in the afternoon. Presumably most of the post-morning drop offs are visitors as opposed to employees. If we restrict to drop offs before 10 AM, the median drop off time is 7:59 AM, and 25% of drop offs happen before 7:08 AM.

A few blocks to the north is Citigroup’s headquarters at 388 Greenwich St, and although the building doesn’t appear to have a dedicated driveway the way Goldman does, we can still isolate taxis that drop off directly in front of the building to see what time Citigroup’s workers arrive in the morning:

Some of the evening drop offs near Citigroup are probably for the bars and restaurants across the street, but again the morning drop offs are probably mostly Citigroup employees. Citigroup’s morning arrival stats are comparable to Goldman’s: a median arrival of 7:51 AM, and 25% of drop offs happen before 7:03 AM.

The top neighborhoods for taxi pickups that drop off at Goldman Sachs or Citigroup on weekday mornings are:

West Village
Chelsea-Flatiron-Union Square
SoHo-Tribeca

So what’s the deal, do bankers not live above 14th St (or maybe 23rd St) anymore? Alas, there are still plenty of trips from the stodgier parts further uptown, and it’s certainly possible that people coming from uptown are more likely to take the subway, private cars, or other modes of transport, so the taxi data is by no means conclusive. But still, the cool kids have been living downtown for a while now, why should the bankers be any exception?

Parting Thoughts

As I mentioned in the introduction, this post covers a lot. And even then, I feel like it barely scratches the surface of the information available in the full dataset. For example, did you know that in January 2009, just over 20% of taxi fares were paid with a credit card, but as of June 2015, that number has grown to over 60% of all fares?

And for more expensive taxi trips, riders now pay via credit card more than 75% of the time:

There are endless analyses to be done, and more datasets that could be merged with the taxi data for further investigation. The Citi Bike program releases public ride data; I wonder if the introduction of a bike-share system had a material impact on taxi ridership? And maybe we could quantify fairweather fandom by measuring how taxi volume to Yankee Stadium and Citi Field fluctuates based on the Yankees' and Mets' records?

There are investors out there who use satellite imagery to make investment decisions, e.g. if there are lots of cars in a department store’s parking lots this holiday season, maybe it’s time to buy. You might be able to do something similar with the taxi data: is airline market share shifting, based on traffic through JetBlue’s terminal at JFK vs. Delta’s terminal at LaGuardia? Is demand for lumber at all correlated to how many people are loading up on IKEA furniture in Red Hook?

I’d imagine that people will continue to obtain Uber data via FOIL requests, so it will be interesting to see how that unfolds amidst increased tension with city government and constant media speculation about a possible IPO.

Lastly, I mentioned the “medium data revolution” in my previous post about Fannie Mae and Freddie Mac, and the same ethos applies here. Not too long ago, the idea of downloading, processing, and analyzing 267 GB of raw data containing 1.1 billion rows on a commodity laptop would have been almost laughably naive. Today, not only is it possible on a MacBook Air, but there are increasingly more open-source software tools available to aid in the process. I’m partial to PostgreSQL and R, but those are implementation details: increasingly, the limiting factor of data analysis is not computational horsepower, but human curiosity and creativity.

GitHub

If you’re interested in getting the data and doing your own analysis, or just want to read a bit about the more technical details, head over to the GitHub repository.

from R-bloggers http://ift.tt/1OOZsbY
via IFTTT

martes, 3 de noviembre de 2015

What’s the probability that a significant p-value indicates a true effect?

If the p-value is < .05, then the probability of falsely rejecting the null hypothesis is <5%, right? That means, a maximum of 5% of all significant results is a false-positive (that’s what we control with the α rate). Well, no. As you will see in a minute, the “false discovery rate” (aka. false-positive rate), […]

from R-bloggers http://ift.tt/1NONADg
via IFTTT

lunes, 2 de noviembre de 2015

Graphical Modeling Mimics Selective Attention: Customer Satisfaction Ratings

As shown by the eye tracking lines and circles, there is more on the above screenshot than we can process simultaneously. Visual perception takes time, and we must track where the eye focuses by recording sequence and duration. The "50% off" ...

from R-bloggers http://ift.tt/1N9I002
via IFTTT

Statistical Graphics and Visualization course materials

I’ve just finished teaching the Fall 2015 session of 36-721, Statistical Graphics and Visualization. Again, it is a half-semester course designed primarily for students in the MSP program (Masters of Statistical Practice) in the CMU statistics department. I’m pleased that … Continue reading →

from R-bloggers http://ift.tt/1jSMw7N
via IFTTT

Cheminformatics - QSAR