miércoles, 1 de julio de 2015

How Airbnb built a data science team

From Venturebeat: Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production MySQL database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set

from Simply Statistics http://ift.tt/1NvtRYt
via IFTTT

viernes, 26 de junio de 2015

An Attempt to Understand Boosting Algorithm(s)

Tuesday, at the annual meeting of the French Economic Association, I was having lunch Alfred, and while we were chatting about modeling issues (econometric models against machine learning prediction), he asked me what boosting was. Since I could not be very specific, we’ve been looking at wikipedia page. Boosting is a machine learning ensemble meta-algorithm for reducing bias primarily and also variance in supervised learning, and a family of machine learning algorithms which convert weak learners to strong ones One should admit that it is not … Continue reading An Attempt to Understand Boosting Algorithm(s)

from R-bloggers http://ift.tt/1FEEXnQ
via IFTTT

jueves, 25 de junio de 2015

KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!

The challenge from the KDD Cup this year was to use their data relating to student enrollment in online MOOCs to predict who would drop out vs who would stay. The short story is that using H2O and a lot … Continue reading

from R-bloggers http://ift.tt/1fFLDgk
via IFTTT

martes, 23 de junio de 2015

Illustrated Guide to ROC and AUC

(In a past job interview I failed at explaining how to calculate and interprete ROC curves – so here goes my attempt to fill this knowledge gap.) Think of a regression model mapping a number of features onto a real number … Continue reading

from R-bloggers http://ift.tt/1LrAwEW
via IFTTT

viernes, 19 de junio de 2015

‘Variable Importance Plot’ and Variable Selection

Classification trees are nice. They provide an interesting alternative to a logistic regression.  I started to include them in my courses maybe 7 or 8 years ago. The question is nice (how to get an optimal partition), the algorithmic procedure is nice (the trick of splitting according to one variable, and only one, at each node, and then to move forward, never backward), and the visual output is just perfect (with that tree structure). But the prediction can be rather poor. The performance of that algorithme can hardly … Continue reading ‘Variable Importance Plot’ and Variable Selection

from R-bloggers http://ift.tt/1Sr1nSt
via IFTTT

jueves, 18 de junio de 2015

Confidence Intervals for prediction in GLMMs

With LM and GLM the predict function can return the standard error for the predicted values on either the observed data or on new data. This is then used to draw confidence or prediction intervals around the fitted regression lines. The confidence intervals (CI) focus on the regression lines and can be interpreted as (assuming […]

from R-bloggers http://ift.tt/1Ilf6Db
via IFTTT