domingo, 27 de diciembre de 2015

Fraud Detection with R and Azure

Detecting fraudulent transactions is a key applucation of statistical modeling, especially in an age of online transactions. R of course has many functions and packages suited to this purpose, including binary classification techniques such as logistic regression. If you'd like to implement a fraud-detection application, the Cortana Analytics gallery features an Online Fraud Detection Template. This is a step-by step guide to building a web-service which will score transactions by likelihood of fraud, created in five steps: Generate tagged data Data Preprocessing Feature engineering Train and Evaluation Model Publish as web service Each step makes use of the R language,...

from R-bloggers

R sucks

I’m doing an analysis and one of the objects I’m working on is a multidimensional array called “attitude.” I took a quick look: > dim(attitude) [1] 30 7 Huh? It’s not supposed to be 30 x 7. Whassup? I search through my scripts for a “attitude” but all I find is the three-dimensional array. Where […]

The post R sucks appeared first on Statistical Modeling, Causal Inference, and Social Science.

from R-bloggers

Basic Concepts in Machine Learning

What are the basic concepts in machine learning? I found that the best way to discover and get a handle on the basic concepts in machine learning is to review the introduction chapters to machine learning textbooks and to watch the videos from the first model in online courses. Pedro Domingos is a lecturer and professor on machine learning […]

The post Basic Concepts in Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery

How to create a Twitter Sentiment Analysis using R and Shiny

Everytime you release a product or service you want to receive feedback from users so you know what they like and what they don’t. Sentiment Analysis can help you. I will show you how to create a simple application in R and Shiny to perform Twitter Sentiment Analysis in real-time. I use RStudio. We will […]

from R-bloggers

martes, 22 de diciembre de 2015

Useful Things To Know About Machine Learning

Do you want some tips and tricks that are useful in developing successful machine learning applications? This is the subject of a journal article from 2012 titled “A Few Useful Things to Know about Machine Learning” (PDF) by University of Washing professor Pedro Domingos. It’s an in interesting read with a great opening hook: developing successful machine […]

The post Useful Things To Know About Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery

viernes, 18 de diciembre de 2015

Anomaly Detection in R


Inspired by this Netflix post, I decided to write a post based on this topic using R.

There are several nice packages to achieve this goal, the one we´re going to review is AnomalyDetection.

Download full -and tiny- R code of this post here.

Normal Vs. Abnormal

The definition for abnormal, or outlier, is an element which does not follow the behaviour of the majority.

Data has noise, same example as a radio which doesn't have good signal, and you end up listening to some background noise.

  • The orange section could be noise in data, since it oscillates around a value without showing a defined pattern, in other words: White noise
  • Are the red circles noise or they are peaks from an undercover pattern?

A good algorithm can detect abnormal points considering the inner noise and leaving it behind. The AnomalyDetectionTs in AnomalyDetection package can perform this task quite well.

Hands on anomaly detection!

In this example, data comes from the well known wikipedia, which offers an API to download from R the daily page views given any {term + language}.

In this case, we've got page views from term fifa, language en, from 2013-02-22 up to today.

After applying the algorithm, we can plot the original time series plus the abnormal points in which the page views were over the expected value.

About the algorithm

Parameters in algorithm are max_anoms=0.01 (to have a maximum of 0.01% outliers points in final result), and direction="pos" to detect anomalies over (not below) the expected value.

As a result, 8 anomalies dates were detected. Additionally, the algorithm returns what it would have been the expected value, and an extra calculation is performed to get this value in terms of percentage perc_diff.

If you want to know more about the maths behind it, google: Generalized ESD and time series decomposition

Something went wrong: Something strange since 1st expected value is the same value as the series has (34028 page views). As a matter of fact perc_diff is 0 while it should be a really low number. However the anomaly is well detected and apparently next ones too. If you know why, you can email and share the knowledge :)

Discovering anomalies

Last plot shows a line indicating linear trend over an specific period -clearly decreasing-, and two black circles. It's interesting to note that these black points were not detected by the algorithm because they are part of a decreasing tendency (noise perhaps?).

A really nice shot by this algorithm since the focus on detections are on the changes of general patterns. Just take a look at the last detected point in that period, it was a peak that didn't follow the decreasing pattern (occurred on 2014-07-12).

Checking with the news

These anomalies with the term fifa are correlated with the news, the first group of anomalies is related with the FIFA World Cup (around Jun/Jul 2014), and the second group centered on May 2015 is related with FIFA scandal.

In the LA Times it can be found a timeline about the scandal, and two important dates -May 27th and 28th-, which are two dates founded by the algorithm.

Our Twitter and LinkedIn Group
More posts here.

Thanks for reading :)

from R-bloggers

jueves, 17 de diciembre de 2015

Using Decision Trees to Predict Infant Birth Weights

In this article, I will show you how to use decision trees to predict whether the birth weights of infants will be low or not. We will use the birthwt data from the MASS library. What is a decision tree? A decision tree is an algorithm that builds a flowchart like graph to illustrate the […]

from R-bloggers

martes, 15 de diciembre de 2015

R and Python: Theory of Linear Least Squares

In my previous article, we talked about implementations of linear regression models in R, Python and SAS. On the theoretical sides, however, I briefly mentioned the estimation procedure for the parameter $boldsymbol{beta}$. So to help us understand how software does the estimation procedure, we'll look at the mathematics behind it. We will also perform the estimation manually in R and in Python, that means we're not gonna use any special packages, this will help us appreciate the theory.

Linear Least Squares

Consider the linear regression model, [ y_i=f_i(mathbf{x}|boldsymbol{beta})+varepsilon_i,quadmathbf{x}_i=left[ begin{array}{cccc} 1&x_{11}&cdots&x_{1p} end{array}right],quadboldsymbol{beta}=left[begin{array}{c}beta_0\beta_1\vdots\beta_pend{array}right], ] where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,cdots, N$. The $f_i(mathbf{x}|boldsymbol{beta})$ is the deterministic part of the model that depends on both the parameters $boldsymbol{beta}inmathbb{R}^{p+1}$ and the predictor variable $mathbf{x}_i$, which in matrix form, say $mathbf{X}$, is represented as follows [ mathbf{X}=left[ begin{array}{cccccc} 1&x_{11}&cdots&x_{1p}\ 1&x_{21}&cdots&x_{2p}\ vdots&vdots&ddots&vdots\ 1&x_{N1}&cdots&x_{Np}\ end{array} right]. ] $varepsilon_i$ is the error term at the $i$th case which we assumed to be Gaussian distributed with mean 0 and variance $sigma^2$. So that [ mathbb{E}y_i=f_i(mathbf{x}|boldsymbol{beta}), ] i.e. $f_i(mathbf{x}|boldsymbol{beta})$ is the expectation function. The uncertainty around the response variable is also modelled by Gaussian distribution. Specifically, if $Y=f(mathbf{x}|boldsymbol{beta})+varepsilon$ and $yin Y$ such that $y>0$, then begin{align*} mathbb{P}[Yleq y]&=mathbb{P}[f(x|beta)+varepsilonleq y]\ &=mathbb{P}[varepsilonleq y-f(mathbf{x}|boldsymbol{beta})]=mathbb{P}left[frac{varepsilon}{sigma}leq frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]\ &=Phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right], end{align*} where $Phi$ denotes the Gaussian distribution with density denoted by $phi$ below. Hence $Ysimmathcal{N}(f(mathbf{x}|boldsymbol{beta}),sigma^2)$. That is, begin{align*} frac{operatorname{d}}{operatorname{d}y}Phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]&=phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]frac{1}{sigma}=mathbb{P}[y|f(mathbf{x}|boldsymbol{beta}),sigma^2]\ &=frac{1}{sqrt{2pi}sigma}expleft{-frac{1}{2}left[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}. end{align*} If the data are independent and identically distributed, then the log-likelihood function of $y$ is, begin{align*} mathcal{L}[boldsymbol{beta}|mathbf{y},mathbf{X},sigma]&=mathbb{P}[mathbf{y}|mathbf{X},boldsymbol{beta},sigma]=prod_{i=1}^Nfrac{1}{sqrt{2pi}sigma}expleft{-frac{1}{2}left[frac{y_i-f_i(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}\ &=frac{1}{(2pi)^{frac{n}{2}}sigma^n}expleft{-frac{1}{2}sum_{i=1}^Nleft[frac{y_i-f_i(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}\ logmathcal{L}[boldsymbol{beta}|mathbf{y},mathbf{X},sigma]&=-frac{n}{2}log2pi-nlogsigma-frac{1}{2sigma^2}sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2. end{align*} And because the likelihood function tells us about the plausibility of the parameter $boldsymbol{beta}$ in explaining the sample data. We therefore want to find the best estimate of $boldsymbol{beta}$ that likely generated the sample. Thus our goal is to maximize the likelihood function which is equivalent to maximizing the log-likelihood with respect to $boldsymbol{beta}$. And that's simply done by taking the partial derivative with respect to the parameter $boldsymbol{beta}$. Therefore, the first two terms in the right hand side of the equation above can be disregarded since it does not depend on $boldsymbol{beta}$. Also, the location of the maximum log-likelihood with respect to $boldsymbol{beta}$ is not affected by arbitrary positive scalar multiplication, so the factor $frac{1}{2sigma^2}$ can be omitted. And we are left with the following equation, begin{equation}label{eq:1} -sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2. end{equation} One last thing is that, instead of maximizing the log-likelihood function we can do minimization on the negative log-likelihood. Hence we are interested on minimizing the negative of Equation (ref{eq:1}) which is begin{equation}label{eq:2} sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2, end{equation} popularly known as the residual sum of squares (RSS). So RSS is a consequence of maximum log-likelihood under the Gaussian assumption of the uncertainty around the response variable $y$. For models with two parameters, say $beta_0$ and $beta_1$ the RSS can be visualized like the one in my previous article, that is
Performing differentiation under $(p+1)$-dimensional parameter $boldsymbol{beta}$ is manageable in the context of linear algebra, so Equation (ref{eq:2}) is equivalent to begin{align*} lVertmathbf{y}-mathbf{X}boldsymbol{beta}rVert^2&=langlemathbf{y}-mathbf{X}boldsymbol{beta},mathbf{y}-mathbf{X}boldsymbol{beta}rangle=mathbf{y}^{text{T}}mathbf{y}-mathbf{y}^{text{T}}mathbf{X}boldsymbol{beta}-(mathbf{X}boldsymbol{beta})^{text{T}}mathbf{y}+(mathbf{X}boldsymbol{beta})^{text{T}}mathbf{X}boldsymbol{beta}\ &=mathbf{y}^{text{T}}mathbf{y}-mathbf{y}^{text{T}}mathbf{X}boldsymbol{beta}-boldsymbol{beta}^{text{T}}mathbf{X}^{text{T}}mathbf{y}+boldsymbol{beta}^{text{T}}mathbf{X}^{text{T}}mathbf{X}boldsymbol{beta} end{align*} And the derivative with respect to the parameter is begin{align*} frac{operatorname{partial}}{operatorname{partial}boldsymbol{beta}}lVertmathbf{y}-mathbf{X}boldsymbol{beta}rVert^2&=-2mathbf{X}^{text{T}}mathbf{y}+2mathbf{X}^{text{T}}mathbf{X}boldsymbol{beta} end{align*} Taking the critical point by setting the above equation to zero vector, we have begin{align} frac{operatorname{partial}}{operatorname{partial}boldsymbol{beta}}lVertmathbf{y}-mathbf{X}hat{boldsymbol{beta}}rVert^2&overset{text{set}}{=}mathbf{0}nonumber\ -mathbf{X}^{text{T}}mathbf{y}+mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{0}nonumber\ mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{X}^{text{T}}mathbf{y}label{eq:norm} end{align} Equation (ref{eq:norm}) is called the normal equation. If $mathbf{X}$ is full rank, then we can compute the inverse of $mathbf{X}^{text{T}}mathbf{X}$, begin{align} mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{X}^{text{T}}mathbf{y}nonumber\ (mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}nonumber\ hat{boldsymbol{beta}}&=(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}.label{eq:betahat} end{align} That's it, since both $mathbf{X}$ and $mathbf{y}$ are known.


If $mathbf{X}$ is full rank and spans the subspace $Vsubseteqmathbb{R}^N$, where $mathbb{E}mathbf{y}=mathbf{X}boldsymbol{beta}in V$. Then the predicted values of $mathbf{y}$ is given by, begin{equation}label{eq:pred} hat{mathbf{y}}=mathbb{E}mathbf{y}=mathbf{P}_{V}mathbf{y}=mathbf{X}(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}, end{equation} where $mathbf{P}$ is the projection matrix onto the space $V$. For proof of the projection matrix in Equation (ref{eq:pred}) please refer to reference (1) below. Or we could use the estimate $hat{boldsymbol{beta}}$ for obtaining $hat{mathbf{y}}$ that is, begin{equation}label{eq:yhbh} hat{mathbf{y}}=mathbb{E}mathbf{y}=mathbf{X}hat{boldsymbol{beta}} end{equation}


Let's fire up R and Python and see how we can apply those equations we derived. For purpose of illustration, we're going to simulate data from Gaussian distributed population. To do so, consider the following codes

R ScriptPython ScriptHere we have two predictors x1 and x2, and our response variable y is generated by the parameters $beta_1=3.5$ and $beta_2=2.8$, and it has Gaussian noise with variance 7. While we set the same random seeds for both R and Python, we should not expect the random values generated in both languages to be identical, instead both values are independent and identically distributed (iid). For visualization, I will use Python Plotly, you can also translate it to R Plotly.

Now let's estimate the parameter $boldsymbol{beta}$ which by default we set to $beta_1=3.5$ and $beta_2=2.8$. We will use Equation (ref{eq:betahat}) for estimation. So that we have

R ScriptPython ScriptThat's a good estimate, and again just a reminder, the estimate in R and in Python are different because we have different random samples, the important thing is that both are iid. To proceed, we'll do prediction using Equations (ref{eq:pred}) and (ref{eq:yhbh}). That is,

R ScriptPython ScriptThe first column above is the data y, the second column is the prediction due to Equation (ref{eq:yhbh}), and the third column is due to Equation (ref{eq:pred}). Thus if we are to expand the prediction into an expectation plane, then we have

You have to rotate the plot by the way to see the plane, I still can't figure out how to change it in Plotly. Anyway, at this point we can proceed computing for other statistics like the variance of the error, and so on. But I will leave it for you to explore. Our aim here is just to give us an understanding on what is happening inside the internals of our software when we try to estimate the parameters of the linear regression models.


  1. Arnold, Steven F. (1981). The Theory of Linear Models and Multivariate Analysis. Wiley.
  2. OLS in Matrix Form

from R-bloggers

Making Sense of Logarithmic Loss

Logarithmic Loss, or simply Log Loss, is a classification loss function often used as an evaluation metric in kaggle competitions. Since success in these competitions hinges on effectively minimising the Log Loss, it makes sense to have some understanding of how this metric is calculated and how it should be interpreted. Log Loss quantifies the […]

The post Making Sense of Logarithmic Loss appeared first on Exegetic Analytics.

from R-bloggers

lunes, 14 de diciembre de 2015

Practical Data Science with R examples

One of the big points of Practical Data Science with R is to supply a large number of fully worked examples. Our intent has always been for readers to read the book, and if they wanted to follow up on a data set or technique to find the matching worked examples in the project directory … Continue reading Practical Data Science with R examples

from R-bloggers

viernes, 11 de diciembre de 2015

Fitting Generalized Regression Neural Network with Python

from R-bloggers

Download Federal Reserve Economic Data (FRED) with Python

In the operational loss calculation, it is important to use CPI (Consumer Price Index) adjusting historical losses. Below is an example showing how to download CPI data online directly from Federal Reserve Bank of St. Louis and then to calculate monthly and quarterly CPI adjustment factors with Python.

from R-bloggers

martes, 8 de diciembre de 2015

Microsoft’s new Data Science Virtual Machine

Earlier this week, Andrie showed you how to set up and provision your own virtual machine (VM) to run R and RStudio in Azure. Another option is to use the new Microsoft Data Science Virtual Machine, a pre-configured instance that includes a suite of tools useful to data scientists, including: Revolution R Open (performance-enhanced R) Anaconda Python Visual Studio Community Edition Power BI Desktop (with R capabilities) SQL Server Express (with R integration) Azure SDK (including the ability to run R experiments) There's no software charge associated with using this VM, you'll pay only the standard Azure infrastructure fees (starting...

from R-bloggers

viernes, 4 de diciembre de 2015

Feature Selection with caret’s Genetic Algorithm Option

by Joseph Rickert If there is anything that experienced machine learning practitioners are likely to agree on, it would be the importance of careful and thoughtful feature engineering. The judicious selection of which predictor variables to include in a model often has a more beneficial effect on overall classifier performance than the choice of the classification algorithm itself. This is one reason why classification algorithms that automatically include feature selection such as glmnet, gbm or random forests top the list of “go to” algorithms for many practitioners. There are occasions, however, when you find yourself for one reason or another...

from R-bloggers