domingo, 27 de diciembre de 2015
Fraud Detection with R and Azure
from R-bloggers http://ift.tt/1OLb3TQ
via IFTTT
R sucks
I’m doing an analysis and one of the objects I’m working on is a multidimensional array called “attitude.” I took a quick look: > dim(attitude) [1] 30 7 Huh? It’s not supposed to be 30 x 7. Whassup? I search through my scripts for a “attitude” but all I find is the three-dimensional array. Where […]
The post R sucks appeared first on Statistical Modeling, Causal Inference, and Social Science.
from R-bloggers http://ift.tt/1RHWN4w
via IFTTT
Basic Concepts in Machine Learning
What are the basic concepts in machine learning? I found that the best way to discover and get a handle on the basic concepts in machine learning is to review the introduction chapters to machine learning textbooks and to watch the videos from the first model in online courses. Pedro Domingos is a lecturer and professor on machine learning […]
The post Basic Concepts in Machine Learning appeared first on Machine Learning Mastery.
from Machine Learning Mastery http://ift.tt/1QYY6dQ
via IFTTT
How to create a Twitter Sentiment Analysis using R and Shiny
from R-bloggers http://ift.tt/1OsEWhH
via IFTTT
martes, 22 de diciembre de 2015
Useful Things To Know About Machine Learning
Do you want some tips and tricks that are useful in developing successful machine learning applications? This is the subject of a journal article from 2012 titled “A Few Useful Things to Know about Machine Learning” (PDF) by University of Washing professor Pedro Domingos. It’s an in interesting read with a great opening hook: developing successful machine […]
The post Useful Things To Know About Machine Learning appeared first on Machine Learning Mastery.
from Machine Learning Mastery http://ift.tt/1Jsb5OQ
via IFTTT
viernes, 18 de diciembre de 2015
Anomaly Detection in R
Introduction
Inspired by this Netflix post, I decided to write a post based on this topic using R.
There are several nice packages to achieve this goal, the one we´re going to review is AnomalyDetection.
Download full -and tiny- R code of this post here.
Normal Vs. Abnormal
The definition for abnormal, or outlier, is an element which does not follow the behaviour of the majority.
Data has noise, same example as a radio which doesn't have good signal, and you end up listening to some background noise.
- The orange section could be noise in data, since it oscillates around a value without showing a defined pattern, in other words: White noise
- Are the red circles noise or they are peaks from an undercover pattern?
A good algorithm can detect abnormal points considering the inner noise and leaving it behind. The
AnomalyDetectionTs
inAnomalyDetection
package can perform this task quite well.
Hands on anomaly detection!
In this example, data comes from the well known wikipedia, which offers an API to download from R the daily page views
given any {term + language}
.
In this case, we've got page views from term fifa
, language en
, from 2013-02-22
up to today.
After applying the algorithm, we can plot the original time series plus the abnormal points in which the page views were over the expected value.
About the algorithm
Parameters in algorithm are max_anoms=0.01
(to have a maximum of 0.01%
outliers points in final result), and direction="pos"
to detect anomalies over (not below) the expected value.
As a result, 8 anomalies dates were detected. Additionally, the algorithm returns what it would have been the expected value, and an extra calculation is performed to get this value in terms of percentage perc_diff
.
If you want to know more about the maths behind it, google: Generalized ESD
and time series decomposition
Something went wrong: Something strange since 1st expected value is the same value as the series has (34028
page views). As a matter of fact perc_diff
is 0 while it should be a really low number. However the anomaly is well detected and apparently next ones too. If you know why, you can email and share the knowledge :)
Discovering anomalies
Last plot shows a line indicating linear trend over an specific period -clearly decreasing-, and two black circles. It's interesting to note that these black points were not detected by the algorithm because they are part of a decreasing tendency (noise perhaps?).
A really nice shot by this algorithm since the focus on detections are on the changes of general patterns. Just take a look at the last detected point in that period, it was a peak that didn't follow the decreasing pattern (occurred on 2014-07-12
).
Checking with the news
These anomalies with the term fifa
are correlated with the news, the first group of anomalies is related with the FIFA World Cup (around Jun/Jul 2014), and the second group centered on May 2015 is related with FIFA scandal.
In the LA Times it can be found a timeline about the scandal, and two important dates -May 27th and 28th-, which are two dates founded by the algorithm.
Our Twitter and LinkedIn Group
More posts here.
Thanks for reading :)
from R-bloggers http://ift.tt/1NsHPvm
via IFTTT
jueves, 17 de diciembre de 2015
Using Decision Trees to Predict Infant Birth Weights
from R-bloggers http://ift.tt/1O9HQrl
via IFTTT
martes, 15 de diciembre de 2015
R and Python: Theory of Linear Least Squares
Linear Least Squares
Consider the linear regression model, [ y_i=f_i(mathbf{x}|boldsymbol{beta})+varepsilon_i,quadmathbf{x}_i=left[ begin{array}{cccc} 1&x_{11}&cdots&x_{1p} end{array}right],quadboldsymbol{beta}=left[begin{array}{c}beta_0\beta_1\vdots\beta_pend{array}right], ] where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,cdots, N$. The $f_i(mathbf{x}|boldsymbol{beta})$ is the deterministic part of the model that depends on both the parameters $boldsymbol{beta}inmathbb{R}^{p+1}$ and the predictor variable $mathbf{x}_i$, which in matrix form, say $mathbf{X}$, is represented as follows [ mathbf{X}=left[ begin{array}{cccccc} 1&x_{11}&cdots&x_{1p}\ 1&x_{21}&cdots&x_{2p}\ vdots&vdots&ddots&vdots\ 1&x_{N1}&cdots&x_{Np}\ end{array} right]. ] $varepsilon_i$ is the error term at the $i$th case which we assumed to be Gaussian distributed with mean 0 and variance $sigma^2$. So that [ mathbb{E}y_i=f_i(mathbf{x}|boldsymbol{beta}), ] i.e. $f_i(mathbf{x}|boldsymbol{beta})$ is the expectation function. The uncertainty around the response variable is also modelled by Gaussian distribution. Specifically, if $Y=f(mathbf{x}|boldsymbol{beta})+varepsilon$ and $yin Y$ such that $y>0$, then begin{align*} mathbb{P}[Yleq y]&=mathbb{P}[f(x|beta)+varepsilonleq y]\ &=mathbb{P}[varepsilonleq y-f(mathbf{x}|boldsymbol{beta})]=mathbb{P}left[frac{varepsilon}{sigma}leq frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]\ &=Phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right], end{align*} where $Phi$ denotes the Gaussian distribution with density denoted by $phi$ below. Hence $Ysimmathcal{N}(f(mathbf{x}|boldsymbol{beta}),sigma^2)$. That is, begin{align*} frac{operatorname{d}}{operatorname{d}y}Phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]&=phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]frac{1}{sigma}=mathbb{P}[y|f(mathbf{x}|boldsymbol{beta}),sigma^2]\ &=frac{1}{sqrt{2pi}sigma}expleft{-frac{1}{2}left[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}. end{align*} If the data are independent and identically distributed, then the log-likelihood function of $y$ is, begin{align*} mathcal{L}[boldsymbol{beta}|mathbf{y},mathbf{X},sigma]&=mathbb{P}[mathbf{y}|mathbf{X},boldsymbol{beta},sigma]=prod_{i=1}^Nfrac{1}{sqrt{2pi}sigma}expleft{-frac{1}{2}left[frac{y_i-f_i(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}\ &=frac{1}{(2pi)^{frac{n}{2}}sigma^n}expleft{-frac{1}{2}sum_{i=1}^Nleft[frac{y_i-f_i(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}\ logmathcal{L}[boldsymbol{beta}|mathbf{y},mathbf{X},sigma]&=-frac{n}{2}log2pi-nlogsigma-frac{1}{2sigma^2}sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2. end{align*} And because the likelihood function tells us about the plausibility of the parameter $boldsymbol{beta}$ in explaining the sample data. We therefore want to find the best estimate of $boldsymbol{beta}$ that likely generated the sample. Thus our goal is to maximize the likelihood function which is equivalent to maximizing the log-likelihood with respect to $boldsymbol{beta}$. And that's simply done by taking the partial derivative with respect to the parameter $boldsymbol{beta}$. Therefore, the first two terms in the right hand side of the equation above can be disregarded since it does not depend on $boldsymbol{beta}$. Also, the location of the maximum log-likelihood with respect to $boldsymbol{beta}$ is not affected by arbitrary positive scalar multiplication, so the factor $frac{1}{2sigma^2}$ can be omitted. And we are left with the following equation, begin{equation}label{eq:1} -sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2. end{equation} One last thing is that, instead of maximizing the log-likelihood function we can do minimization on the negative log-likelihood. Hence we are interested on minimizing the negative of Equation (ref{eq:1}) which is begin{equation}label{eq:2} sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2, end{equation} popularly known as the residual sum of squares (RSS). So RSS is a consequence of maximum log-likelihood under the Gaussian assumption of the uncertainty around the response variable $y$. For models with two parameters, say $beta_0$ and $beta_1$ the RSS can be visualized like the one in my previous article, that is Performing differentiation under $(p+1)$-dimensional parameter $boldsymbol{beta}$ is manageable in the context of linear algebra, so Equation (ref{eq:2}) is equivalent to begin{align*} lVertmathbf{y}-mathbf{X}boldsymbol{beta}rVert^2&=langlemathbf{y}-mathbf{X}boldsymbol{beta},mathbf{y}-mathbf{X}boldsymbol{beta}rangle=mathbf{y}^{text{T}}mathbf{y}-mathbf{y}^{text{T}}mathbf{X}boldsymbol{beta}-(mathbf{X}boldsymbol{beta})^{text{T}}mathbf{y}+(mathbf{X}boldsymbol{beta})^{text{T}}mathbf{X}boldsymbol{beta}\ &=mathbf{y}^{text{T}}mathbf{y}-mathbf{y}^{text{T}}mathbf{X}boldsymbol{beta}-boldsymbol{beta}^{text{T}}mathbf{X}^{text{T}}mathbf{y}+boldsymbol{beta}^{text{T}}mathbf{X}^{text{T}}mathbf{X}boldsymbol{beta} end{align*} And the derivative with respect to the parameter is begin{align*} frac{operatorname{partial}}{operatorname{partial}boldsymbol{beta}}lVertmathbf{y}-mathbf{X}boldsymbol{beta}rVert^2&=-2mathbf{X}^{text{T}}mathbf{y}+2mathbf{X}^{text{T}}mathbf{X}boldsymbol{beta} end{align*} Taking the critical point by setting the above equation to zero vector, we have begin{align} frac{operatorname{partial}}{operatorname{partial}boldsymbol{beta}}lVertmathbf{y}-mathbf{X}hat{boldsymbol{beta}}rVert^2&overset{text{set}}{=}mathbf{0}nonumber\ -mathbf{X}^{text{T}}mathbf{y}+mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{0}nonumber\ mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{X}^{text{T}}mathbf{y}label{eq:norm} end{align} Equation (ref{eq:norm}) is called the normal equation. If $mathbf{X}$ is full rank, then we can compute the inverse of $mathbf{X}^{text{T}}mathbf{X}$, begin{align} mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{X}^{text{T}}mathbf{y}nonumber\ (mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}nonumber\ hat{boldsymbol{beta}}&=(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}.label{eq:betahat} end{align} That's it, since both $mathbf{X}$ and $mathbf{y}$ are known.Prediction
If $mathbf{X}$ is full rank and spans the subspace $Vsubseteqmathbb{R}^N$, where $mathbb{E}mathbf{y}=mathbf{X}boldsymbol{beta}in V$. Then the predicted values of $mathbf{y}$ is given by, begin{equation}label{eq:pred} hat{mathbf{y}}=mathbb{E}mathbf{y}=mathbf{P}_{V}mathbf{y}=mathbf{X}(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}, end{equation} where $mathbf{P}$ is the projection matrix onto the space $V$. For proof of the projection matrix in Equation (ref{eq:pred}) please refer to reference (1) below. Or we could use the estimate $hat{boldsymbol{beta}}$ for obtaining $hat{mathbf{y}}$ that is, begin{equation}label{eq:yhbh} hat{mathbf{y}}=mathbb{E}mathbf{y}=mathbf{X}hat{boldsymbol{beta}} end{equation}Computation
Let's fire up R and Python and see how we can apply those equations we derived. For purpose of illustration, we're going to simulate data from Gaussian distributed population. To do so, consider the following codesR ScriptPython ScriptHere we have two predictors
x1
and x2
, and our response variable y
is generated by the parameters $beta_1=3.5$ and $beta_2=2.8$, and it has Gaussian noise with variance 7. While we set the same random seeds for both R and Python, we should not expect the random values generated in both languages to be identical, instead both values are independent and identically distributed (iid). For visualization, I will use Python Plotly, you can also translate it to R Plotly.Now let's estimate the parameter $boldsymbol{beta}$ which by default we set to $beta_1=3.5$ and $beta_2=2.8$. We will use Equation (ref{eq:betahat}) for estimation. So that we have
R ScriptPython ScriptThat's a good estimate, and again just a reminder, the estimate in R and in Python are different because we have different random samples, the important thing is that both are iid. To proceed, we'll do prediction using Equations (ref{eq:pred}) and (ref{eq:yhbh}). That is,
R ScriptPython ScriptThe first column above is the data
y
, the second column is the prediction due to Equation (ref{eq:yhbh}), and the third column is due to Equation (ref{eq:pred}). Thus if we are to expand the prediction into an expectation plane, then we haveYou have to rotate the plot by the way to see the plane, I still can't figure out how to change it in Plotly. Anyway, at this point we can proceed computing for other statistics like the variance of the error, and so on. But I will leave it for you to explore. Our aim here is just to give us an understanding on what is happening inside the internals of our software when we try to estimate the parameters of the linear regression models.
Reference
- Arnold, Steven F. (1981). The Theory of Linear Models and Multivariate Analysis. Wiley.
- OLS in Matrix Form
from R-bloggers http://ift.tt/1UtQfVV
via IFTTT
Making Sense of Logarithmic Loss
Logarithmic Loss, or simply Log Loss, is a classification loss function often used as an evaluation metric in kaggle competitions. Since success in these competitions hinges on effectively minimising the Log Loss, it makes sense to have some understanding of how this metric is calculated and how it should be interpreted. Log Loss quantifies the […]
The post Making Sense of Logarithmic Loss appeared first on Exegetic Analytics.
from R-bloggers http://ift.tt/1Z9f4ZN
via IFTTT
lunes, 14 de diciembre de 2015
Practical Data Science with R examples
from R-bloggers http://ift.tt/1IKJHk3
via IFTTT
viernes, 11 de diciembre de 2015
Download Federal Reserve Economic Data (FRED) with Python
from R-bloggers http://ift.tt/1SRDxik
via IFTTT
martes, 8 de diciembre de 2015
Microsoft’s new Data Science Virtual Machine
from R-bloggers http://ift.tt/1QZR8GD
via IFTTT
viernes, 4 de diciembre de 2015
Feature Selection with caret’s Genetic Algorithm Option
from R-bloggers http://ift.tt/1MZh1CE
via IFTTT