martes, 27 de octubre de 2015

Text Mining with R

Last week, we had a great course on Text Mining with R at the European Data Innovation Hub. For persons interested in text mining with R, another 1-day crash course is scheduled at the Leuven Statistics Research Center (Belgium) on November 17 (http:...

from R-bloggers http://ift.tt/1jDqeXz
via IFTTT

martes, 20 de octubre de 2015

Statistical Tests: Asymptotic, Exact, ou based on Simulations?

This morning, in our mathematical statistics course, we’ve been discussing the ‘proportion test‘, i.e. given a sample of Bernoulli trials, with , we want to test against  A natural test (which can be related to the maximum likelihood ratio test) is  based on the statistic The test function is here To get the bounds of the acceptance region, we need the distribution of , under . Consider here a numerical application n=20 p=.5 set.seed(1) echantillon=sample(0:1,size=n, prob=c(1-p,p), replace=TRUE) the asymptotic distribution The first (and standard idea) is to use … Continue reading Statistical Tests: Asymptotic, Exact, ou based on Simulations?

from R-bloggers http://ift.tt/1KmRdM7
via IFTTT

lunes, 19 de octubre de 2015

Predicting sales: Pandas vs SQL



from FastML http://ift.tt/1W0xit6
via IFTTT

Basic Forecasting

Forecasting refers to the process of using statistical procedures to predict future values of a time series based on historical trends. For businesses, being able gauge expected outcomes for a given time period is essential for managing marketing, planning, and finances. For example, an advertising agency may want to utilizes sales forecasts to identify which […]

from R-bloggers http://ift.tt/1juR7wO
via IFTTT

Trying to optimize

I wanted to try some more machine learning. On Kaggle there is a competition How Much Did It Rain? II. This is quite a bigger data set than Titanic. To quote from Kaggle:Rainfall is highly variable across space and time, making it notoriously tricky t...

from R-bloggers http://ift.tt/1RQbjo2
via IFTTT

jueves, 15 de octubre de 2015

Review: Beautiful Data

I've just finished reading Beautiful Data (published by O'Reilly in 2009), a collection of essays edited by Toby Segaran and Jeff Hammerbacher. The 20 essays from 39 contributors address a diverse array of topics relating to data and how it's collected, analysed and interpreted. Since this is a collection of essays, the writing style and […]

The post Review: Beautiful Data appeared first on Exegetic Analytics.



from R-bloggers http://ift.tt/1LPtHrU
via IFTTT

lunes, 12 de octubre de 2015

Practicing Meta-Analytic Thinking Through Simulations

People find it difficult to think about random variation. Our mind is more strongly geared towards recognizing patterns than randomness. In this blogpost, you can learn what random variation looks like, how to reduce it by running well-powered studies, and how to meta-analyze multiple small studies. This is a long read, and most educational if you follow the assignments. You'll probably need about an hour.

We'll use R, and the R script at the bottom of this post (or download it from GitHub). Run the first section (sections are differentiated by # # # #) to install the required packages and change some setting. 


IQ tests have been designed such that the mean IQ of the entire population of adults is 100, with a standard deviationof 15. This will not be true for every samplewe draw from the population. Let’s get a feel for what the IQ scores from a sample look like. Which IQ scores will people in our sample have?

Assignment 1
We will start by simulating a random sample of 10 individuals. Run the script in the section #Assignment 1. Both the mean, as the standard deviation, differ from the true mean in the population. Simulate some more samples of 10 individuals and look at the means and SD's. They differ quite a lot. This type of variation is perfectly normal in small samples of 10 participants. See below for one example of a simulated sample.



Let’s simulate a larger sample, of 100 participants by changing the n=10 in line 23 of the R script to n = 100 (remember R code is case-sensitive). 


We are slowly seeing what is known as the normal distribution. This is the well known bell shaped curve that represents the distribution of many variables in scientific research (although some other types of distributions are quite common as well). The mean and standard deviation are much closer to the true mean and standard deviation, and this is true for most of the simulated samples. Simulate at least 10 samples with n = 10, and 10 samples with n = 100. Look at the means and standard deviations. Let’s simulate one really large sample, of 1000 people (run the code, changing n=10 to n=1000). The picture shows one example.


Not every simulated study of 1000 people will yield the true mean and standard deviation, but this one did. And although the distribution is very close to a normal distribution, even with a 1000 people it is not perfect.

The accuracy with which you can measure the IQ in a population is easy to calculate when you know the standard deviation, and the percentage of long-run probability of being of making an error. If you choose a 95% confidence interval, and want to estimate IQ within an error range of 2 IQ points, you first convert the 95% confidence interval to a Z-score (1.96), and use the formula:
N = (Z * SD/error)2
In this example, (1.96*15/2) 2 = 216 people (rounded down). Feel free to check by running the code with n = 216 (remember that this is a long term average!)
In addition to planning for accuracy, you can plan for power. The power of a study is the probability of observing a statistically significant effect, given that there is a true effect to be found. It depends on the effect size, the sample size, and the alpha level.

We can simulate experiments, and count how many statistically significant results are observed, to see how much power we have. For example, when we simulate 100.000 studies, and 50% of the studies reveal a p-value smaller than 0.05, this means the power of our study (given a specific effect size, sample size, and alpha-level) is 50%.
We can use the code in the section of Assignment 2. Running this code will take a while. It will simulate 100000 experiments, where 10 participants are drawn from a normal distribution with the mean of 110, and a SD of 15. To continue our experiment, let’s assume the numbers represent measured IQ, which is 110 in our samples. For each simulated sample, we test whether the effect differs from an IQ of 100. In other words, we are testing whether our sample is smarter than average.

The program returns all p-values, and it will return the power, which will be somewhere around 47%. It will also yield a plot of the p-values. The first bar is the count of all p-values smaller than 0.05, so all statistically significant p-values. The percentage of p-values in this single bar visualizes the power of the study.

Instead of simulating the power of the study, you can also perform power calculations in R (see the code at the end of assignment 2). To calculate the power of a study, we need the sample size (in our case, n = 10), the alpha level (in our case, 0.05), and the effect size, which for a one-sample t-test is Cohen’s d, which can be calculated as d = (X-μ)/σ, or (110-100)/15 = 0.6667. 

Assignment 2
Using the simulation and the pwr package, examine what happens with the power of the experiments when the sample size is increased to 20. How does the p-value distribution change?
Using the simulation and the pwr package, examine what happens with the power of the experiments when the mean in the sample changes from 110 to 105 (set the sample size to 10 again). How does the p-value distribution change?
Using the simulation and the pwr package, examine what happens with the power of the experiments when the mean in the sample is set to 100 (set the sample size to 10 again). Now, there is no difference between the sample and the average IQ. How does the p-value distribution change? Can we formally speak of ‘power’ in this case? What is a better name in this specific situation?
Variance in two groups, and their difference.
Now, assume we have a new IQ training program that will increase peoples IQ score with 6 points. People in condition 1 are in the control condition – they do not get IQ training. People in condition 2 get IQ training. Let’s simulate 10 people in each group, assuming the IQ in the control condition is 100, and in the experimental group is 106 (the SD is still 15 in each group) by running the code for Assignment 3.

The graph you get will look like a version of the one below. The means and SD for each sample drawn are provided in the graph (control condition on the left, experimental condition on the right).


The two groups differ in how close they are to their true means, and as a consequence, the difference between groups varies as well. Note that this difference is the main variable in statistical analyses when comparing two groups. Run at least 10 more simulations to look at the data pattern.

Assignment 3
Compared to the one-sample case above, we now have 2 variable group means, and two variable standard deviations. If we perform a power analysis, how do you think this additional variability will influence the power of our test? In other words, for the exact same effect size (e.g., 0.6667), will the power of our study remain the same, will it increase, or will it decrease?
Test whether your intuition was correct or not by running this power analysis for an independent samples t-test:
pwr.t.test(d=0.6667,n=10,sig.level=0.05,type="two.sample",alternative="two.sided")
In dependent samples, the mean in one sample correlates with the mean in the other sample. This reduced the amount of variability in the difference scores. If we perform a power analysis, how do you think this will influence the power of our test?
Effect size calculations for dependent samples are influenced by the correlation between the means. If this correlation is 0.5, the effect size calculation for the dependent case and the independent case is identical. But the power for a dependent t-test will be identical to the power in a one-sample t-test.
Verify this by running the power analysis for a dependent samples t-test, with a true effect size of 0.6667, and compare the power with the same power analysis for a one-sample t-test we performed above:
pwr.t.test(d=0.6667,n=10,sig.level=0.05,type="paired ",alternative="two.sided")


Variation across studies
Up until know, we have talked about the variation of data points within a single study. It is clear that the larger the sample size, the more the observed difference (in the case of two means) or the more the observed correlation (in the case or two related variables) mirrors the true difference or correlation. We can calculate the variation in the effects we are interested in directly. Both correlations are mean differences are effect sizes. Because mean differences are difficult to compare across studies that use different types of measures to examine an effect, or different scales to measure differences on, whenever multiple effect sizes are compared researchers often use standardized effect sizes. In this example, we will focus on Cohen’s d, which provides the standardized mean difference.
As explained in Borenstein, Hedges, Higgins, & Rothstein (2009) a very good approximation of the variance of d is provided by:

This formula shows that the variance of d depends only on the sample size and the value of d itself. 
Single study meta-analysis
Perhaps you remember that whenever the 95% confidence interval around an effect size estimate excludes zero, the effect is statistically significant. When you want to test whether effects sizes across a number of studies differ from 0, you have to perform what is known as a meta-analysis. In essence, you perform an analysis over analyses. You first analyze individual studies, and then analyze the set of effect sizes you calculated from each individual study. To perform a meta-analysis, all you need are the effect sizes and the sample size of each individual study.
Let’s first begin with something you will hardly ever do in real life: a meta-analysis of a single study. This is a little silly, because a simple t-test of correlation will tell you the same thing – but that’s educational to see.
We will simulate one study examining our IQ training program. The IQ in the control condition has M = 100, SD = 15, and in the experimental condition the average IQ has improved to M = 106, SD = 15. We will randomly select the sample size, and draw between 20-50 participants in each condition.
Our simulated results for a single simulation (see the code below) for the control condition gives M=97.03, and for the experimental condition gives M = 107.52. The difference (of the experimental condition – the control condition, so lower scores mean better performance in the experimental condition) is statistically significant, t(158) = 2.80, p = 0.007. The effect size Hedges’ g = 0.71. This effect size overestimates the true effect size substantially. The true effect size is d= 0.4 – calculate this for yourself.
Run the code in assignment 6 (I'm skipping some parts I do use in teaching - feel free to run that code to explore variation in correlations) to see the data. Remove the # in front of the set.seed line to get the same result as in this example.

Assignment 6
If we perform a meta-analysis, we get almost the same result - the calculations used by the meta package differ slightly (although it will often round to the same 2 digits after the decimal point), because it uses a different (Wald) type of tests and confidence interval – but that’s not something we need to worry about here.
Run the simulation a number of times to see the variation in the results, and the similarity between the meta-analytic result and the t-test.

The meta-analysis compares the meta-analytic effect size estimate (which in this example is based on a single study) to zero, and tests whether the difference from zero is statistically significant. We see the estimate effect size g = 0.7144, a 95% CI, and a z-score (2.7178), which is the test statistic for which a p-value can be calculated. The p-value of 0.0066 is very similar to that observed in the t-test.
                          95%-CI      z  p-value 0.7143703 [0.1992018; 1.2295387] 2.7178   0.0066
Meta-analysis are often visualized using forest plots. We see a forest plot summarizing our single test below:
In this plot we see a number (1) for our single study. The effect size (0.71), which is Hedges's g, the unbiased estimate of Cohen's d, and the confidence interval [0.2; 1.23] are presented on the right. The effect size and confidence interval is also visualized. The effect size by the orange square (the larger the sample size, the bigger the square is) and the length of the line running through it is the 95% confidence interval.
A small-scale meta-analysis
Meta-analyses are made to analyze more than one study. Let’s analyze 4 studies, with different effect sizes (0.44, 0.7, 0.28, 0.35) and sample sizes (60, 35, 23, 80 and 60, 36, 25, 80).
 Researchers have to choose between a fixed effect model or a random effects model when performing a meta-analysis.
Fixed effect models assume a single true effect size underlies all the studies included in the meta-analysis. Fixed effect models are therefore only appropriate when all studies in the meta-analysis are practically identical (e.g., use the same manipulation) and when researchers do not want to generalize to different populations (Borenstein, Hedges, Higgins, & Rothstein, 2009).
By contrast, random effects models allow the true effect size to vary from study to study (e.g., due to differences in the manipulations between studies). Note the difference between fixed effect and random effects (plural, meaning multiple effects). Random effects models therefore are appropriate when a wide range of different studies is examined and there is substantial variance between studies in the effect sizes. Since the assumption that all effect sizes are identical is implausible in most meta-analyses random effects meta-analyses are generally recommended (Borenstein et al., 2009).
The meta-analysis in this assignment, where we have simulated studies based on exactly the same true effect size, and where we don’t want to generalize to different populations, is one of the rare examples where a fixed effect meta-analysis would be appropriate – but for educational purposes, I will only show the random effects model. When variation in effect sizes is small, both models will give the same results.
In a meta-analysis, a weighted mean is computed. The reason studies are weighed when calculating the meta-analytic effect size is that larger studies are considered to be more accurate estimates of the true effect size (as we have seen above, this is true in general). Instead of simply averaging over an effect size estimate from a study with 20 people in each condition, and an effect size estimate from a study with 200 people in each condition, the larger study is weighed more strongly when calculating the meta-analytic effect size.
R makes it relatively easy to perform a meta-analysis by using the meta or metafor package. Run the code related to Assignment 7. We get the following output, where we see four rows (one for each study), the effect sizes and 95% CI for each effect, and the %W (random), which is the relative weight for each study in a random effects meta-analysis.

                  95%-CI %W(random)1 0.44 [ 0.0802; 0.7998]      30.032 0.70 [ 0.2259; 1.1741]      17.303 0.28 [-0.2797; 0.8397]      12.414 0.35 [ 0.0392; 0.6608]      40.26Number of studies combined: k=4
                                      95%-CI      z  p-valueRandom effects model 0.4289 [0.2317; 0.6261] 4.2631 < 0.0001
Quantifying heterogeneity:tau^2 = 0; H = 1 [1; 1.97]; I^2 = 0% [0%; 74.2%]
Test of heterogeneity:    Q d.f.  p-value 1.78    3   0.6194
The line below the summary gives us the statistics for the random effects model. First, the meta-analytic effect size estimate (0.43) with the 95% CI [0.23; 0.63], and the associated z-score and p-value. Based on the set of studies we simulated here, we would conclude it looks like there is a true effect.
The same information is visualized in a forest plot:

The meta-analysis also provides statistics for heterogeneity. Tests for heterogeneity examine whether there is large enough variation in the effect sizes included in the meta-analysis to assume their might be important moderators of the effect. For example, assume studies examine how happy receiving money makes people. Half of the studies gave people around 10 euros, while the other half of the study gave people 100 euros. It would not be surprising to find both these manipulations increase happiness, but 100 euro does so more strongly that 10 euro. Many manipulations in psychological research differ similarly in their strength. If there is substantial heterogeneity, researchers should attempt to examine the underlying reason for this heterogeneity, for example by identifying subsets of studies, and then examining the effect in these subsets. In our example, there does not seem to be substantial heterogeneity (the test for heterogeneity, the Q-statistic, is not statistically significant).

Assignment 7
Play around with the effect sizes and sample sizes in the 4 studies in our small meta-analysis. What happens if you increase the sample sizes? What happens if you make the effect sizes more diverse? What happens when the effect sizes become smaller (e.g., all effect sizes vary a little bit around d = 0.2). Look at the individual studies. Look at the meta-analytic effect size.
Simulating small studies
Instead of typing in specific number for every meta-analysis, we can also simulate a number of studies with a specific true effect size. This is quite informative, because it will show how much variability there is in small, underpowered, studies. Remember that many studies in psychology are small and underpowered.
In this simulation, we will randomly draw data from a normal distribution for two groups. There is a real difference in means between the two groups. Like above, the IQ in the control condition has M = 100, SD = 15, and in the experimental condition the average IQ has improved to M = 106, SD = 15. We will simulate between 20 and 50 participants in each condition (and thus create a ‘literature’ that consists primarily of small studies).
You can run the code we have used above (for a single meta-analysis) to simulate 8 studies, perform the meta-analysis, and create a forest plot. The code for Assignment 8 is the same as earlier, we just changed the nSims=1 to nSims=8.
The forest plot of one of the random simulations looks like:

The studies show a great deal of variability, even though the true differencebetween both groups is exactly the same in every simulated study. Only 50% of the studies reveal a statistically significant effect, but the meta-analysis provides clear evidence for the presence of a true effect in the fixed-effect model (p < 0.0001):

                     95%-CI %W(fixed) %W(random)


1 -0.0173 [-0.4461; 0.4116]     14.47      13.83


2 -0.0499 [-0.5577; 0.4580]     10.31      11.16


3  0.6581 [ 0.0979; 1.2183]      8.48       9.74


4  0.5806 [ 0.0439; 1.1172]      9.24      10.35


5  0.3104 [-0.1693; 0.7901]     11.56      12.04


6  0.4895 [ 0.0867; 0.8923]     16.40      14.87


7  0.7362 [ 0.3175; 1.1550]     15.17      14.22


8  0.2278 [-0.2024; 0.6580]     14.37      13.78


 
Number of studies combined: k=8


 
                                      95%-CI      z  p-value


Fixed effect model   0.3624 [0.1993; 0.5255] 4.3544 < 0.0001




Assignment 8
Pretend these would be the outcomes of studies you actually performed. Would you have continued to test your hypothesis in this line of research after study 1 and 2 showed no results?
Simulate at least 10 small meta-analyses. Look at the pattern of the studies, and how much they vary. Look at the meta-analytic effect size estimate. Does it vary, or is it more reliable? What happens if you increase the sample size? For example, instead of choosing samples between 20 and 50 [SampleSize<-sample(20:50, 1)], choose samples between 100 and 150 [SampleSize<-sample(100:150, 1)].


Meta-Analysis, not Miracles
Some people are skeptical about the usefulness of meta-analysis. It is important to realize what meta-analysis can and can’t do. Some researchers argue meta-analyses are garbage-in, garbage-out. If you calculate the meta-analytic effect size of a bunch of crappy studies, the meta-analytic effect size estimate will also be meaningless. It is true that a meta-analysis cannot turn bad data into a good effect size estimation. Similarly, meta-analytic techniques that aim to address publication bias (not discussed in this blog post) can never provide certainty about the unbiased effect size estimate.
However, meta-analysis does more than just provide a meta-analytic effect size estimate that is statistically different from zero or not. It allows researchers to examine the presence of bias, and the presence of variability. These analyses might allow researchers to identify different subsets of studies, some stronger than others. Very often, a meta-analysis will provide good suggestions for future research, such as large scale tests of the most promising effect under investigation.
Meta-analyses are not always performed with enough attention to  detail (e.g., Lakens, Hilgard, & Staaks, 2015). It is important to realize that a meta-analysis has the potential to synthesize a large set of studies, but the extent to which a meta-analysis succesfully achieves this is open for discussion. For example, it is well-known that researchers on opposite sides of a debate (e.g., concerning the question whether aggressive video games do or do not lead to violence) can publish meta-analyses reaching opposite conclusions. This is obviously undesirable, but points towards the large degrees in freedom in choosing which articles to include in the meta-analysis, as well as other choices that are made throughout the meta-analysis.
Nevertheless, meta-analyses can be very useful. First of all, small scale-meta-analyses can actually mitigate publication bias, by allowing researchers to publish individual studies that show statistically significant effect and studies that do not show statistically significant effect, while the overall meta-analytic effect size provides clear support for a hypothesis. Second, meta-analyses provide us with a best estimate (or a range of best estimate, given specific assumptions of bias) of the size of effects, or the variation in effect sizes depending on specific aspects of the performed studies, which can inspire future research.
That’s a lot of information about variation in single studies, variation across studies, meta-analyzing studies, and performing power analyses to design studies that have a high probability of showing a true effect, if it’s there! I hopethis is helpful in designing studies and evaluating their results.


from R-bloggers http://ift.tt/1Lq795h
via IFTTT

domingo, 11 de octubre de 2015

Data Analysis Workflow: ‘Omics style

Follow along with the presentation and recreate all the analysis results for yourself.

from R-bloggers http://ift.tt/1GC4tMq
via IFTTT

martes, 6 de octubre de 2015

Undirected Graphs When the Causality Is Mutual

Structural equation models impose causal order on a set of observations. We start with a measurement model: a list of theoretical constructs and a table assigning what is observed (manifest) to what is hidden (latent). Although it is possible to think ...

from R-bloggers http://ift.tt/1OgK4mv
via IFTTT

Using differential privacy to reuse training data

Win-Vector LLC‘s Nina Zumel wrote a great article explaining differential privacy and demonstrating how to use it to enhance forward step-wise logistic regression. This allowed her to reproduce results similar to the recent Science paper “The reusable holdout: Preserving validity in adaptive data analysis”. The technique essentially protects and reuses test data, allowing the series … Continue reading Using differential privacy to reuse training data

from R-bloggers http://ift.tt/1jLu80n
via IFTTT

viernes, 2 de octubre de 2015

Understanding empirical Bayes estimation (using baseball statistics)

Which of these two proportions is higher: 4 out of 10, or 300 out of 1000? This sounds like a silly question. Obviously , which is greater than .

But suppose you were a baseball recruiter, trying to decide which of two potential players is a better batter based on how many hits they get. One has achieved 4 hits in 10 chances, the other 300 hits in 1000 chances. While the first player has a higher proportion of hits, it’s not a lot of evidence: a typical player tends to achieve a hit around 27% of the time, and this player’s could be due to luck. The second player, on the other hand, has a lot of evidence that he’s an above-average batter.

This post isn’t really about baseball, I’m just using it as an illustrative example. (I actually know very little about sabermetrics. If you want a more technical version of this post, check out this great paper). This post is, rather, about a very useful statistical method for estimating a large number of proportions, called empirical Bayes estimation. It’s to help you with data that looks like this:

Success Total
11 104
82 1351
2 26
0 40
1203 7592
5 166

A lot of data takes the form of these success/total counts, where you want to estimate a “proportion of success” for each instance. Each row might represent:

  • An ad you’re running: Which of your ads have higher clickthrough rates, and which have lower? (Note that I’m not talking about running an A/B test comparing two options, but rather about ranking and analyzing a large list of choices.)
  • A user on your site: In my work at Stack Overflow, I might look at what fraction of a user’s visits are to Javascript questions, to guess whether they are a web developer. In another application, you might consider how often a user decides to read an article they browse over, or to purchase a product they’ve clicked on.

When you work with pairs of successes/totals like this, you tend to get tripped up by the uncertainty in low counts. does not mean the same thing as ; nor does mean the same thing as . One approach is to filter out all cases that don’t meet some minimum, but this isn’t always an option: you’re throwing away useful information.

I previously wrote a post about one approach to this problem, using the same analogy: Understanding the beta distribution (using baseball statistics). Using the beta distribution to represent your prior expectations, and updating based on the new evidence, can help make your estimate more accurate and practical. Here I’ll demonstrate the related method of empirical Bayes estimation, where the beta distribution is used to improve a large set of estimates. What’s great about this method is that as long as you have a lot of examples, you don’t need to bring in prior expectations.

Here I’ll apply empirical Bayes estimation to a baseball dataset, with the goal of improving our estimate of each player’s batting average. I’ll focus on the intuition of this approach, but will also show the R code for running this analysis yourself. (So that the post doesn’t get cluttered, I don’t show the code for the graphs and tables, only the estimation itself. But you can find all this post’s code here)

Working with batting averages

In my original post about the beta distribution, I made some vague guesses about the distribution of batting averages across history, but here we’ll work with real data. We’ll use the Batting dataset from the excellent Lahman package. We’ll prepare and clean the data a little first, using dplyr and tidyr:

library(dplyr)
library(tidyr)
library(Lahman)

career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(Pitching, by = "playerID") %>%
  group_by(playerID) %>%
  summarize(H = sum(H), AB = sum(AB)) %>%
  mutate(average = H / AB)

# use names along with the player IDs
career <- Master %>%
  tbl_df() %>%
  select(playerID, nameFirst, nameLast) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID") %>%
  select(-playerID)

Above, we filtered out pitchers (generally the weakest batters, who should be analyzed separately). We summarized each player across multiple years to get their career Hits (H) and At Bats (AB), and batting average. Finally, we added first and last names to the dataset, so we could work with them rather than an identifier:

career
## Source: local data frame [9,256 x 4]
## 
##                 name     H    AB average
##                (chr) (int) (int)   (dbl)
## 1         Hank Aaron  3771 12364  0.3050
## 2       Tommie Aaron   216   944  0.2288
## 3          Andy Abad     2    21  0.0952
## 4        John Abadie    11    49  0.2245
## 5     Ed Abbaticchio   772  3044  0.2536
## 6        Fred Abbott   107   513  0.2086
## 7        Jeff Abbott   157   596  0.2634
## 8        Kurt Abbott   523  2044  0.2559
## 9         Ody Abbott    13    70  0.1857
## 10 Frank Abercrombie     0     4  0.0000
## ..               ...   ...   ...     ...

I wonder who the best batters in history were. Well, here are the ones with the highest batting average:

name H AB average
Jeff Banister 1 1 1
Doc Bass 1 1 1
Steve Biras 2 2 1
C. B. Burns 1 1 1
Jackie Gallagher 1 1 1

Err, that’s not really what I was looking for. These aren’t the best batters, they’re just the batters who went up once or twice and got lucky. How about the worst batters?

name H AB average
Frank Abercrombie 0 4 0
Horace Allen 0 7 0
Pete Allen 0 4 0
Walter Alston 0 1 0
Bill Andrus 0 9 0

Also not what I was looking for. That “average” is a really crummy estimate. Let’s make a better one.

Step 1: Estimate a prior from all your data

Let’s look at the distribution of batting averages across players.

(For the sake of estimating the prior distribution, I’ve filtered out all players that have fewer than 500 at-bats, since we’ll get a better estimate from the less noisy cases. I show a more principled approach in the Appendix).

The first step of empirical Bayes estimation is to estimate a beta prior using this data. Estimating priors from the data you’re currently analyzing is not the typical Bayesian approach- usually you decide on your priors ahead of time. There’s a lot of debate and discussion about when and where it’s appropriate to use empirical Bayesian methods, but it basically comes down to how many observations we have: if we have a lot, we can get a good estimate that doesn’t depend much on any one individual. Empirical Bayes is an approximation to more exact Bayesian methods- and with the amount of data we have, it’s a very good approximation.

So far, a beta distribution looks like a pretty appropriate choice based on the above histogram. (What would make it a bad choice? Well, suppose the histogram had two peaks, or three, instead of one. Then we might need a mixture of Betas, or an even more complicated model). So we know we want to fit the following model:

We just need to pick and , which we call “hyper-parameters” of our model. There are many methods in R for fitting a probability distribution to data (optim, mle, bbmle, etc). You don’t even have to use maximum likelihood: you could use the mean and variance, called the “method of moments”. But we’ll use the fitdistr function from MASS.

# just like the graph, we have to filter for the players we actually
# have a decent estimate of
career_filtered <- career %>%
    filter(AB >= 500)

m <- MASS::fitdistr(career_filtered$average, dbeta,
                    start = list(shape1 = 1, shape2 = 10))

alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]

This comes up with and . How well does this fit the data?

Not bad! Not perfect, but something we can work with.

Step 2: Use that distribution as a prior for each individual estimate

Now when we look at any individual to estimate their batting average, we’ll start with our overall prior, and update based on the individual evidence. I went over this process in detail in the original Beta distribution post: it’s as simple as adding to the number of hits, and to the total number of at-bats.

For example, consider our hypothetical batter from the introduction that went up 1000 times, and got 300 hits. We would estimate his batting average as:

How about the batter who went up only 10 times, and got 4 hits. We would estimate his batting average as:

Thus, even though , we would guess that the batter is better than the batter!

Performing this calculation for all the batters is simple enough:

career_eb <- career %>%
    mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0))

Results

Now we can ask: who are the best batters by this improved estimate?

name H AB average eb_estimate
Rogers Hornsby 2930 8173 0.358 0.355
Shoeless Joe Jackson 1772 4981 0.356 0.350
Ed Delahanty 2596 7505 0.346 0.343
Billy Hamilton 2158 6268 0.344 0.340
Harry Heilmann 2660 7787 0.342 0.339

Who are the worst batters?

name H AB average eb_estimate
Bill Bergen 516 3028 0.170 0.178
Ray Oyler 221 1265 0.175 0.191
John Vukovich 90 559 0.161 0.196
John Humphries 52 364 0.143 0.196
George Baker 74 474 0.156 0.196

Notice that in each of these cases, empirical Bayes didn’t simply pick the players who had 1 or 2 at-bats. It found players who batted well, or poorly, across a long career. What a load off our minds: we can start using these empirical Bayes estimates in downstream analyses and algorithms, and not worry that we’re accidentally letting or cases ruin everything.

Overall, let’s see how empirical Bayes changed all of the batting average estimates:

The horizontal dashed red line marks - that’s what we would guess someone’s batting average was if we had no evidence at all. Notice that points above that line tend to move down towards it, while points below it move up.

The diagonal red line marks . Points that lie close to it are the ones that didn’t get shrunk at all by empirical Bayes. Notice that they’re the ones with the highest number of at-bats (the brightest blue): they have enough evidence that we’re willing to believe the naive batting average estimate.

This is why this process is sometimes called shrinkage: we’ve moved all our estimates towards the average. How much it moves these estimates depends on how much evidence we have: if we have very little evidence (4 hits out of 10) we move it a lot, if we have a lot of evidence (300 hits out of 1000) we move it only a little. That’s shrinkage in a nutshell: Extraordinary outliers require extraordinary evidence.

Conclusion: So easy it feels like cheating

Recall that there were two steps in empirical Bayes estimation:

  1. Estimate the overall distribution of your data.
  2. Use that distribution as your prior for estimating each average.

Step 1 can be done once, “offline”- analyze all your data and come up with some estimates of your overall distribution. Step 2 is done for each new observation you’re considering. You might be estimating the success of a post or an ad, or classifying the behavior of a user in terms of how often they make a particular choice.

And because we’re using the beta and the binomial, consider how easy that second step is. All we did was add one number to the successes, and add another number to the total. You can build that into your production system with a single line of code that takes nanoseconds to run.

// We hired a Data Scientist to analyze our Big Data
// and all we got was this lousy line of code.
float estimate = (successes + 78.7) / (total + 303.5);

That really is so simple that it feels like cheating- like the kind of “fudge factor” you might throw into your code, with the intention of coming back to it later to do some real Machine Learning.

I bring this up to disprove the notion that statistical sophistication necessarily means dealing with complicated, burdensome algorithms. This Bayesian approach is based on sound principles, but it’s still easy to implement. Conversely, next time you think “I only have time to implement a dumb hack,” remember that you can use methods like these: it’s a way to choose your fudge factor. Some dumb hacks are better than others!

But when anyone asks what you did, remember to call it “empirical Bayesian shrinkage towards a Beta prior.” We statisticians have to keep up appearances.

Appendix: How could we make this more complicated?

We’ve made some enormous simplifications in this post. For one thing, we assumed all batting averages are drawn from a single distribution. In reality, we’d expect that it depends on some known factors. For instance, the distribution of batting averages has changed over time:

Ideally, we’d want to estimate a different Beta prior for each decade. Similarly, we could estimate separate priors for each team, a separate prior for pitchers, and so on. One useful approach to this is Bayesian hierarchical modeling (as used in, for example, this study of SAT scores across different schools).

Also, as alluded to above, we shouldn’t be estimating the distribution of batting averages. Really, we should use all of our data to estimate the distribution, but give more consideration to the players with a higher number of at-bats. This can be done by fitting a beta-binomial distribution. For instance, we can use the dbetabinom.ab function from VGAM, and the mle function:

library(VGAM)

# negative log likelihood of data given alpha; beta
ll <- function(alpha, beta) {
  -sum(dbetabinom.ab(career$H, career$AB, alpha, beta, log = TRUE))
}

m <- mle(ll, start = list(alpha = 1, beta = 10), method = "L-BFGS-B")
coef(m)
## alpha  beta 
##    75   222

We end up getting almost the same prior, which is reassuring!



from R-bloggers http://ift.tt/1QNLmEr
via IFTTT