Cheminformatics - QSAR: abril 2016

sábado, 30 de abril de 2016

Interactions between Categorical Variables in Mixed Graphical Models

In a previous post we recovered the conditional independence structure in a dataset of mixed variables describing different aspects of the life of individuals diagnosed with Autism Spectrum Disorder, using the mgm package. While depicting the independence structure in multivariate data set gives a first overview of the relations between variables, in most applications we interested in the exact parameter estimates. For instance, for interactions between continuous variables, we would like to know the sign and the size of parameters - i.e., if the nodes in the graph are positively or negatively related, and how strong these associations are. In the case of interactions between categorical variables, we are interested in the signs and sizes of the set of parameters that describes the exact non-linear relationship between variables.

In this post, we take the analysis a step further and show how to use the output of the mgm package to take a closer look at the recovered dependencies. Specifically, we will recover the sign and weight of interaction parameter between continuous variables and zoom into interactions between categorical and continuous variables and between two categorical variables. Both the dataset and the code are available on Github.

We start out with the conditional dependence graph estimated in the previous post, however, now with variables grouped by their type:

We obtained this graph by fitting a mixed graphical model using the mgmfit() function as in the previous post:

# load data; available on Github
datalist <- readRDS('autism_datalist.RDS')
data <- datalist$data
type <- datalist$type
lev <- datalist$lev

library(devtools)
install_github('jmbh/mgm') # we need version 1.1-6
library(mgm)

fit <- mgmfit(data, type, lev, lamda.sel = "EBIC", d = 2)

Display Edge Weights and Signs

We now also display the weights of the dependencies. In addition, for interactions between continuous (Gaussian, Poisson) variables, we are able determine the sign of the dependency, as it only depends on one parameter. The signs are saved in fit$signs. To make plotting easier, there is also a matrix fit$edgecolor, which gives colors to positive (green), negative (red) and undefined (grey) edge signs.

Now, to plot the weighted adjacency matrix with signs (where defined), we give fit$edgecolor as input to the argument edge.color in qgraph and plot the weighted adjacency matrix fit$wadj instead of the unweighted adjacency matrix fit$adj:

library(devtools)
install_github('SachaEpskamp/qgraph') # we need version 1.3.3
library(qgraph)

# define variable types
groups_typeV <- list("Gaussian"=which(datalist$type=='g'), 
                     "Poisson"=which(datalist$type=='p'),
                     "Categorical"=which(datalist$type=='c'))

# pick some nice colors
group_col <- c("#72CF53", "#53B0CF", "#ED3939")

jpeg("Autism_VarTypes.jpg", height=2*900, width=2*1300, unit='px')
qgraph(wgraph, 
       vsize=3.5, 
       esize=5, 
       layout=Q0$layout, # gives us the same layout as above
       edge.color = edgeColor, 
       color=group_col,
       border.width=1.5,
       border.color="black",
       groups=groups_typeV,
       nodeNames=datalist$colnames,
       legend=TRUE, 
       legend.mode="style2",
       legend.cex=1.5)
dev.off()

This gives us the following figure:

Red edges correspond to negative edge weights and green edge weights correspond to positive edge weights. The width of the edges is proportional to the absolut value of the parameter weight. Grey edges connect categorical variables to continuous variables or to other categorical variables and are computed from more than one parameter and thus we cannot assign a sign to these edges.

While the interaction between continuous variables can be interpreted as a conditional covariance similar to the well-known multivariate Gaussian case, the interpretation of edge-weights involving categorical variables is more intricate as they are comprised of several parameters.

Interpretation of Interaction: Continuous - Categorical

We first consider the edge weight between the continuous Gaussian variable ‘Working hours’ and the categorical variable ‘Type of Work’, which has the categories (1) No work, (2) Supervised work, (3) Unpaid work and (4) Paid work. We get the estimated parameters behind this edge weight from the matrix of all estimated parameters in the mixed graphical model fit$mpar.matrix:

matrix(fit$mpar.matrix[fit$par.labels == 16, fit$par.labels == 17], ncol=1)

           [,1]
[1,] -3.7051782
[2,]  0.0000000
[3,]  0.0000000
[4,]  0.5059143

fit$par.labels indicates which parameters in fit$mpar.matrix belong to the interaction between which two variables. Note that in the case of jointly Gaussian data, fit$mpar.matrix is equivalent to the inverse covariance matrix and each interaction would be represented by 1 value only.

The four values we got from the model parameter matrix represent the interactions of the continuous variable ‘Working hours’ with each of the categories of ‘Type of work’. These can be interpreted in a straight forward way of incraesing/decreasing the probability of a category depending on ‘Working hours’. We see that the probability of category (a) ‘No work’ is greatly decreased by an increase of ‘Working hours’. This makes sense as somebody who does not work has to work 0 hours. Next, working hours seem not to predict the probability of categories (b) ‘Supervised work’ and (c) ‘Unpaid work’. However, increasing working hours does increase the probabilty of category (d) ‘Paid work’, which indicates that individuals who get paid for their work, work longer hours. Note that these interactions are unique in the sense that the influence of all other variables is partialed out!

Interpretation of Interaction: Categorical - Categorical

Next we consider the edge weight between the categorical variables (14) ‘Type of Housing’ and the variable (15) ‘Type of Work’ from above. ‘Type of Housing’ has to categories, (a) ‘Not independent’ and (b) ‘Independent’. As in the previous example, we take the relevant parameters from the model parameter matrix:

fit$mpar.matrix[fit$par.labels == 14, fit$par.labels == 16]

     [,1] [,2] [,3]       [,4]
[1,]   NA    0    0 -0.5418989
[2,]   NA    0    0  0.5418989

Again, the rows represent the categories of variable (14) ‘Type of Housing’. The columns indicate how the different catgories of variable (16) ‘Type of Work’ predict the probability of these categories. The first column is the dummy category ‘No work’. The parameters can therefore be interpreted as follows:

Having supervised or unpaid work, does not predict a probability of living independently or not that is different for individuals with no work. Having paid work, however, decreases the probability of living not independently and increases the probability of living independently, compared to the reference category ‘no work’.

The interpretations above correspond to the typical interpretation of parameters in a multinomial regression model, which is indeed what is used in the node wise regression approach we use in the mgm packge to estimate mixed graphical models. For details about the exact parameterization of the multinomial regression model check chapter 4 in the glmnet paper. Note that because we use the node wise regression approach, we could also look at how the categories in (16) ‘Type of work’ predict (17) ‘Working hours’ or how the categories of (14) ‘Type of housing’ predict the probabilities of (16) ‘Type of Housing’. These parameters can be obtained by exchanging the row indices with the column indices when subsetting fit$mpar.matrix. For an elaborate explanation of the node wise regresssion approach and the exact structure of the model parameter matrix please check the mgm paper.

from R-bloggers http://ift.tt/24p9ogL
via IFTTT

lunes, 25 de abril de 2016

Boosting and AdaBoost for Machine Learning

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. In this post you will discover the AdaBoost Ensemble method for machine learning. After reading this post, you will know: What the boosting ensemble method is and generally how it works. How to learn to boost decision […]

The post Boosting and AdaBoost for Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1Vw6tBm
via IFTTT

sábado, 23 de abril de 2016

jueves, 21 de abril de 2016

Holding Your Hand Like a Small Child Through a Neural Network Part 2

The second of 2 posts expanding upon a now-classic neural network blog post and demonstration, guiding the reader through the workings of a simple neural network.

from KDnuggets http://ift.tt/1QpOswt
via IFTTT

Bagging and Random Forest Ensemble Algorithms for Machine Learning

Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging. In this post you will discover the Bagging ensemble algorithm and the Random Forest algorithm for predictive modeling. After reading this post you will know about: The […]

The post Bagging and Random Forest Ensemble Algorithms for Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1plXDYS
via IFTTT

martes, 19 de abril de 2016

Support Vector Machines for Machine Learning

Support Vector Machines are perhaps one of the most popular and talked about machine learning algorithms. They were extremely popular around the time they were developed in the 1990s and continue to be the go-to method for a high-performing algorithm with little tuning. In this post you will discover the Support Vector Machine (SVM) machine […]

The post Support Vector Machines for Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/23XtNcF
via IFTTT

Top 15 Frameworks for Machine Learning Experts

Either you are a researcher, start-up or big organization who wants to use machine learning, you will need the right tools to make it happen. Here is a list of the most popular frameworks for machine learning.

from KDnuggets http://ift.tt/1qWlFuH
via IFTTT

jueves, 14 de abril de 2016

Regression & Correlation for Military Promotion: A Tutorial

A clear and well-written tutorial covering the concepts of regression and correlation, focusing on military commander promotion as a use case.

from KDnuggets http://ift.tt/2602xfG
via IFTTT

Popular Deep Learning Libraries

There are so many deep learning libraries to choose from. Which are the good professional libraries that are worth learning and which are someones side project and should be avoided. It is hard to tell the difference. In this post you will discover the top deep learning libraries that you should consider learning and using […]

The post Popular Deep Learning Libraries appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1qRYCle
via IFTTT

martes, 12 de abril de 2016

From Science to Data Science, a Comprehensive Guide for Transition

An in-depth, multifaceted, and all-around very helpful roadmap for making the switch from 'science' to 'data science,' yet generally useful for data science beginners or anyone looking to get into data science.

from KDnuggets http://ift.tt/1TP1t9A
via IFTTT

Predicting Wine Quality with Azure ML and R

by Shaheen Gauher, PhD, Data Scientist at Microsoft In machine learning, the problem of classification entails correctly identifying to which class or group a new observation belongs, by learning from observations whose classes are already known. In what follows, I will build a classification experiment in Azure ML Studio to predict wine quality based on physicochemical data. Several classification algorithms will be applied on the data set and the performance of these algorithms will be compared. I will also present a tutorial on how to do similar exercise using MRS (Microsoft R Server, formerly Revolution R Enterprise). I will use...

from R-bloggers http://ift.tt/1S7MQvs
via IFTTT

lunes, 11 de abril de 2016

Simulating queueing systems with simmer

We are very pleased to announce that a new release of simmer, the Discrete-Event Simulator for R, is on CRAN. There are quite a few changes and fixes, with the support of preemption as a star new feature. Check out the complete set of release notes here.

Let’s simmer for a bit and see how this package can be used to simulate queueing systems in a very straightforward way.

The M/M/1 system

In Kendall’s notation, an M/M/1 system has exponential arrivals (M/M/1), a single server (M/M/1) with exponential service time (M/M/1) and an inifinite queue (implicit M/M/1/(infty)). For instance, people arriving at an ATM at rate (lambda), waiting their turn in the street and withdrawing money at rate (mu).

Let us remember the basic parameters of this system:

whenever (rho < 1). If that is not true, it means that the system is unstable: there are more arrivals than the server is capable of handling, and the queue will grow indefinitely.

The simulation of an M/M/1 system is quite simple using simmer. The trajectory-based design, combined with magrittr’s pipe, is very verbal and self-explanatory.

library(simmer)
set.seed(1234)

lambda <- 2
mu <- 4
rho <- lambda/mu # = 2/4

mm1.trajectory <- create_trajectory() %>%
  seize("resource", amount=1) %>%
  timeout(function() rexp(1, mu)) %>%
  release("resource", amount=1)

mm1.env <- simmer() %>%
  add_resource("resource", capacity=1, queue_size=Inf) %>%
  add_generator("arrival", mm1.trajectory, function() rexp(1, lambda)) %>%
  run(until=2000)

Our package provides convenience plotting functions to quickly visualise the usage of a resource over time, for instance. Down below, we can see how the simulation converges to the theoretical average number of customers in the system.

library(ggplot2)

# Evolution of the average number of customers in the system
graph <- plot_resource_usage(mm1.env, "resource", items="system")

# Theoretical value
mm1.N <- rho/(1-rho)
graph + geom_hline(yintercept=mm1.N)

It is possible also to visualise, for instance, the instantaneous usage of individual elements by playing with the parameters items and steps.

plot_resource_usage(mm1.env, "resource", items=c("queue", "server"), steps=TRUE) +
  xlim(0, 20) + ylim(0, 4)

We may obtain the time spent by each customer in the system and we compare the average with the theoretical expression.

mm1.arrivals <- get_mon_arrivals(mm1.env)
mm1.t_system <- mm1.arrivals$end_time - mm1.arrivals$start_time

mm1.T <- mm1.N / lambda
mm1.T ; mean(mm1.t_system)

## [1] 0.5

## [1] 0.5012594

It seems that it matches the theoretical value pretty well. But of course we are picky, so let’s take a closer look, just to be sure (and to learn more about simmer, why not). Replication can be done with standard R tools:

library(parallel)

envs <- mclapply(1:1000, function(i) {
  simmer() %>%
    add_resource("resource", capacity=1, queue_size=Inf) %>%
    add_generator("arrival", mm1.trajectory, function() rexp(1, lambda)) %>%
    run(1000/lambda) %>%
    wrap()
})

Et voilà! Parallelizing has the shortcoming that we lose the underlying C++ objects when each thread finishes, but the wrap function does all the magic for us retrieving the monitored data. Let’s perform a simple test:

library(dplyr)

t_system <- get_mon_arrivals(envs) %>%
  mutate(t_system = end_time - start_time) %>%
  group_by(replication) %>%
  summarise(mean = mean(t_system))

t.test(t_system$mean)

##
##      One Sample t-test
##
## data:  t_system$mean
## t = 344.14, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.4953154 0.5009966
## sample estimates:
## mean of x
##  0.498156

Good news: the simulator works. Finally, an M/M/1 satisfies that the distribution of the time spent in the system is, in turn, an exponential random variable with average (T).

qqplot(mm1.t_system, rexp(length(mm1.t_system), 1/mm1.T))
abline(0, 1, lty=2, col="red")

M/M/c/k systems

An M/M/c/k system keeps exponential arrivals and service times, but has more than one server in general and a finite queue, which often is more realistic. For instance, a router may have several processor to handle packets, and the in/out queues are necessarily finite.

This is the simulation of an M/M/2/3 system (2 server, 1 position in queue). Note that the trajectory is identical to the M/M/1 case.

lambda <- 2
mu <- 4

mm23.trajectory <- create_trajectory() %>%
  seize("server", amount=1) %>%
  timeout(function() rexp(1, mu)) %>%
  release("server", amount=1)

mm23.env <- simmer() %>%
  add_resource("server", capacity=2, queue_size=1) %>%
  add_generator("arrival", mm23.trajectory, function() rexp(1, lambda)) %>%
  run(until=2000)

In this case, there are rejections when the queue is full.

mm23.arrivals <- get_mon_arrivals(mm23.env)
mm23.arrivals %>%
  summarise(rejection_rate = sum(!finished)/length(finished))

##   rejection_rate
## 1     0.02065614

Despite this, the time spent in the system still follows an exponential random variable, as in the M/M/1 case, but the average has dropped.

mm23.t_system <- mm23.arrivals$end_time - mm23.arrivals$start_time
# Comparison with M/M/1 times
qqplot(mm1.t_system, mm23.t_system)
abline(0, 1, lty=2, col="red")

from R-bloggers http://ift.tt/1N3PFyO
via IFTTT

domingo, 10 de abril de 2016

Naive Bayes for Machine Learning

Naive Bayes is a simple but surprisingly powerful algorithm for predictive modeling. In this post you will discover the Naive Bayes algorithm for classification. After reading this post, you will know: The representation used by naive Bayes that is actually stored when a model is written to a file. How a learned model can be […]

The post Naive Bayes for Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1YnGHgf
via IFTTT

jueves, 7 de abril de 2016

Visualising F1 Stint Strategies

With the new F1 season upon us, I’ve started tinkering with bits of code from the Wrangling F1 Data With R book and looking at the data in some new ways. For example, I started wondering whether we might be able to learn something interesting about the race strategies by looking at laptimes on a […]

from R-bloggers http://ift.tt/1Vagxj4
via IFTTT

Basics of GPU Computing for Data Scientists

With the rise of neural network in data science, the demand for computationally extensive machines lead to GPUs. Learn how you can get started with GPUs & algorithms which could leverage them.

from KDnuggets http://ift.tt/1RRuTiF
via IFTTT

Classification And Regression Trees for Machine Learning

Decision Trees are an important type of algorithm for predictive modeling machine learning. The classical decision tree algorithms have been around for decades and modern variations like random forest are among the most powerful techniques available. In this post you will discover the humble decision tree algorithm known by it’s more modern name CART which stands […]

The post Classification And Regression Trees for Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1TEw9tU
via IFTTT

miércoles, 6 de abril de 2016

Deep Learning for Internet of Things Using H2O

H2O is feature-rich open source machine learning platform known for its R and Spark integration and it’s ease of use. This is an overview of using H2O deep learning for data science with the Internet of Things.

from KDnuggets http://ift.tt/23hjBLI
via IFTTT

Link Analysis for Fraud Detection: How Linkurious enabled investigation of the massive Panama Papers leaks

Linkurious is a partner of the International Investigative Journalist Consortium (ICIJ) since the Swiss Leaks scandal. ICIJ network of 370 journalists is using Linkurious to investigate the Panama Papers. Learn the inside story of the biggest data leak investigation in history.

from KDnuggets http://ift.tt/22crlMG
via IFTTT

Computational Actuarial Science, with R, in Barcelona

This Wednesday, I will give a graduate crash course on computational actuarial science, with R, which will be the second part of the lecture of Tuesday. Slides are now available,

from R-bloggers http://ift.tt/1SPz0hi
via IFTTT

Linear Discriminant Analysis for Machine Learning

Logistic regression is a classification algorithm traditionally limited to only two-class classification problems. If you have more than two classes then Linear Discriminant Analysis is the preferred linear classification technique. In this post you will discover the Linear Discriminant Analysis (LDA) algorithm for classification predictive modeling problems. After reading this post you will know: The […]

The post Linear Discriminant Analysis for Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1SNMBWz
via IFTTT

martes, 5 de abril de 2016

A bit on the F1 score floor

At Strata+Hadoop World “R Day” Tutorial, Tuesday, March 29 2016, San Jose, California we spent some time on classifier measures derived from the so-called “confusion matrix.” We repeated our usual admonition to not use “accuracy” as a project goal (business people tend to ask for it as it is the word they are most familiar … Continue reading A bit on the F1 score floor

from R-bloggers http://ift.tt/1TsbeKK
via IFTTT

lunes, 4 de abril de 2016

El Gobierno despidió al diseñador de la nueva Patente Mercosur

El Ministerio de Justicia le encargó la nueva matrícula para autos y motos. Dice que lo echaron por pertenecer a La Cámpora. Además, un homenaje a Malvinas en respuesta a las patentes de Top Gear.

from ARGENTINA AUTOBLOG http://ift.tt/1S3vL2K
via IFTTT

Logistic Regression Tutorial for Machine Learning

Logistic regression is one of the most popular machine learning algorithms for binary classification. This is because it is a simple algorithm that performs very well on a wide range of problems. In this post you are going to discover the logistic regression algorithm for binary classification, step-by-step. After reading this post you will know: […]

The post Logistic Regression Tutorial for Machine Learning appeared first on Machine Learning Mastery.

from Machine Learning Mastery http://ift.tt/1Y8Zyvv
via IFTTT

sábado, 2 de abril de 2016

Top 10 Essential Books for the Data Enthusiast

A unique top 10 list of book recommendations, for each of 10 categories this list provides a top paid and top free book recommendation. If you're interested in books on data, this diverse list of top picks should be right up your alley.

from KDnuggets http://ift.tt/1UHMtf6
via IFTTT