viernes, 10 de junio de 2011

The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models - Tropsha - 2003 - QSAR & Combinatorial Science - Wiley Online Library

The Importance of Being Earnest:

Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models

Alexander Tropsha1,†, Paola Gramatica2, Vijay K. Gombar3
Article first published online: 16 APR 2003


Abstract

This paper emphasizes the importance of rigorous validation as a crucial, integral component of Quantitative Structure Property Relationship (QSPR) model development. We consider some examples of published QSPR models, which in spite of their high fitted accuracy for the training sets and apparent mechanistic appeal, fail rigorous validation tests, and, thus, may lack practical utility as reliable screening tools. We present a set of simple guidelines for developing validated and predictive QSPR models. To this end, we discuss several validation strategies including (1) randomization of the modelled property, also called Y-scrambling, (2) multiple leave-many-out cross-validations, and (3) external validation using rational division of a dataset into training and test sets. We also highlight the need to establish the domain of model applicability in the chemical space to flag molecules for which predictions may be unreliable, and discuss some algorithms that can be used for this purpose. We advocate the broad use of these guidelines in the development of predictive QSPR models.


DOI: 10.1002/qsar.200390007
Copyright © 2003 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

martes, 29 de marzo de 2011

Best Practices for QSAR Modelling

Interesting papers about the validation and predictivity of QSAR Models:

Validation and Predictivity of QSAR Models - Hugo Kubinyi (lecture slides)

Best Practices for QSAR Model Development, Validation, and Exploitation - Alexander Tropsha - (2010), Best Practices for QSAR Model Development, Validation, and Exploitation. Molecular Informatics, 29: 476–488. doi: 10.1002/minf.201000061

martes, 17 de agosto de 2010

GA Multiple response models

“An important characteristic of the GA–VSS method is that a single model is not necessarily obtained but the result usually is a population of acceptable models; this characteristic, sometimes considered a disadvantage, provides an opportunity to make an evaluation of the relationships with the response from different points of view. A theoretical disadvantage is that the absolute best model could be not present in the final population. However, after a
careful selection of the best models, ! consensus analysis can be performed contemporarily using the selected models and estimating the response as weighted average of the responses of the single models.”

Molecular Descriptors for Chemoinformatics,
Volumes I & II
Roberto Todeschini
Viviana Consonni

miércoles, 14 de octubre de 2009

Data pre-processing - Normalization

Dogra, Shaillay K., "Normalization." From QSARWorld--A Strand Life Sciences Web Resource.
http://www.qsarworld.com/qsar-statistics-normalization.php

Normalization

Most of the computed descriptors differ in the scales in which their values lie. One may need to normalize them before proceeding with further statistical analysis. This mostly depends on the subsequent Machine Learning algorithms that one wants to run on the data.

Algorithms like Decision Trees, Regression Forest, Decision Forest and Naïve Bayes do not require normalized data as input. For Linear Regression, normalization is a recommended step. For Neural Networks – classification or regression, Support Vector Machines – classification or regression, normalization of data is required.

In context of cheminformatics, a standard way to normalize data is by mean shifting and auto-scaling. This makes the mean of a thus transformed descriptor column as 0 and the standard deviation as 1.

Mean Shifting

Most of the computed descriptors differ in the scales in which their values lie. One may thus want to normalize them before proceeding with further statistical analysis. As part of normalization, each value for a given descriptor (all values in a column) is adjusted or shifted by the mean value. As a result, the new mean value becomes 0. This happens for all the descriptors and they thus now have the same mean value 0. Hence, mean, as a measure of central location of the distribution of values, for all the descriptors, is now the same. However, the 'spread' or the 'variation' in the data, about the mean, is still the same as in the original data. This can now be taken care of by scaling the values with the standard deviation.

This is best illustrated with an example. Consider these numbers: 1, 2, 3, 4, and 5. The total of these numbers is 15 and the mean is 3. Adjusting each value by the mean value gives the transformed numbers as: -2, -1, 0, 1, and 2. The new total is 0 and thus the new mean is 0. However, note that the standard deviation is still the same as original (√2). This can now be taken care of by scaling the values with standard deviation in order to make the new standard deviation as 1.

Autoscaling

For a given set of values, the standard deviation can be made to be unit by scaling (dividing) all the values by the original standard deviation. This is a standard step in normalization of data.

Say, the values are - 1, 2, 3, 4 and 5. The standard deviation is √2. Now, dividing each value by the standard deviation gives us the transformed data as - 1/√2, √2, 3/√2, 2√2 and 5/√2. The new standard deviation for this set of values is 1.

(The above principle is better demonstrated algebraically).

A value x belonging to a distribution with mean 'x_mean' and standard deviation 's' can be transformed to a standard score, or z-score, in the following manner:

z = (x - x_mean)/s

The mean of standard scores is zero. When values are standardized, the units in which they are expressed are equal to the standard deviation, s. For the standardized scores, the standard deviation becomes 1. (Variance is also 1). The interpretation of the standard-score of a given value is in terms of the number of standard deviations the value is above or below the mean (of the distribution of standardized scores).

So, the standardization of a set of values involves two steps. First, the mean is subtracted from every value, which shifts the central location of the distribution to 0. Then the thus mean-shifted values are divided by the standard deviation, s. This now makes the standard deviation as 1.