Refereed Publications
Advances in Cheminformatics Methodologies and Infrastructure to Support the Data Mining of Large, Heterogeneous Chemical Datasets
Curr. Comp. Aid. Drug Des., 2009, submitted
In recent years, there has been an explosion in the availability of publicly accessible chemical information, including chemical structures of small molecules, structure-derived properties and associated biological activities in a variety of assays. These data sources present us with a significant opportunity to develop and apply computational tools to extract and understand the underlying structure-activity relationships. Furthermore, by integrating chemical data sources with biological information (protein structure, gene expression and so on), we can attempt to build up a holistic view of the effects of small molecules in biological systems. Equally important is the ability for non-experts to access and utilize state of the art cheminformatics method and models. In this review we present recent developments in cheminformatics methodologies and infrastructure that provide a robust, distributed approach to mining large and complex chemical datasets. In the area of methodology development, we highlight recent work on characterizing structure-activity landscapes, QSAR model domain applicability and the use of chemical similarity in text mining. In the area of infrastructure, we discuss a distributed web services framework that allows easy deployment and uniform access to
computational (statistics, cheminformatics and computational chemistry) methods, data and models. We also discuss the development of PubChem derived databases and highlight techniques that allow us to scale the infrastructure to extremely large compound collections, by use of distributed processing on Grids. Given that the above work is applicable to arbitrary types of cheminformatics problems, we also present some case studies related to virtual screening for anti-malarials and predictions of anti- cancer activity.
Towards Interoperable and Reproducible QSAR Analyses: Exchange of Data Sets
J. Cheminformatics, 2009, submitted
Background: QSAR/QSPR is a widely used method to relate chemical structures and responses based on ex-
perimental observations. In QSAR, chemical structures are expressed as descriptors, which are mathematical
representations like calculated properties or enumerated fragments. Many existing QSAR data sets are based
on a combination of different software tools mixed with in-house developed solutions, with datasets manually
assembled in spreadsheets. Currently there exists no agreed-upon definition of descriptors and no standard for
exchanging data sets in QSAR, which together with numerous different descriptor implementations makes it a
virtually impossible task to reproduce and validate analyses, and significantly hinders collaborations and re-use of
data.
Results: We present a step towards standardizing QSAR analyses by defining interoperable and reproducible
QSAR/QSPR data sets, comprising an open XML format (QSAR-ML) and an open extensible descriptor ontology
(Blue Obelisk Descriptor Ontology). The ontology provides an extensible way of uniquely defining descriptors
for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these
descriptors. Hence, a data set described by QSAR-ML makes its setup completely reproducible. We also provide
an implementation as a set of plugins for Bioclipse that simplifies QSAR data set formation, and allows for
exporting in QSAR-ML as well as traditional CSV formats. The implementation facilitates addition of new
descriptor implementations, from locally installed software and remote Web services; the latter is demonstrated
with REST and XMPP Web services.
Conclusions: Standardized QSAR data sets opens up new ways to store, query, and exchange data for subsequent
analyses. QSAR-ML supports completely reproducible dataset formation, solving the problems of defining which
software components were used, their versions, and the case of multiple names for the same descriptor. This
makes is easy to join, extend, combine data sets and also to work collectively. The presented Bioclipse plugins
equip scientists with intuitive tools that make QSAR-ML widely available for the community.
Improving Usability and Accessibility of Cheminformatics Tools for Chemists Through Cyberinfrastructure and Education
Cheminformatics, 2009, in press
PubChem as a Source of Polypharmacology
J. Chem. Inf. Model., 2009, 49, 2044-2055
[ Abstract ]
[DOI 10.1021/ci9001876 ]
Polypharmacology provides a new way to address the issue of high
attrition rates arising from lack of efficacy and toxicity. However,
the development of polypharmacology is hampered by the incomplete
SAR data and limited resources for validating target
combinations. The PubChem bioassay collection, reporting the activity of
compounds in multiple assays, allows us to study polypharmacological
behavior in the PubChem collection via cross-assay analysis. In this
paper, we developed a network representation of the assay collection
and then applied a bipartite mapping between this network and
various biological networks (i.e., PPI, pathway) as well as
artificial networks (i.e., drug-target network). Mapping to a
drug-target network allows us to prioritize new selective
compounds, while mapping to other biological networks enable us to
observe interesting target pairs and their associated compounds
in the context of biological systems. Our results indicate this
approach could be a useful way to investigate polypharmacology in
the PubChem bioassay collection.
Chemoinformatic Analysis of Drugs, Natural Products, Molecular Libraries Small Molecule Repository and Combinatorial Libraries
J. Chem. Inf. Model., 2009, 49, 1010-1024
[ Abstract ]
[DOI 10.1021/ci800426u ]
A multiple criteria approach is presented, that is used to perform a comparative analysis of four recently developed combinatorial libraries to drugs, Molecular Libraries Small Molecule Repository (MLSMR) and natural products. The compound databases were assessed in terms of physicochemical properties, scaffolds, and fingerprints. The approach enables the analysis of property space coverage, degree of overlap between collections, scaffold and structural diversity, and overall structural novelty. The degree of overlap between combinatorial libraries and drugs was assessed using the R-NN curve methodology, which measures the density of chemical space around a query molecule embedded in the chemical space of a target collection. The combinatorial libraries studied in this work exhibit scaffolds that were not observed in the drug, MLSMR, and natural products databases. The fingerprint-based comparisons indicate that these combinatorial libraries are structurally different than current drugs. The R-NN curve methodology revealed that a proportion of molecules in the combinatorial libraries is located within the property space of the drugs. However, the R-NN analysis also showed that there are a significant number of molecules in several combinatorial libraries that are located in sparse regions of the drug space.
Navigating Structure Activity Landscapes
Drug Discov. Today, 2009, 14, 698-705
[ Abstract ]
[DOI 10.1016/j.drudis.2009.04.003 ]
The problem of how to systematically explore structure-activity relationships (SARs) is still largely unsolved in medicinal chemistry. Recently, data analysis tools have been introduced to navigate activity landscapes and assess structure-activity relationships on a large scale. Initial investigations reveal a surprising heterogeneity among SARs and shed light on the relationship between `global' and `local' SAR features. Moreover, insights are provided into the fundamental issue of why modeling tools work well in some cases, but not in others.
Pharmacophore Representation and Searching
Assessing How Well a Modeling Protocol Captures a Structure-Activity Landscape
J. Chem. Inf. Model., 2008, 48, 1716-1728
[ Abstract ]
[DOI 10.1021/ci8001414 ]
We introduce the notion of structure-activity landscape index (SALI) curves as a way to assess a model and a modeling protocol, applied to structure-activity relationships. We start from our earlier work [J. Chem. Inf. Model., 2008, 48, 646-658], where we show how to study a structure-activity relationship pairwise, based on the notion of "activity cliffs" - pairs of molecules that are structurally similar but have large differences in activity. There, we also introduced the SALI parameter, which allows one to identify cliffs easily, and which allows one to represent a structure-activity relationship as a graph. This graph orders every pair of molecules by their activity. Here, we introduce the new idea of a SALI curve, which tallies how many of these orderings a model is able to predict. Empirically, testing these SALI curves against a variety of models, ranging over two-dimensional quantitative structure-activity relationship (2D-QSAR), three-dimensional quantitative structure-activity relationship (3D-QSAR), and structure-based design models, the utility of a model seems to correspond to characteristics of these curves. In particular, the integral of these curves, denoted as SCI and being a number ranging from -1.0 to 1.0, approaches a value of 1.0 for two literature models, which are both known to be prospectively useful.
The Structure-Activity Landscape Index: Identifying and Quantifying Activity-Cliffs
J. Chem. Inf. Model., 2008, 48, 646-658
[ Abstract ]
[DOI 10.1021/ci7004093 ]
A new method for analyzing a structure-activity relationship is proposed. By use of a simple quantitative index, one can readily identify "structure-activity cliffs": pairs of molecules which are most similar but have the largest change in activity. We show how this provides a graphical representation of the entire SAR, in a way that allows the salient features of the SAR to be quickly grasped. In addition, the approach allows us view the SARs in a data set at different levels of detail. The method is tested on two data sets that highlight its ability to easily extract SAR information. Finally, we demonstrate that this method is robust using a variety of computational control experiments and discuss possible applications of this technique to QSAR model evaluation.
A Flexible Web Service Infrastructure for the Development and Deployment of Predictive Models
J. Chem. Inf. Model., 2008, 48, 456-464
[ Abstract ]
[DOI 10.1021/ci700188u ]
The development of predictive statistical models is a common task in
the field of drug design. The process of developing such models
involves two main steps: building the model and then deploying the
model. Traditionally such models have been deployed using web page
interfaces. This approach restricts the user to using the specified
web page and using the model in other ways can be cumbersome. In
this paper we present a flexible and generalizable approach to the
deployment of predictive models, based on a web service
infrastructure using R. The infrastructure described allows one to
access the functionality of these models using a variety of approach
ranging from web pages to workflow tools. We highlight the
advantages of this infrastructure by developing and subsequently
deploying random forest models for two datasets.
On the Interpretation and Interpretability of QSAR Models
J. Comp. Aid. Molec. Des., 2008, 22, 857-871
[ Abstract ]
[DOI 10.1007/s10822-008-9240-5 ]
The goal of a quantitative structure--activity relationship (QSAR) model is to encode the relationship between molecular structure and biological activity or physical property. Based on this encoding, such models can be used for predictive purposes. Assuming the use of relevant and meaningful descriptors, and a statistically significant model, extraction of the encoded structure--activity relationships (SARs) can provide insight into what makes a molecule active or inactive. Such analyses by QSAR models are useful in a number of scenarios, such as suggesting structural modifications to enhance activity, explanation of outliers and exploratory analysis of novel SARs. In this paper we discuss the need for interpretation and an overview of the factors that affect interpretability of QSAR models. We then describe interpretation protocols for different types of models, highlighting the different types of interpretations, ranging from very broad, global, trends to very specific, case-by-case, descriptions of the SAR, using examples from the training set. Finally, we discuss a number of case studies where workers have provide some form of interpretation of a QSAR model.
Utilizing High Throughput Screening Data for Predictive Toxicology Models: Protocols and Application to MLSCN Assays
J. Comp. Aid. Molec. Des., 2008, 22, 367-384
[ Abstract ]
[DOI 10.1007/s10822-008-9192-9 ]
Computational toxicology is emerging as an encouraging alternative to experimental testing. The Molecular Libraries Screening Center Network (MLSCN) as part of the NIH Molecular Libraries Roadmap has recently started generating large and diverse screening datasets, which are publicly available in PubChem. In this report, we investigate various aspects of developing computational models to predict cell toxicity based on cell proliferation screening data generated in the MLSCN. By capturing feature-based information in those datasets, such predictive models would be useful in evaluating cell-based screening results in general (for example from reporter assays) and could be used as an aid to identify and eliminate potentially undesired compounds. Specifically we present the results of random forest ensemble models developed using different cell proliferation datasets and highlight protocols to take into account their extremely imbalanced nature. Depending on the nature of the datasets and the descriptors employed we were able to achieve percentage correct classification rates between 70% and 85% on the prediction set, though the accuracy rate dropped significantly when the models were applied to in vivo data. In this context we also compare the MLSCN cell proliferation results with animal acute toxicity data to investigate to what extent animal toxicity can be correlated and potentially predicted by proliferation results. Finally, we present a visualization technique that allows one to compare a new dataset to the training set of the models to decide whether the new dataset may be reliably predicted.
Userscripts for the Life Sciences
BMC Bioinformatics, 2007, 8, 487
[ Abstract ]
[DOI 10.1186/1471-2105-8-487 ]
The web has seen an explosion of chemistry
and biology related resources in the last 15 years: thousands of
scientific journals, databases, wikis, blogs and resources are available with
a wide variety of types of information. There is a huge need to aggregate
and organise this information. However, the sheer number of resources makes it
unrealistic to link them all in a centralised manner. Instead,
search engines to find information in those resources flourish, and formal
languages like Resource Description Framework and Web Ontology Language
are increasingly used to allow linking of resources.
A recent development is the use of userscripts to change the appearance of
web pages, by on-the-fly modification of the web content. This opens
possibilities to aggregate information and computational results from
different web resources into the web page of one of those resources.
Chemical Data Mining of the NCI Human Tumor Cell Line Database
J. Chem. Inf. Model., 2007, 47, 2063-2076
[ Abstract ]
[DOI 10.1021/ci700141x ]
The NCI Developmental Therapeutics Program Human Tumor cell line data set is a publicly available database that contains cellular assay screening data for over 40 000 compounds tested in 60 human tumor cell lines. The database also contains microarray assay gene expression data for the cell lines, and so it provides an excellent information resource particularly for testing data mining methods that bridge chemical, biological, and genomic information. In this paper we describe a formal knowledge discovery approach to characterizing and data mining this set and report the results of some of our initial experiments in mining the set from a chemoinformatics perspective.
Counting Clusters Using R-NN Curves
J. Chem. Inf. Model., 2007, 47, 1308-1318
[ Abstract ]
[DOI 10.1021/ci600541f ]
Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for non-hierarchical clustering methods, such as $k$-means, is the number of clusters, k. Traditionally the value of $k$ is obtained by performing the clustering with different values of $k$ and selecting that value that leads to the optimal clustering. In this study we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J.~Chem.~Inf.~Model., 2006, 46, 1713-1722) which uses a nearest neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the dataset which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical datasets. Our results indicate the the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters
A Web Service Infrastructure for Chemoinformatics
J. Chem. Inf. Model., 2007, 47, 1303-1307
[ Abstract ]
[DOI 10.1021/ci6004349 ]
The vast increase of pertinent information available to drug discovery scientists means that there is strong demand for tools and techniques for organizing and intelligently mining this information for manageable human consumption. At Indiana University, we have developed an infrastructure of chemoinformatics web services that simplify the access to this information and the computational techniques that can be applied to it. In this paper, we describe this infrastructure, give some examples of its use, and then discuss our plans to use it as a platform for chemoinformatics application development in the future.
Ensemble Feature Selection: Consistent Descriptor Subsets for Multiple QSAR Models
J. Chem. Inf. Model., 2007, 47, 989-997
[ Abstract ]
[DOI 10.1021/ci600563w ]
Selecting a small subset of descriptors from a large pool to build a predictive QSAR model is an important step in the QSAR modeling process. In general subset selection is very hard to solve, even approximately, with guaranteed performance bounds. Traditional approaches employ deterministic or stochastic methods to obtain a descriptor subset that leads to an optimal model of a single type (such as linear regression or a neural network). With the development of ensemble modeling approaches, multiple models of differing types are individually developed resulting in different descriptor subsets for each model type. However it is advantageous, from the point of view of developing interpretable QSAR models, to have a single set of descriptors that can be used for different model types. In this paper, we describe an approach to the selection of a single, optimal, subset of descriptors for multiple model types. We apply this approach to three datasets, covering both regression and classification, and show that the constraint of forcing different model types to use the same set of descriptors does not lead to a significant loss in predictive ability for the individual models considered. In addition, interpretations of the individual models developed using this approach indicate that they encode similar structure-activity trends.
Chemical Informatics Functionality in R
J. Stat. Soft., 2007, 18,
[ Abstract ]
[ Link ]
The flexibility and scope of the R programming environment has made it a popular choice for statistical modeling and scientific prototyping in a number of fields. In the field of chemistry, R provides several tools for a variety of problems related to statistical modeling of chemical information. However, one aspect common to these tools is that they do not have direct access to the information that is available from chemical structures, such as contained in molecular descriptors.
We describe the rcdk package that provides the R user with access to the CDK, a Java framework for cheminformatics. As a result, it is possible to read in a variety of molecular formats, calculate molecular descriptors and evaluate fingerprints. In addition, we describe the rpubchem that will allow access to the data in PubChem, a public repository of molecular structures and associated assay data for approximately 8 million compounds. Currently the package allows access to structural information as well as some simple molecular properties from PubChem. In addition the package allows access to bio-assay data from the PubChem FTP servers.
Local Lazy Regression: Making Use of the Neighborhood to Improve QSAR Predictions.
J. Chem. Inf. Model., 2006, 46, 1836-1847
[ Abstract ]
[DOI 10.1021/ci060064e ]
Traditional quantitative structure-activity relationship (QSAR) models aim to capture global structure-activity trends present in a data set. In many situations, there may be groups of molecules which exhibit a specific set of features which relate to their activity or inactivity. Such a group of features can be said to represent a local structure-activity relationship. Traditional QSAR models may not recognize such local relationships. In this work, we investigate the use of local lazy regression (LLR), which obtains a prediction for a query molecule using its local neighborhood, rather than considering the whole data set. This modeling approach is especially useful for very large data sets because no a priori model need be built. We applied the technique to three biological data sets. In the first case, the root-mean-square error (RMSE) for an external prediction set was 0.94 log units versus 0.92 log units for the global model. However, LLR was able to characterize a specific group of anomalous molecules with much better accuracy (0.64 log units versus 0.70 log units for the global model). For the second data set, the LLR technique resulted in a decrease in RMSE from 0.36 log units to 0.31 log units for the external prediction set. In the third case, we obtained an RMSE of 2.01 log units versus 2.16 log units for the global model. In all cases, LLR led to a few observations being poorly predicted compared to the global model. We present an analysis of why this was observed and possible improvements to the local regression approach.
R-NN Curves: An Intuitive Approach to Outlier Detection Using a Distance Based Method
J. Chem. Inf. Model., 2006, 46, 1713-1722
[ Abstract ]
[DOI 10.1021/ci060013h ]
Libraries of chemical structures are used in a variety of cheminformatics tasks such as virtual screening and QSAR modeling and are generally characterized using molecular descriptors. When working with libraries it is useful to understand the distribution of compounds in the space defined by a set of descriptors. We present a simple approach to the analysis of the spatial distribution of the compounds in a library in general and outlier detection in particular based on counts of neighbors within a series of increasing radii. The resultant curves, termed R-NN curves, appear to follow a logistic model for any given descriptor space, which we justify theoretically for the 2D case. The method can be applied to data sets of arbitrary dimensions. The R-NN curves provide a visual method to easily detect compounds lying in a sparse region of a given descriptor space. We also present a method to numerically characterize the R-NN curves thus allowing identification of outliers in a single plot.
The Blue Obelisk--Interoperability in Chemical Informatics.
J. Chem. Inf. Model., 2006, 46, 991-998
[ Abstract ]
[DOI 10.1021/ci050400b ]
The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a diverse Internet group promoting reusable chemistry via open source software development, consistent and complimentary chemoinformatics research, open data, and open standards. We outline recent examples of cooperation in the Blue Obelisk group: a shared dictionary of algorithms and implementations in chemoinformatics algorithms drawing from our various software projects; a shared repository of chemoinformatics data including elemental properties, atomic radii, isotopes, atom typing rules, and so forth; and Web services for the platform-independent use of chemoinformatics programs.
Scalable Partitioning and Exploration of Chemical Spaces using Geometric Hashing
J. Chem. Inf. Model., 2006, 46, 321-333
[ Abstract ]
[DOI 10.1021/ci050403o ]
Virtual screening (VS) has become a preferred tool to augment high-throughput screening1 and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249,071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.
Generating, Using and Visualizing Molecular Information in R
Validation of the CDK Surface Area Routine
CDK News, 2006, 3, 5-9
Recent Developments of the Chemistry Development Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics
Curr. Pharm. Des., 2006, 12, 2110-2120
[ Abstract ]
[DOI 10.2174/138161206777585274 ]
The Chemistry Development Kit (CDK) provides methods for common tasks in molecular informatics, including 2D and 3D rendering of chemical structures, I/O routines, SMILES parsing and generation, ring searches, isomorphism checking, structure diagram generation, etc. Implemented in Java, it is used both for server-side computational services, possibly equipped with a web interface, as well as for applications and client-side applets. This article introduces the CDK's new QSAR capabilities and the recently introduced interface to statistical software.
Interpreting Computational Neural Network QSAR Models: A Detailed Interpretation of the Weights and Biases
J. Chem. Inf. Model., 2005, 45, 1109-1121
[ Abstract ]
[DOI 10.1021/ci050110v ]
Interpreting Computational Neural Network QSAR Models: A Measure of Descriptor Importance
J. Chem. Inf. Model., 2005, 45, 800-806
[ Abstract ]
[DOI 10.1021/ci050022a ]
We present a method to measure the relative importance of the descriptors present in a QSAR model developed with a computational neural network (CNN). The approach is based on a sensitivity analysis of the descriptors. We tested the method on three published data sets for which linear and CNN models were previously built. The original work reported interpretations for the linear models, and we compare the results of the new method to the importance of descriptors in the linear models as described by a PLS technique. The results indicate that the proposed method is able to rank descriptors such that important descriptors in the CNN model correspond to the important descriptors in the linear model.
Determining the Validity of a QSAR Model--A Classification Approach
J. Chem. Inf. Model., 2005, 45, 65-73
[ Abstract ]
[DOI 10.1021/ci0497511 ]
The determination of the validity of a QSAR model when applied to new compounds is an important concern in the field of QSAR and QSPR modeling. Various scoring techniques can be applied to specific types of models. We present a technique with which we can state whether a new compound will be well predicted by a previously built QSAR model. In this study we focus on linear regression models only, though the technique is general and could also be applied to other types of quantitative models. Our technique is based on a classification method that divides regression residuals from a previously generated model into a good class and bad class and then builds a classifier based on this division. The trained classifier is then used to determine the class of the residual for a new compound. We investigated the performance of a variety of classifiers, both linear and nonlinear. The technique was tested on two data sets from the literature and a hand built data set. The data sets selected covered both physical and biological properties and also presented the methodology with quantitative regression models of varying quality. The results indicate that this technique can determine whether a new compound will be well or poorly predicted with weighted success rates ranging from 73% to 94% for the best classifier.
Using R to Provide Statistical Functionality for QSAR Modeling in CDK to Provide Statistical Functionality for QSAR Modeling in CDK
Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors.
J. Chem. Inf. Comput. Sci., 2004, 44, 2179-2189
[ Abstract ]
[DOI 10.1021/ci049849f ]
A QSAR modeling study has been done with a set of 79 piperazyinylquinazoline analogues which exhibit PDGFR inhibition. Linear regression and nonlinear computational neural network models were developed. The regression model was developed with a focus on interpretative ability using a PLS technique. However, it also exhibits a good predictive ability after outlier removal. The nonlinear CNN model had superior predictive ability compared to the linear model with a training set error of 0.22 log(IC50) units (R2 = 0.93) and a prediction set error of 0.32 log(IC50) units (R2 = 0.61). A random forest model was also developed to provide an alternate measure of descriptor importance. This approach ranks descriptors, and its results confirm the importance of specific descriptors as characterized by the PLS technique. In addition the neural network model contains the two most important descriptors indicated by the random forest model.
The Development of QSAR Models To Predict and Interpret the Biological Activity of Artemisinin Analogues
J. Chem. Inf. Comput. Sci., 2004, 44, 1440-1449
[ Abstract ]
[DOI 10.1021/ci0499469 ]
This work presents the development of Quantitative Structure-Activity Relationship (QSAR) models to predict the biological activity of 179 artemisinin analogues. The structures of the molecules are represented by chemical descriptors that encode topological, geometric, and electronic structure features. Both linear (multiple linear regression) and nonlinear (computational neural network) models are developed to link the structures to their reported biological activity. The best linear model was subjected to a PLS analysis to provide model interpretability. While the best linear model does not perform as well as the nonlinear model in terms of predictive ability, the application of PLS analysis allows for a sound physical interpretation of the structure-activity trend captured by the model. On the other hand, the best nonlinear model is superior in terms of pure predictive ability, having a training error of 0.47 log RA units (R2 = 0.96) and a prediction error of 0.76 log RA units (R2 = 0.88).
Generation of QSAR Sets with a Self-Organizing Map.
J. Mol. Graph. Model., 2004, 23, 1-14
[ Abstract ]
[DOI 10.1016/j.jmgm.2004.03.003 ]
A Kohonen self-organizing map (SOM) is used to classify a data set consisting of dihydrofolate reductase inhibitors with the help of an external set of Dragon descriptors. The resultant classification is used to generate training, cross-validation (CV) and prediction sets for QSAR modeling using the ADAPT methodology. The results are compared to those of QSAR models generated using sets created by activity binning and a sphere exclusion method. The results indicate that the SOM is able to generate QSAR sets that are representative of the composition of the overall data set in terms of similarity. The resulting QSAR models are half the size of those published and have comparable RMS errors. Furthermore, the RMS errors of the QSAR sets are consistent, indicating good predictive capabilities as well as generalizability.