Visualizing Scaffold Trends

Recently Barbara Zdrazil and I published an article that explored the idea of tracking the attention being paid to a scaffold in the medicinal chemistry literature (as represented by ChEMBL). The gist of the idea is that scaffolds that are more frequently enumerated or tested in more assays (or even published in increasingly high IF journals) are receiving more attention than ones that are less frequently enumerated and so on. By fitting robust regression models to per-year scaffold-aggregated properties we identified significant vs non-significant trends.

The idea originated from a blog post (archived version) by Jonathan Baell, where he traced the publication history of the bis-chalcone scaffold starting from Stoll et al, Biochemistry, 2001 ending up at Anchoori et al, Cancer Cell, 2013, the point being that a PAINS containing scaffold (and thus of possibly dubious biological activity) received increasing attention resulting in a (relatively) high profile journal publication. This led to the question of whether we could systematically capture such attention trends for other scaffolds and thus this paper.

While the article presents a comprehensive analysis, it is limited to using a fixed set of scaffolds (defined using the Bemis-Murcko scheme) and a few properties we selected because we thought they would be proxies of attention. What if we could consider any scaffold? And visualize the time evolution of an arbitrary scaffold-aggregated property over time? This would be something like Google Trends – except that instead of text search terms, you’d be comparing scaffolds.

So I put together the Scaffold Trend Explorer, which allows you to view trends for a number of properties, for arbitrary substructures. Obviously, searching for frequent substructures will probably be non-responsive (so I disallow queries such as benzene and straight chain alkanes with < 8 carbons). I’ve provided a number of properties ranging from the count of enumerated compounds to drug-likeness. You can draw a structure or provide the SMILES directly. In addition there is a set of bookmarks for well known scaffolds (taken from Welsch et al, 2010). You can compare multiple (up to 9) scaffolds at a time, and compute moving window average curves, which hides the year to year variation.

This tool should let users play around with the idea of scaffold trends. Currently, it’s a very simple visualization tool – you can download the per-year data, but that’s it. Unlike the paper, I don’t fit regression lines, though I hope to implement this in the future. There’s a number of enhancements planned, including access to the underlying publications for a scaffold in a given year, simple analytics (such as differential analysis) on trends and so on. A natural next step is to go beyond the medchem literature and consider patents as well (say, via SureChEMBL). And of course, feature requests are also welcome.

Byproducts of Byproducts & Biomedical Data

Recently I came across a fantastic article that explored how far ahead Google Maps is compared to Apple Maps, focusing in particular on Areas of Interest (AOI), and how this is achieved with Googles competencies in massive data and massive computation, resulting in a moat. The conclusion is that

Google has gathered so much data, in so many areas, that it’s now crunching it together and creating features that Apple can’t make—surrounding Google Maps with a moat of time

But the key point that caught my eye was the idea that Google Maps sophistication is a byproduct of byproducts. As pointed out, AOI’s are a byproduct of buildings (a byproduct of satellite imagery) and places (a byproduct of Street View) and thus AOI’s are byproducts of byproducts.

This observation led me to thinking of how it could apply in a biomedical setting. In other words, given disparate biomedical data types, what new data types can be generated from them, and using those derived data types what further data types could be derived again? (“data type” may not be the right term to use here, and “entity” may be a more suitable one).

One interpretation of this idea are integrative resources, where disparate (but related) data types are connected to each other in a single store, allowing one to (hopefully) make non-obvious links between entities that explain a higher level system or phenomenon. Recent examples include Pharos and MARRVEL. However, these don’t really fit the concept of byproducts of byproducts as neither of these resources actually generate new data from pre-existing data, at least by themselves.

So are there better examples? One that comes to mind is the protein folding problem. While one could fold proteins de novo, it’s a little easier if constraints are provided. Thus we have constraints derived from NMR and AA coevolution. As a result we can view predicted protein structures as a byproduct of NMR constraints (a byproduct of structure determination) and a byproduct of AA co-evolution data (a byproduct of gene sequencing). An example of this is Tang et al, 2015.

Another one that comes to mind are inferred gene (or signalling, metabolic etc) networks, which go from say, gene expression data to a network of genes. But going by the Google Maps analogy above, the gene network is the first level byproduct. One could image a computation that processes a set of (inferred) gene networks to generate higher level structures (say, spatial localization or differentiation). But this is a bit more fuzzier than the protein structure problem

Of course, this starts to break down when we take into account errors in the individual components. Thus sequencing errors can introduce errors in the coevolution data, which can get carried over into the protein structure. This isn’t inevitable – but it does require validation and possibly curation. And in many case, large, correlated datasets can allow one to account for errors (or work around them).

This is mainly speculation on my part, but it seems interesting to try and think of how one can combine disparate data types to generate new ones, and repeat this process to come up with something new that was not available (or not obvious) from the initial data types.

Waterfall Plots for Dose Response Curves

Waterfall plots are a common visualization method to view multiple spectra and have some similarities with joy plots. In the high throughput screening world, people have plot multiple dose response curves, offset on the z-axis to produce something that looks like a waterfall. An example is Figure 1 in Inglese et al, PNAS, 2006, 103(31). In my opinion, such visualizations are not much more than eye candy and not particulary informative, though it helps if the curves to be displayed are picked carefully so that they can be differentiated in the plot. However, people seem to like them and I’ve been asked to generate them based on dose response fit parameters.

Here’s an implementation using rgl, which results in an interactive waterfall plot. An example of the output is shown below

A waterfall plot for active (red) and inconclusive (green) dose response curves

Who is Eligible?

Applicant(s), age 35 or younger, who have demonstrated excellence in their chemical information related research and who are developing careers that have the potential to have a positive impact on the utility of chemical information relevant to chemical structures, reactions and compounds, are invited to submit applications.  While the primary focus of the Grant Program is the career development of young researchers, additional bursaries may be made available at the discretion of the Trust.  All requests must follow the application procedures noted below and will be weighed against the same criteria.

Which Activities are Eligible?

Grants may be awarded to acquire the experience and education necessary to support research activities; e.g. for travel to collaborate with research groups, to attend a conference relevant to one’s area of research (including the presentation of an already-accepted research paper), to gain access to special computational facilities, or to acquire unique research techniques in support of one’s research. Grants will not be given for activities completed prior to the grant award date.

Application Requirements

Applications must include the following documentation:

1. A letter that details the work upon which the Grant application is to be evaluated as well as details on research recently completed by the applicant;
2. The amount of Grant funds being requested and the details regarding the purpose for which the Grant will be used (e.g. cost of equipment, travel expenses if the request is for financial support of meeting attendance, etc.). The relevance of the above-stated purpose to the Trust’s objectives and the clarity of this statement are essential in the evaluation of the application);
3. A brief biographical sketch, including a statement of academic qualifications and a recent photograph;
4. Two reference letters in support of the application.  Additional materials may be supplied at the discretion of the applicant only if relevant to the application and if such materials provide information not already included in items 1-4.   A copy of the completed application document must be supplied for distribution to the Grants Committee and can be submitted via regular mail or e-mail to the Committee Chair (see contact information below).