What is the intention of CancerResource?
is a comprehensive knowledgebase for drug-target relationships
related to cancer as well as for supporting information or experimental data. Drug-target relationships are determined by a manually curated text-mining of publicly available literature. A couple of resources that provide similar data with slightly different background and intention are mined a. for comparison with the CancerResource
text-mining and b. for integration into this knowledgebase. Thus, CancerResource
reflects the actual knowledge about this matter in an integrative way.
To strenghten the literature mining, which is in its result a compilation of direct
knowledge of drug-target relationships, interactions that are known in the PDB
is added to that part.
To improve the functionality of CancerResource
as an oncologic web exploration tool, several kinds of experimental data are added. Indirect
drug interactions on target genes that are known from whole-genome explorations by microarray technology or on cancer cell lines, characterized as cellular fingerprints, can be immediately compared with the text-mining results for direct
To cover this attempt, CancerResource
provides a set of exploration tools or access points to data. CancerResource
provides information for or on lead compound design, for drug action on genes, for detecting novel target genes, for estimating or predicting activities of compounds on cancer cell lines. A big point is the detection and visualization of the targeting of multiple genes by a particular drug as well as the targeting of a single gene by multiple drugs.
Considering all those aspects, this oncologic web exploration tool shall allow researchers to initiate workflows in several directions. As a conclusion: this site is designed for biochemists, pharmacists, and scientists in biomedicine who are intending to develop novel drugs or who want to gather a fast overview over the field of drug-target relationships.
What kind of information can I find in CancerResource?
First of all, CancerResource
comprises a lot of drug-target interactions
and serves as a discovery tool for such information, which can be found in the central part of any detail result page, for a drug page as well as for a target gene page.
Querying a drug, all targets will be displayed; querying a potential target, all targeting drugs will be displayed.
For a group of drugs, as well as for a group of targets, a matrix of drugs and targets will be displayed, to elucidate in the same picture the multiple targeting of a single drug and the targeting on a single gene by multiple drugs.
is a comprehensive database for information about cancer-related targets in conjunction with experimental data (as gene expression, drug influence, drug influence on gene expression - differential gene expression, pathway affiliation).
So, cellular fingerprints
are the computational representation of the activity of a drug on a defined set of cancer cell lines (integration of DTP / NCI
data). With this tool it is possible to elucidate relationships between the structure and the activity of a compound.
Gene expression data
of 1,821 cancer cell lines, obtained from NCI60, CCLE and CoSMIC, can be explored (by KEGG pathway affiliation or user-defined
expression data). Here, the user is able to detect genes that are, in single cancer cell lines, significantly differently expressed.
of 2,037 cancer cell lines, obtained from NCI60, CCLE and CoSMIC, are available. A total amount of 872,658 mutations for 19,834 genes is stored in CancerResource.
How to use CancerResource?
The illustrated use case gives a starting point of how to use CancerResource
by uploading either external mutational data or mRNA expression values:
- Upload external mRNA expression data on the Cell line/Expression site by using the Query database with your own data box.
- You can choose between using affy probe names or HGNC gene symbols with normalized and log2 transformed expression values seperated by tabs.
- Please be aware, that a calculation can take up to 10 minutes, depending on the number of genes that have to be analyzed.
- The most similar cancer cell lines are computed. Either by Pearson correlation distance or fold changes.
- In this example REH cell line is calculated as the most similar cancer cell line by Pearson correlation distance.
- By clicking on the REH cell line link the most effective drugs of the REH cancer cell line are displayed.
- With regard to IC50 Patupilone is determined as the most effective drug on REH cancer cell lines.
- By clicking on the Patupilone link, the user is directed to Patupilone's detail site displaying information about Patupilone as well as its cellular fingerprint.
- The mutation profile of REH shows the 10 most affected cancer relevant genes with regard to Polyphen prediction.
- The heatmap illustration gives an overview of the similarity of REH cancer cell line to other cell lines of the same consortia, which can be selected on the top of the page.
- To compare the expression levels of the most affected cancer relevant genes copy their gene symbols into the Query database by expression data box on the Cell line/Expression site.
How are compound-target gene interactions for CancerResource retrieved?
Compound-target relationships were automatically detected by own literature text mining over 19 million PubMed
abstracts using our vocabularies for drugs and targets.
The drug vocabulary was generated from compounds having a cancer-related ATC-classification via SuperDrug or if the compound and its synonymous name are in the NCI compound set. The cancer relationship of a gene was determined from annotations in cancer-related KEGG pathways and the Gene Ontology, GO. Abstracts, titles and MeSH terms were converted into a text index using the LingPipe (http://alias-i.com/lingpipe/index.html) and the Lucene software packages. Both vocabularies were searched against each indexed abstract and the result was scored by an own rule-based validation algorithm. After this automatic procedure and a subsequent ranking revealing about 8,000 publications, a manual revision of the hits followed resulting in about 900 highly significant publications of direct interactions.
To complement the own literature mining, cancer-related drug-target interactions were collected from several established data sources such as ChEMBL
How is a drug (or compound, generally) defined?
A drug is defined as a cancer-target-related compound, as a chemical with known cancer relevant activity, or a compound that shows to have influence in cancer development. Drugs inside this database were collected from several data sources:
Drugs (and compounds) that revealed from the own text mining of cancer-related literature in PubMed: Abstracts were computationally processed to find experimentally verified drugs (see above).
Drugs associated with target genes and a cancer disease annotation from CTD, PharmGKB, and TTD.
Drugs associated with target genes without a particular disease annotation from DrugBank, ChEMBL and PDB to complement the set and enable a re-positioning query.
Compounds that have influence on cell lines DTP / NCI
Additionally, general resources for compounds are mined to provide backbone information on drugs as described.
How is a target defined?
Targets in CancerResource
are genes or proteins which are involved in the appearence and development of cancer.
, cancer associated targets are selected from different sources:
Textmining; Thousands of PubMed abstracts were processed computationally to find experimentally verified drugs, targets and drug-target-relations. The abstracts were filtered by cancer relevant terms (e.g. 'antineoplastic'). All results were manually curated.
Target genes or proteins associated with drugs from ChEMBL, CTD, PharmGKB, TTD, DrugBank and PDB.
Additionally, general resources are mined to provide backbone information on genes or proteins as described.
How is KEGG used in CancerResource?
(Kyoto Encyclopedia of Genes and Genomes) is a collection of database resources for linking genomes to life and the environment. KEGG PATHWAY provides a collection of manually drawn pathway maps which visualize molecular interaction and reaction networks.
, KEGG maps for more than 50 cancer relevant (signalling) pathways are used to picture the role of targets and drugs acting on them. Targets with annotated drug-target interactions are highlighted in yellow, and information about the drugs acting on them is given on-click.
Expression data are inserted into pathway maps as colored icon borders if the user performed a respective search before and requested the link to the pathway map. Colored maps are retrieved via Web Service.
What is an Over-Representation Analysis (ORA)?
The over-representation analysis ORA is a statistical estimation for the affection of a given KEGG pathway through a (drug) treatment.
The number of differentially expressed genes of a expression direction (both directions are validated separately) in the pathway are compared to 'all' differentially expressed genes of that direction.
Additionally, the complete numbers of all genes in the pathway and the total number of genes are going in into the calculation.
The ORA utilizes the hypergeometric distribution, whereas for data points i
outside of the event case the hypergeometric function HF is applied. The sum of all HF(i
) reveals the p-value; a p-value lower than 0.05 is a good estimation for a significant influence (of a drug) on the pathway considered.
What is a Cellular Fingerprint?
A cellular fingerprint represents the growth rates of 2,037 human cancer cell lines as reactions on the treatment with a particular compound. A boolean array is generated in the following way:
1. A bit comparison is only possible for a pre-defined vector of cell lines
2. GI-50 values of the 2,037 given cell lines were normalized using the z-score normalization, z = ( x - μ ) / σ
3. Each single, normalized GI-50 value is transformed to a 42 bit vector
- x = the 2,037 values for one compound
- μ = mean of x
- σ = standard deviation of x
4. In consequence, one cellular fingerprint has a length of 2520 bits
What is a Tanimoto Coefficient?
A simple count of shared features (common fragment substructures) can be a measure of
chemical distance when used in some similarity coefficient. Dictionaries of predefined structural
fragments, such as MDL Information Systems MACCS keys, are used to identify features
contained in a molecule. The structural fragments or features that are present in the given
molecule are turned ON (set as 1) and the ones that are absent are kept OFF (set as 0). Thus, for
each molecule one ends up having a string containing 1s and 0s (bit string).
Once the molecules have been represented by such bit-strings the Tanimoto Coefficient can be used
as a measure to assess similarity.
Lets say, we are comparing two molecules A and B. If NA is
number of features (ON bits) in A, NB is the number of features (ON bits) in B, and NAB is the
number of features (ON bits) common in both A and B, then, the Tanimoto Coefficient is:
NAB = number of "1" bits that occur in both row A and in row B
NA = number of "1" bits in row A
NB = number of "1" bits in row B
row A contains the fingerprint of molecule A
Two structures with a Tanimoto Coefficient greater or equal to 0.85 (which refers to a similarity of 85%) are considered as similar enough to be able to transfer biological activities of one molecule to the other and, thus, predict toxicities, pathways the
molecule might participate in, and potential binding partners.
Note that OFF bits do not determine the similarity. In other words, if some molecular features
are absent in both molecules then that is not taken as an indication of similarity between the two.
What is a mean graph?
A mean graph is the graphical representation of the vector of the cancer cell lines (vertically) and their normalized GI-50 value (horizontally; see also cellular fingerprint).
The single cell lines are indicated in the middle of the plot whereas the GI-50 value for each cell line is given as a green bar and as difference to the mean value.
The mean value is given as the Z-score of the GI50 values in units of the negative decadic logarithm (CellMiner), as the Z-score of the GI50 values in units of the decadic logarith. (CCLE) or as the Z-score of the IC50 values in units of the decadic logarithm.
What are IC-50 and GI-50 values?
The half maximal inhibitory concentration (IC-50 value
) is a measure of the effectiveness of a compound in inhibiting
biological or biochemical function.
If a significant effect is measurable, the compound in question is a drug candidate.
This quantitative measure indicates how much of a particular drug or other substance (inhibitor) is needed to inhibit a given biological process by half.
In other words, it is the half maximal (50%) Inhibitory Concentration (IC) of a substance (50% IC, or IC-50).
It is commonly used as a measure of antagonist drug potency in pharmacological research.
The NCI renamed the value for the concentration that causes 50% growth inhibition to emphasize the correction for the cell count at time zero. Thus, the GI-50 value
is the concentration of test drug where 100 x (T - T0)/(C - T0) = 50. The optical density of the test well after a 48h period of exposure to test drug is T, the optical density at time zero is T0, and the control optical density is C. The ``50'' is called the GI50PRCNT, a T/C-like parameter that can have values from +100 to -100. The GI-50 measures the growth inhibitory power of the test agent. The TGI is the concentration of test drug where 100 x (T - T0)/(C - T0) = 0. Thus, the TGI signifies a cytostatic effect.
What is a “similarity search”?
During an activity similarity search, a fingerprint of a search compound will be compared with the fingerprints of all other compounds in the database in order to find molecules similar in the reacton pattern on cell lines.
For this case, a row of compounds must be pre-selected.
This can be done by their chemical structure.
Either, chemical structures are determined by a structure similarity search, "similarity search".
As the result of the complete query, the structure similarity vector (all structures found with correspondence to a given structure) can be compared with the activity profile vector, i.e., a vector of cellular fingerprints for the compounds found.
: A structure similarity search is performed by the calculation of the Tanimoto Coefficient
Which Expression Data are available in CancerResource?
NCI-60, CCLE and CoSMIC cell line Expression Data
Such data represent the expression of a single gene in a single cell line compared to all cell lines in case of CellMiner or to all cell lines for the same tissue type in case for CCLE and CoSMIC. The representation of the expression profile for other tissue types is enabled.
This is no differential gene expression, only the relative abundance compared to other cell lines.
User-defined Expression Data
A user can import own experimental (microarray) data, a single chip to compare it against NCI-60, CCLE and CoSMIC cell lines.
Differential expression of genes will be calculated on-line.
Those results can be projected on KEGG pathways.
For the latter case, an Over-Representation Analysis will be calculated to determine significance.