P Multidimensional data modelling

Advanced theoretical conceptsInstitute of Bioinformatics, University of Muenster, Muenster, GermanyAIMS
The objective of PROVABES is to go beyond existing conventional approaches in strengthening the impact of biomarkers as identified by participants of this consortium. A large amount of gene expression data contributing to the scope of this network has already been generated. A database platform will be established by integrating previous joint research projects (ENCCA, EuroBoNet, EURO-EWING, TranSaRNet, ASSET). Integrating these data sets in the current research efforts to perform combined or re-analyses and using them as a reference, is essential.
However, data comparison is often not as easy and straightforward as expected; the established quality checks and normalisation procedures will give an estimate if a comparison might be possible. As the focus of this research proposal is directed towards biomarker validation, the expected heterogeneity of the data is certainly a challenge. This underlines the importance of establishing coordinated experimental designs according to standardised objectives and SOPs. In high dimensional data collections, the number of system states erodes meaningful conclusions if data from different experimental sources are averaged.WORK PLAN
The biostatistical part is complemented by a bioinformatics part focussing on exploratory data analysis strategies (top down) and data integration (bottom up). This platform part is based on a large portfolio of methods ranging from sequence analysis, de novo motive search algorithms, comparative approaches, biomathematical methods in clinical research, classification algorithms and expression analysis up to de novo reconstruction of cellular dependency networks
Specifically, we support the important aspect of defining the characteristics of the biological network environment around the selected set of biomarkers. This knowledge essentially adds stability to diagnostic interpretations and discussions by adding consistency to complex clinical observations and complex signal patterns of mid-sized biomarker panels.
In WP1.2, the bioinformatics team will focus on functional annotation of the genomic fragments that display copy number alteration (CNA). We will create a database to store and annotate CNAs detected in ES patients. The data generated by this project will be complemented by literature search. The front end of the database will provide a Graphical User Interface enabling comments on annotated features by the registered users. The system will be implemented using PostgreSQL, open source object-relational database system and series of in house built scripts to accommodate the project-generated data. At the analytical level, we will especially focus on genes known to be involved in ES. If any of these are discovered in CNA regions, we will investigate how this affects the cognate network(s).
In the course of WP1.2, besides a multivariate statistics approach (together with the biostatistics group) data-based probabilistic methods can be applied – specifically, resampling methods following ideas in to differentiate signals and separate them from noise. Prerequisite is to set up specific algorithms to unify and integrate the diverse data sources in a standardised way. Key elements are the weighting and normalisation procedures to acquire the proposed data integration. These steps determine the outcome of every procedure
In WP2, gene and miRNA expression analysis will be supported by established differential evaluation pipelines (R, Bioconductor). The network defining approach starts with a differential miRNA analysis followed by a TargetScan/mRNA data based miRNA target prediction. Additional procedures are forming miRNA-target networks. The functional analysis is based on gene-phenotype approaches and the identification of discriminative sub-networks and enriched pathways. Further refinement to predictive miRNA sub-networks can be achieved by utilising protein interaction data.
WP3 will generate TMA and protein expression data. Here, we support data analysis by decipherin correlation-based dependency structures between the (expressed) biomarkers themselves and supporting factors. The procedure in its first step is similar to procedures like Langenfeld et al. [2008] but then hooks on a data-driven combinatorial algorithm [Korsching et al. 2005, 2013] applying ideas similar to Patil et al. [2011]. The procedure is suited for the given number of factors up to 18. More factors might be included by superimposing more than one factor by applying certain threshold rules. The given approach exhaustively analyses the factor panel to find the optimal dependency structure covered by the data. The qualitative result can directly be compared with observations described in the literatur.EXPLOITATION OF THE RESULTS
This projects is performing as a consolidation and cross-linking plattform. So a lot of backbone support to data validation and building of a network context around highlighted molecular markers will be provided. The focus on the dependency analysis will further support proposals to include new or exclude given molecular markers as well as to define marker panels with a certain specificity.
The whole applied methodology is not specific for the sarcomas the PROVABES project is focused on, but the experiences will shape the evolution of improved analysis concepts and algorithms.