### abstract ###
Accurate modelling of biological systems requires a deeper and more complete knowledge about the molecular components and their functional associations than we currently have.
Traditionally, new knowledge on protein associations generated by experiments has played a central role in systems modelling, in contrast to generally less trusted bio-computational predictions.
However, we will not achieve realistic modelling of complex molecular systems if the current experimental designs lead to biased screenings of real protein networks and leave large, functionally important areas poorly characterised.
To assess the likelihood of this, we have built comprehensive network models of the yeast and human proteomes by using a meta-statistical integration of diverse computationally predicted protein association datasets.
We have compared these predicted networks against combined experimental datasets from seven biological resources at different level of statistical significance.
These eukaryotic predicted networks resemble all the topological and noise features of the experimentally inferred networks in both species, and we also show that this observation is not due to random behaviour.
In addition, the topology of the predicted networks contains information on true protein associations, beyond the constitutive first order binary predictions.
We also observe that most of the reliable predicted protein associations are experimentally uncharacterised in our models, constituting the hidden or dark matter of networks by analogy to astronomical systems.
Some of this dark matter shows enrichment of particular functions and contains key functional elements of protein networks, such as hubs associated with important functional areas like the regulation of Ras protein signal transduction in human cells.
Thus, characterising this large and functionally important dark matter, elusive to established experimental designs, may be crucial for modelling biological systems.
In any case, these predictions provide a valuable guide to these experimentally elusive regions.
### introduction ###
Many features of biological systems cannot be inferred from a simple sum of their components but rather emerge as network properties CITATION.
Organisms comprise systems of highly integrated networks or accelerating networks CITATION in which all components are integrated and coordinated in time and space.
Given such complexity, the gaps in our current knowledge prevent us from modelling complete living organisms CITATION, CITATION.
Therefore, the development of bio-computational approaches for identifying new protein functions and protein-protein functional associations can play an important role in systems biology CITATION .
The scarce knowledge of biological systems is further compounded by experimental error.
It is common for different high-throughput experimental approaches, applied to the same biological system, to yield different outcomes, resulting in protein networks with different topological and biological properties CITATION.
However, errors are not restricted to high-throughput analysis.
For example, it has been demonstrated that high-throughput yeast two-hybrid interactions for human proteins are more precise than literature-curated interactions supported by a single publication CITATION .
There has been a great deal of work analysing biological networks across different species, giving insights into how networks evolve.
However, many of these publications have yielded disparate and sometimes contradictory conclusions.
Observation of poor overlap in protein networks across species CITATION and divergence amongst organisms CITATION suggest fast evolution.
Significant variation in subunit compositions of the functional modules has also been observed in protein networks across species CITATION.
However, in contrast to these observations, recent work using combined protein-protein interaction data suggests high conservation of the protein networks between yeast and human CITATION.
This approach, based on data combination, stresses the importance of integrating different data sources to reduce the bias associated with errors in functional prediction, and to increase the coverage in network modelling, and has been demonstrated in numerous studies CITATION CITATION .
Increasing the accuracy of networks by integrating different protein interaction data relies on the intuitive principle that combining multiple independent sources of evidence gives greater confidence than a single source.
For any genome wide computational analyses, we expect the prediction errors to be randomly distributed amongst a large sample of true negative interactions.
Hence, it is unlikely that two independent prediction methods will both identify the same false positive data in large interactomes like yeast or human.
In general, we expect the precision to increase proportionally to the number of independent approaches supporting the same evidence.
From the available list of well-known integration methods specifically designed to integrate diverse protein-protein interaction -PPI- datasets, we chose the Fisher method CITATION in order to have a predictor that is independent from the experimental data used to validate it.
Fisher integration method is not a trained or supervised method as, for example, Naive Bayes or SVM methods.
The Fisher method presumes a Gaussian random distribution of the prediction datasets' scores as a null hypothesis and the Fisher integrated score calculation is based on Information Theory statistics CITATION, CITATION.
Therefore, the Fisher integration score is completely independent of the experimental datasets used in this study to validate and compare the predictions.
In this work, we significantly increase the prediction power of binary protein functional associations in yeast and human proteomes by integrating different individual prediction methods using the Fisher integration method.
Three different untrained methods are implemented: GECO ; hiPPI ; and CODA run with two protein domain classifications, CATH CITATION and PFAM CITATION.
The four different prediction datasets obtained by these methods, were integrated using simple integration and Fisher's method as examples of untrained methods.
Similarly ab-initio prediction datasets from STRING CITATION were also integrated using Fisher integration and compared against the integrated prediction datasets from our methods.
Results from the Fisher integration of our prediction datasets were benchmarked and compared against the individual prediction methods and the results from the integrated STRING methods.
In all cases we demonstrate increased performance for the integrated approach with the Fisher integration of GECO, hiPPI, CODAcath and CODApfam datasets yielding the best results.
Protein pairs identified by significant Fisher integration p-values were used to build a protein network model for yeast and human proteomes referred to as the Predictogram.
Additionally, all the protein-protein associations from several major biological databases, including Reactome CITATION, Kegg CITATION, GO CITATION, FunCat CITATION, Intact CITATION, MINT CITATION and HRPD CITATION were retrieved and combined into a network referred to as a Knowledgegram.
As implemented in other pioneering studies CITATION, we built predicted and experimental models for further comparison.
Different network topology parameters were calculated and compared between KG and PG models for two test species Homo sapiens and Sacharomyces cerevisae.
We observe how the networks change as the cut-off on the confidence score of the predictions is varied.
Results of this PG and KG network comparison demonstrate that PG networks resemble KG networks in many of the major topological features and model a substantial fraction of real protein network associations, as previously observed in some bacterial predicted networks CITATION, CITATION .
There have been frequent observations of low overlaps between different experimental high-throughput approaches CITATION.
Our comparison of the PG and KG models also show that the intersection between the two models is small and that the majority of predictions in the PG are novel predictions.
However, the overlap between PG and KG is significantly higher than expected by random in both species supporting a correspondence between the PG and KG screenings of PPI space.
This PG and KG data overlap is significantly larger in yeast than in human, pointing to a better functional characterization of the yeast PPI network and the presence of larger dark areas in the human PPI network still hidden from current experimental knowledge.
We suggest that this novel prediction set may be a valuable estimation of the relative differences in dark matter of uncharacterised protein-protein associations between both specie, and we show that this dark matter contains key elements, such as hubs, with important functional roles in the cell.
By analogy CITATION, dark matter in protein network models refers to predicted protein-protein associations, whose existence has not yet been experimentally verified.
In this study, we suggest that dark matter involves functional associations difficult to characterise by current experimental assays making any network modelling of organisms highly incomplete and therefore inaccurate.
The results are divided into four main sections in which the predicted and experimental PPI models of human and yeast are compared.
The first section analyses the performance of the single and integrated methods predicting the protein associations and determines the correlation between the prediction scores and the degree of accuracy and noise in the predictions.
The second chapter compares the topological network features of the predicted and experimental PPI models at equivalent levels of accuracy and noise.
The third section searches for functional differences between the predicted and experimental models looking for specific functional areas which appear to be illuminated by the prediction methods but elusive to the experimental approaches.
Whilst the final fourth section explores whether the predicted PPI network graphs contain additional context-based information on protein associations beyond the sets of predicted protein pairs used to build the networks.
