Regression of a quantitative trait, e.g. gene expression, or a quantitative phenotype such as blood pressure or survival, on a large number of predictors, e.g. SNPs or CNVs, is pervasive in integrative genomics. As the number of predictors increases and largely overtakes the number of collected samples, regression of the trait on al l predictors simultaneously becomes ill defined and finding well supported models that link a subset of predictors to the trait becomes increasingly challenging. Stepwise procedures are unstable and unwieldy in the face of the huge space of typically multi-collinear predictors.
In the framework of ultra-high dimensional data sets, we are investigating different modelling strategies for variable selection. Using an evolutionary Monte Carlo strategy, we have developed a prototype algorithm that currently allows to search efficiently over a space of up to 10 000 predictors.
The fully probabilistic framework of Bayesian variable selection models for binary classification is an alternative to resampling for estimating the uncertainty in the selection of genes. MCMC methods can be used for model space exploration, but the large scale of genomic applications makes standard algorithms either computationally very expensive or results in poor mixing of the Markov chain. We developed a novel MCMC algorithm that uses the dependence structure between gene variables in order to find blocks of genes that are proposed to be updated together, thus drastically improving mixing while keeping the computational burden acceptable. This block MCMC algorithm can easily be combined with tempering methods to further improve Markov chain mixing.
A recent development in genetics and genomics is the detection of expression QTLs (eQTLs), that is of genes whose mRNA expression level is associated with one or more genetics markers. Such associations are a first step towards understanding complex regulation processes. Data sets for these investigations are generally large, with tens of thousands of transcripts (mRNAs) being measured simultaneously with tens of thousands of markers. The size of these data sets presents challenges to Bayesian methods aiming to coherently analyse markers and transcripts jointly.
We are investigating the use of spike and slab type models for variable selection. The starting point is the Jia and Xu Bayesian shrinkage model (Jia and Xu, Genetics 2007). All transcripts and markers are analysed simultaneously; markers may be associated with multiple transcripts and transcripts may be associated with multiple markers. These models are attractive because they have a simple structure and can be fit in a fully Bayesian manner using Gibbs sampling. However, we find that the Gibbs sampler encounters serious mixing problems when applied to data sets with multiple outcomes. We are developing more sophisticated updating methods in order to explore the parameter space more fully.
This work provides an investigation of the univariate methods commonly used to search for expression Quantitative Trait Loci (eQTL), and presents a multivariate method based on the Lasso as a superior alternative. We show that the Lasso is equivalent to univariate methods in assessing evidence for the existence of one or more eQTL at a transcript, but gives improved performance in terms of locating eQTL. More precisely, simulation studies show that the Lasso has higher power for a given false positive rate than univariate methods, because Linkage Disequilibrium between markers generate false positive signals which damage the performance of univariate methods. Further simulation studies examine properties of methods which construct confidence intervals for location of QTL.
These suggest, with the typical effect sizes simulated here, that precise estimates of location are difficult. Finally we apply these methods to experimental data to produce a list of genes with evidence of genetic regulation; for a subset of these, large effect sizes means a precise estimate of location is possible.
Properties of these eQTL imply that this subset is under monogenic control, and multiple eQTL proposed by univariate methods are false positives induced by marker correlation.
Finding molecular profiles based on gene expression microarray data is a variable selection problem which is particularly challenging due to the large number of variables relative to sample size, which results in an ill-conditioned problem with many equally good solutions. Thus, one question that was investigated in the course of this project is the stability of molecular profiles. To address this issue a resampling study was performed comparing the similarity between resampled molecular profiles for a variety of univariate and multivariate classification methods.
Classical latent factor analysis seeks to discover patterns of dependence in multivariate data that allow dimension reduction through the representation of the observed variables as linear combinations of a smaller number of unobserved 'factors'. We are interested in finding sparse representations, in which there are many zero coefficients among the linear coefficients, in the interests of parsimony, interpretability, and statistical stability; we use a Bayesian hierarchical modelling approach.
Specifically, we examine the situation where there are two or more groups of variables, neither low in dimension, and the main interest is in discovering sparse representations of the dependence between them. We develop a strategy that structures the patterns of dependence, and explores the model space allowing the numbers of common and specific factors to vary.
We are motivated by a study relating profiling of metabolites with transcript and enzyme activity, and illustrate the statistical and computational perfor- mance of our methodology, and its sensitivity to prior assumptions, on both these data and a variety of simulated set-ups.
Standard regression analyses are often plagued with problems encountered when one tries to make meaningful inference using datasets that contain a large number of variables, especially in the presence of high-order interactions. This situation arises in, for example, surveys consisting of a large number of questions yielding a potentially unwieldy set of inter-related data. We propose a method that addresses these problems by using, as its basic unit of inference, a profile, formed from a sequence of covariate values. These covariate profiles are clustered into risk groups and these risk groups are used to predict a relevant outcome. This modeling framework is set in the context of a Bayesian partition model, which gives our method a number of distinct advantages over traditional clustering approaches in that it allows the number of clusters to be random, performs variable selection, allows for comparison of arbitrary subgroups of the data, and examines subgroups based on their association with an outcome of interest. The method is demonstrated with an analysis of survey data obtained from The National Survey of Children's Health (NSCH). The approach has been implemented using the standard Bayesian modeling software, WinBUGS, with code provided at the end of the manuscript.
Metabonomics is the quantitative investigation of metabolic variation, including the metabolic responses to endogenous and environmental stimuli. Studies in metabonomics normally focus on variability in the concentrations of small molecule metabolites in biofluids such as urine. For the field to develop, accurate and efficient experimental and statistical procedures for measuring these metabolite concentrations are required.
Proton Nuclear Magnetic Resonance (NMR) is a technology for producing spectra that contain quantative information on many hundreds of the metabolites in a biofluid simultaneously. For a variety of reasons it is difficult to interpret these spectra automatically. Multiple signals from the same compound frequently appear in different parts of the spectrum while signals from different compounds are often convolved. There is also experiment level variability in the position and shape of the resonance peaks corresponding to different compounds. We are working with the Computational Bioinformatics group, in the Department of Biomolecular Medicine, to develop Bayesian models and sampling algorithms which overcome these problems and allow efficient automatic interpretation of NMR spectra including inference of the concentrations of metabolites present in biofluids.
We are also researching models for multiple spectra that allow us to classify individuals according to their metabolite profiles. The principle motivation for this work is the epidemiology of chronic disease and studies of many hundreds of individuals, which investigate connections between the metabolic system, disease outcomes and environmental and genetic risk factors.
Affymetrix microarrays are widely used to profile the expression of thousands of genes simultaneously. The predominant type of array is the 3' Genome Array, which uses probes targeting the 3' end of each gene of interest. Recently, Exon and Gene Arrays have been developed with probes covering the whole length of the gene.
Previously our group has developed a Bayesian hierarchical model called BGX for obtaining measures of gene expression using 3' arrays. This project is extending BGX for the new Exon and Gene Arrays. In common with BGX, the new model takes into account additive and multiplicative error, non-specific hybridisation and probe affinity effects (using probe sequence information).
The new model includes two extensions to BGX. (1) On the Exon and Gene Arrays cross-hybridisation is measured by a smaller number of global "mis-match" probes: there is no longer a mis-match probe corresponding to every perfect-match probe. This makes the estimation of probe signals in the model simpler and more robust. (2) Probes may be mapped to more than one gene. The mapping between genes and probes is explicitly accounted for in the model. Thus signal from probes which match several genes is split between the genes in a statistically sound way.