Bayesian Integrative Genomics

Phase I: Funded by the BBSRC Exploiting Genomics initiative, from May 2002 to February 2007.

Phase II: A collaborative programme started in 2008, coordinated by Prof Sylvia Richardson.

Multi-Mapping Bayesian Gene eXpression for Affymetrix whole-transcript arrays

MMBGX is a package for analysing whole-transcript Affymetrix arrays at the Ensembl gene or Ensembl transcript (isoform) level, taking into account the multi-mapping structure between probes and the features they target. The package relies on the annmap Bioconductor package to create the "structure directories" containing the mapping information linking probes to features. The oligo Bioconductor package is needed to map probes to oligonucleotide sequences. Please cite Turro et al. 2010 if you use this software in your work. Note that, in contrast to the description in the paper, MMBGX now runs separately on each array and condition-specific expression estimates can be generated afterwards as described below. Additionally, MMBGX now uses annmap mappings rather than Affymetrix mappings for Gene ST arrays as well as Exon ST arrays.

Getting started

Make sure you have installed R and the standard Bioconductor packages. Additionally, you need to install the following dependencies: annmap, oligo, inline and Rcpp. Then download and install the latest version of the MMBGX R package.
MMBGX requires one "structure directory" per array type, Ensembl version and analysis level (Ensembl gene or Ensembl transcript). To create a structure directory, you must install the relevant annmap MySQL database and then run the makeStructDir command in R. E.g. suppose you have installed the human Ensembl 70 database and named it "hs70" in the databases.txt file (as explained in the annmap cookbook PDF), then you could create the structure directory for analysing HuEx-1_0-st-v2 arrays at the transcript level as follows: makeStructDir(annmapDB="hs70", cdfName="HuEx-1_0-st-v2", geneLevel=FALSE). Alternatively, if you are having trouble getting the MySQL database to work or simply wish to save some time, you can download the Ensembl 64 gene and transcript level structure directories for HuEx-1_0-st-v2 and HuGene-1_0-st-v1 arrays here. Ensembl 70 directories for HuEx-1_0-st-v2 (transcript and gene level), MoGene-1_0-st-v1 (gene level) and MoEx-1_0-st-v1 (transcript and gene level) are available here. Unzip the file and place the structDirs directory inside the mmbgx R library directory (echo `R RHOME`/library/mmbgx). If you would like to be sent structure directories for other array types, species or versions of Ensembl, please contact the corresponding author.
Place your CEL files into condition-specific directories (suggested).
Load the MMBGX package and analyse your CEL files in turn. Suppose you have three CEL files in each of two conditions and they are in directories cond1 and cond2 respectively.
```
library(mmbgx)
setwd("cond1")
d <- read.celfiles(dir())
mmbgx(d, geneLevel=FALSE, outputDir="transcriptRuns")
mmbgx(d, geneLevel=TRUE, outputDir="geneRuns")
setwd("../cond2")
d <- read.celfiles(dir())
mmbgx(d, geneLevel=FALSE, outputDir="transcriptRuns")
mmbgx(d, geneLevel=TRUE, outputDir="geneRuns")
setwd("..")
```
This will analyse each Exon array CEL file in directory cond1 at the transcript and gene level. The output of the transcript-level analysis will go in directories named run.1, run.2, run.3, etc. inside cond1/transcriptRuns/. The output of the gene-level analysis will end up in cond1/geneRuns/ The same will happen in directory cond2. Type ?mmbgx for documentation. Note that each array may be analysed simultaneously. If you have many arrays and a computing cluster, we recommend using the standalone binary.

When the runs are finished, combine the output directories within conditions:

combineRuns.mmbgx(c(dir("cond1/transcriptRuns", full.names=TRUE), dir("cond2/transcriptRuns", full.names=TRUE)), sampleSets=c(3,3), outputDir="combinedTranscriptRun")
combineRuns.mmbgx(c(dir("cond1/geneRuns", full.names=TRUE), dir("cond2/geneRuns", full.names=TRUE)), sampleSets=c(3,3), outputDir="combinedGeneRun")

This will loess-normalise all the arrays and average the posterior distributions of the expression parameter "mu" within conditions.

Finally, read in the combined directories for analysis:
```
resG <- readSingle.mmbgx("combinedGeneRun")
resT <- readSingle.mmbgx("combinedTranscriptRun")
```
You may then access several data structures. E.g.:
- resT$PSids: a vector of probeset IDs (Affy gene IDs for Gene arrays; Ensembl transcript IDs (if geneLevel=FALSE) or gene IDs (if geneLevel=TRUE) for Exon arrays). In the case of Exon arrays, you may occasionally see IDs formed from multiple IDs separated by the '+' character (E.g. ENST00000009120+ENST00000234610 or ENSG00000123584+ENSG00000166008. That is because these transcripts or genes cannot be distinguished using the available probes on the array. You should treat the expression for these probesets as the aggregate expression of all the component transcripts/genes.
- resT$muave: the log expression estimate of each probeset (rows) under each condition (columns)
- resT$mu[[n]]: 1024 posterior samples of the log expression parameter for each probeset (rows) under condition n
- resT$geneIDs (for transcript-level Exon array analysis only): list of Ensembl gene IDs
- resT$geneTranscripts (only available for Exon array data): mappings from Ensembl genes to Ensembl transcript IDs

Differential expression analysis

It is useful to plot a histogram of the probability of up-regulation of each transcript or gene (cf. Hein et al. 2006, Figure 4). The peaks near 0 and 1 in the example below represent differentially expressed genes/transcripts while the hump around 0.5 represents the non-differentially expressed set. If the spline fit (the black line) is inadequate, try setting a lower value for the df argument. Type ?analysis.mmbgx for documentation.

plotDEHistogram(resT)

The genes/transcripts may be ranked by differential expression.

> ra <- rankByDE(resT, absolute=TRUE)
> head(ra)
                                Position DiffExpression
ENST00000382159                    33819       424.4531
ENST00000381340+ENST00000451599    33387       412.2417
ENST00000261637                     3963       384.4219
ENST00000371079                    26935       373.2364
ENST00000334258                    15571       365.8463
ENST00000358117                    21671       365.2377

The DiffExpression statistic is described in Turro et al. 2007, the greater its value the greater the probability of differential expression. The Position column tells you the index of the transcript/gene.

> plotDEDensity(resT, "ENST00000382159")

> tail(ra)
                Position DiffExpression
ENST00000466277    88301   0.0004721673
ENST00000360019    22569   0.0002682017
ENST00000474642    95186   0.0002639391
ENST00000461757    84570   0.0001762428
ENST00000395730    38539   0.0001606146
ENST00000328642    14258   0.0001176768

> plotDEDensity(resT, 88301)

Differential splicing analysis

The main goal of differential splicing analysis is to identify transcripts whose expression change between conditions is significantly different from the overall expression change of its gene:

ds <- differentialSplicing(resG, resT)

ds is a data frame with four columns:

g_index: the gene index in resT$geneIDs
g_maxpg0: the maximum value within a gene for the probability that a transcript's differential expression is greater than that of its gene
g_maxdiff: the log fold change for the transcript with the maximum value above
t_index: the index of the transcript with the most extreme value above
g_transconsidered: the number of transcripts considered when calculating the above values

Genes of interest have g_maxpg0 near 0 or 1 and a relatively high abs(g_maxdiff). This plot may be useful:

plot(ds$g_maxpg0, ds$g_maxdiff)

You can obtain a list of candidate genes:

> head(ds[(ds$g_maxpg0 > 0.95 | ds$g_maxpg0 < 0.05) & abs(ds$g_maxdiff) > 2.5,])
40       66 1.00000000  2.976158  111067                 8
104     168 0.02929688 -2.646944   76121                 2
128     218 0.00000000 -3.558733   57454                 7
192     304 0.00000000 -2.710747   28648                14
280     449 1.00000000  2.710265  106233                 8
645    1003 0.00781250 -3.058512  109931                18

> resT$geneIDs[ds$g_index[2]]
[1] "ENSG00000000457"

>  resT$PSids[ds$t_index[2]]
[1] "ENST00000367772"

You can plot the posterior densities of expression parameter "mu" for the various transcripts under each condition:

plotTranscriptDensities(resT, ds$g_index[2])

Standalone binary

The MMBGX software is computationally intensive so you should try to run it on a powerful computer with lots of CPU cores. If you are analysing many arrays, you might prefer to run the standalone binary on a cluster to speed things up.

To run MMBGX as a standalone binary, you must first compile it. Then, from R, run the mmbgx function with inputDirs set to a character vector of length equal to the number of arrays in the AffyBatch object and standalone=TRUE (each element of inputDirs is a temporary directory of your choice, one per array). You can then run the standalone binary once per array, passing path/to/inputdir/infile.txt as a single argument each time. E.g.

From the command line (just do this once):


tar -xzf mmbgx_0.99.27.tar.gz
cd mmbgx/src/
make -f Makefile.standalone

Before you run make above, you might want to check that -march is set appropriately for your machine in the Makefile.standalone file.

I have found that ICC-compiled binaries are about twice as fast as GCC-compiled ones. If you use 64-bit linux, I recommend you try the ICC-compiled statically-linked binary mmbgx-icc in the src/ directory.

Then in R, prepare the input directories:


library(mmbgx)
setwd("cond1")
d <- read.celfiles(dir())
# create input directories transcriptInput.1, transcriptInput.2, ..., geneInput.1, geneInput.2...
mmbgx(d, geneLevel=FALSE, inputDirs=paste("transcriptInput", c(1:length(d)), sep="."), standalone=TRUE)
mmbgx(d, geneLevel=TRUE, inputDirs=paste("geneInput", c(1:length(d)), sep="."), standalone=TRUE)
setwd("../cond2")
d <- read.celfiles(dir())
mmbgx(d, geneLevel=FALSE, inputDirs=paste("transcriptInput", c(1:length(d)), sep="."), standalone=TRUE)
mmbgx(d, geneLevel=TRUE, inputDirs=paste("geneInput", c(1:length(d)), sep="."), standalone=TRUE)

From the command line run MMBGX:


cd cond1; mkdir transcriptRuns; mkdir geneRuns
cd transcriptRuns
for i in `ls .. | grep transcriptInput.`; do 
  ../../mmbgx/src/mmbgx $i/infile.txt
done
cd ../geneRuns
for i in `ls .. | grep geneInput.`; do 
  ../../mmbgx/src/mmbgx $i/infile.txt
done
cd ../../cond2; mkdir transcriptRuns; mkdir geneRuns
# etc

Notes:

If you get the error libgomp.so.1: shared object cannot be dlopen()ed after library(mmbgx), restart R as follows: LD_PRELOAD=libgomp.so.1 R.

To get OpenMP (shared memory parallelism) support on Mac OS X Leopard, download and install Xcode 3.1 and run the following commands before installing the package:


cd /usr/bin
sudo rm gcc cc g++
sudo ln -s gcc-4.2 gcc
sudo ln -s gcc-4.2 cc 
sudo ln -s g++-4.2 g++
cd /Library/Frameworks/R.framework/Versions/Current/Resources/etc/i386/
sudo sed -i .bak 's/-isysroot\ \/Developer\/SDKs\/MacOSX10\.4u\.sdk//g' Makeconf

After installing the package, you may run:


cd /Library/Frameworks/R.framework/Versions/Current/Resources/etc/i386/
sudo cp Makeconf.bak Makeconf

There is a bug in the Intel compiler version 10 to 10.1.018 that causes the Boost pseudo-random number generator to go haywire. If you use the Intel compiler, make sure you are using version 9 or version 10.1.018 and above.