bgx logo

Bayesian Integrative Genomics

Phase I: Funded by the BBSRC Exploiting Genomics initiative, from May 2002 to February 2007.

Phase II: A collaborative programme started in 2008, coordinated by Prof Sylvia Richardson.



MMSEQ: haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads

MMSEQ has a new home on GitHub and this page is now outdated.

pipeline

The flowchart to the right depicts the MMSEQ pipeline for obtaining expression estimates from RNA-seq data. There are two routes, with starting points labelled A and B. Route A is quite fast and straightforward to run and uses pre-existing transcript sequences for alignment. Route B requires more time, as it involves the creation of custom transcript sequences based on the data. The two routes are described below. Basic knowledge of how to work in a unix-based environment is required. Route A may be run through the R-based ArrayExpressHTS pipeline. Please cite Turro et al. 2011 (Genome Biology) if you use MMSEQ in your work.

Recent news:

ROUTE A: use a ready-made transcript FASTA file consisting of isoform or haplo-isoform sequences

  1. Software requirements:

    MMSEQ consists of two main binaries (bam2hits and mmseq) and several supporting utilities (binaries and Ruby, Perl and bash scripts). Make sure the directories containing the Bowtie, SAMtools and MMSEQ binaries and scripts are in your PATH, otherwise some of the commands in this manual will not work (see Note 8).

  2. Input files needed:
  3. Running the pipeline:
    1. Create Bowtie indexes for your FASTA file using bowtie-build. It is advisable to use a lower-than-default value for --offrate (such as 2 or 3) as long as the resulting index fits in memory. E.g. if you're using the ready-made filtered file for Human, try:
      gunzip Homo_sapiens.GRCh37.64.ref_transcripts.fa.gz
      bowtie-build --offrate 3 Homo_sapiens.GRCh37.64.ref_transcripts.fa Homo_sapiens.GRCh37.64.ref_transcripts
      
    2. Optional: If your insert size distribution overlaps your read length (pity - you should aim for mean insert length > read length), you might consider trimming back the reads to exclude adapter sequences. I recommend using Trim Galore! for paired-end sequencing. E.g.:
      trim_galore --phred64 -q 15 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACTAACAAGATCTCGTATGCCGTCTTCTGCTTG \
        -a2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -s 3 -e 0.05 \
        --length 36 --trim1 --gzip --paired file_1.fq.gz file_2.fq.gz
      (TruSeq index highlighted in blue).
    3. Generate BAM file of alignments using Bowtie. Mandatory flags are -a --best --strata -S (you absolutely must remember to specify -a to ensure you get multi-mapping alignments). Suppress alignments for reads that map to a huge number of transcripts with the -m option (e.g. -m 100). Other flags/options may be needed depending on your data (e.g. -I and -X for the minimum and maximum insert size and/or --norc if you are using a forward-stranded protocol). If your reference FASTA file doesn't use the vertebrate Ensembl naming convention, then you must specify --fullref. If you are using barcoding and the read names of your second mates end with /3 or /4, you need to change your FASTQ files so that they end with /2 instead because bowtie only trims read names ending with /1 or /2 (awk 'FNR % 4==1 { sub(/\/[34]$/, "/2") } { print }' secondreads.fq > secondreads-new.fq should do the trick). If you have multiple FASTQ files from the same library, feed them all together to Bowtie in one go (separate the FASTQ file names with commas). If you are getting many "Exhausted best-first chunk memory" warnings, try increasing --chunkmbs to 128 or 256. If the read names contain any spaces, make sure the substring up to the first space in each read is unique, as Bowtie strips anything after a space in a read name. The output BAM file must be sorted by read name. With paired-end data, only pairs where both reads have been aligned are used so you can use the samtools 0xC filtering flag to reduce the size of your output. E.g. to align the paired files 3125_7_1.fastq and 3125_7_2.fastq (from the Montgomery et al. HapMap data):
      bowtie -a --best --strata -S -m 100 -X 400 --chunkmbs 256 -p 8 Homo_sapiens.GRCh37.64.ref_transcripts \
        -1 3125_7_1.fastq -2 3125_7_2.fastq | samtools view -F 0xC -bS - | samtools sort -n - 3125_7.namesorted
      
    4. Generate mappings from reads to transcript sets with bam2hits. Basic usage: bam2hits transcript_fasta namesorted_bam > hits_file (run bam2hits with no arguments for full usage instructions and see Notes 3, 4, 5 and 6 for additional guidance). E.g.:
      bam2hits Homo_sapiens.GRCh37.64.ref_transcripts.fa 3125_7.namesorted.bam > 3125_7.hits
      
      The hits file can be inspected and converted to plain text using the hitstools utility.
    5. Run the mmseq program. Basic usage: mmseq hits_file output_base (run mmseq without arguments for full usage instructions). E.g.:
      mmseq 3125_7.hits 3125_7
      
  4. Description of output: MMSEQ operates on a sample-by-sample basis, and thus the expression estimates are roughly proportional to the RNA concentrations in each sample. Some scaling of the estimates may be required to make them comparable across biological replicates and conditions. The posterior standard deviations capture the uncertainty due to Poisson counting noise and due to the ambiguity in the mappings between reads and transcripts. The biological variance between samples can only be discerned with the use of biological replicates (see section on differential expression below).

ROUTE B: create new transcript FASTA files using your data

  1. Software requirements:

    MMSEQ consists of three binaries (bam2hits, mmseq and extract_transcripts) and several Ruby, R, Perl and bash scripts. Make sure the directories containing the Bowtie, TopHat, SAMtools and MMSEQ binaries and scripts are in your PATH, otherwise some of the commands in this manual will not work. Also, add the path to polyHap2.jar to the CLASSPATH, otherwise the Java commands in this manual will not work (see Note 8).

  2. Input files needed:
  3. Running the pipeline:
    1. Create Bowtie indexes for your genome FASTA file using bowtie-build. E.g. bowtie-build -f Homo_sapiens.GRCh37.64.dna.toplevel.ref.fa Homo_sapiens.GRCh37.64.dna.toplevel.ref. Alternatively, download ready-made Bowtie indexes for Human (Ensembl v64 (GRCh37/hg19), gzipped).
    2. For each sample, align your reads to the genome using TopHat and the GFF file with options --no-novel-juncs --min-isoform-fraction 0.0 --min-anchor-length 3. The output directory should be specified with the -o option. This example is based on the Montgomery et al. HapMap dataset:
      tophat -G Homo_sapiens.GRCh37.64.ref.gff --no-novel-juncs --min-isoform-fraction 0.0 --min-anchor-length 3 \
             -r 192 -p 8 -o 3125_7.tophat Homo_sapiens.GRCh37.64.dna.toplevel.ref 3125_7_1.fastq 3125_7_2.fastq 
      
    3. Call SNPs using SAMtools mpileup and save as a filtered VCF file (see documentation). E.g. from the main directory:
      SBAMS=( `find . -wholename "*.tophat/accepted_hits.bam"` )
      samtools mpileup -ugf Homo_sapiens.GRCh37.64.dna.toplevel.ref.fa ${SBAMS[@]} | bcftools view -bvcg - > var.raw.bcf
      bcftools view var.raw.bcf | vcfutils.pl varFilter > var.flt.vcf
      
    4. Import SNPs for use with polyHap2. You need to specify the path to the GFF annotation file used in the alignment, the filtered VCF file produced above and a temporary directory in which to save files. E.g.:
      java dataFormat.ImportSNPs Homo_sapiens.GRCh37.64.ref.gff var.flt.vcf ph2temp 
      
      (If there are any '.' or '/' characters in the sample names in the VCF header, be sure to remove them before running the above command, as polyHap2 currently fails otherwise.) If you are working with a small number of human samples (< 10), it may be worth improving your phasing by combining the VCF file [coming soon] obtained from the 60 CEU samples in the Montgomery et al. HapMap dataset with your own VCF file, instead of using only your own data (see merge-vcf). However, if you intend to do arbitrary phasing just to reduce allelic mapping biases, then there is no need for this (see below).
    5. To phase the genotypes, run the phase.sh script in the temporary directory. E.g.:
      ./ph2temp/phase.sh
      Note that commands listed in phase.sh may safely be run simultaneously if you have access to many CPUs.
      Alternatively, if you are only interested in reducing allelic mapping biases and not in haplo-isoform expression, run the phase.sh script with the -r option to assign the alleles to the two haplotypes without phasing (REF alleles are assigned to the _A haplotype and ALT alleles to the _B haplotype). This option may be used even when the number of samples is as low as 1.
    6. Create the ATCG-based phased haplotypes for each transcript, saving the files in a polyHap2 output directory. You need to specify the path to the GFF annotation file used in the alignment, the path to the temporary directory created above and an output directory. E.g.:
      java dataFormat.ConvertABtoATGC Homo_sapiens.GRCh37.64.ref.gff ph2temp ph2output
      The files produced are a SNPinfo position file and sample-specific haplotype files containing the two haplotypes (suffixed _A and _B) of each transcript.
    7. Enter the polyHap2 output directory and construct custom transcriptome reference FASTA files, one for each sample, using the SNPinfo position file and the sample-specific haplotype files. This is done with the haploref.rb script. Usage: haploref.rb transcript_fasta gff_file pos_file hap_file > hapiso_fasta. E.g. for sample 3125_7:
      haploref.rb ../Homo_sapiens.GRCh37.64.ref_transcripts.fa ../Homo_sapiens.GRCh37.64.ref.gff SNPinfo 3125_7.hap > 3125_7.fa
      
      Now you have a separate haplo-isoform FASTA file for each sample. For each sample, follow the Route A pipeline using the new customised FASTA file (in the example above, 3125_7.fa) as the reference. If you just did arbitrary phasing to reduce allelic mapping biases, bear in mind Note 6.

Differential expression analysis with mmdiff

The mmdiff binary allows you to perform model comparison using the MMSEQ output. You can run mmdiff as follows:

mmdiff [OPTIONS...] matrices_file mmseq_file1 mmseq_file2... > out.mmdiff

Set the prior probability that the second model is true with -p FLOAT. Use the option -useprops to model summaries of the posterior isoform proportions rather than the expression levels from MMSEQ. Run the binary without arguments for more options.

The matrices files determines which models you wish to compare. You must define four matrices: M is a model-independent covariate matrix, C maps observations to classes for each model, P0(collapsed) is a collapsed matrix (i.e. distinct rows) defining statistical model 0 and P1(collapsed) is a collapsed matrix defining statistical model 1. If a model has no classes (i.e. just an intercept term) use 1 for the P matrix. E.g. for differential expression between two groups of 4 samples without extraneous variables the following matrices file would be appropriate:


# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 0
0 0
0 1
0 1
0 1
0 1
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
.5
-.5

For a three-way differential expression analysis with three, three and two observations per group respectively, the following matrices file would be appropriate:


# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 0
0 1
0 1
0 1
0 2
0 2
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
1 0 0
0 1 0
0 0 1

In order to assess whether the log fold change between group A and group B is different to the log fold change between group C and group D, assuming there are two observations per group, the following matrices file would be appropriate:


# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 1
0 1
1 2
1 2
1 3
1 3
# P0(collapsed); no. of rows = no. of classes for model 0
.5
-.5
# P1(collapsed); no. of rows = no. of classes for model 1
.5 .5
.5 -.5
-.5 .5
-.5 -.5

If you require advice on setting up the matrices file for your particular study design, do not hesitate to contact the corresponding author.

Description of the output:

  1. feature_id: name of feature (e.g. transcript)
  2. bayes_factor: the Bayes factor in favour of the second model
  3. posterior_probability: the posterior probability in favour of the second model (the prior probability is recorded in a # comment at the top of the file)
  4. alpha0 and alpha1: posterior mean estimates of the intercepts for each model
  5. eta0_0, eta0_1..., eta1_0, eta1_1...: posterior mean estimates of the regression coefficients under each of the models
  6. mu_sample1, mu_sample2,... sd_sample1, sd_sample2,...: the data, i.e. the posterior means and standard deviations used as the outcomes

Differential expression analysis with edgeR or DESeq

The MMSEQ expression estimates are in FPKM units (fragments per kilobase of transcript per million mapped reads or read pairs), which makes different samples broadly comparable. You may read in the output from multiple samples using the readmmseq.R R script included in the MMSEQ package (requires R ≥ 2.9). The readmmseq function takes three arguments:

  1. mmseq_files: vector of *.mmseq output files to read in
  2. sample_names: vector of strings containing the sample names
  3. normalize: logical scalar specifying whether to normalise the log expression values and log count equivalents for each sample by their median deviation from the mean (like DESeq).

The function returns a list with the following elements:

To test for differential expression with edgeR or DESeq, the count equivalents need to be used. E.g., to test for DE between two groups of two samples, you might run the following code in R from the directory containing the mmseq output files:


source("/path/to/readmmseq.R")
library(edgeR)
library(DESeq) # version >=1.5

ms = readmmseq()
tab = ms$counts[apply(ms$counts,1,function(x) any(round(x)>0)),] # only keep features with counts

# edgeR 
d = DGEList(counts = tab, group = c("group1", "group1", "group2","group2"))
d = estimateCommonDisp(d)
de.edgeR = exactTest(d)

# DESeq
cds = newCountDataSet(round(tab), c("group1", "group1", "group2", "group2") )
sizeFactors(cds) = rep(1, dim(cds)[2])
cds = estimateDispersions(cds)
de.DESeq = nbinomTest(cds,"group1", "group2")

Collapsing transcripts with mmcollapse

The mmcollapse binary takes the outputs from mmseq and collapses sets of transcripts based on posterior correlation estimates. The overall precision of the expression estimates for a collapsed set is usually much greater than for the individual component transcripts, which typically share important structural features (many shared exons, etc). Usage:


mmcollapse basename1 basename2...

Here, basename is the name of the mmseq output files without the suffixes (.mmseq, .identical.mmseq, .gene.mmseq). The output is a set of new MMSEQ files with suffix .collapsed.mmseq.

Notes

  1. fastagrep.sh is similar to unix grep except it works on multi-line sequence entries in a FASTA file. E.g., to remove alternative haplotype/supercontig entries from the Ensembl DNA, cDNA and ncRNA FASTA files and combine the filtered cDNA and ncRNA files:

    fastagrep.sh -v 'supercontig|GRCh37:Y:1:10000|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.64.dna.toplevel.fa > Homo_sapiens.GRCh37.64.dna.toplevel.ref.fa
    fastagrep.sh -v 'supercontig|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.64.cdna.all.fa > Homo_sapiens.GRCh37.64.cdna.ref.fa
    fastagrep.sh -v 'supercontig|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.64.ncrna.fa > Homo_sapiens.GRCh37.64.ncrna.ref.fa
    cp Homo_sapiens.GRCh37.64.cdna.ref.fa Homo_sapiens.GRCh37.64.ref_transcripts.fa
    cat Homo_sapiens.GRCh37.64.ncrna.ref.fa >> Homo_sapiens.GRCh37.64.ref_transcripts.fa

    (note how this gets rid of the pesky 'GRCh37:Y:1:10000' entry in Ensembl human DNA FASTA files, which only contains Ns and breaks samtools indexing)

  2. It is possible to use reference transcript FASTA files with entries that are formatted differently than the cDNA and ncRNA files provided by Ensembl. To do this, tell bam2hits how to access the transcript IDs and the gene IDs in the reference entries using the -m tg_regexp t_ind g_ind option. tg_regexp specifies a regular expression that matches FASTA entries (excluding the >) where pairs of brackets are used to capture transcript and gene IDs. t_ind and g_ind are indexes for which pair of brackets capture the transcript and gene IDs respectively. E.g. for the human Ensembl file, entries look like this:

    >ENST00000416067 cdna:known chromosome:GRCh37:21:9907194:9968585:-1 gene:ENSG00000227328

    so (assuming you ran bowtie with --fullref) you can tell bam2hits how to extract the transcript and gene IDs like this:

    bam2hits -m "(E\S+).*gene:(E\S+).*" 1 2 Homo_sapiens.GRCh37.64.ref_transcripts.fa 3125_7.sam > 3125_7.hits

    If you are unsure what regular expression to use, try out by trial and error using the testregexp.rb script. Usage: testregexp.rb -m tg_regexp t_ind g_ind transcript_fasta. If you run bam2hits with the -m option, double-check the IDs in the header and main lines of the output hits file before feeding it to mmseq (use the hitstools utility to inspect binary hits files or convert them to text format).

  3. For paired-end data, bam2hits obtains the mode of the insert size distribution from the first 2M proper pairs with unambiguous insert sizes, smoothing the histogram in windows of 3, and sets it to the expected insert size, while the standard deviation is set to 1/1.96 times the distance between the mode and the top 2.5% insert sizes. For single-end data, the mean and standard deviation are set to 180 and 25 respectively. These values can be overridden with the -i expected_isize sd_of_isizes option.

    The normal approximation is used to give transcripts their effective lengths. If you would like to adjust the lengths differently, specify -u alt_lengths in bam2hits, where alt_lengths is the name of a previously produced file containing the transcript IDs and the adjusted lengths in a two column layout.

    Additionally, for paired-end reads, an insert size deviation filter removes alignments with insert sizes beyond 1.96 sd of the mean if alternative alignments exist with insert sizes within 1.96 sd of the mean. The filter can be disabled with -k.

  4. You may correct the transcript lengths for non-uniformity effects with Poisson regression method of Li et al. by specifying the -c alt_lengths option when running bam2hits, where alt_lengths is the name of an output file in which to save the adjusted lengths for future use. This is quite a time-consuming task, so if you have multiple runs with similar non-uniformity biases, you may choose to use the adjusted lengths calculated from one run when processing a subsequent run. This can be achieved by specifying -u alt_lengths in bam2hits, where alt_lengths is the name of a previously produced file containing the transcript IDs and the adjusted lengths in a two column layout. Currently the Li et al. correction and the insert size distribution correction (above) are mutually exclusive, but a combined correction is in the works.

  5. If you are not interested in haplo-isoform estimation and have constructed a new transcript reference with arbitrary-phase haplo-isoform sequences solely to reduce allelic mapping biases, then the alignments to the "_A" and "_B" haplo-isoforms should be merged. This can be done by specifying the -b flag when running bam2hits.

  6. Once you have downloaded a GTF file from Ensembl, run the filterGTF.rb and ensembl_gtf_to_gff.pl scripts to remove GTF entries which are not in the transcript FASTA and convert to GFF. E.g. for human:

    filterGTF.rb Homo_sapiens.GRCh37.64.ref_transcripts.fa Homo_sapiens.GRCh37.64.gtf > Homo_sapiens.GRCh37.64.ref.gtf
    ensembl_gtf_to_gff.pl Homo_sapiens.GRCh37.64.ref.gtf > Homo_sapiens.GRCh37.64.ref.gff
    
  7. Examples of how you might set the PATH and CLASSPATH environment variables to use the programs in this manual:
    # rename bundled Mac OS X binaries
    unzip mmseq_1.0.2.zip && cd mmseq_1.0.2
    mv bam2hits-1.0.2-Darwin-x86_64 bam2hits
    mv mmseq-1.0.2-Darwin-x86_64 mmseq 
    mv mmdiff-1.0.2-Darwin-x86_64 mmmdiff
    #...
    
    # set paths for mmseq and samtools binaries and scripts
    export PATH=$PATH:/home/bob/mmseq_1.0.2:/home/bob/samtools:/home/bob/samtools/misc
    
    # set java path for polyHap2
    export CLASSPATH=$CLASSPATH:/home/bob/polyHap2_20110607/polyHap2.jar
    
  8. You may control the number of threads spawned by mmseq with the OMP_NUM_THREADS environment variable.