Bayesian Integrative Genomics

Phase I: Funded by the BBSRC Exploiting Genomics initiative, from May 2002 to February 2007.

Phase II: A collaborative programme started in 2008, coordinated by Prof Sylvia Richardson.

MMSEQ: haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads

MMSEQ has a new home on GitHub and this page is now outdated.

The flowchart to the right depicts the MMSEQ pipeline for obtaining expression estimates from RNA-seq data. There are two routes, with starting points labelled A and B. Route A is quite fast and straightforward to run and uses pre-existing transcript sequences for alignment. Route B requires more time, as it involves the creation of custom transcript sequences based on the data. The two routes are described below. Basic knowledge of how to work in a unix-based environment is required. Route A may be run through the R-based ArrayExpressHTS pipeline. Please cite Turro et al. 2011 (Genome Biology) if you use MMSEQ in your work.

Recent news:

2013-04-26: MMSEQ 1.0.2 released, better prior for eta in mmdiff (Inverse-Gamma)
2013-04-08: MMSEQ 1.0.1 released, including new binaries mmcollapse and mmdiff (publication forthcoming)
2012-12-06: MMSEQ 1.0.0 beta released (more mmseq output: estimates for features even if they have no reads, output posterior standard deviations, whether or not features have been observed, etc. Improved normalisation procedure in readmmseq.R and mmdiff (using only features with unique hits in a minimum proportion of samples). Beta versions of mmcollapse and mmdiff programs. Inclusion of t2g_hits program.)
2012-05-28: MMSEQ 0.11.2 released (output mean posterior isoform/gene expression proportions)
2012-04-19: MMSEQ 0.11.1 released (output true sequence lengths in addition to effective transcript lengths)
2012-04-16: MMSEQ 0.11.0 released (bam2hits updates: print nice ASCII insert size histogram to check it's roughly normal; -i option now uses expected_isize and sd_of_isizes (instead of deviation_thres); unless adjusted lengths are provided or calculated with mseq (options -c and -u), transcript lengths are adjusted using the insert size distribution - the prior fragment distribution given the transcript is assumed to be uniform up to the transcript length within the main support of the insert size distribution; proper repetitive transcript filter applied before simplified isize deviation filter; deal gracefully with truncated hits files)
2012-03-26: MMSEQ 0.10.0 released (new hitsio library by Jürgen Jänes results in ~7x reduction in hits file sizes - inspection of compressed binary hits files and conversion to plain text can be done with the hitstools utility; fixed default regex to work with new (>v64) Ensembl fasta entry naming convention; more information in *.mmseq tables; output gzipped MCMC traces; improved readmmseq.R script)
2011-12-16: MMSEQ 0.9.18 released (the *.mmseq files now contain all the transcripts or genes, regardless of whether they have counts; improved R script to read in the output (readmmseq.R); fixed potential integer overflow if gibbs_iter set too high; default gibbs_iter doubled to 2^14, though there can be a noticeable reduction in autocorrelation up to 2^17 iterations)
2011-11-07: MMSEQ 0.9.17 released (misc. improvements and small bugfix in bam2hits code for identifying read mates)
2011-10-21: MMSEQ 0.9.16 released (output mean(log) rather than log(mean) of expression parameter; output expression-weighted sum of transcript lengths in gene files; misc other changes)
2011-10-17: MMSEQ 0.9.15 released (automatically set insert size deviation filter parameters; remove read length - 1 from effective transcript lengths; added extract_transcripts program to produce sequences from genome FASTA and Ensembl-like GTF files)
2011-09-22: MMSEQ 0.9.14 released (workaround for bug in Mac gcc that causes omp_get_max_threads() to incorrectly return 1 if called from a parallel region - estimates obtained on Mac with previous versions of mmseq are probably invalid)
2011-09-20: MMSEQ 0.9.13 released (changed default prior on mu to Gamma(0.1,0.1), which appears to reduce upwards bias for lowly expressed transcripts; all mmseq parameters can now be controlled from the command line; include mmseq2counts.R for DE analysis)
2011-08-15: MMSEQ 0.9.12 released (bug fix for potential seg fault)
2011-06-12: MMSEQ now available on Fedora 14/15 and EPEL 5/6 (yum install mmseq). Many thanks to Adam Huffman.
2011-06-07: new version of polyHap2 released with support for random assignment of alleles to haplotypes
2011-05-27: MMSEQ 0.9.11 released (fixed bug in insert size deviation filter in bam2hits)
2011-05-04: MMSEQ 0.9.10c released (ensured haploref.rb works properly with custom regular expressions - same behaviour as testregexp.rb and bam2hits; fixed bug in identical transcript/gene amalgamation code. Prior to 0.9.10, incorrect values outputted in *.amalgamated.mmseq and *.gene.mmseq files if one of the transcripts had zero hits and no values outputted in *.gene.mmseq for genes with log Gibbs mean < 0)
2011-03-18: MMSEQ 0.9.9 released (better integration with 'mseq'; deal gracefully with Ns in counts file; harmless buffer overflow bugfix in bam2hits)
2011-02-10: MMSEQ paper published in Genome Biology
2010-10-19: MMSEQ 0.9.8 released

ROUTE A: use a ready-made transcript FASTA file consisting of isoform or haplo-isoform sequences

Software requirements:
- Bowtie 0.12.9 (not Bowtie 2)
- SAMtools
- MMSEQ 1.0.2 (see Note 1) (old version 0.11.2a)
MMSEQ consists of two main binaries (bam2hits and mmseq) and several supporting utilities (binaries and Ruby, Perl and bash scripts). Make sure the directories containing the Bowtie, SAMtools and MMSEQ binaries and scripts are in your PATH, otherwise some of the commands in this manual will not work (see Note 8).
Input files needed:
- Bowtie-compatible (e.g. FASTQ) files containing reads from your experiment.
- A FASTA file containing transcript sequences, which will be used to create the Bowtie indexes. E.g. one of the following:
  - A FASTA file containing isoform or haplo-isoform sequences created by yourself.
  - A FASTA file containing cDNA and ncRNA Ensembl transcript sequences, but excluding alternative haplotype/supercontig entries. See Note 2 for instructions on how to produce such a file. Alternatively, you may download a ready-made filtered transcript FASTA file for Human (Ensembl v64, gzipped). If you sequenced polyadenylated transcripts only, include only the cDNA (not the ncRNA) transcript sequences in the FASTA file.
  - If you have your own GTF file, you can extract the transcript sequences from a genome FASTA file using the extract_transcripts program included in the MMSEQ package, but note that it only works with Ensembl-like GTF files. Usage: extract_transcripts genome_fasta ensembl_gtf_file > transcript_fasta.
  - If you are working with C57 or CAST inbred mice, you can download the UCSC isoform FASTA files for C57 and CAST mice and haplo-isoform FASTA file for C57xCAST crosses (produced using the Gregg et al. SNP calls on E15 embryo brains), but note that there are no gene IDs in these entries so you must use -m "(.*)" 1 1 in bam2hits (see below), which ignores the gene level.
  - N.B.: make sure the FASTA entries do not contain any tab characters.
Running the pipeline:
1. Create Bowtie indexes for your FASTA file using bowtie-build. It is advisable to use a lower-than-default value for --offrate (such as 2 or 3) as long as the resulting index fits in memory. E.g. if you're using the ready-made filtered file for Human, try:
```
gunzip Homo_sapiens.GRCh37.64.ref_transcripts.fa.gz
bowtie-build --offrate 3 Homo_sapiens.GRCh37.64.ref_transcripts.fa Homo_sapiens.GRCh37.64.ref_transcripts
```
2. Optional: If your insert size distribution overlaps your read length (pity - you should aim for mean insert length > read length), you might consider trimming back the reads to exclude adapter sequences. I recommend using Trim Galore! for paired-end sequencing. E.g.:
```
trim_galore --phred64 -q 15 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACTAACAAGATCTCGTATGCCGTCTTCTGCTTG \
  -a2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -s 3 -e 0.05 \
  --length 36 --trim1 --gzip --paired file_1.fq.gz file_2.fq.gz
```
  (TruSeq index highlighted in blue).
3. Generate BAM file of alignments using Bowtie. Mandatory flags are -a --best --strata -S (you absolutely must remember to specify -a to ensure you get multi-mapping alignments). Suppress alignments for reads that map to a huge number of transcripts with the -m option (e.g. -m 100). Other flags/options may be needed depending on your data (e.g. -I and -X for the minimum and maximum insert size and/or --norc if you are using a forward-stranded protocol). If your reference FASTA file doesn't use the vertebrate Ensembl naming convention, then you must specify --fullref. If you are using barcoding and the read names of your second mates end with /3 or /4, you need to change your FASTQ files so that they end with /2 instead because bowtie only trims read names ending with /1 or /2 (awk 'FNR % 4==1 { sub(/\/[34]$/, "/2") } { print }' secondreads.fq > secondreads-new.fq should do the trick). If you have multiple FASTQ files from the same library, feed them all together to Bowtie in one go (separate the FASTQ file names with commas). If you are getting many "Exhausted best-first chunk memory" warnings, try increasing --chunkmbs to 128 or 256. If the read names contain any spaces, make sure the substring up to the first space in each read is unique, as Bowtie strips anything after a space in a read name. The output BAM file must be sorted by read name. With paired-end data, only pairs where both reads have been aligned are used so you can use the samtools 0xC filtering flag to reduce the size of your output. E.g. to align the paired files 3125_7_1.fastq and 3125_7_2.fastq (from the Montgomery et al. HapMap data):
```
bowtie -a --best --strata -S -m 100 -X 400 --chunkmbs 256 -p 8 Homo_sapiens.GRCh37.64.ref_transcripts \
  -1 3125_7_1.fastq -2 3125_7_2.fastq | samtools view -F 0xC -bS - | samtools sort -n - 3125_7.namesorted
```
4. Generate mappings from reads to transcript sets with bam2hits. Basic usage: bam2hits transcript_fasta namesorted_bam > hits_file (run bam2hits with no arguments for full usage instructions and see Notes 3, 4, 5 and 6 for additional guidance). E.g.:
```
bam2hits Homo_sapiens.GRCh37.64.ref_transcripts.fa 3125_7.namesorted.bam > 3125_7.hits
```
  The hits file can be inspected and converted to plain text using the hitstools utility.
5. Run the mmseq program. Basic usage: mmseq hits_file output_base (run mmseq without arguments for full usage instructions). E.g.:
```
mmseq 3125_7.hits 3125_7
```
Description of output:
- output_base.mmseq contains a table with columns:
  1. feature_id: name of transcript
  2. log_mu: posterior mean of the log_e expression parameter (use this as your log expression measure)
  3. sd: posterior standard deviation of the log_e expression parameter
  4. mcse: Monte Carlo standard error
  5. iact: integrated autocorrelation time
  6. effective_length: effective transcript length
  7. true_length: length of transcript sequence
  8. unique_hits: number of reads uniquely mapping to the transcript
  9. mean_proportion: posterior mean isoform/gene proportion
  10. mean_probit_proportion: posterior mean of the probit-transformed isoform/gene proportion
  11. sd_probit_proportion: posterior standard deviation of the probit-transformed isoform/gene proportion
  12. log_mu_em: log-scale transcript-level EM estimate
  13. observed: whether or not a feature has hits
  14. ntranscripts: number of isoforms for the gene of that transcript
- output_base.identical.mmseq: as above but aggregated over transcripts sharing the same sequence (these estimates are usually far more precise than the corresponding individual estimates in the transcript-level table). Note that log_mu_em and proportion summaries are not available for aggregates.
- output_base.gene.mmseq: as above but aggregated over genes and the effective_length is an average of isoform effective lengths weighted by their expression
- Various other files are output (output_base.*.trace_gibbs.gz, output_base.M and output_base.k). These will be useful for downstream analysis in future, but they can be ignored for now.
MMSEQ operates on a sample-by-sample basis, and thus the expression estimates are roughly proportional to the RNA concentrations in each sample. Some scaling of the estimates may be required to make them comparable across biological replicates and conditions. The posterior standard deviations capture the uncertainty due to Poisson counting noise and due to the ambiguity in the mappings between reads and transcripts. The biological variance between samples can only be discerned with the use of biological replicates (see section on differential expression below).

ROUTE B: create new transcript FASTA files using your data

Software requirements:
- Bowtie 0.12.9 (not Bowtie 2)
- TopHat ≥ 1.3
- SAMtools ≥ 0.1.10
- polyHap2 (a Java program)
- MMSEQ 1.0.2 (see Note 1) (old version 0.11.2a)
MMSEQ consists of three binaries (bam2hits, mmseq and extract_transcripts) and several Ruby, R, Perl and bash scripts. Make sure the directories containing the Bowtie, TopHat, SAMtools and MMSEQ binaries and scripts are in your PATH, otherwise some of the commands in this manual will not work. Also, add the path to polyHap2.jar to the CLASSPATH, otherwise the Java commands in this manual will not work (see Note 8).
Input files needed:
- Bowtie-compatible (e.g. FASTQ) files containing reads from your experiment.
- An Ensembl transcript FASTA file for transcriptome alignment. If necessary, be sure to combine the cDNA and ncRNA files and filter out alternative haplotype/supercontig entries using the fastagrep.sh script included with the MMSEQ package (see Note 2). Alternatively, use the ready-made filtered transcript FASTA file for Human (Ensembl v64, gzipped).
- An Ensembl genome FASTA file for genome alignment. If necessary, be sure to filter out alternative haplotype/supercontig/duplicate Y entries using the fastagrep.sh script included with the MMSEQ package (see Note 2). Alternatively, use the ready-made filtered genome FASTA file for Human (Ensembl v64, gzipped).
- A GTF file containing exon/transcript/gene annotations. Filter out transcripts in the GTF file that are not in the transcript FASTA file and convert to GFF format (see Note 7). Alternatively, use the ready-made filtered GFF file for Human (Ensembl v64, gzipped). The GFF file must contain a nested gene → mRNA → exon structure in the third column (as in the Ensembl-derived files).
Running the pipeline:
1. Create Bowtie indexes for your genome FASTA file using bowtie-build. E.g. bowtie-build -f Homo_sapiens.GRCh37.64.dna.toplevel.ref.fa Homo_sapiens.GRCh37.64.dna.toplevel.ref. Alternatively, download ready-made Bowtie indexes for Human (Ensembl v64 (GRCh37/hg19), gzipped).
2. For each sample, align your reads to the genome using TopHat and the GFF file with options --no-novel-juncs --min-isoform-fraction 0.0 --min-anchor-length 3. The output directory should be specified with the -o option. This example is based on the Montgomery et al. HapMap dataset:
```
tophat -G Homo_sapiens.GRCh37.64.ref.gff --no-novel-juncs --min-isoform-fraction 0.0 --min-anchor-length 3 \
       -r 192 -p 8 -o 3125_7.tophat Homo_sapiens.GRCh37.64.dna.toplevel.ref 3125_7_1.fastq 3125_7_2.fastq 
```
3. Call SNPs using SAMtools mpileup and save as a filtered VCF file (see documentation). E.g. from the main directory:
```
SBAMS=( `find . -wholename "*.tophat/accepted_hits.bam"` )
samtools mpileup -ugf Homo_sapiens.GRCh37.64.dna.toplevel.ref.fa ${SBAMS[@]} | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter > var.flt.vcf
```
4. Import SNPs for use with polyHap2. You need to specify the path to the GFF annotation file used in the alignment, the filtered VCF file produced above and a temporary directory in which to save files. E.g.:
```
java dataFormat.ImportSNPs Homo_sapiens.GRCh37.64.ref.gff var.flt.vcf ph2temp 
```
  (If there are any '.' or '/' characters in the sample names in the VCF header, be sure to remove them before running the above command, as polyHap2 currently fails otherwise.) If you are working with a small number of human samples (< 10), it may be worth improving your phasing by combining the VCF file [coming soon] obtained from the 60 CEU samples in the Montgomery et al. HapMap dataset with your own VCF file, instead of using only your own data (see merge-vcf). However, if you intend to do arbitrary phasing just to reduce allelic mapping biases, then there is no need for this (see below).
5. To phase the genotypes, run the phase.sh script in the temporary directory. E.g.:
```
./ph2temp/phase.sh
```
  Note that commands listed in phase.sh may safely be run simultaneously if you have access to many CPUs.
  Alternatively, if you are only interested in reducing allelic mapping biases and not in haplo-isoform expression, run the phase.sh script with the -r option to assign the alleles to the two haplotypes without phasing (REF alleles are assigned to the _A haplotype and ALT alleles to the _B haplotype). This option may be used even when the number of samples is as low as 1.
6. Create the ATCG-based phased haplotypes for each transcript, saving the files in a polyHap2 output directory. You need to specify the path to the GFF annotation file used in the alignment, the path to the temporary directory created above and an output directory. E.g.:
```
java dataFormat.ConvertABtoATGC Homo_sapiens.GRCh37.64.ref.gff ph2temp ph2output
```
  The files produced are a SNPinfo position file and sample-specific haplotype files containing the two haplotypes (suffixed _A and _B) of each transcript.
7. Enter the polyHap2 output directory and construct custom transcriptome reference FASTA files, one for each sample, using the SNPinfo position file and the sample-specific haplotype files. This is done with the haploref.rb script. Usage: haploref.rb transcript_fasta gff_file pos_file hap_file > hapiso_fasta. E.g. for sample 3125_7:
```
haploref.rb ../Homo_sapiens.GRCh37.64.ref_transcripts.fa ../Homo_sapiens.GRCh37.64.ref.gff SNPinfo 3125_7.hap > 3125_7.fa
```
  Now you have a separate haplo-isoform FASTA file for each sample. For each sample, follow the Route A pipeline using the new customised FASTA file (in the example above, 3125_7.fa) as the reference. If you just did arbitrary phasing to reduce allelic mapping biases, bear in mind Note 6.

Differential expression analysis with mmdiff

The mmdiff binary allows you to perform model comparison using the MMSEQ output. You can run mmdiff as follows:

mmdiff [OPTIONS...] matrices_file mmseq_file1 mmseq_file2... > out.mmdiff

Set the prior probability that the second model is true with -p FLOAT. Use the option -useprops to model summaries of the posterior isoform proportions rather than the expression levels from MMSEQ. Run the binary without arguments for more options.

The matrices files determines which models you wish to compare. You must define four matrices: M is a model-independent covariate matrix, C maps observations to classes for each model, P0(collapsed) is a collapsed matrix (i.e. distinct rows) defining statistical model 0 and P1(collapsed) is a collapsed matrix defining statistical model 1. If a model has no classes (i.e. just an intercept term) use 1 for the P matrix. E.g. for differential expression between two groups of 4 samples without extraneous variables the following matrices file would be appropriate:


# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 0
0 0
0 1
0 1
0 1
0 1
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
.5
-.5

For a three-way differential expression analysis with three, three and two observations per group respectively, the following matrices file would be appropriate:


# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 0
0 1
0 1
0 1
0 2
0 2
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
1 0 0
0 1 0
0 0 1

In order to assess whether the log fold change between group A and group B is different to the log fold change between group C and group D, assuming there are two observations per group, the following matrices file would be appropriate:


# M; no. of rows = no. of observations
0
0
0
0
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 1
0 1
1 2
1 2
1 3
1 3
# P0(collapsed); no. of rows = no. of classes for model 0
.5
-.5
# P1(collapsed); no. of rows = no. of classes for model 1
.5 .5
.5 -.5
-.5 .5
-.5 -.5

If you require advice on setting up the matrices file for your particular study design, do not hesitate to contact the corresponding author.

Description of the output:

feature_id: name of feature (e.g. transcript)
bayes_factor: the Bayes factor in favour of the second model
posterior_probability: the posterior probability in favour of the second model (the prior probability is recorded in a # comment at the top of the file)
alpha0 and alpha1: posterior mean estimates of the intercepts for each model
eta0_0, eta0_1..., eta1_0, eta1_1...: posterior mean estimates of the regression coefficients under each of the models
mu_sample1, mu_sample2,... sd_sample1, sd_sample2,...: the data, i.e. the posterior means and standard deviations used as the outcomes

Differential expression analysis with edgeR or DESeq

The MMSEQ expression estimates are in FPKM units (fragments per kilobase of transcript per million mapped reads or read pairs), which makes different samples broadly comparable. You may read in the output from multiple samples using the readmmseq.R R script included in the MMSEQ package (requires R ≥ 2.9). The readmmseq function takes three arguments:

mmseq_files: vector of *.mmseq output files to read in
sample_names: vector of strings containing the sample names
normalize: logical scalar specifying whether to normalise the log expression values and log count equivalents for each sample by their median deviation from the mean (like DESeq).

The function returns a list with the following elements:

log_mu_gibbs: the expression estimates
mcse: the standard errors
counts: the count equivalents
unique_hits: the number of unique hits per feature
effective_length: the effective lengths, which differ across samples
true_length: the actual lengths of the transcript sequences

To test for differential expression with edgeR or DESeq, the count equivalents need to be used. E.g., to test for DE between two groups of two samples, you might run the following code in R from the directory containing the mmseq output files:


source("/path/to/readmmseq.R")
library(edgeR)
library(DESeq) # version >=1.5

ms = readmmseq()
tab = ms$counts[apply(ms$counts,1,function(x) any(round(x)>0)),] # only keep features with counts

# edgeR 
d = DGEList(counts = tab, group = c("group1", "group1", "group2","group2"))
d = estimateCommonDisp(d)
de.edgeR = exactTest(d)

# DESeq
cds = newCountDataSet(round(tab), c("group1", "group1", "group2", "group2") )
sizeFactors(cds) = rep(1, dim(cds)[2])
cds = estimateDispersions(cds)
de.DESeq = nbinomTest(cds,"group1", "group2")

Collapsing transcripts with mmcollapse

The mmcollapse binary takes the outputs from mmseq and collapses sets of transcripts based on posterior correlation estimates. The overall precision of the expression estimates for a collapsed set is usually much greater than for the individual component transcripts, which typically share important structural features (many shared exons, etc). Usage:


mmcollapse basename1 basename2...

Here, basename is the name of the mmseq output files without the suffixes (.mmseq, .identical.mmseq, .gene.mmseq). The output is a set of new MMSEQ files with suffix .collapsed.mmseq.

Notes

- The MMSEQ package comes with statically-linked binaries for 64-bit Mac OS X and GNU/Linux and should work out of the box on most systems. On Fedora 14/15 and EPEL 5/6 systems, you can install the package simply by running yum install mmseq. If you prefer to compile the binaries yourself, you will need the following dependencies:
  - Boost C++ Libraries. They are easy to install with a package manager:
    - Mac OS X+MacPorts: port install boost
    - Mac OS X+Homebrew: brew install boost; ln -s /usr/local/lib/libboost_regex-mt.a /usr/local/lib/libboost_regex.a; ln -s /usr/local/lib/libboost_iostreams-mt.a /usr/local/lib/libboost_iostreams.a
    - Ubuntu/Debian/etc: apt-get install libboost-all-dev
    - Fedora/CentOS/Red Hat/etc: yum install boost (or, on older systems, yum install boost141)
  - GNU Scientific Library. It is easy to install with a package manager:
    - Mac OS X+MacPorts: port install gsl
    - Mac OS X+Homebrew: brew install gsl
    - Ubuntu/Debian/etc: apt-get install libgsl0-dev
    - Fedora/CentOS/Red Hat/etc: yum install gsl
  - SAMtools library. It is easy to install with a package manager:
    - Mac OS X+Homebrew: brew install samtools; export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/include/bam
  To compile the binaries in the MMSEQ package with GCC, simply run make. Due to a bug in the default version of GCC that comes with XCode for Mac OS X, you need to use a more recent version of GCC. If the Boost, GSL or SAMtools libraries aren't found by the compiler, add the path(s) containing the header files and libraries respectively to the LIBRARY_PATH and CPLUS_INCLUDE_PATH environment variables. E.g. to add the path to SAMtools (which should contain sam.h and libbam.a, among other files):
```
export CPLUS_INCLUDE_PATH=/home/bob/seq/samtools-0.1.17:$CPLUS_INCLUDE_PATH
export LIBRARY_PATH=/home/bob/seq/samtools-0.1.17:$LIBRARY_PATH
```
  If you still can't compile and you can't or don't want to use the pre-packaged binaries, please contact the corresponding author on the paper.
- You must also make sure you have a Ruby interpreter to run the Ruby scripts. It comes bundled with Mac OS X and often with GNU/Linux. It is easy to install on the latter with a package manager:
  - Ubuntu/Debian/etc: apt-get install ruby-full
  - Fedora/CentOS/Red Hat/etc: yum install ruby
- If you wish to correct the transcript lengths for non-uniformity effects using the Poisson regression method of Li et al., you must install the R package mseq.

fastagrep.sh is similar to unix grep except it works on multi-line sequence entries in a FASTA file. E.g., to remove alternative haplotype/supercontig entries from the Ensembl DNA, cDNA and ncRNA FASTA files and combine the filtered cDNA and ncRNA files:

fastagrep.sh -v 'supercontig|GRCh37:Y:1:10000|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.64.dna.toplevel.fa > Homo_sapiens.GRCh37.64.dna.toplevel.ref.fa
fastagrep.sh -v 'supercontig|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.64.cdna.all.fa > Homo_sapiens.GRCh37.64.cdna.ref.fa
fastagrep.sh -v 'supercontig|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.64.ncrna.fa > Homo_sapiens.GRCh37.64.ncrna.ref.fa
cp Homo_sapiens.GRCh37.64.cdna.ref.fa Homo_sapiens.GRCh37.64.ref_transcripts.fa
cat Homo_sapiens.GRCh37.64.ncrna.ref.fa >> Homo_sapiens.GRCh37.64.ref_transcripts.fa

(note how this gets rid of the pesky 'GRCh37:Y:1:10000' entry in Ensembl human DNA FASTA files, which only contains Ns and breaks samtools indexing)

It is possible to use reference transcript FASTA files with entries that are formatted differently than the cDNA and ncRNA files provided by Ensembl. To do this, tell bam2hits how to access the transcript IDs and the gene IDs in the reference entries using the -m tg_regexp t_ind g_ind option. tg_regexp specifies a regular expression that matches FASTA entries (excluding the >) where pairs of brackets are used to capture transcript and gene IDs. t_ind and g_ind are indexes for which pair of brackets capture the transcript and gene IDs respectively. E.g. for the human Ensembl file, entries look like this:
```
>ENST00000416067 cdna:known chromosome:GRCh37:21:9907194:9968585:-1 gene:ENSG00000227328
```
so (assuming you ran bowtie with --fullref) you can tell bam2hits how to extract the transcript and gene IDs like this:
```
bam2hits -m "(E\S+).*gene:(E\S+).*" 1 2 Homo_sapiens.GRCh37.64.ref_transcripts.fa 3125_7.sam > 3125_7.hits
```
If you are unsure what regular expression to use, try out by trial and error using the testregexp.rb script. Usage: testregexp.rb -m tg_regexp t_ind g_ind transcript_fasta. If you run bam2hits with the -m option, double-check the IDs in the header and main lines of the output hits file before feeding it to mmseq (use the hitstools utility to inspect binary hits files or convert them to text format).
For paired-end data, bam2hits obtains the mode of the insert size distribution from the first 2M proper pairs with unambiguous insert sizes, smoothing the histogram in windows of 3, and sets it to the expected insert size, while the standard deviation is set to 1/1.96 times the distance between the mode and the top 2.5% insert sizes. For single-end data, the mean and standard deviation are set to 180 and 25 respectively. These values can be overridden with the -i expected_isize sd_of_isizes option.

The normal approximation is used to give transcripts their effective lengths. If you would like to adjust the lengths differently, specify -u alt_lengths in bam2hits, where alt_lengths is the name of a previously produced file containing the transcript IDs and the adjusted lengths in a two column layout.

Additionally, for paired-end reads, an insert size deviation filter removes alignments with insert sizes beyond 1.96 sd of the mean if alternative alignments exist with insert sizes within 1.96 sd of the mean. The filter can be disabled with -k.
You may correct the transcript lengths for non-uniformity effects with Poisson regression method of Li et al. by specifying the -c alt_lengths option when running bam2hits, where alt_lengths is the name of an output file in which to save the adjusted lengths for future use. This is quite a time-consuming task, so if you have multiple runs with similar non-uniformity biases, you may choose to use the adjusted lengths calculated from one run when processing a subsequent run. This can be achieved by specifying -u alt_lengths in bam2hits, where alt_lengths is the name of a previously produced file containing the transcript IDs and the adjusted lengths in a two column layout. Currently the Li et al. correction and the insert size distribution correction (above) are mutually exclusive, but a combined correction is in the works.
If you are not interested in haplo-isoform estimation and have constructed a new transcript reference with arbitrary-phase haplo-isoform sequences solely to reduce allelic mapping biases, then the alignments to the "_A" and "_B" haplo-isoforms should be merged. This can be done by specifying the -b flag when running bam2hits.

Once you have downloaded a GTF file from Ensembl, run the filterGTF.rb and ensembl_gtf_to_gff.pl scripts to remove GTF entries which are not in the transcript FASTA and convert to GFF. E.g. for human:

filterGTF.rb Homo_sapiens.GRCh37.64.ref_transcripts.fa Homo_sapiens.GRCh37.64.gtf > Homo_sapiens.GRCh37.64.ref.gtf
ensembl_gtf_to_gff.pl Homo_sapiens.GRCh37.64.ref.gtf > Homo_sapiens.GRCh37.64.ref.gff

Examples of how you might set the PATH and CLASSPATH environment variables to use the programs in this manual:

# rename bundled Mac OS X binaries
unzip mmseq_1.0.2.zip && cd mmseq_1.0.2
mv bam2hits-1.0.2-Darwin-x86_64 bam2hits
mv mmseq-1.0.2-Darwin-x86_64 mmseq 
mv mmdiff-1.0.2-Darwin-x86_64 mmmdiff
#...

# set paths for mmseq and samtools binaries and scripts
export PATH=$PATH:/home/bob/mmseq_1.0.2:/home/bob/samtools:/home/bob/samtools/misc

# set java path for polyHap2
export CLASSPATH=$CLASSPATH:/home/bob/polyHap2_20110607/polyHap2.jar

You may control the number of threads spawned by mmseq with the OMP_NUM_THREADS environment variable.