Gmix

Gmix is a program for Bayesian analysis of univariate normal mixtures, in which one component has a fixed mean and possible fixed variance. This model is used in the Biometrics paper "A model-based approach for detecting distinctive gene expression profiles in multiclass response microarray experiments", Broët, P., Richardson, S., Lewin, A., Dalmasso, C. and Magdelenat, H. (2004). It was developed from the program Nmix, which implements the approach of Richardson and Green, Journal of the Royal Statistical Society, B, 59, 731-792 (1997).



zipped archive



Instructions (also included among the downloaded files)



INSTALLATION



The package distributed consists of:

      main source file        gmx.f

      auxiliary routines in

            fortran and C     algama.f dgauss4.f pnorm.f dgamma.f xflush.f sd.c

      makefile                Makefile_gmx

      example data files      enz.dat galx.dat lnacid.dat (used in Richardson and Green)

					AACRresiduals.dat (used in Broët et al)

      this file               readme.txt



All files are text files, so can be run under Unix or Windows (subject to correct format for line endings).



---------------------------------------------



RUNNING



Gmix is a Fortran program for Unix-like systems, implementing 

the methodology for univariate normal mixture

analysis described in Richardson and Green 

(J. R. Statist. Soc. B, 1997, 59, 731-792; see also the

correction in J. R. Statist. Soc. B, 1998, 60, 661).

These notes assume familiarity with this paper. 



The model described in the above papers has been extended

 so that one component has fixed mean and possible fixed 

variance. Instead of one parameter k for the number of components,

 there are now two parameters kleft and kright for the numbers of

 components to the left and right respectively of the fixed mean

 component. Either of these parameters can be zero.



The program reads a single main data file, and produces 

multiple output text files summarising the posterior

distribution of the model, for subsequent analysis

and display (R or Splus are good choices for these

tasks). (Only very limited summary information

is computed by this program itself.)



The run of the program is controlled by options, switches

and parameter settings that have sensible default values,

and can be set by command line arguments in Unix shell

style. 



For example, the run on the Breast Cancer dataset in Broët et al,

with run-length of 100000, burn-in of 100000, and minimal output

options could be replicated by:



gmix -range -empty -dfix1 -fixl -kil0 -kpois2 -nb100000 -n100000 AACRresiduals.dat



The principal options are as follows, listed in

several main groups. In this list, N denotes

a numerical value passed in as a parameter of the model

or sampler, other options are logical settings

that are by default off, turned on by specifying the

option.



1. Model options



      -prior      disregard the likelihood and the data (except possibly

                  in setting prior settings, etc.), thus computing the

                  prior instead of the posterior.

      -fixl       suppress dimension-changing moves, thus fixing the

      -fixr       number of mixture components (to the value set by the 

                  -kil and -kir options below).

  

2. Parameter settings



      -range      use range-based hyperparameters (default: don't use them)

      -kpoisN     set kleft,kright poisson(N) (default Uniform, default for N=2)

      -dN         set delta (hyperparameter for weights) (default 1)

      -spN        set s (spacing parameter in prior for means) (default 1)

                  (see rejoinder to discussion of Richardson and Green, page 786)

      -aN         set alpha (hyperparameter for variances) (default 2)

      -bN         set beta (hyperparameter for variances) (default Gamma(g,h))

      -gN         set g (hyperparameter for beta) (default 0.2)

      -hN         set h (hyperparameter for beta) (default 10)



                  The upper limit kmax of the number of components

                  is hard-wired equal to 15 for both kleft and kright in the program 			- it can be changed by editing line 1 of gmx.f (where they are 				termed ncmaxleft,ncmaxright)



3. Monte Carlo sampler options



      -nN         number of sweeps (default 10000)

      -nbN        length of burn-in (default 0)

      -kilN       initial number of components (default 1) similar for right

      -empty      use empty-component birth-death moves

      -seedN      random number seed (default 0, meaning use clock time to 

                  initialise). In the log file for the run, the seed

                  actually used is printed, and the run can be repeated

                  with the same random numbers by specifying that value

                  in this option on the subsequent run.



4. Output options



      -pLETTERS   additional output files, if LETTERS includes

                        d:    .avn .den .dev

                        c:    .pcl .scl

                        w:    .wms.*

                        a:    .z.*

                  (default: none of these)

                  See below for contents of these files.

      -nsN        sets nsamp: number of (equi-spaced) Monte Carlo samples 

                  dumped in .out file (default 100)

      -nspN       sets nspace: spacing in sweeps between successive records 

                  in trace files .ent, .bk, .dev, .wms*, .z* (default 20)

      -debug      turn on verbose output



Input and output file names



The input data should be in a file with extension .dat, 

with the sample size n on the first line, and the data

values y_1, y_2, ..., y_n following in free format, with arbitrary

spaces and newlines.



All output files are given names of the

form N.ext, where N is a unique integer assigned by the program

to label successive runs (it chooses the smallest positive integer

such that N.log does not already exist), and .ext signifies the

type of output the file contains.



The extensions and their meanings are:



(a) normal output:



      .log        logs information about run, parameter settings, 

                  acceptance rates, etc.

      .kl         posterior for kleft, Bayes factors, prior on kleft,

                  posterior for kposleft, empty-component statistics

                  and, optionally, average deviances, etc.  (.kr similar for right)      

      .pe         parameter estimates: posterior means for weights, means 

                  and variances for each component, conditional on each

                  pair of values of kleft,kright

      .out   [*]  sample of output states: k, kpos,

                  weight, mean, variance and number of observations

                  assigned, for each component

      .bdlogl [*] changes to kleft or kposleft: number of sweep (file not 

                  produced when -fix option set) .bdlogr same for right

      .ent [nsp]  k, kpos, entropy



(b) additionally, if beta or kappa are random:



      .bk  [nsp]  beta, kappa, k, kpos



(c) optionally, if -debug specified



      .db         verbose records of all moves of the sampler



(d) optionally, as determined by -p... options (see above)



      .avn        for each (kleft,kright), number of visits to (kleft,kright), average numbers 

                  of observations assigned to each component

      .den        posterior expected density, on a grid of 200 equi-spaced

                  y values (spanning the range of the data, expanded by

                  5% in each direction): output is grid, density, conditional

                  on k=1,2,...,6, and unconditionally

      .dev  [nsp] k, kpos, deviance, modified deviance

      .pcl        predictive classification: for each (kleft,kright), for each j=1,2,..,k-1,

                  for each y* on grid (see .den above), posterior probability 

                  that z*=j

      .scl        within-sample classification: for each (kleft,kright), for each j=1,2,..,k-1,

                  for each i = 1,2,...,n, posterior probability that z(i)=j

      .wms.K [nsp]in a separate file for each k, current values for weights,

                  means and variances of each component j=1,2,...,k (thus each

                  sample record is of k consecutive lines)

      .z.K  [nsp] in a separate file for each k, current values for allocations

                  z(i),i=1,...,n



key:  kpos = number of non-empty components

      [nsp] - trace files with output dumped every nspace sweeps (see

                  -nsp option above)

      [*]   - other trace files

      other output files contain information aggregated over the run.



Gmix is free of charge for educational and non-commercial research purposes.