INSTALLATION The package distributed consists of: main source file gmx.f auxiliary routines in fortran and C algama.f dgauss4.f pnorm.f dgamma.f xflush.f sd.c makefile Makefile_gmx example data files enz.dat galx.dat lnacid.dat (used in Richardson and Green) AACRresiduals.dat (used in Broët et al) this file readme.txt All files are text files, so can be run under Unix or Windows (subject to correct format for line endings). --------------------------------------------- RUNNING Gmix is a Fortran program for Unix-like systems, implementing the methodology for univariate normal mixture analysis described in Richardson and Green (J. R. Statist. Soc. B, 1997, 59, 731-792; see also the correction in J. R. Statist. Soc. B, 1998, 60, 661). These notes assume familiarity with this paper. The model described in the above papers has been extended so that one component has fixed mean and possible fixed variance. Instead of one parameter k for the number of components, there are now two parameters kleft and kright for the numbers of components to the left and right respectively of the fixed mean component. Either of these parameters can be zero. The program reads a single main data file, and produces multiple output text files summarising the posterior distribution of the model, for subsequent analysis and display (R or Splus are good choices for these tasks). (Only very limited summary information is computed by this program itself.) The run of the program is controlled by options, switches and parameter settings that have sensible default values, and can be set by command line arguments in Unix shell style. For example, the run on the Breast Cancer dataset in Broët et al, with run-length of 100000, burn-in of 100000, and minimal output options could be replicated by: gmix -range -empty -dfix1 -fixl -kil0 -kpois2 -nb100000 -n100000 AACRresiduals.dat The principal options are as follows, listed in several main groups. In this list, N denotes a numerical value passed in as a parameter of the model or sampler, other options are logical settings that are by default off, turned on by specifying the option. 1. Model options -prior disregard the likelihood and the data (except possibly in setting prior settings, etc.), thus computing the prior instead of the posterior. -fixl suppress dimension-changing moves, thus fixing the -fixr number of mixture components (to the value set by the -kil and -kir options below). 2. Parameter settings -range use range-based hyperparameters (default: don't use them) -kpoisN set kleft,kright poisson(N) (default Uniform, default for N=2) -dN set delta (hyperparameter for weights) (default 1) -spN set s (spacing parameter in prior for means) (default 1) (see rejoinder to discussion of Richardson and Green, page 786) -aN set alpha (hyperparameter for variances) (default 2) -bN set beta (hyperparameter for variances) (default Gamma(g,h)) -gN set g (hyperparameter for beta) (default 0.2) -hN set h (hyperparameter for beta) (default 10) The upper limit kmax of the number of components is hard-wired equal to 15 for both kleft and kright in the program - it can be changed by editing line 1 of gmx.f (where they are termed ncmaxleft,ncmaxright) 3. Monte Carlo sampler options -nN number of sweeps (default 10000) -nbN length of burn-in (default 0) -kilN initial number of components (default 1) similar for right -empty use empty-component birth-death moves -seedN random number seed (default 0, meaning use clock time to initialise). In the log file for the run, the seed actually used is printed, and the run can be repeated with the same random numbers by specifying that value in this option on the subsequent run. 4. Output options -pLETTERS additional output files, if LETTERS includes d: .avn .den .dev c: .pcl .scl w: .wms.* a: .z.* (default: none of these) See below for contents of these files. -nsN sets nsamp: number of (equi-spaced) Monte Carlo samples dumped in .out file (default 100) -nspN sets nspace: spacing in sweeps between successive records in trace files .ent, .bk, .dev, .wms*, .z* (default 20) -debug turn on verbose output Input and output file names The input data should be in a file with extension .dat, with the sample size n on the first line, and the data values y_1, y_2, ..., y_n following in free format, with arbitrary spaces and newlines. All output files are given names of the form N.ext, where N is a unique integer assigned by the program to label successive runs (it chooses the smallest positive integer such that N.log does not already exist), and .ext signifies the type of output the file contains. The extensions and their meanings are: (a) normal output: .log logs information about run, parameter settings, acceptance rates, etc. .kl posterior for kleft, Bayes factors, prior on kleft, posterior for kposleft, empty-component statistics and, optionally, average deviances, etc. (.kr similar for right) .pe parameter estimates: posterior means for weights, means and variances for each component, conditional on each pair of values of kleft,kright .out [*] sample of output states: k, kpos, weight, mean, variance and number of observations assigned, for each component .bdlogl [*] changes to kleft or kposleft: number of sweep (file not produced when -fix option set) .bdlogr same for right .ent [nsp] k, kpos, entropy (b) additionally, if beta or kappa are random: .bk [nsp] beta, kappa, k, kpos (c) optionally, if -debug specified .db verbose records of all moves of the sampler (d) optionally, as determined by -p... options (see above) .avn for each (kleft,kright), number of visits to (kleft,kright), average numbers of observations assigned to each component .den posterior expected density, on a grid of 200 equi-spaced y values (spanning the range of the data, expanded by 5% in each direction): output is grid, density, conditional on k=1,2,...,6, and unconditionally .dev [nsp] k, kpos, deviance, modified deviance .pcl predictive classification: for each (kleft,kright), for each j=1,2,..,k-1, for each y* on grid (see .den above), posterior probability that z*=j .scl within-sample classification: for each (kleft,kright), for each j=1,2,..,k-1, for each i = 1,2,...,n, posterior probability that z(i)=j .wms.K [nsp]in a separate file for each k, current values for weights, means and variances of each component j=1,2,...,k (thus each sample record is of k consecutive lines) .z.K [nsp] in a separate file for each k, current values for allocations z(i),i=1,...,n key: kpos = number of non-empty components [nsp] - trace files with output dumped every nspace sweeps (see -nsp option above) [*] - other trace files other output files contain information aggregated over the run.
Gmix is free of charge for educational and non-commercial research purposes.