TODO.txt



$Header$

GMTK TODO List/Memory

Other things that need to be done, or just other ideas that might be
incorporated into the package at some point. Things that are already
done are marked with an "X" on the left of the item. When an item is
added, please add the date to the left of the list item. Note that
some items below might be redundant as they were added at very
different times.

--------------------------------------------

X Low level file parser should support
   both integer and named pointers to
   the parameters that they share.
   (when they write out the parameters, they
    should write out the form that was written in).
    
X simple +* equations in decision tree leaf nodes,
  using variables vi and ci (e.g.,
  v0, v1, v2, ... , and c0, c1, c2, ...) for values
  and cardinalities of parent variables.
  exs:  c0*v0 + v1
        c0 + 1
        3

X add a MTCPT type, for deterministic cpt
  which just points to one of the decision trees
  in the DT table.

X add error check message if two low level objects
  have the same name and kill the program.
   
X discrete observed variables have only one cardinality,
  card needs to be a vector.

X Include gaussian objects for special situations.
    index -1 corresponds to, with probability 1, gaussian
    index -2 corresponds to, with probability 0, gaussian

X An anytime clique finding algorithm that searches
  for the best clique tree in an hour, day, weekend, etc.
  Save the resulting clique (and model description) to
  a file so that it can be reused multiple times.
  (removed Fri Dec 31 01:55:48 2004) 

X Produce messages to the user saying that what the clique
  sizes are and that they might want to adjust the 
  structure to reduce clique sizes if they become too large.
  (removed Fri Dec 31 01:55:54 2004)

X the inference procedure should work both with clique chains and clique trees
    - no, but basic data structures are ok.
    - at some point, add tree stuff, for single frame applications
  DONE: Wed Sep  1 11:29:25 2004

- the inference procedure should work with disconnected networks

X Allocate data structures for CPTS, etc. automatically, if not explicitly 
  specified in a file. 

X use namedobject for randomvariable and any other object that
  has a string name
  DONE: Wed Sep  1 11:29:33 2004

X- include check to make sure that names for each section in parms are unique

X- todo, make sure that chunk n:m is valid.

X IMPORTANT: ability to train means and other dlinkmats in a dlinkmats structure
  independently.

- to through and start using setBasicAllocatedBit() on read, and add
  assertions to that effect on all routines that need it.
  (later) started doing this, but need to do a sweep to make sure all use
  this mechanism.

- IMPORTANT: look at and get rid of purify memory lost message
     - write destructors

- write sampling code

X update make file so that -@ works for .o files.

X Check the results of all DT query() calls to make sure
  the index it returns is valid. Report great error messages
  if it doesn't.
  Ultimately, some type of run-type check done once at the beginning
  so this query doesn't need to take place.

X if an object gets no probability, do something
  more than just issuing a warning, like setting
  to uniform probs, removing the component,
  re-randomizing, or something.
  done: set to previous values.

X add a k-means mode for initialization.
    - keep covariances at unity for a while.
    - first iteration just use random assignments
      (pick a random component who gives unity probability,
       everyone else gives zero probability)
    - done essentially since the splitting stuff
      is working and we can start with a single Gaussian
      and path counting.

X Add gaussian split program

X add linear BMM links

- add non-linear BMM links

X implement mixture component vanishing using the 
  shared Dense1DPMF. If one GM decides that
  its component should vanish based on a MCVR
  then all of them should (and should check this
  as well).

X when reading in definitions for each utterance:
    we could change the DTs with a particular name
    (e.g., read in a DT file and have a read
     mode that changes the pointer if it is there).
    This would need to have objects that use DTs
    use the integer index rater than the pointer directly.
   Essentially done, DTs can now specify a file
   containing a number of DTs one per segment.

X Re-think the 1 DT per utterance stuff so that
  we can seemlessly do parallelism.
   (dt file indexes?)
  DONE: Wed Sep  1 11:30:09 2004

- rethink DT format numbering

- add a 'fail' tag to DT leaves.

X pass .str files through cpp before reading.

X make sure that all files passed through cpp that
  line numbering error messages is done right.

X GMTK_GMParms::read(), keeps open all files
  until the end. If it encounters the same name again,
  it keeps track of where it was and continues reading.
  Also, this should append rather than delete.

X Allow the file parser to pass ASCII files through cpp
  (but not binary).

- need to add a check that makes sure that the observation
  range used by the cont. parents match that used by
  the gaussian, perhaps add a length check as well.

X Fri Jun 15 12:43:49 2001, 
  MEMORY SAVINGS:
  RandomVariables and their discrete and continuous children
  have lots of redundant information when they are unrolled, thereby
  wasting lots of memory. This could be changed so that
  RV's keep a pointer to a RV common structure which when the RV is cloned
  uses the same common data.
  DONE: Wed Sep  1 11:30:38 2004

- Short term: evaulate float/double for logp on Aurora

X Long term: 
     have a log space algorithm option

- in file parser, add checks to make sure that all ints read in
  are non-negative

X change parser to keep track of multiple #include files via line directives

X change sparse PMF so that it just contains a list
  of values, and then a pointer to a dense pmf

X get the parameter writing stuff finalized.
   (get working after workshop, use simple global write in binary for now).

X get MCVR working

- if last component (or entire mixture) has no probability, 
   either 1) die
          2) do nothing, and do not change anything, reverting
             to previous values.
          2) force to uniform parameters
          3) For now, **** force to impossible Gaussian *** 
             and issue a loud warning.

- prior counts
    - need routine makeAccumulatorsPrior
    - stand alone program for writing out accumulator file to be read in.

X IMPORTANT: finish load/store/accumulate accumulators

- Identify potential issues with release (ie.., bugs, slowness, obviously needed toolkit abilities,
  thigns that are inconvenient, etc.).

X add VCID everywhere.

- allow deterministic relations to be enumerated out. in some cases, this is
  easier than a decision tree. 

X- add a binary/ascii parameter file conversion program.

- export all internal program variables to command line (e.g., var floor, etc)

X check on memory leak stuff

- make sure that all gaussians use means/variances that are the right dimensionality
  in read file.

- rethink sparse CPT and make it such that sparseCPTs don't use the Dense1DPMF which
  have become tailored to Gaussian mixtures (so lengths might change).
   sparse CPTs should use dense CPTs somehow.

- write C++ program to print number of free parameters for a system.

X add simple multiplication onto decision tree leafs. (or better, make it
  use real integer formulas with parens, etc.).

- export the optional training stuff (i.e., don't train means, just covars, etc)
  to command line.

- create new objects, integer to name mappings
  to map to decision tree leaves (corresponding to integers) to either
     1) Gaussian mixture objects
     2) Switching gaussian mixture objects
     3) sparse PMFs

X make sure that dlinkmatrix precompute is being
  called once the global observation matrix is ready.

- check for cardinalties in str file

- make sure dlinks are checked somewhere for validity wrt a file

- rething the EMable thing with the virtual functions, might
  be a speedup there, esp. with emIncrement.

- figure out a good way to get (save to disk) most viterbi assignment to 
  mixture variables.

- clean up source directories.

- fix unrolling bug, where it is possible to get an assertion
  failure because of unrolling a network but having  incompatible
  RVs.

- dlinks, make sure we do not point to self

- decision trees, need to deal with the issue with reading them
  in, parallelism, and so on.

- implement other forms of mappings from RV 
    - decision tree (done)
    - hash table
    - direct mapping 

- write MDCPT parameters out in nice order with smart comments.

- dlinkmats, normalize by "previous" covariance matrix, GEM alg

- more triangulation procedures to reduce large clique sizes.

- get switching parents working with triangulation

- print message at start with largest clique (members, and upper bound on
  joint state space)

- reading ascii feature files should not be by line.

X fix bug with small parallel chunks and accumulators being zero

- add option to pass definitions to cpp with cpp arg.

- add some way for DTs to refer to other DTs (i.e., a leaf
  of a DT an continue on using another DT, to make sharing
  easier, and save memory for big DTs).

X option to floor variances when they are read in.
  (for all programs including ascii/binary conversion)

X ascii/binary conversion program can go both ways.

- dynamic DTs, error messages should print cur name as well as base name. 

- make DT such taht even if overlap exists, binsearch will occur.
  (add option to search from the middle outward).
  
- add option to split/vanish top/bottom N mixtures irregardless of that.

- for documentation:
  - good idea to turn on conservative vanishing right after forced splitting.
    This is to make sure that the splitting as a good idea. A forced
    split might not be a good idea. The split Gaussian might 
    dwindle away during training after a split, so keeping conservative
    vanishing will keep that from happening (and will keep the
    variances from being large).
 
- option to turn off all warnings and notes.

- accumulators pretty printed

- don't allocate nextmean next covar until end of em iteration
  since they are contained in component.
  define a new bit in emable to support this (since need
  to know the first time end of em thing is called).

- ability to produce viterbi paths with mixture variables
  using the gaussian mixture objects.

- write a vector version of log(1 + exp(x))
   Wed Aug 18 19:43:41 2004: write a specific version of 
   log(1 + exp(x)) in one function log1pexp(x).

- clean up swap and end EM in gmparams

- add command line option "-format file-type" to the main programs that
  will explain the formats of the various files. For example, if there are
  no gaussian mixtures, does the master file still have to mention them?

- make DTS such that 'default' is not required, and that if
  we have splits w/o a default, then it will have a run-time
  error if we ever get a case that doesn't match the guys in the split

- change range error messages to indcate where the error is
  in the file, etc. where the error occurs (add an extra 
  string argument, etc.).

- check that there are no self loops in dlink strcutres

- add startskip/endskip check

- add link checks

- when no-one left uses a component after vanishing,
  get rid of it (add a 'used' bit perhaps in EMable.h).

- make vanishing stuff vanish w/o a trace (i.e., unused
  component is gone).

- make more conise all the warning messages about vanishing
  (don't need to report all of them, report single summary
   message).   

- check on error check messages, arguments out of order
  in dlink matrix message??? (check with Geoff)

- remove all the using_files stuff in GM.cc

- clean up GM.cc with setExampleStream, and all of that.

- Wed Aug  8 21:58:36 2001 it is still the case that
  the gaussian dimensions are not being checked (since
  we don't know DT leaf values at start time).
  Once tables are in place, we can make sure
  that the tables point to all matching gaussians
  at start time.

- write our own pre-processor (doesn't have space problem
  that cpp has, and also will give standard #line/#file directives).
  Perhaps in perl.

- add tag to command line to add to all cloned objects.

- include global missed increment count in accumulators.

- arguments print default values

- viterbi option so that it prints out max likelyhood of
  one variable summing over all others.

- clean up virtual functions in GMTK_EMable since some of
  the EM ones need not be virtual.

x fix arg description of beam

- for docs, when stdfracs are zero for D and B, and
  when we clonesharemeans, we might make a copy of
  the gaussians when cloning that are exactly the
  same as the parent leading to redundant copy of
  Gaussians. Make sure to mention this in docs.

- support no training names such as gmMx* for foobar*

- state clustering, ala HTK, occupancy counts.

- when segment is to short during training, skip it
  rather than exiting with an error. 


TODO:
 - file formats (table & output file)
 - record phone numbers
 - src id
 - icassp

- accumulate multiple accumulators, give a list of accumulator
  files when numeric range doesn't work to support accumulators
   on different machines.

- go through and making sure all the tying logic and not-training options work.


- change internal class names from ??CPT to the SparseCPT, DenseCPT, etc.

- create a name index type so that DT's can be used for the following.
    - in mappings to GM indexes, DT leaf specifies a relative
      offset in a table rather than in the global collection of GMs. 

    - in MSCPTs, so that row elements of the MSCPT point to
      offsets in a table for the 1dPDFs rather than in the global
      table. Add entry in MSCPT definition in data file.

- a way of adding counts to discrete CPTs without needing to specify an entire accumulator
  file.

- cpp program is determined by environment variable if it exists.

- remove from parser the integer index stuff since string names exist.

- remove all cin/cout and use printf/scanf

- add -version flag to everything.


- from Gang.
1. When I want to print the hidden variables I met the following:

suppose my varList file is the following
   wordLatticeState
   state
   phoneTransition
   wordTransition
and my fileList file is the following
   wordLatticeState.log
   state.log
   phoneTransition.log
   wordTransition.log

The output will put everything in the file wordLatticeState.log.  In that
file the first integer will be wordLatticeState(0), second will be
state(0), so on and so forth.  It didn't create other three files.

2. suggestion

in gmtkViterbi, cppCommand is very useful.  But I if I have a lot macros,
the command line string will be very long.  I wonder whether is a way that
this will take a file of macros.

JB answer: but this can be done using cpps #include in one of the
           files. TODO: add this to documentation.


- add implicit approach to tutorial (from Yimin).

- graphvis and grappa from at&t can visualize graphs very nicely.


- 
>.
>WARNING: Ending EM iteration but 124 rows of MDCPT 'mannerMDCPT' had zero 
>counts. Using previous values for those rows.
>
>Actually, one thing that would be useful would be if it were possible to have 
>a "verbose" option where it tells you which rows had zero counts.
>
------------------------------------------------------------

Tue Jun 18 17:47:44 2002
> >(2)  Does GMTK allow you to use models that aren't strictly probabilistic?  
> >E.g. if I wanted to weight the acoustic model score relative to the language 
> >model, would I be able to do that?  I know that for some purposes I can "fool" 
> >it, as I've done with the feature model, by having multiple observation 
> >variables pointing to the same observations, but that is fairly limited.
> 
> Not at the moment, but that would be really easy to add (and I'd be happy
> to do that for you).
------------------------------------------------------------


FIXED Wed Jun 19 19:27:26 2002

Tue Jun 18 17:51:42 2002
> >(1) I am training a new model, and am getting an error with the new version of gmtkEMtrain but not the old version.  The error occ
> >urs when loading accumulators (but not when training without accumulators), and the message I get is:
> >
> >error in accumulating accumulators: /t/klivescu/aurora/articulatory5/MISC/emtrain.1.log:  
> >EOF occurred in readDouble, file '/t/klivescu/aurora/articulatory5/MISC/acc_file_1.data': MDCPT load accums
> >Loading accumulators from '/t/klivescu/aurora/articulatory5/MISC/acc_file_1.data'
> 
> Hmm. Are you using accumulators that were generated with the old
> version with the new version? I don't think any of that code was
> changed, but it is possible there might have been a bug introduced.

No, the accumulators were generated with the new version.  I ran a
few more experiments to try to figure this out, and it seems that
it has to do with the size of the accumulator files--i.e. it only
happens when the accum files have only a few utterances' worth of
data.  E.g. if I train on 50 sentences broken up into 2 accum files,
it's fine; but when broken up into 10, I get the error.  Perhaps it 
is unhappy about some things not being observed in the accum files?

> That would be good. Could you set up the bug on music and/or orca?

I set up the files on both machines.  On both machines, everything is in
~klivescu/aurora/articulatory5.  The NOTES file contains the command 
lines for the old/new versions (sorry about all the long path names--
the commands were copied directly from script-generated makefiles).  
I think I made everything group-writable, so you should be able to 
run the commands.

Thanks!
Karen

------------------------------------------------------------
FIXED Wed Jun 19 19:27:26 2002

Tue Jun 18 17:51:16 2002
	
(3)  Not so much a question as just letting you know about another error 
(different from the last one) that I got with the new version but not the old 
one.  This one also occurred when combining accumulators.  The error message 
was:

Loading accumulators from '/homes/klivescu/aurora/articulatory7/MISC_NEW/acc_fi
le_1.data'
GMTK_MeanVector.cc:718: failed assertion `emEmAllocatedBitIsSet()'
IOT/Abort trap (core dumped)

This occurred while training a feature model with clustered features.  The only special thing I can think of about this model is that there were a number of Gaussians that I wasn't training (via -objsNotToTrain) because they correspond to impossible combinations of feature values.  I put all the files necessary to replicate the error on music in ~klivescu/aurora/articulatory7.  The NOTES file in that dir has the commands that I ran for both the old and new versions.

-----------------------------------------------------------------
FIXED Wed Jun 19 19:27:26 2002

Wed Jun 19 11:34:25 2002

At the moment, we are not checking for emAmTrainingBitIsSet() in
GMTK_MeanVector.cc, GMTK_DlinkMatrix.cc, and GMTK_DiagCovarVector.cc
in all of the EM routines, except for the swap routine (i.e., if the
bit is not set, we don't swap). The reason for this is as follows.
When sharing is not on, it is fine to check this bit before each EM
routine, and if not set, do nothing. The problem is that with sharing,
when we compute the updates for the shared means, we'll need the
counts for not only the means but also for the covariances, and vice
versa. A similar situation arrises when dlinks are involved. One
solution might be to activate the accumulation if it is seen that
sharing is occuring, the the problem with this is that, at the moment,
we don't know if sharing is occuring until after emINcrement is called
for the 2nd time on teh same mean object. If the bit is off the first
time, we might miss the first accumulation.

The solution now is to compute all the counts for the
means/variances,etc. in all cases, even when the not_training bit is
on, but this can be very wastefull. Another problem is that we need to
save the accumulators when doing parallel training, even when the
means are not being trained. This means that the training bit could be
off, but we are accumulating accumulators for the shared object so the
accumulators need to be saved even when the training bit is off.  This
logic therefore needs to be rethought.

-----------------------------------------------------------------

- get automatic allocation of DenseCPTs working again.


Tue Jul  2 19:35:29 2002
- errors when reading in parameter files should be more informative
  and perhaps say where in the parameeeeeer files the problem is.

Tue Jul  2 19:38:36 2002
- no need for warning about accumulatedProb = 0 for MTCPT

Tue Jul  2 19:53:43 2002
add to documentation something about normalization features
and variance floor. Hints/tips on what to do here.
  - possibly a diff. var floor for each feature vector element/

XX Tue Jul  2 20:24:18 2002
   - add a 'noscore' CPT so that we can do true conditional discrete
     observations, similar to what can be done for discrete observations.

Sun Jul 07 17:16:56 2002
 - go through all parameter reading code to produce better error messages

XX Sun Jul 07 17:16:56 2002
   (e.g., realarray, and all of that, probably eliminate realarray class
    as it's not doing much).

Sun Jul 07 23:40:36 2002
 - idea, for off-line triangulation heuristic, rather than
   min-weight, choose "min dynamic weight." The dynamic
   weight is obtained by running through all possible values
   in the clique and choose the node to eliminate next which
   results in the fewest number of clique instantiations.
   Possible problem. Implementation of edges changes for  
   each utterance, but it is only the decision trees that
   change. 
   Options:
     1) Do this for one possible value of the DTs and 
        hope that this works for the rest
     2) do this for all, and average over the differnet
        DT values (or take min of max, etc.)
     3) have user provide "canonical" DT examples
        that are used for triangulation.
     4) 

  Question of how much to expose user to triangulation issues:
    Goal: try to make it possible for them to experiment with
    better strategies w/o asking them to understand everything,
    but still include it in the tutorial section.
  
Fri Jul 12 23:39:17 2002
  - rewrite issue with di_xxCPT type of CPT objects being
    declared. There should be one CPT object type used in
    the program, and all guys inheret from it. Q: how  
    to do the separate namespace for MDCPTs, MTCPTS, and MSCPTs?
  - this also an issue for mixGaussians. Need to re-think all
    of this.


Sat Jul 13 02:05:27 2002
 - ascii file reading should read in ascii files preprocessed
   by CPP (optionally and by default). Also, the list of files 
   itself should be processed by CPP.

Tue Jul 16 18:38:44 2002
 - when variances get floored, message should also include occupancy
   probability (to see if counts are low)


Thu Jul 18 15:38:36 2002
 - add comment to docs about problem with single quotes in GMTK comments
    because CPP has trouble

XX Thu Jul 18 15:38:51 2002
 - add neg values between -infty and -0 being log probabilities.

Mon Jul 22 13:39:02 2002
 - add neg vals cpt and dpmf to docs,
 - add to docs that normthres can be == 0 to turn it off.

Fri Jul 19 10:11:04 2002
 DONE: add ability to specify initial counts on command line for discrete objects
   (M?CPT, DPMF) during training, to do a form of Laplace smoothing.
   Perhaps similar to obsnottotrain option. 
   (done Sat Jul 23 16:02:39 2005, general Dirichlet prior model for DenseCPTs and DPMFs)

Fri Jul 19 10:13:22 2002
 - idea. Each gmtk object should have a set of options that can
   be set when the object is defined, rather than in a separate file???

ri Jul 19 11:59:44 2002
 - add karen's MCVR/MCSR comments to the docs

Fri Jul 19 18:48:57 2002
 - use regex library for obs to not train, and other such
   things.

Mon Jul 22 13:40:06 2002
 - idea about triangulation heuristic
   (when doing min weight, when a det. var and its parents
    live in the clique, don't multiply the child's cardinaliiity
    into the clique weights). So, if there is only one
    random parent, entire clique weight = cardinality of
    that parent).

Mon Jul 22 15:15:04 2002
  - when strcutre contains no observations in file, shouldn't
    require obs file (but how to determine unrolling in that case?)

Thu Jul 25 19:57:48 2002
  - add a per sentence like range option to all programs, like
    in the pfile tools. This would pass directly down to the
    file reading utilities.
  - Update the pfile tools to use the new file format.
  

Fri Jul 26 14:52:53 2002
  - add appropriate waring/informational message mechanism,
    so that users can se more/less of the warning information, with
    command line option (e.g., for dets hitting zero, and any
    other funny hacks that occur).
				   

Wed Jul 31 22:04:27 2002
  DONE: actually, rather than laplace smoothing, add an object type 
    that can include initial counts to use for MDCTPs and MSCPTs
    (use  same file format as these, but just include pos integer or fp counts
     rather than probabilities).
   (done Sat Jul 23 16:03:05 2005)

Tue Aug  6 22:49:42 2002
  - use email to chiaping on Tue Aug  6 22:49:46 2002 about unity
   score Gaussian -> docs    


Wed Aug  7 13:46:35 2002
 - LM scale factors,
   for vit decoding, give the option to raise the prob of 
   a rand var prob to a power (keep it static for now in
   the structure file).

Fri Aug  9 10:33:17 2002
  - change low level fileparser to when it has an error
    message keep track of the line number & file name of file that it
    is parsing for error messages. Need to parse CPP options
    as well in that case.


Sun Aug 11 18:07:53 2002
 - obsNotToTrain file formats
    - should allow multiple objects on the same line
    - should use a better format (regexpressions, etc.)


Wed Aug 14 14:08:19 2002
    given a word-int map, allow the inclusion of word matrices
    WidMatrix to map to internal GMTK matrices, using loadWordFactors  
    and Katrin's tag-word format. I.e., support observed data
    in the form of words/utterances for a word map.


Mon Sep 30 01:21:02 2002
  - next set of thigns might be redundant with above, but just a check
   1) deterministic CPTs shouldn't have messagtes/notes, etc. about
      not getting any accumulated probability, etc.
   2) error messages with MSCPTs, MDCPts, use new names


Wed Dec  4 19:39:41 2002
  - docs: ascii file formats for data files, add bit in domentation that
    one frame per line, and fix error message on line 1272 in GMTK_GM.cc
    (see email from Au@hk).


Tue Dec 17 13:31:36 2002
>2. Usually LM scale factor is accompanied with the insertion penalty, i.e.
>
>		scale * log P + penalty
>
>  I think, including insertion penalty would be quite useful for 
>  incorporating LMs with GMTK.
 (actually, do the switching variable scale/penalty stuff).
Email from Dec 18th
> A way, then, of getting an insertion penalty effect is to have a
> switching scale and penalty, syntax could be:
> 
>       weight: 
>         value 1.0 value 0.0     // scale = 1, penalty = 0 
>       | value 5.0 value 3.0 ;   // scale = 5, penalty = 3
> 
> which mimics the conditional parents notation. In this case, it uses
> the first scale,penalty when the switching parent is in its first
> region, and the 2nd when the switching parent is in its second region.
> So, when it switches in the true bigram (meaning a word transition
> occured).
> 
> We could have time-dependent scale,penalty by doing:
> 
>       weight: 
>         observed 0:0 observed 1:1
>       | observed 2:2 observed 3:3 ;
> 
> where it would be assumed that observations 0-3 contain only scales
> and penalties. 
> 
> Thoughts??

That's funny because I was about to ask you what happens when you have switching
 parents and whether you can change the weight for the different branches of the
 switch.  I didn't think about the penalty but that would be very useful too.  T
he syntax above looks good to me, except I might make it more explicit about whi
ch is the scale and which is the penalty, e.g.

weight:  scale value 1.0 penalty observed 0:0 ;


Thu Dec 19 17:36:22 2002
  - topological sort loop detection should indicate
    which node is involved in a loop.

Sun Dec 22 21:09:32 2002
 - unrolling routine should not allow for P to reference into
   E and E to reference into P (i.e., P and E can not have
   a parent on the other side of a chunk).
   Why?
    - then the interface method no longer makes sense. This is
      because that constrained triangulation for DBNs
      is implicitely assumed by the interface algorithm (i.e.,
      nodes up until the 'face' are elimiated first). With
      links accross chunks, then the face might live
      in both the chunk and E at the same time, might
      cause large cliques, and would make triangulation
      more difficult. 

      Note that we could have a mode that allowed for this,
      if it used the old strategy of first unrolling the
      network, then moralizing triangulating, and then
      doing inference.
      Idea: since we'll need such a mode for unrolling
          by 0 cases anyway, we could use it for the
          case when E links into P (and vice versa).


    - Makes the concept of unrolling more difficult. If
      links could span accross a chunk, then depending
      on the unrolling amount, a link with child in
      say P might at one time
      link into a portion of the unrolling chunk, at
      another time, might link into a node in E.
    - update: Thu Dec 26 03:19:08 2002
       perhaps allow for this to occur when using
       the unroll graph and then triangulation mode (e..g,
       the same mode that works for snakes will work for here as well).
    - update: Wed Aug 18 20:11:08 2004
       also, allow this when dealing with static template graph case.

Sun Dec 22 21:12:43 2002
  - idea: wrap-around mode?
     i.e., allow for negative indices at the beginning of a graph
     to link into the end (i.e., P can link to E using negative)?? 
  - but this would have same problems as linking from P to E
    in note above.
  - Wed Aug 18 20:11:34 2004: this relate to loopy BP?

Tue Dec 24 00:25:18 2002
  - find best {left,right} interface should have the option
    to judge the quality of the interface not by size
    but also by weight (which includes deterministic 
    variables).


Tue Dec 24 08:02:50 2002  
   for min weight heuristic
	    // TODO: should also look at if this is a sparse CPT,
	    // and if so, multiply by the average density (number of non-zeros)
	    // of the CPT rather than the entire cardinality. For a very
	    // sparse CPT, for example, using just the cardinality
	    // is *very* conservative. 
   a sparce CPT can provide this estimate itself of the
   average cardinality. E.g., it can compute over
   all parent values number of columns that are non-zero,
   and take average, that is the 'effective' cardinality.
   Can do this by looking at sparcePMF.length() field
   on line 228 of GMTK_MSCPT.cc.


Tue Dec 24 08:26:10 2002
   define a cardinality of a CPT baesd on the
   average 2^entropy of the node, under the
   assumption that pruning will remove stuff below
   those values that are not significant.
   This is another heuristic to add (and should be
   controllable via command line parameter).

Tue Dec 24 13:58:16 2002
  - for snake structure, can have a module that
    still does regular unconstrained triangulation and
    unrolls and triangulates for each length (in that
    case, snake will be better)


Tue Dec 24 13:58:48 2002
  - for min-fill, min-weight, etc., when there
    is a tie, back off to constrained triangulation
    (e.g., eliminate the earliest node first, or
     a random node selected from among the earliest
     nodes)


Wed Dec 25 02:09:25 2002
  - with triangulation heuristics, when a tie occurs
    then use another heuristic (i.e., with a tie with
    min weight, then use min fill in)

Wed Dec 25 12:50:41 2002
  - also include an anytime triangulation algorithm
    exhaustive search. It should occasionally
    print out the percent searched so far, and
    should accept a SIGUSR2 to terminate search
    so far and take answer. Also, should
    include an argument that is the time (in days
    and hours) to compute triangulation for, and
    stop after that amount of time.
    

Thu Dec 26 02:16:36 2002
  - fix: core dump issue when graph is not connected.
    (i.e., when chunk is not connected to itself)


Thu Dec 26 02:21:28 2002
  - Chris Bartels Todo:
    - write a class Anytime_Triangulation
       that has same interface as basicTriangulation routine
    - write an anytime exhaustive search triangulatio
    - get MCS working with current graphs, so we can
      verify that graphs are indeed triangulated.


Thu Dec 26 03:18:13 2002
  - include something indocumentation about
    cpp0: ....
    errors, and that these are generated from cpp not gmtk.


Mon Dec 30 05:16:04 2002
  - this is probably mentioned somewhere above, but
    change fileParser.{cc,h} to print out line numbers when
    there is an error message, and also to understand
    cpp's filename and line number mechanism (for ascii files
    and when they are processed with cpp).

Fri Jan  3 19:48:46 2003
  - this is an old todo item taken out of foce.
  // snake structure with constrained elimination
  // will ruin the clique_size = 2 property
  //     of the 'snake' structure. The todo is to get this working 
  //     with that (and similar) structures.
  /*
   *     idea: make complete components in C_l *only* if they
   *           are connected via nodes/edges within either (P + left_C_l)
   *           or preceeding C.
   *           What we will then have is a collection of cliques for 
   *           the interface(s). In this case, we glue together
   *           the corresponding sets of cliques. Right interface
   *           algorithm should be similar (and use E rather than P).
   *           But might both left and right interface need to be used in the
   *           same repeated chunk in this case to get clique_size=2???
   *      alternatively (and easier): for snake, just use the unconstrained
   *           triangulation method (which works perfectly for snake).
   *          


Sat Jan  4 17:39:20 2003$
  - build utilities to convert (at least partially) from
    other standard GM network file formats to GMTK and
    vice versa. See the 
    Bayesian Network Repository page.
        

Tue Jan  7 12:24:16 2003
  - remove 'observation' as keyword for weight and just use
    observed.

Mon Jan 13 13:31:34 2003
  - BUG: when we have a sparse cpt in a parameter file that
    refers to a name 'global' for a collection object, that
    object does not exist since loadGlobal() was called in the
    gmtk programs last (after other trainable and master files
    have been called). 

Mon Jan 13 13:32:31 2003$
  - get rid of MSCPT, MDCPT, and MTCPT messages. In sparce CPT
    error messages, it says MTCPT when it should refer to sparce CPTs.
    Fri Jan 23 13:17:05 2004: change all internal program variables
     away from using MTCPT, MDCPT, etc. also
   
Mon Jan 13 16:54:33 2003
  - make sure in documentation that it says that dlink feature index 
    values are for absolute locations with respect to the
    feature files and not relative locations, relative
    to the feature range given in the .str file (if this
    is not done already).
      - parent variables are absolute, with absolute feature locations
       given in dlink definition
      - child variables are relative, where relative location is given
        starting with feature range given in .str file.

Mon Jan 13 17:18:17 2003
  - Fix bad error message, 'Error: upper n >- limit n in range string' when training range does
    not match the range of utterance numbers in the observation file.


DONE: Thu Jan 30 19:15:59 2003
  - optimization
   right now, updating of Gausisans during mean and variance updating as follows:
    *covars_p += (*f_p)*(*f_p)*fprob;
    *nextMeans_p += (*f_p)*fprob;
   but where the two lines of code occur in different files, and so
   it can't re-use the common subexpresion (*f_p)*fprob
  - TODO, write a routine that can update both at the same time,
    probably best way to do this is to add a bit of code to emIncrement
    in GMTK_DiagGaussian and GMTK_LinMeanCondGaussian for this purpose.


DONE: Fri Jan 31 00:30:30 2003
  - add a '-debug' option that prints various status
    messages depending on the '-debug' number.

DONE: Mon Feb 03 14:45:49 2003
  - might be above, but add the option to
    unroll more times of the C chunk and triangulate that
    rather than just one time (this would put more constraints
    on the number of allowable frames for a P,C,E graph).

Sun Feb 23 17:04:32 2003
  - have exponential partition algorithm take a SIGUSR1 to terminiate
    what it has done so far and continue with the best it has found so far.

Sun Feb 23 17:05:10 2003
  - Have exponential partition algorithm also parallize (so that space
    can be broken into chunks in some way and have search run in parallel).

Sun Feb 23 17:12:53 2003
  - when using with an observation set that does not match what unrolling
    can do for a given graph and triangulation, give options
      1) left justify, 2) right justify, 3) center, 4) unconstraints
    Make it clear that when obs length is too short, it will do unconstrainted
     (or perhaps even backoff to simpler triangulation scheme, one that
      does not do any partially unrolled triangulation).
     call it "semi-constrained triangulation"
     (good for certain graphs, Brian's, Kevin's, but still doesn't 
      work for snake, the optimal graph is fully unconstrained).


Fri Feb 28 00:09:13 2003
  - not a todo, but rather a note. When the final reported
    weight is not one of the ones in the boundary alg search (and it
    is lower), it is probably the intial score (before the boundary
    alg. is run). keywords: final, face heuristic, exponential algorithm,
    weight not same, different weight, first weight, final weight

Wed Mar 05 19:45:16 2003
  - also allow language models to be specified in W3 XML format 
    see http://www.w3.org/TR/ngram-spec/
  - allow this as well as the DARPA format above.
  - 

Sat Mar  8 00:00:26 2003
  - another arguments.cc bug, in this case with -argsFile and then
    missing required option later, default argument values are printed
    incorrectly.

Sat Mar  8 16:56:48 2003
  - Write an emacs editing mode file for gmtk .str files.


Sun Mar  9 14:42:11 2003 
  - fix arguments.cc to not use C++ iostream
    and give usage() an optional message.

Sun Mar 09 19:00:59 2003
  - random variable should be super class, and 
    subclasses should add things like neighbors, etc.
    I.e., might make parent class of random variable is all the parser
    should want to deal with.

Sat Mar 15 22:31:31 2003
  - the issue of setting objects to not train, i.e., when A) a mean
    of a Gaussian is trained but a covariance is not, or when B) a covariance
    of a Gaussian is trained but not a mean. In case B, the
    covariance update should really be relative to the existing untrained mean,
    not the new mean which is accumulated, computed, and then discarded.
    The no-train thing should be re-thought out.

Sat Mar 15 22:33:27 2003
 - suggestions from Chris Bartels
	Jeff,
	
	You had mentioned that I should keep track of ideas I had to help with
	GMTK debugging and usability as I used it.  Here is the list.
	
	Debugging Support:

	Display the values for observation variables for the first several frames.
	The user can use this to verify that they are indexing into their data
	files correctly and have the byte swapping set properly.
	
	Allow the user to specify the values of the random variables in some sort
	of text file, then GMTK can display the values of each of the
	deterministic variables based on this.  For example, with the aurora
	based graphs the user would supply a text file of state transitions which
	would just look something like '0 0 1 0 0 1 0 0 1' and GMTK would print
	out 9 frames of variables, and because the model can't complete in just 9
	frames give some sort of error message saying so.
	
	Start an FAQ for common problems such as the byte swapping and the
	'skipping example due to zero probability' that I was seeing.
	
	Bugs:
	
	In one of my decision tree files I used some #define constants but this
	did not work nor did it give me an error message.
	
	When I was using the decision trees with the #define constants I was
	sometimes getting core dumps rather than an error message and an elegant
	exit.  When I didn't get the core dump it looked like my current state
	value which controlled the DPMFs I was training only was taking on one
	value.
	
	Usability suggestions:

	Give it the ability to generate initial Gaussian parameters like it
	currently can do with DPMFs.
	
	I nice GUI with integrated editing and debugging features.


Sat Mar 22 21:02:06 2003
  - environment variables for:
     - cpp (or some preprocessor) command and/or path
     - default cpp (or preprocessor) options


Thu Apr 24 13:36:20 2003
  - Add ability to have hidden variables take distributions
    from observation files (so that the distribution might
    change from frame to frame, making the distribution 
    itself an observation, the the variable using that
    distribution is still hidden). Distribution can for example
    come from a NN.
    Sat Jun 14 19:48:34 2003: This allows for "hybrid" like systems
       (where NNs can supply distributions over observed variables),
       but also allows for simulation of situations where
       one has observed continuous parents with hidden discrete children.
       Learning, of course, has to be done separately (but need to think
       more about this).

Fri May 09 17:42:07 2003
  - add automaticly existing observed variable frame, so users
    can condition on frame(0), frame(-1), etc. which gives 
    the frame number in an utterance (and so users don't need 
    to have a counter for this).
    Sun Feb 22 14:45:15 2004: update, also add length(n) variable
    that for all n returns the length of the current utterance.


Fri May 09 18:52:17 2003
  - assertion failed when dlink matrices exist but are not used by
    a Gaussian component, but then the accumulator tries to accumulate
    the dlink matrix but finds that it's em stuff has not been allocated.
    this happens when accumulating externally from acc file. 
    Should report an error when dlink matrices exist in data files
    that are not being used by any gaussian component, since we might
    get this error when tring to accumulate the dlink matrices without
    their em stuff existing.
    TODO: possibly have automatic allocation of em structures via when
    we accumulate them rather than the other way around.
    
    
Fri May 23 15:42:30 2003
  - fix argument code so that rather than giving warnings on argument type errors, it
    gives true errors.
 

XX Wed May 28 14:45:38 2003
    - perhaps this is done already, but when checking for loops and valid structures,
     do toploigical sort on all of 
       - basic template (unroll(0))
       - unroll(1)
       - unroll(2)
     There is some email from me to chris on this topic, look for that.
    DONE: Sat Jul  5 20:49:37 2003 (JB)
    Wed Jan 21 23:25:44 2004: need to unroll by from 1 to number of vars in C partition
      to get this working.

Tue Jun 03 23:19:31 2003
  - need to add a check for edges spanning from P to E *after* moralization.
    This can happen with backwards dlinks.
    (i.e., consider fig 7 graph in uai03 paper, if only a three frame
     template is used and graph is moralized, then unrolled
     chunk will have edges from P to E).


Tue Jul 01 17:36:05 2003
  - go through all file reading code and insert the is.fileName()  (and ideally a line number
    as well) for all error reports about that file.

DONE: Fri Jul 04 00:39:18 2003
  - make the MCS implementation truely O(n+e) raather than O(N^2)

Sat Jul  5 17:30:27 2003
  - make sure that certain core dumps with M>1 on some of the graphs
    are due to running out of memory.
  - Implement a fixed upper bound memory usage, and turn off memoize when
    that limit is reached, so that memory usage does not grow unboundedly. 

Sat Jul  5 17:34:38 2003
  - in timer code comments, give a bunch (lots) of examples of valid timer strings.
  - include these examples in documentation.

Tue Jul 08 15:19:03 2003
  - improve error message in GMTK_FileParser.cc
	  error("Error: RV \"%s\" at frame %d (line %d), num parents cond. %d different than required by %s \"%s\".\n",

Thu Jul 10 16:25:06 2003
  - set up GMTK web page with links to all papers, etc.

Thu Jul 10 16:25:23 2003
  - BMM link adaptation

Thu Jul 10 16:25:33 2003
  - Adding DARPA format Language Model CPTs

Thu Jul 10 16:25:33 2003
  - Adding FLM Language Model CPTs

Sat Aug 23 15:52:47 2003
  - see if it helps to do log(size)+weight rather than just weight
    in triangulation code. 10^weight is just number of "rows" in the 
    table, but 10^(log(size)+weight) is full table size (with zeros 
    removed).
        - this will probably be moot now that packed clique values
          are used.

Sat Sep  6 18:24:29 2003
  - add a flush buffer mode to debug.h so that output when error
    occurs doesn't get lost.

Sun Sep  7 00:34:12 2003
  - fix bug with partition algorithm
    with graph parents_before_after.str and similar such things.


Sun Oct 12 23:06:28 2003
  - add an extra internal variable tehat is constant and
    gives an integer corresponding to the total number of frames
    in an utterance. It is available at every frame, but
    is always the same value for an utterance.
  - add another variable that gives the frame number of
    the variable in quesiton (i.e., frame(0) is the current
    frame, frame(-1) is the previous frame, etc.). In
    first frame of template can't do frame(t<0), and in
    last frame of template can't do frame(t>0)
  - Possibly tie this in with the observation matrix by having
    variables that:
      - give number of changes in a discrete stream of features
      - length of original unexpanded number of frames, before
        obs matrix has done upsampling, etc.
  - Make a general module to add such variables.

Wed Oct 15 19:18:58 2003
  - allow for frames without variables to do multirate and 
    automatic downsampling.

Sun Oct 19 13:36:47 2003
  - Should either add "noisy-or" and "noisy-and" implementations 
    of CPTs, and their multivariate generalizations, or
    mention in documentation how they can easily be obtained  
    using a deterministic 'or' function with grand-parents
    the original variables.
    Add variational training to this sort of model.


Sun Nov  2 21:49:43 2003
 - make other variables available in DT formulas, such as:
    - current frame number of child
    - total number of frames in current utterance.
    - 

DONE: Sun Nov  4 22:32:50 2003:
Triangulation idea: 
  - add all deterministic ancestors as neighbors when going from DN to UGM.
Before triangulation:
 do:
   1 for each node, if any of its parents are
     deterministic, add those parents as neighbors .
   2 If any such newly added neighbors are themselves deterministic,
     add those new neighbors' parents as neighbors
     If no new such newly added neighbors are det, then stop.
   3. Goto 2.
  - Possibly do similar thing for sparse CPT nodes,
    i.e., add sparse ancestors if their effective cardinality
    is small enough.
    

Sun Nov  9 22:35:40 2003:
 Other things for inference
   - for viterbi decoding mode, don't store the probabilities for
     all time, just for the current and next clique, only   
     store the back pointers.
   - Also have a mode where only the back poitners for a particular
     "word" (variable, set of variables) are stored rather than
     all variables? (see what this would mean).


Mon Dec 29 16:40:08 2003

 - nodes assigned to cliques, there are several categories.

 1) node is iterated by a separator comming into the clique
    iterate over: only values that exist in all incomming
    separators.
 2) node is actually assigned to this clique, which means
    that its parents must either be iterated before we iterate
    over this node
    iterate over: only values with non-zero probability.
 3) a node that exists because one or more of its children need 
    the values of the node, so we need to 
    iterate over: all possible values of node


Wed Dec 31 17:45:52 2003
  need to do something with
     1) continuous nodes (which are observations)
     2) discrete observed nodes (ideally, break the graph).

  cpts have their own value which gets set and then
  copied over again to RV. Make it such that CPT uses
  RV val directly rather than copying value over many times.
 

Sun Jan 11 06:53:27 2004
 - Go through all code and change variables to use
   unsigned when variables are really unsigned
 - Look for machine specific assumptions since problems
   might arise when compiling on 64b machines.
 - when this is done, remove all (unsigned**) casts
   in GMTK_MaxClique.cc in call to pack()

Tue Jan 13 00:07:02 2004
 - start doing many subclasses of RV.
   Right now there are virtual functions in RV and its
   derivatives that are doing checks that could
   be setup by an appropriate subclass that
   is created by the file parser. For example,
     1) the weight stuff 
     2) switching parent stuff
     3) etc.
    (i.e., if we're paying the cost of a virtual
     function, we might as well switch right
     to a function that does exactly what we want).
 - make a packed bit length member function for GMTK_PackedCliqueValue.h  
 - for hash tables and dynamic arrays, and when doing multiple iterations, use simple 
   learning algorithm to decide starting size (rather than fixed constants), 
   to avoid extraneous memory accesses.

Sat Jan 17 14:38:59 2004  (but updated since then) 
  - rewrite RV class hierarchy so that virtual function cost
    is all we pay for prob, begin(), next(), etc.
  - all RVs currently use STL, change to use sarrays.
  - summary: new features to add:
     - inference features
          - viterbi decoder and word only (or variable only) 
            backtrace, generalizing LVCSR decoders (see decoder papers).
          - n-best and lattice
          - loopy
          - importance sampling
          - variational inference
          - Sat Apr 24, 2004: sampling and hidden continuous mixture Gaussians.
     - usability features
          - ARPA LMs and FLMs
          - non-linear Gaussians and non-linear BMMs w. pseudo-2nd-order training methods.
          - DONE: smoothing for CPTs in EM
          - generalized parameter adaptation directly within GMTK
          - ??? (something I'm forgetting). (Sat Apr 24 22:21:09 2004, probably "decision tree"
            (or some form of) state clustering, generalized to
            graphical models)
          - generalized probability weights/exponents
          - decision tree formulas (Chris)
          - integrate in new obs code.
     - theory/algorithms
          - switching parents in inference in forward/backward
          - characterize nets which require particular M and S
          - triangulation for nets with sparse CPTs (Chris)       
     - tutorials
          - update and get aurora tutorial into CVS
          - get Karim's fully implicit tutorial integrated in
          - write other tutorials for
              - language modeling
              - pronunciation modeling
          - integrate in small test suite examples (exp1, exp2, etc.) which will
            both serve as tutorial material and something to
            test things out to ensure compiled version works.  


Sun Jan 18 02:11:09 2004
  - in boundary search, traverse to parents of children
    in next partition first, to try to find something useful sooner.

Wed Jan 21 23:22:18 2004
  - pre-allocation size
    - have a function that computes the clique weight relative to the
      junction tree and the JT root, one that includes the
      notion of separator driven clique iterations.
    - Use this notion of weight to compute a value for initial allocation
      size. Also, use 'beam' value to help guide how large the first
      allocation should be (if beam is large, we should allocate
      a smaller first size).


Fri Jan 23 16:02:09 2004
   - consider writing a parameter file format in XML.


Fri Jan 23 18:32:41 2004
   - add a section to the documentation about the "design" of decision
     trees, how there can be good ways and bad ways to do it, and how 
     bad ways can be slower. Give examples.
   - explain in documentation that a DT's job is to map every possible
     input variable values (each input variable can have values
     from 0:(card-1)) to a valid output value (from 0:(card-1)). Not
     all output values need to be produced, but all possible input
     variable values (the cross-product) must be considered, or otherwise
     you might either get 1) unexpected results since a value
     occurs you didn't expect, or 2) hopefully, you'll get a runtime error 
     saying that the output value is invalid.


Tue Jan 27 15:56:38 2004
  - have observation files also be able to read streams
    of words. The words are converted to integers using
    one of the internal word hash tables, where the
    hash tables were built using some vocabulary.
  - Perhaps look into "perfect" hashing functions for
    vocabularies.

Tue Jan 27 18:31:29 2004
  - add note in documentation about how cardinalities
    of 2^n have the  least efficient use of bits
    whereas cards of 2^n-1 have the most efficient
    use of their allocated space (assuming in each
    case tha the variables use all their values with
    non-zero probability).

Wed Jan 28 20:40:22 2004
  - write GMTK's own tokenizer for .str files to avoid problems
    with flex library, etc.


Fri Jan 30 19:17:21 2004
  - triangulater, add option to just re-triangulate
    one of P, C, or E.
    Allows for parallelism. 
    DONE, but need way to merge different trifiles together.
    create program: gmtkMergeTrifiles


Sat Jan 31 11:45:06 2004
  - arguments code, when printing out default arguments, it
    prints out the ones after they potentially have been modified om
    command line, i.e.,
         foo -arg1 a -arg2 b -arg3 c -helpk
    will print out default values of a for arg 1 b for arg2 regardless of
    what they were in program. Fix this as it is confusing. 

Sat Jan 31 19:01:06 2004
  - get new inference working with disconnected networks.


Mon Feb  2 23:41:53 2004
  - figure out a way to have discrete RVs. in different
    frames be observed or hidden (i.e., perhaps
    the hidden/observed status of a RV can be obtained
    from the observation file for a given utterance in
    some nice way). This would need to gel with
    the new RV class hierarchy. 
  - Wed Aug 18 20:34:24 2004: update, this is incompatible
    with new RV hierarchy, since in the new hierarchy,
    the hidden/observed status is determined by the RV type
    (but perhaps unroll() could pay attention to this).    

Mon Feb  9 21:07:55 2004
  - fix bug where
      If E = empty in right interface case, and where
           resulting E only contains interface nodes (which might
           result in a clique of size one if the interface has one node)
      If P = empty in left interface case (symmetric case as above).
   - probable solution. Allow for P or E to still be empty in .trifile.
   - other (possibly easy) solution: when P (or E) are empty, unroll an extra
    time to create a P' (or E') that is not empty. 


Mon Feb  9 21:11:40 2004
  - fix bug where buggy graph has core dump in triangulation code:
   frame : 0 {
   variable : state {
      type: discrete hidden cardinality 3 ;
      switchingparents: nil;
      conditionalparents: nil using DenseCPT("state0");
   }
   variable : obs  {
      type: continuous observed 0:5 ;
      switchingparents: nil;
      conditionalparents: state(0) using mixture 
            collection(CN) mapping("DT::state2obs");
   }
   }
   frame : 1 {
   variable : state {
      type: discrete hidden cardinality 3 ;
      switchingparents: nil;
      conditionalparents: state(-1) using DenseCPT("state1_with_state_pars");
   }
   variable : obs {
      type: continuous observed 0:5 ;
      switchingparents: nil;
      conditionalparents: state(0) using mixture 
           collection(CN) mapping("DT::state2obs");
   }
  }
  chunk 0:0
     
  This should probably check that unrolled graph produces a connected graph.
  (or allow for unconnected graphs).


Thu Feb 12 11:59:35 2004
  - also allow triangulation routine to take a SIGUSR1 to stop searching
    and report back what it has found so far.
  - add check pointing in triangulation code (say every 10 minutes, save  
    the best .trifile so far found).


Sat Feb 14 22:09:30 2004
In backward pass during EM, we might have zero probs that were only
discovered in backward pass, so check for 0 (or threshold) before
doing EM update. Should have an EM pruning threshold as well.
So, we'll have three pruning thresholds:
   1) cbeam - clique beam
   2) sbeam - separator beam
   3) ebeam - EM posterior probabiltiy beam, below this we don't
              bother to update.


Wed Feb 18 12:11:08 2004
  - have a program that goes through and checks all DTs for all
    values of all parents and ensures that all children are
    valid values for each location a DT is used. Possibly
    do this for all iterable DTs as well. 
    this will
      1) help reduce bugs by telling a user when DTs are problem
         before running the program.
      2) speed things up as dynamic checking can be removed.
   - Wed Aug 18 20:36:32 2004 update: if we switch ot unsigned, we need only check
      upper bound, and no longer check for < 0 condition.

Wed Feb 18 15:12:25 2004
  - add mention of Chiaping email from today about splitting
    to manual.

Wed Feb 18 23:57:26 2004
Check the following and use the one that is faster.
> Do you know if:
>
>   const  set<RandomVariable*>::iterator it;
>   const  set<RandomVariable*>::iterator it_end = foo.end();
>   for (it=foo.begin(); it != it_end; it ++) { ...}
>
> is indeed faster than:
>
>   const  set<RandomVariable*>::iterator it;
>   for (it=foo.begin(); it != foo.end(); it ++) { ...}
>
> I've been noticing it in your code, and if it is faster, will want to
> start converting over.


Thu Feb 19 23:41:00 2004
  - tree-structured lexicon: it should be possible to code
    this up using sparse CPTs, and use the same standard inference procedure.
    Should investigate this.


Sat Feb 21 19:11:22 2004
  - investigate and use faster hash functions for the vhash stuff.
   -  Look at Bob Jenkin's hash function
      and the FNV (Fowler/Noll/Vo) Hash. 
  also:
    http://www.mobydisk.com/softdev/techinfo/hash_paper/Near%20Minimal%20Perfect%20Hashing.html


Sun Feb 22 14:36:17 2004
  - Allow soft evidence to come from an observation
    file
  - Allow probability distributions each frame to
    come from an observation file, so that distributions
    can come from NN output probabilities.
      - variable w/o parents can get its distribution from
        observation file. Define a new CPT that
        obtains its probabilities in this way.
      - NN probabilities in some sort of conditional distribution.
     In other words:
          - time-inhomogeneous CPTs (for vars without parents)
          - time-inhomogeneous CPTs for one parent, and observed binary children set to 1.
            i.e., we have a --> b,  where b is observed == 1 , and the
            score comes from hidden variable a, as P(b|a) = score_from_file(a).
  - This functionality should subsume that of hybrid decoders.

Tue Feb 24 11:01:06 2004
  - add an option to either sort or not sort the separators either
    as not at all, max overlap, or min weight, and make available 
    on the command line.


Mon Mar  1 23:09:05 2004
  - cpp include files should be smarter about the current working directory.
    Currently, if a file says include "foo.file", it uses the users
    current working directory even if the file was specified using a long path name.
    Try to pre-pend the path to the file somehow (or add that to CPP's -I options by
    default)
     - include in documentation that user can give "-I path" options in the -cppOptions
       commands in GMTK.
     - Use environment variable for GMTK, GMTK_CPP_INCLUDE to include other
       paths for GMTK. 


Mon Mar  1 23:41:09 2004
  - in DT leaf formulas, add
     variables fp0 fp1 fp2 for frame numbers of parents 0 1 2, and fc for frame number of child.
  - also, be able to add cardinality of both parent and child.
  - also, be able to add number of frames in current segment.
    (this might make moot the need for special internal variables, as it
    it is easy to have RVs with DTs to do this.) Make sure that we can have
    RVs without parents and with a MTCPT that always returns a single value,
    and that can have cardinality 1 (and so should never be placed in any clique
    value, since it is essentially another way of doing an observation, but here
    the observation changes each frame).


Mon Mar  1 23:50:32 2004
  - variables with card = 1 should be treated as observations, and triangulation
    code and methods should pay attention to this. 


Sat Mar  6 14:52:42 2004
  - let cbeam and sbeam come from an observation vector, so we can have dynamic beam pruning.
  - Also, give a file with a beam schedule based on percentage (i.e.,
    cbeam argument can either be a fp value, or a name of a file that has something like:
        0-10: beam1
        11-50: beam2
        51-95: beam3
        96-100: beam4
    where 0-10 says the first 10% of the file, 11-50 the next 39%, and so on.
  - also have a beam for each clique in some way (i.e., perhaps a growth function'
    so that larger cliques have larger beams, etc.)

Sat Mar  6 19:52:15 2004
 - write an any-time script that iterates for a given amount of time:
   1) generate a random triangulation (meaning it chooses
      one of the triangulation heuristics at random).
   2) run one utterance of inference, timing the result
      (kill process if it takes too long).
 - Stores the N-best triangulation files by run-time. 

 
Sun Mar  7 14:13:59 2004
  - allow parameters to be read in in any order, and do not only backward
    but forward reference to items not yet defined. Do a final pass
    over parameters to resolve all references.

Sun Mar  7 19:41:24 2004
  - with '-verb 100' option, set up so that some form of indenting is
    working within a clique for RV assignment printing (easier to read).

X- Mon Mar  8 01:47:26 2004
   - fix two bugs that Karim found around this date (see gmtk mail)
     1) DT leaves with p0 rather than (p0)
     2) cliques with all obs, see karim learning string-edit distance example.

Mon Mar  8 12:27:41 2004
  - in DT leaves, add an abs() function.
  1) absolute value function abs(expression)
  2) unary negation
  3) floor, ceil, round (implement without floating point)
  4) new variables: 
       - cardinality of child: c
       - current frame number of parent: fp0, fp1, fp2, etc.
       - current frame number of children: fc
       - number of frames in current utterance: nframes
  5) right shift / left shift, rotate left, rotate right.
        
Mon Mar  8 20:48:40 2004
  Allow hidden continuous mixtures, and do inference simply
  by (importance) sampling from that distribution (so it is like discrete).
  This should be easy to get working with new inference code.

Wed Mar 10 21:56:35 2004
  - DONE: change to allow cliques with all observations, and all observed graphs. T?his
    should be fairly easy, just do less "error" checking.
  - change to allow disconnected frames (i.e., just don't send a message between partitions).
  - DONE: add a feature where we topologically sort nodes in a clique based on
    expected cost. I.e., if a deterministic node uses a complicated DT, then
    it should come as early as possible in a clique to avoid redundantly calling
    the expensive DT. Cheap DTs can go later in the clique order (i.e., extend
    topological sort with continous vars first to be topological based on expected cost.
    Perhaps use DT description length as an estimate of the DT cost, or perhaps  
    just number of DT leaf nodes. 


Fri Mar 12 17:13:42 2004
 - instead of unity score CPTs and Gaussians, produce alpha-score CPts and Gaussian
   where alpha can be any value.

Sat Mar 13 18:45:03 2004
 - docs: place note in documentation about FPEs, and to first make sure that
   you don't have any NaNs in the feature files (since this will
   cause FPEs in GMTK and people might think this is a GMTK bug).
   You can use 'od' to check the status of raw binary feature files.
   I.e., 'od -f' will print things out in 32-bit floats. You can
   also use od to skip a bit (say if it is a 12-bite header).
   Also mention the use of obsPrint (the new pfile functions Karim
   did). All should go in documentation
 - Have obs file check for NaNs on input and report error (todo for Karim's
   new version of obsfile).


Sat Mar 13 22:24:53 2004
  - DONE (Wed Aug 18 20:48:12 2004): implement another form of pruning, where you provide the
    maximum number of clique or separator entries that survive.
    Number should be able to be specified either as percent (relative
    to the current total) or as an absolute number.
  - Note: clique pruning will be easy, but separator pruning
    in this case will be a bit more difficult.

Sun Mar 14 23:04:25 2004
  - make 1-page reference card for GMTK DTs with everything from syntax, rules, tips for speeding up,
    and all possible formulas, etc.

DONE: Sun Mar 21 05:24:38 2004
  - in inner inference loops, rather than have:
        rv->begin()
            computeProb
        while (rv->next())
  - have rv->next() also provide the probability, thereby
    avoiding yet another virtual function call (which is expensive). 
    i.e., have rv->next(prob) return true if prob was filled with the next probability.
   

Thu Mar 25 17:04:41 2004
  - define a pre-defined CPP variable that gives the major and minor version of GMTK.

Fri Mar 26 14:17:15 2004
  - make output word strings compatible with NIST word error scoring tools.
    In general, look at NIST tools, and make GMTK output as compatible as possible with them.

Fri Mar 26 17:38:32 2004
  - script to convert from an easy rep. of a SparseCPT into single GMTK sparseCPT in one file.

Sat Mar 27 15:14:17 2004
  - create a .gmtktriangulaterc file for common triangulation options.
    In addition, have an ENV variable for this purpose.

Sat Mar 27 20:57:04 2004
 - island algorithm should interact with gaussian component cache, in that
   cache memory should be re-claimed for different islands so that
   we don't use O(T) mem for cache but rather O(log T) mem.
 - Mon Aug 30 19:11:30 2004: for now, issue message saying that
   user might want to turn off caching when using island.

Sat Mar 27 21:08:06 2004
 - add bit about gaussian component cache to docs. 


Tue Mar 30 13:08:33 2004
  - add a noCPTCheck option, to allow cpts to read in non-probabilties, etc.
    Figure out how to get working with log probs, etc.

Fri Apr  2 12:41:23 2004
  - look into issue of n-P triangulation being different for different n
    (see email from Ozgur today regarding that).


Sun Apr  4 17:16:40 2004
  - not really a TODO, but an idea. Another completely different way of doing
    inference would be to off-line compile out all cliques into single
    integer states, and then have all loops be single iterations through
    all non-zero prob states (conditioned on current previous states).
    This would need to go through all utterances, and do this separately
    for each, since each utterance having a different DT would mean
    that there can't be one single compiled version (this still could be
    done offline, and stored on disk). 
 

Tue Apr  6 19:27:48 2004
  - make it such so that verb if set high enough will also print
    out values of continuous observed RVs (rather than just C).
    This means making the 90-100 range print out everything, but
    higher 90s print out cont. observed vectors as well (this
    might produce tons of output but it might be useful for debugging).


Tue Apr  6 19:29:56 2004
  Karen's latest wish list:

   Thanks for the reminder about the GMTK wish list.  Here's my wish list so 
   far...  Some of these I've mentioned before, but just for completeness:

  -Arbitrary sets of variable values, including strings, rather than [0..card-1] 
  (I think you said this is already being done)

  -Clustering.  Ideally, I imagine a clustering scheme which is not separate 
  from training, but instead, gmtkEMtrain can optionally, at the end of every N 
  iterations, cluster two distributions whenever they meet some similarity 
  criterion.

  -Soft evidence

  -Very minor, but:  having the option to not have the "zero accumulator" 
  warnings when saving the accumulator files.  This could be a verbosity setting 
  (maybe this is already done in gmtkEMtrainNew--I haven't tried it yet).  I 
  tend to have a large number of small chunks, so my log files sometimes wind up 
  having huge numbers of warnings.

  -An optional probability floor for dense CPTs in gmtkEMtrain

  -Debugging...  I'm still thinking about what kind of output would be helpful.  
  For starters, it would be good if -showVitVals showed the conditional (log) 
  prob of each variable value.  One question that I often have is "why do these 
  observations have zero probability?" and it's hard to say what kind of 
  debugging output would help to answer that question--maybe someone more 
  familiar with the internals of the toolkit would know better.  gmtkJT 
  -verbosity 100 might do the trick, but I don't know how to parse it yet.  In 
  the long term, I can imagine an application that lets you interactively 
  suggest different observations and tells you where the zero probabilities 
  happen.  I'll keep thinking about other types of debugging output.


Wed Apr  7 11:51:32 2004
  - get compiling with intel compiler. Also, see:
    http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html

Thu Apr  8 00:05:03 2004
  - get island decoding working (need to have separator keep track of which
    entry is current max rather than having it assume that its values
    are set from right most clique.

      C0 -- s0 -- C1 -- s1 -- C2 -- s2 -- C3 --
               1>    2>    3>    <4          <b  

    1> and 3> are ceGatherIntoRoot msgs
    2> is a ceSendToNextPartition msg
    <4 is a deReceveFromIncommingSep msg
       and deScatterOutofRoot is (currently) a nop
    <b is a deReceveFromIncommingSep msg from before.

   The problem is that <4 assumes that it's 's2' is currently
   set to appropriate values (max of C3 done by <b), but message 3>
   changes clique C2 and thus changes s2 so that it no longer
   holds max value.

  Possible fix: add two variables in InferenceSep
    // Two indices to get at the veterbi values for current separator.
    // unsigned viterbiAccIndex;
    // unsigned viterbiRemIndex;
  to keep track of which entries in seps (such as s2) are max.
  Then <4 would restore those values for each separator. 
  and deScatterOutofRoot would compute them (since deScatterOutofRoot
  is called at the time C3 is at its true max value).
  

Mon Apr 12 10:31:10 2004
  - convert all CPT cards and vals to unsigned rather than int.
    (and thus convert all checks from 0<=v && v<card to just v<card 

Thu Apr 15 12:14:09 2004
  - make an option when reading triangulation files to ignore RV cardinality differences
    between trifiles and corresponding .str files.

Fri Apr 16 11:36:36 2004
  - add to documentation note about '-w' option to cpp to suppress warnings such as.
     /homes/bilmes/g/tests_and_development/xwrd/str/masterFile.ears_decode_wordT:13:61: warning: pasting "/" and "stateTransition" does not give a valid preprocessing token


Sun Apr 18 20:02:59 2004
  - create a new version of GMTK_JunctionTree.{cc,h} called
    GMTK_JunctionTreeInwardOutward.{cc,h}, that does
    inward/outward form of collect/distribute evidence. There will be some graphs for which this
    will have a hugely beneficial effect as there are cases where the full state
    space does not occur until about 100 frames into the utterance.
    Set it up so that it has a min number of unrollings case (say 4 or 5) since it won't
    help at all for short numbers of unrollings.


Thu Apr 22 14:40:34 2004
  - support filtering p(x_t | y_{1:t}), prediction p(x_t | y_{1:s}) s < t,
    as well as smoothing p(x_t| y_{1:T}), where y is observed, and x are
    the hidden variables in the DBN.

   - filtering is easy, just give an option to command line
     that gives length other than true length of segment.
   - prediction can be done with a similar function, but using
     a long E partition.
    (but probably better ways exist).


Fri Apr 23 19:02:36 2004
  - arguments.cc, when string argument is given "" it shouldn't give warning.

Sat Apr 24 20:57:47 2004
  - look at log/exp implementations
    http://openbsd.secsup.org/src/lib/libm/src/e_log.c
  - and produce a fast function that does:
     f(x)= log(1+exp(x)) as one accurate and fast
     function, has to be faster than random lookup in table
     (i.e., the cost is less than the typicall cache miss rate in
     the table, under a normal workload). 
    alternatively, could be log10(1+10^x)
    Note: if x << 0 then f(x) = 0
          if x >> 0 then f(x) = x
    non-linear part is when x ~ 0
    But we only care about the case when x <= 0
    since we form x as a difference between smaller and larger value in logadd.

  
Sat Apr 24 22:21:09 2004
  - make one global config.h file to define all constants used by the program
    including default floating point size (32, 64 bits), growth rates of memory
    allocators, default program options, etc. This gives us a one stop shop
    to set all defaults.

Wed Apr 28 23:37:01 2004
  - add option to turn off hash table sharing of clique and sep values and figure out a way
    to do this without taking an additional branch hit.

Mon May 10 13:06:14 2004 
  - in documentation, add that all mixtures in a collection must have the same dim.
  - 

Fri Jun  4 18:41:44 2004
From Simon.
The 'clean' Makefile targets should probably also remove the
depends.make files - otherwise these can still contain references to
files that have been removed or have moved to another directory, such
as pfile.h

Sat Jun  5 19:14:59 2004
  - covert binary dump variables in gmtkViterbiNew to ascii as well, redo all such
    outputs.

Tue Jun  8 12:14:48 2004
  - staged verbose outputs, so that when you wnat to be verbose in inference, you don't
    get so much verbose output during initial JT creation, etc.


Sat Jun 12 14:22:47 2004
  - add new observation component type: ChordalGaussian,
    which allows user to specify a sparse representation of
    a Gaussian (i.e., sparse inverse covariance). The triangulation
    engine will triangulate it (finding a good triangulation), and
    the result will be the chordal sparse Gaussian used. Since it
    is chordal, ML estimates still exist.


Sun Jun 13 15:14:51 2004
  - make quick reference card for decision trees 
  - make quick reference card for sparse CPTs 

Mon Jun 14 18:28:13 2004
  - become aware of parent values, comment should mention that parents won't change, so that
    when people add new cpts, they know that they don't need to copy parent values.

Mon Jun 14 19:24:21 2004
  - NGramCPT always has its own vocab object.
    RVs can also have its own Vocab object.
    when a RV with a vocab object uses an NGramCPT, we check that the
    two vocab objects are the same.


Tue Jun 15 17:44:56 2004
  - in tri-file partition information ugm, say if moralized for
    triangulated graph as comment.

Mon May 10 13:06:14 2004 
  - in documentation, add that all mixtures in a collection must have the same dim.
  - 

Fri Jun  4 18:41:44 2004
From Simon.
The 'clean' Makefile targets should probably also remove the
depends.make files - otherwise these can still contain references to
files that have been removed or have moved to another directory, such
as pfile.h

Sat Jun  5 19:14:59 2004
  - covert binary dump variables in gmtkViterbiNew to ascii as well, redo all such
    outputs.

Tue Jun  8 12:14:48 2004
  - staged verbose outputs, so that when you wnat to be verbose in inference, you don't
    get so much verbose output during initial JT creation, etc.


Sat Jun 12 14:22:47 2004
  - add new observation component type: ChordalGaussian,
    which allows user to specify a sparse representation of
    a Gaussian (i.e., sparse inverse covariance). The triangulation
    engine will triangulate it (finding a good triangulation), and
    the result will be the chordal sparse Gaussian used. Since it
    is chordal, ML estimates still exist.


Sun Jun 13 15:14:51 2004
  - make quick reference card for decision trees 
  - make quick reference card for sparse CPTs 

Tue Jun 22 15:41:46 2004
  - design function to do log(1+exp(x)) directly.
  - make use of log1p() = log(1+x) when x is very small.
      (just include source code of log1p() for portability).
  - check that log1p() is available, use HAVE_LOG1P config.h
  - include ASM directive for microprocessors that have fnlog1p functions (i387??)  

  - fast log function from web:
      
       Submitted by Laurent de Soras, posted on 30 March 2001 
       Fast log() Function, by Laurent de Soras: 
       --------------------------------------------------------------------------------
       
       Here is a code snippet to replace the slow log() function... It
       just performs an approximation, but the maximum error is below
       0.007. Speed gain is about x5, and probably could be increased
       by tweaking the assembly code.
       
       The function is based on floating point coding. It's easy to
       get floor (log2(N)) by isolating exponent part. We can refine
       the approximation by using the mantissa. This function returns
       log2(N) :
       
       
       inline float fast_log2 (float val)
       {
          int * const    exp_ptr = reinterpret_cast <int *> (&val);
          int            x = *exp_ptr;
          const int      log_2 = ((x >> 23) & 255) - 128;
          x &= ~(255 << 23);
          x += 127 << 23;
          *exp_ptr = x;
       
          val = ((-1.0f/3) * val + 2) * val - 2.0f/3;   // (1)
       
          return (val + log_2);
       }
        
       
       The line (1) computes 1+log2(m), m ranging from 1 to 2. The
       proposed formula is a 3rd degree polynomial keeping first
       derivate continuity. Higher degree could be used for more
       accuracy. For faster results, one can remove this line, if
       accuracy is not the matter (it gives some linear interpolation
       between powers of 2).
       
       Now we got log2(N), we have to multiply it by ln(2) to get the natural log : 
       
       
       inline float fast_log (const float &val)
       {
          return (fast_log2 (val) * 0.69314718f);
       }
       -- Laurent 
      This tip was submitted by Laurent de Soras, posted on 30 March 2001. To submit your own tip, visit here. 
   - See also:
       http://algo.inria.fr/papers/html/RiSaShVH96/RiSaShVH96.html 
       BSD source code of log1p.c
          http://www.tundraware.com/Technology/Docs/FreeBSD-Source/lib/libm/common_source/log1p.c       


Tue Jun 22 18:36:23 2004
 new sequence CPTs (and lattice CPTs).
   one case for just A -> B
      where B has prob 1 for having value in position A of some list
   another case 
                    C
                    | 
                    v
              A --> B
      where B retains A's value if C = 0, and B is next value in
      sequence when C = 1.

  The sequences should be read-inable using HTK's MLF format. 


Wed Jun 23 00:24:01 2004
  DT formula reading errors when {} characters are not there, causes core dump. 
      

Fri Jul 16 00:52:15 2004
  >When gmtkTriangulate is run with -anyTime and a non-integer time string, the decimal point is ignored.  E.g., -anyTime 1.5h will r
  >un for 15 hours, and 15.0m will run for 150 minutes.  It doesn't make a big difference if non-integers are supported or not, but i
  >f not, it would be good to have a warning if a user enters a non-integer.   


Sun Jul 18 22:52:56 2004
  unityScore CPT iterators should work (right now they just die with an assertion).
   (but this is OK since unity score CPTs apply only to observed variables without parents).
   Also, fix class hierarchy in CPT class.

Tue Jul 27 14:29:03 2004
 -  in (to exist) config.h file, give option to make means, covariances, dlinks as double (64-bit) rather than 32-bit. 
 -  modify error messages about variance floor can also as option increase variance precision to 64-bit. 


Tue Jul 27 14:57:19 2004
  - modified form of separator drivin inference, where a separator can drive other cliques coming into a clique evne
    if the separator doesn't contribute probability. I.e.,

           A --  B  
               /
             C

    with messages from A->B and C->B. If the AB sep has intersection with C, and if AB is quite sparse,
    rather than generate all of C only to have it squashed when going into B, we can use AB sep to
    keep C from generating in the first place the stuff that will ultimately just get squashed.


Tue Jul 27 17:44:28 2004
j - trifile prints out more partition information, interface , number of variables, etc.bb


Thu Jul 29 14:29:41 2004
  - export to command line the not processing of stuff via cpp (to assume that).

Fri Jul 30 01:12:00 2004
  - allow DT formulas to have variables corresponding to the integer
    portion of the global observation files using the frame of the child and/or parent.
    I.e., p3o21 would be the integer in the observation file at position 21 using
    the frame of parent3, or co12 would be the int in the obs file at the frame of the child.

Tue Aug  3 15:07:03 2004
  -jt_info.txt dispositions, print text rather than numbers.


Thu Aug 12 12:21:15 2004
  - improve clique packing code to be smarter about how it packs cliques into multiples of words.
    Right now, we always allocate multiples of machine words to pack cliques. There will most often
    be wasted space at the end of the last word. For multi-word packings, code should figure out
    how to shift values around so that we minimize the number of values that cross word boundaries
    since they cost more to pack and unpack, while staying within the contraint of using the same number
    of words.


Wed Aug 18 18:36:47 2004
  - add support for immediate parameters in str file, i.e., ability to give
    immediate DTs  as in:
           DeterministicCPT(" decision tree")
           DenseCPT("etc.")
           FNGramCPT("fngram spec.")
    etc.

Wed Aug 18 18:44:42 2004
 - Figure out some way to allow for a variable number of variables per frame
   rather than a fixed number of variables (this is a simple version of switching
   existance, update inference algorithms for this).

Wed Aug 18 19:30:28 2004
  - DT formulas:
     add functions for mapping together lots of functions, i.e.,:
         -  add(e1,e2,...,eN) // adds three expressions together
         -  and(e1,e2,...,eN) // ands three together
         -  or(e1,e2,...,eN) // ands three together   
         -  median(e1,e2,...,eN) // ands three together   
         -  max(), min(), 
    

Wed Aug 18 19:42:28 2004
  - trifiles include username of creator as well as time of creation.

Wed Aug 18 20:23:43 2004
  - add fixed meory usage command line option so that we exit with an error
    if a certain amount of memory is exceeded (rather than just swapping and
    killing the machine). Integrate this with gmtkTime -multi so that
    if the fixed memory (or time) is exceeded, we free-up everything
    done so far in current inerence pass, and return to beginning state.
  - This might be hard since many OSs (e.g., currently linux) don't
    support setrlimit with fixed memory

Wed Aug 18 21:05:31 2004
Summary of many of the new features compiled from the above up to the
current date (this includes major new features, but not the bugs or
other minor things mentioned above).

     - fast inference and other inference features 
          - viterbi decoder and word only (or variable only) 
            backtrace, generalizing LVCSR decoders (see decoder papers).
          - n-best and lattice generation
          - loopy propagation
          - importance sampling
          - variational and mean-field inference
          - sampling and hidden continuous mixture Gaussians.
          - fast boundary search, using max-flow/min-cut algorithms.
          - ability to max/sum variables (max over some variables, sum over others,
            for hybrid decoding).
          - gmtkFlatten
          - gmtkDecode, and support for n-best and lattice generation.
          - option for inward/outward inference rather than forward-backwards.
            This can significantly speed things up since we don't keep things
            around that are zero (try to have a picture explaining this).
         - support filtering p(x_t | y_{1:t}), prediction p(x_t |
           y_{1:s}) s < t, as well as smoothing p(x_t| y_{1:T}), where
           y is observed, and x are the hidden variables in the DBN.
         - island algorithm interacts with Gaussian caching, so that we cache current island only.
         - virtual evidence extended to separator driven inference so that
           the virtual evidence in a clique acts like a separator, thereby effectively
           pruning away what is not necessary.
         - Single-Pass Retraining, ala HTK (sec 8.6). (single pass retraining).

     - underlying different models
        - native support for time-inhomogeneous DBNs
        - non-linear Gaussians and non-linear BMMs w. pseudo-2nd-order training methods.
          non-linear regression models, lasso, etc.
        - smoothing for CPTs in EM
             Laplace smoothing (DONE), 
             Good Turing within EM training, false counts.
        - generalized parameter adaptation directly within GMTK
        -  (Sat Apr 24 22:21:09 2004, probably "decision tree"
           (or some form of) state clustering, generalized to
           graphical models). Clustering of two distributions when they
           get very similar (presumably, p(a|b), when two values of b yield
           the same dist over a, we cluster bs together.
        - generalized probability weights/exponents
        - switching weight and penalty.
        - native mode for virtual evidence 
            - enables native-mode hybrid DBN/ANNs and DBN/SVMs.
        - hidden continuous nodes, both GC models, and also non-linear hidden
          continuous nodes, trained using sampling approaches.
        - sparse inverse covariance matrices (corresponding to sparse undirected graphical
          models as Gaussians). To avoid needing to do an iterative scheme, triangulate
          this mdoel giving chordal Gaussians.
        - ancestral graphs and Gaussians.
        - sparse inverse covariance matrices, non-chordal, using iterative proporotional scaling (IPS)
          to train within EM 


     - usability features
          - ARPA LMs and FLMs
          - sequence and lattice CPTs (that iterates through
            a lattice depending on a specific set of parents).
            Like a sparse CPT but easier to specify since it keeps track
            of where it is. 
          - decision tree formulas (Chris)
          - integrate in new obs code.
          - disconnected and static networks
          - add own cpp-like GMTK pre-processor
               (with many but not all of the features of cpp).

          - much better user-level error message reporting, 
          - much improved and more complete documentation with many examples.
               - not only that there is a cycle but which vars are in cycle 
               - quick reference cards (1 page for decision trees).
          - tools for easy automatic parameter creation and structure creation.
            (meta level description of parameters rather than needing to specify
             them all). Also meta tools for Gaussians & continuous densities.
             (converting from dense to sparse cpt).
          - graphical user inferface for graph specification parameter specification
          - regular expressions to determine objects not to train.
          - symbolic observations such as support for language files, etc.
          - Arbitrary sets of variable values, including strings, rather than [0..card-1] 
            so that user interacts with variable values as strings (even though
            internally they are unsigned integers).
          - utilities to convert to and from other standard formats:
             - other GM systems and other speech systems.
             - other systems, HTK, etc.
             - compatibliity with all NIST word scoring tools.
          - emacs editing mode for GMTK structure files.
          - user debugging features
              - (see Sat Mar 15 22:33:27 2003 in TODO file)
              - gmtk user-level debugger to help users figure out why they
                get zero probabilities (i.e., bugs in their determinisitc cpt specifications).
          - fixed upper bounded memory usage.
          - allow for empty frames to do instant down-sampling and multi-rate
            processing.
          - dynamic beam pruning (so that beam widths or counts can come from global
            observation file). beam comes from obs frame corresponding to first 
            variable in the clique. Also, allow beam formulas on command line,
            to adjust variable beams. (might want to have wider beams at beginning
            of segment, an smaller in the middle).
          - allow parameters to be read in in any order, rather than in
            an order that requires any dependencies to have been read in beforehand.
            Do final pass to ensure all is there.
          - set up GMTK wiki page.


     - software speedups
         - custom floating point routines to implement log(1+exp(x)) as a
           single function.
         - much better hash tables

     - theory/algorithms
          - better boundary search algorithms & predicting of when boundary
            will and won't be useful.
          - optimal separator iteration orders
          - switching parents in inference in forward/backward
          - characterize nets which require particular M and S
          - triangulation for nets with sparse CPTs (Chris)

     - tutorials
          - many tutorials and examples to be created and released on the web.
              (list different types).
          - update and get aurora tutorial into CVS
          - get Karim's fully implicit tutorial integrated in
          - write other tutorials for
              - language modeling
              - pronunciation modeling
          - integrate in small test suite examples (exp1, exp2, etc.) which will
            both serve as tutorial material and something to
            test things out to ensure compiled version works.  


DONE: Fri Aug 20 15:01:48 2004
  - remove wrong copyright note from gmtkNGramIndex.cc


DONE: Tue Aug 24 17:50:48 2004
  shift, scale, and penalty definitions:
  given a probabilty p, we do:

   Two options:
        1) penalty*p^scale+shift
        2) scale*p^penalty+shift

   We choose 1, since it corresponds to HTK.

      Or logged:
         log(penalty*p^scale+shift)
      =  log(penalty*p^scale) ++ log(shift)
      =  (log(penalty) + scale*log(p)) ++ log(shift)

       weight: 
         scale value 1.0 penalty 0:0 shift value 0.5 ;
       | scale 0:0 penalty 1:1
       | scale 2:2;


Wed Aug 25 14:54:03 2004
  - change makefile so that all RV objects in one library.

Fri Aug 27 05:15:25 2004
  - Add to docs. In DT formulas, the 'card' of a continous random variable
    evaluates to '0'. 
  - add child card() to DT leafs of discrete children.

Sat Aug 28 15:54:29 2004
  Remove all the extraneous "f >=0" for typeof f  == unsigned in ObservationMatrix
  as in:
  float*const floatVecAtFrame(unsigned f, 
			      const unsigned startFeature) {
    assert (f >= 0 && f < _numFrames);
    return (float*)(featuresBase + _stride*f + startFeature);
  }

Sun Aug 29 07:53:40 2004
  - perhaps modify syntax so that you can specify just 'parents' rather than
    'conditionalparents'

DONE: Sun Aug 29 08:01:18 2004
  - allow immediate value observed RV to have a special value 'frame' which
    is the frame of the RV.
 
Mon Aug 30 04:03:05 2004
  - Time to spend triangulating should be, most on C, then E is often
    much more important to trianglate well than P since in P, we often are starting with
    smaller state spaces, but by the time we get to E, everything has been expanded.
  - DONE in BoundaryTriangulate,
  - TODO: do in scripts.

DONE: Mon Aug 30 06:36:52 2004
  - for immediate observed RV, other constants to add:
     frame - returns frame value of var
     numFrames - returns number of frames in current segment
     segmentNum - returns current segment number
     numSegments - returns number of segments
  - also add to DT leaf formulas.

DONE: Mon Aug 30 09:04:32 2004
  - change gmtkTime -multi to use fork() set/getrlimits to avoid blowing up on
    bad triangulations.

Mon Aug 30 10:08:14 2004
  - Speedup: In, MaxClique::ceIterateAssignedNodesNoRecurse(), consider
    adding ability to the clique packing code to update one RV's value directly
    in the packed clique value. That way, at the innermost iteration, rather than
    needing to sweep through all the RVs reconstructing the entire packed clique value,
    we just update a few bits, and copy the packed clique value to its appropriate
    place in the clique value list.


Mon Aug 30 10:59:48 2004
- New things since europe:
 1) new RV hierarchy, much code changed.
 2) new immediate constants
       frameNum, numFrames, segmentNum, numSegments
 3) switching weights
 3.3) new simplified but more flexible CPT interface
 4) virtual evidence CPTs.
 5) better and more informative error messages (many now include line number location).
 6) working viterbi island algorithm ??? (need to check)
 6) working viterbi island algorithm??
 7) better use of -debug values. 60,65,70,80 for inference are the three levels of increasing 
    verbosity + better printing format.

DONE: Tue Aug 31 10:19:20 2004
  - fix bug in error message bad parsing of cpp.
ERROR: In file '"PARAMS/masterFile.Mar9_16" 2
', DT 'endOfUtterance', invalid leaf node value '(p0)':  RngDecisionTree:: readRecurse leaf node value

Wed Sep  1 07:20:49 2004
   for DOCS, nice equation to describe penalty:
   W* = argmax_W ( log p(X|W) + \alpha \log p(W) + n(W)*\beta )
    where \alpha is the scale, and \beta is the word insertion penalty,
     and n(W) is number of words in W (so the more words we choose,
     the more the penalty if we choose \beta to be negative).

Fri Sep  3 16:10:39 2004
  - now that VECPTs are in, there is more need to not need an obs file on the command line and to get the
    current utterance length either from a file of lengths, or from the VECPTs themselves.
  - add back in the .active() checks for the global observation files.

Fri Sep  3 21:21:22 2004
  - make sure bug in DT formulas, donj't accept 'p2' as avariable when there are only 2 parents (i.e.,
    in formula should check that p_k, that k is valid for current number of parents).  

Mon Sep  6 11:02:55 2004
  - DONE: get cliques of size 1 working (there is no reason they should not work as they are valid cliques). 
Tue Sep  7 13:42:48 2004
  - for DOCS: include discussion about difference between observed RV and RV with no parents and
    a DT that always returns the same value, and tha tobserved is better and why.

DONE: Tue Sep  7 18:14:56 2004
  - bug:
  - cd ~bartels/A/word_baseline 
  - run -of1 ../DATA/testing_2.scp -nf1 39 -ni1 0 -fmt1 htk -iswp1 true -inputMasterFile PARAMS/masterFile.decoding -inputTrainable ./LEARNED_PARAMS/initial.gmp -str ./PARAMS/acoustic_decoding.str -printWordVar Word -varMap ../dictionary/word_list_2 -transitionLabel Word_Transition -island T -base 2 -ckbeam 5000 -dcdrng 0:0 -verb 50

Sat Sep 11 14:36:36 2004
  - gmtkTime, report performance also in number frames/second.

Tue Sep 14 01:19:08 2004
  - new DT types and features.
    - sequence  - successive sorted sequence of integers,no gaps, can start/end at any int, implemented
                  as an array lookup. 
       Options can be things like:
           sequence 11  0 ... 9 default (would be 0 1 2 3 4 5 6 7 8 9)
       same as:
           sequence 11  0 1 2 3 4 5 6 7 8 9 default (would be 0 1 2 3 4 5 6 7 8 9)
       skipby option???
           sequence 11  0 ... 9  skipby 1 default (would be 0 1 2 3 4 5 6 7 8 9)
           sequence 11  0 ... 19 skipby 2 default (would be 0 2 4 6 8 10 12 14 16 18)
           sequence 11  0 ... 29 skipby 3 default ( would be 0 3 6 9 ...)

    - set - arbitrary set of individual integers, implemented with a hash table.
    - ranges - arbitray set of integer sets. Imlemented with a hash table for the single integers,
               and an (sorted if possible) array of bp_range objects for the remainder. 


Wed Sep 15 21:17:36 2004
  - docs/manual idea: give some arbitrary DBN and by-hand derive forward/backward
    equations for the model using the CI statements made by the model. But
    are these equations good? the best we can do? Show the triangulation
    of the model corresponding to these equations, and then say how
    better to think of this in terms of finding a different triangulation,
    which means finding the cheapest set of equations. 
    It can be faster to write a general toolkit and search for a good triangulation 
    since we've got a systematic way of iterating over all possible sets of equations,
    than it is to write one set of equations in C by hand (general phipac philosophy).


Sat Sep 18 21:15:19 2004
  - Speedup/memory reduction: change so that P, Cu0, C, and E
    partitions use the same origin clique (and thus
    hash tables, memory arena, memory size memorization, etc.) when they are compatible. Right now they
    each have their own separate ones.

Sat Sep 18 21:18:05 2004
  - fix error messages with the bp_range class, since currently range can kill the program
    without any indication of where/when it is wrong.

Tue Sep 21 15:59:32 2004
  - bug: divide by zero error from bartels, 9/17/04. 

Wed Sep 22 20:34:42 2004
  - add check to MTCPT children that their card is not > \prod_i card(par_i) -1.

Thu Sep 23 16:51:19 2004
  DONE:  -cppCommandOptions, add default -I. and -I<dir_that_str_file_is_in>
     last (after -cppCommandOptions) when calling cpp.
  , need to add this to docs, that this is done behind the scenes.

Mon Sep 27 15:07:26 2004
  - Test a bunch of .str files, such as downsampling right within the structure file.
  - empty frames.
  - see Gang's message of 9/27/04
  - fix bug suggested in fix in response to Gang's message, 9/27 that I sent out 9/28 (having to
    do with re-writing gmtkViterbiNew).

Sun Oct 10 20:27:38 2004
  - add option to gmtktriangulat to add extra string in comment in trifile.

Tue Oct 19 19:14:58 2004
  - add argument option to send to whatever command PAGER is set to (so command line arguments don't scroll off screen).

Tue Oct 19 19:38:02 2004
  - clique printing for just forward probability in gmtkJT as well as doDist.

Tue Oct 19 20:08:48 2004
  - VECPT should be able to use global obs 

Tue Oct 26 00:52:12 2004
  - the new gmtk viterbi should have option to place output of each
    utterance in separate file, so that each file contains score and viterbi path.

Wed Oct 27 12:48:48 2004
  - new docs. Write a section on DT optimization ideas, i.e., that you should write DTs for efficiency.
    Give example of right and wrong way to write:
        f(0,x) = 0 for all x
        f(1,x) = f(x)
    so query the first variable first. 

Wed Oct 27 13:26:52 2004
  - make program gmtkJunctionTree that creates a junction tree from a
    graph and a trifile.
    Normally the infererence programs make their own junction tree, but gmtkJunctionTree makes
    one offline. Also .jt files can be edited by hand to specify your own junction trees for
    a given triangulation. .jt files also specify interface cliques (from the set of valid ones).
    gmtkJunctionTree should create .jt files that indicate which cliques are potential interface cliques.


Thu Oct 28 00:20:56 2004
  - do phipac [4,2,1] like optimizations for:
      1) Gaussian evaluation
      2) Hash key evaluation.
   
Thu Oct 28 00:21:20 2004
  - in separator sort, put least intersection last. 
    - when a tie occurs, put the bigger (more variables) separator earlier.

Thu Oct 28 13:21:03 2004
  - timing limit arguments to gmtkTriangulate for boundary and triangulate.

Mon Nov 15 13:55:16 2004
  - make sure we are compatible with quicknet for NN-based VE.

Thu Nov 17 07:24:50 2005
  - add max-margin discriminative parameter training.
  - also add simple MMIE, and MCE training.

Fri Nov 19 12:15:41 2004
  - for docs: include bit about how different structures can be used to train same set of parameters  
    by using the same master file - e.g., if you have three templates, one that is 1 frame longm
    another that is 2 frame, and a third which is 3 frame, each of these templates can use the 
    same master file, write out accumulators that are all compatible with each other (i.e.,
    accumulator file is defined according to the master file, not by the structure file, I think).
  - for docs: include bit about how to use short utterances by concatenating things, and using a VE node
    with value 1, and with an always 1 DT. Also, why it works.
    update Fri Dec 31 02:01:06 2004: once disconnected networks are working, this will be moot. 

Wed Nov 24 12:43:32 2004
   gmtkEMTrainNew
  - parallel EM training, when no mor etraning jus tacc step, confusing to
    need to do -trrng nil -maxE 1 (give option to do maxE 0)

Wed Nov 24 17:44:12 2004
  - figure out why right interface case is sometimes faster.

Thu Nov 25 19:24:08 2004
  - docs: recomended names of objects. E.g., use standard conventions to make
    things easier to find/debug. E.g., a mixture might have suffix "_MX" or "MX",
    a DT might have suffix "DT", a cpt "CPT", etc. It is valid to name, say, an
    MDCPT and its DT the same (since the names live in different name spaces),
    but try not to do that. Isssue informational warning (debug level 11) if this is done.


Sat Nov 27 09:44:40 2004
  - for docs: talk about how unrolling works. I.e., that vars are first duplicated in unrolled graph parentless,
    and that edges are determined by children (i.e., after duplication, parents are added
    for each corresponding child, not vice versa).
    Example graph:

                c1 -> b
                ^
                |
          x0 -> x1 -> x2
          |     |     | 
          v     v     v     
         o0     o1    o2

          P      C     E

     I.e., The C partition is duplicated without the link to b, and it is only the very last partition that
     has a link to the E partition (since that is where the b variable exists as a child,
     and has the parent in the previous frame).
 
    Way to explain it: parents are determined to satisfy CPTs (e.g., b's CPT needs a parent) based
    on the current child (i.e., each child gets a CPT and then the child searches for parents
    according to after unrolling, and according to the spec in the structure file).

 
Sat Nov 27 22:10:37 2004
  - gmtkTime -multi should also take vcap, icap, etc. options along with trifile so you don't
    need to re-start to try these options.
  - also, gmktTime should take just vcap, icap, etc. options to try same trifile over and over
    with different JT creation options.


Mon Dec 27 19:40:22 2004
 - veseps
   1) In InferenceSeparatorClique make 
       iAccHashMap and separatorValues pointers. 
      - Also, add field
        corresponding to VE sep so that these pointers are not deleted
        on destruction, and so that backwards pass doesn't propagate into these separators.
     -  During forward pass, only do intersection without any probablility update (rather
        than doing a log multiply by one).

   2) MaxClique contains a pointer to a data structure consisting of an array of
          - parents, child, grandchild
      which are used to create VE separator cliques.
 
   3) we create and sort the separator cliques as normal.

   4) Just before inference, the SeparatorClique computes the tables, reads them
      from disk, or writes them to disk. SeparatorClique creates its own temporary
      InferenceSeparatorClique for the VE separators.

   5) InferenceSeparatorClique looks at the separator clique and if it is a VE sep,
      uses those pointers rather than creating its own new ones.

   6) SeparatorClique deletes the inference separator clique when it dies.

- DONE: remove 'clamp' keyword.

Tue Dec 28 13:38:22 2004
  - extend .str file to allow for undirected edges so that we can do
      -  maxent models
      -  iterable VE constraints on graph

Fri Dec 31 00:08:39 2004
  - for DOCS: mention how to look at and store big log files:
       script |& gzip --best -c > biglogfile.gz
       zcat biglogfile | egrep bla

Fri Dec 31 01:54:39 2004
  - for DOCS. Encourage people to make their scripts relative to a top level directory,
     - don't use absolute names
     - shoud be able to tar up a top level directory, and untarr it some place else
       and have it still work.
     - when including files in cpp, use #include "  " and use "-I dir" on gmtk cpp command line.


Fri Dec 31 22:03:17 2004
  - iterate unassigned nodes, should iterate in order of increasing cardinality.


Fri Dec 31 22:21:42 2004
  - change names of P1, Co, and E1 to P', C', and E' (or somethign).
  - DONE: have one global place for their names and use everywhere (rather than
          multiple strings).


Sat Jan  1 23:07:03 2005
  - for DOCS
     for normalize(), makeUniform(), makeRandom() at the same time, if thigns are shared,
     no guarantee to behaviour (i.e., it depends on the internal order which
     the last operation will be). When in doubt, load with a table
     or write out indivual parameters separately and then read back in jointly.


Fri Jan  7 21:32:19 2005
  - DONE: add verbose messages to parameter reading (to give status updates on reading
         on long parameters).  


Fri Jan 14 17:52:24 2005
  - system speedups
      - hash tables
      - diagGAussian:log_p
      - remove sArray operator [] when possible
      - logp log add operations
      - PackCliqueValue, pack and unpack
      - ceIterateAssignedNodesNoRecurse
      - 

Sun Jan 30 11:30:02 2005
  - viterbi output shoudl allow:
     1) ascii.
     2) separate files per segment.
  - space before 'hidden' or 'observed' keyword in showvitvals.

Tue Feb  8 16:51:19 2005
 - possible addition to think about: in addition to unity and zero
   score Gaussian, include a min/max/median/mean-score Gaussian that gives
   the min-max score seen for the current frame??? Goal:
   simulate DTW's ability to skip speech frames for certain constraint templates, but
   not change dynamic range of best path.

Wed Feb  9 18:35:35 2005
  - when a trifile doesn't match the id, say why it doesn't match.

Wed Feb  9 18:54:52 2005
  - for -verb 55, include printouts at every partition boundary giving memory usage summaries
     (clique hash tables, separator hash tables, clique entries, separator entries, etc.).
 
Wed Feb  9 20:37:23 2005
  - for docs: include mention in DT section that DTs do not specify cardinalities of parents,
    only number of parents. This means that a DT can be used in a number of contexts (with
    different sets of RVs with different RVS), and that the evaluation of DT formulas will
    be different for the same parent corresponding parent values with different parent cardinalities.

Thu Feb 10 09:22:32 2005
  - fix zombie process issue:
   Yes, that is on my todo list. I do know that the zombies go away once
  the process has ended (basically, the problem is that there is an input
  pipe that stays open for the duration of the process, but the pipe has
  called cpp which has already terminated, but the pipe doesn't get closed
  until the end of the process, and its a pain to close it earlier).

Thu Feb 17 13:52:17 2005
  - right now, log arithmetic is used to deal with the scaling problem. At some point
    we should include optional compile time choices to use online scaling, where
    probabilities are re-scaled at the output of each clique. This should be easy 
    to do, it will mean that no matter how long the utterance, things will be
    in range, and it might even speed things up since we no longer need to use
    log arithmetic (so no log-add to do, but on the other hand we need to iterate
    through each clique once again after it is constructed to do the scaling). Note
    that since we already have the max clique value computed, we could perhaps
    just scale by that value.

    In fact, logp.h should have a #define option to use linear arithmetic, and then
    this option is active, it activiates the scaling code in GMTK_MaxClique.cc


Fri Feb 18 20:14:44 2005
 - another speedup:
  - to reduce virtual functions in inner loops, change CPTs so that
    begin also returns a direct function pointer to the next() function,
    and we use that pointer to function rather than the virtual mechanism
    which goes through two virtual functions currently. This doesn't
    eliminate dynamic dispatch, but it reduces it by a factor of 2 on the next calls at least.
  - in addition, given a clique, the first thing we do is iterate through all the
    variables in the clique constructing an array of begin() and next() function pointers
    for each variable in the clique.
    Q: but how to do this when virtual functions in RVs modify values coming from CPTs?

Sat Feb 26 18:31:27 2005
  - docs, for -cpbeam, it is important for this to work not to have
    any positive log likelihoods
    (which can arrise using either Gaussians or VECPTs which provide positive log likelihoods).
    For Gaussians, this can arrise when the variance is small and an x value falls close to
    the mean. The solution for Gaussians is to multiply each value by >1 quantity during training and testing
    (and how this is easy to do using observation distribution) by a
    constant to scale out (and therefore reduce value of)
    probabilities. Give example
    of distribution of P(Y) = P(F(x)) in terms of P(x).

Sat Mar  5 17:17:49 2005
  - split up large .cc files into smaller more logical ones ones.


Wed Mar 16 18:01:22 2005
  - add a viterbi training appraoch to EM training (i.e., backward pass does viterbi, and rather
    than sum, we just train gaussian using viterbi path).
     - also have N-best viterbi EM training, training on the N-best list, weighted by normalized
       score.
  - also, allow sample training (where we sample from the posterior and train on that rather
    than on all cases). N-sample training, sample N times. 
  - N-best training, train on N-best, do this after N-best list has been created.

Thu Mar 17 13:50:47 2005
  - add DT constraint leaf nodes: I.e.,

   -1 constraints 3 
         p0 p1 p2 p3
      0: 0  1  0  1 - 3;
      1: 0  0  1  0 - 4;
      3: 0  0  1  0 - 4;
      default: - 5;

Thu Mar 17 13:50:47 2005   
  - change file parser to have simple tokenizer for ints, floats, chars, and specific
    token ids (specific strings).
      
        
Sat Mar 26 18:44:34 2005
  - run AC-3 or AC-4 constraint satisfaction on VE separators prior to saving them to disk to remove
    incompatibilities.
  - try optimal separator order

Tue Apr  5 13:47:10 2005
  - for Gaussian splitting, it should be possible to also utilize the state posterior, since low 
    state probs. should have less splitting, high state probs. should have more splitting.
  - 

Mon Apr 18 02:46:32 2005
  - for speed:
       - have a proper var sorting routine, that doesn't do everything through the generic topoloigcal sort.
         I.e., when we start with cont. varialbe and go backwards, it might not leave much left to work with.

Wed Apr 20 00:46:19 2005
  - do conflict-directed backjumping not only over vars, but also over previous separators
  - create one unified separator and variable routines, iterating seps first, so that we can
     backjump over both vars and seps w/o C++ exceptions.
  - get A* like heuristic working for cpbeam
      - for cpbeam, figure out a way so that iterators iterate in descending order, so that
        once we prune, we can prune the rest of the vals of the current variable being iterated.
         - for sparse CPT's it is easy.

  - 
  - Add new DT leaf formulas:

   allEqual(p1,p2,p3,7) which returns 1 if (p1==p2 && p2 == p3 && p3 == 7)
   allDiff(p1,p2,p3,7) which returns 1 if none of p1, p2, p3, and 7 are the same.
                        (i.e., p1 != p2 &&  p1 != p3 && p1 != 7 && p2 != p3  etc etc.)
   allUnequal (same as allDiff)
   increasingOrder(p1,p2,p3)  true if p1,p2,p3 are ordered increasing
   nonIncreasingOrder(p1,p2,p3) true if p1,p2,p3 are not increasing (decreasing or equal)
   decreasingOrder(p1,p2,p3)  true if p1,p2,p3 are ordered increasing
   nonDecreasingOrder(p1,p2,p3)  true if p1,p2,p3 are ordered increasing are equal.
   parity(p1,p2,p3) = 1 if (p1+p2+p3) is odd, and zero otherwise.

 - other things possibly to add (for speed??)
    sum(p1,p2,...,pn) computes the sum of the expressions.
    prod(   ) computes the product of the expressions
    random(l,u) returns a random number between 0<= l and u inclusive.


Mon Apr 25 15:49:15 2005
  - for the constraint version of the algorithm,
     for conflict-directed backtracking, keep track of two conditions:
           1) if current variable failed itslef, jump back to one of
              the parents/constraints for the current variable.
           2) if the curent variable failed because of some later variable,
              then jump back to the deepest child variable's parents/constraints.
  - variable ordering for switching parents.


Mon Apr 25 21:55:36 2005
  - create a gmtkFilter program that does filtering. This will be easy by
    printing out sum normalized forward cliques during collect evidence, and
    then printing out. We might need to do one step of distribute evidence starting
    from the C depending on the triangulation of C.


Tue Apr 26 21:30:04 2005
  - create a new CPT type called a ConstraintCPT that
    askes for a set of variables (the parents) and an observed
    to be 1 child, and imposes the constraint with a given weight in the CPT.
    The Constraint CPT is specified as a table of parent (unsigned integer) values and child score
    (float) for that parent value.
    Options:
       - scores can default to unity (and thus need not be specified).
       - scores can be in normal domain (so need to log them), meaning
             scores are always non-negative.
       - scores can be in natural log domain already, so can be positive or negative.

Mon May  2 16:45:39 2005
  - for RV and CPT iterators, have begin return if there is a next right away
    to avoid the extra call to next that will fail for deterministic functions.
    I.e., begin() returns true if the next call to next will not fail, but
     begin() returns false if the next call to next is guaranteed to fail, useful
     for deterministic CPTs.

Tue May  3 17:35:00 2005
  - during inference, within a clique, when expanding a clique, if there
    is a sub-tree that is identical and used/expanded multiple times for
    different values of previous variables (meaning the previous variables
    do not effect the expansion of the subtree), we should:
         1) figure this out offline (and be able to)
         2) expand the subtree once, and the first time we expand the subtree
            we store it in a packed table with the entries consisting of the
            RV values and the score,
         3) the next time we encounter the first variable of the table, rather than
            expanding out the tree again, we just iterate through the table, unpacking
            variables, multiplying in the score, and then jumping down to the
            variable just after the last variable in the table.
    This is similar to the "good" processing in value elimination.
    Q: how do we detect such an independent tree of values?


Tue May  3 20:50:38 2005
   - change RV and CPT interface to have a new routine, rv->nextAbove(cur_p,thres)
     which returns the next probability that is above thres or returns failure. This can
     be done rapidly by doing a quick scan with comparison tests rather than
     repeadly calling virtual functions, returning cur_p, multiplying it in
     and then checking at that point. This would be in liu of sorting
     the CPTs and then pruning the rest of it ever falls below a threshold.

Tue May  3 22:13:05 2005
   - see email in gmtk emails to sheila from today around this time. Problem
     with many disconnected observed parents making triangulations job harder.


Tue May 10 14:00:20 2005
   - when compiling a VESEP, interate variables in increasing cardinality (rather than arbitrarily),
       so that the longest loop is deeply nested.
     Also, when iterating parents, check if there any deterministic relationship between
     the parents, and if so, iterate them out according to that relationship rather than
     based on just prod. of cardinalities.
   - optimize hash tables. In hash functions, during table build insertions, sometimes
     we know we'll only be inserting unique keys. Create a insertUnique() function that
     speeds up insertion by not redundantly checking for key equality (i.e., it assumes
     that the keys will alwasy be different when a collision occurs).
   - (something else in the GH paper) == form JT as chain if it is possible to do so
      with current set of cliques.

Tue May 10 14:19:41 2005
  - have a generic variable order version of iterate that can take any of:
      - separator
      - VE separator
      - unassigned variable
      - assigned variable
    and do this in a dynamic order.
    Also, chooses the next thing to use if any based on previous values.
      VE separators should be choosen only optionally (sometimes we might
      skip them entirely, which potentially changes the assignment of later
      variables).
  - have a generalized notion of VEseparators, so that rather than having to,
    say ,have a table of a,b representing a=b, we can just have the relation
    a=b, so that when we switch on a, it assignes b to a.


Tue May 10 16:42:18 2005
  - potential bug: when cpbeam, need sometimes to call cpt->end() to reclaim storage for FNGramCPT and similar things.
  - when there are a cluster of variables that have many MTCPTs associated with them within the cluster,
    it may be efficient to run over all sets of assignments and do it as a constraint rather than a set of MTCPTs.
  - connection between FSM minimization and SAT solvers.

Tue May 10 21:37:14 2005
  - create new option to JT creation to try to find and to use chain if it exists in the triangulation
    (so that it is more of a branchwidth).

Mon May 16 15:28:39 2005
  - another time/space tradeoff: In a clique, we don't need to store values for
    the variables that are a determinisitic function of other variables that
    are in the clique. This will cut down on size potentially (if the det. variable
    is large cardinality) at the time expense of needing to set the det. values when
    project down to separators.
    (mention this also in Tri jounral paper).

Mon May 16 16:06:08 2005
  - clique printing. In addition to posterior probabilities for cliques, have ability to print out also
    entropy of clique (good for when cliques have lots of entries since for binary variable cliques not
    to many, but with large cardinalities or for man variables, printing out posteriors on cliques might
    be to much. Also think about printing out MIs of subsets of variables for cliques.

Tue May 17 18:22:30 2005
  - finish gmtkSample, so that you can sample not so much from p(x_{1:N}) but instead from
     p(X_{V\E}|X_E), i.e., sample from the posterior distribution of the JT given the evidence.
     This can be done by sampling from cliques in the junction tree and by renormalizing each
     clique as a function of the current values of the incomming separators. Create an array
     of clique entries that correspond to the entries that have the same values as the variables
     of the incomming separators. While doing so, sum up the value of the clique and form
     the cumulative array as well, then uniformly sample from 0 to the sum, and use binary search
     in the cumulative array to find the appropriate clique entry index, then instantiate that
     clique entry and proceed to the next clique.
      - compare: viterbi training vs. sample training vs. "typical" training.
      - add a "sample training" appraoch to EM, where rather than iterating over all values,
        we just sample from the distribution N times per observation vector.

   Tue Sep 2 22:20:58 2008: update: I'd guess that "posterior sampling
      training" (which is what it could be called) probably would
      converge in the expected value.

Thu May 19 17:34:17 2005
  - new faster log table implementation.
     Log table implementation of f(x) = log(1+exp(x)), instead of
     doing this as a linear table, do it using interpolation with
     different density sampling. Use high-order bits to index into bucket
     and in a given bucket low-order bits to find location in interpolation.
     (higher density sampling near y-axis). This should significantly reduce
     needed table size while maintaining (or perhaps increasing) accuracy, thus
     reducing demands on L1 cache. But we'd need to do interpolation.
     See papers:
          wan99: 
             Wan, Y., Khalil, M.A., Wey, C.L., logarithmic numbers and
             logic implementatino, IEE Proc. Comput. Digit. Tehc, Vol. 146, No.
             6. November 1999.
          lewis93:
             Lewis, D.M., Interpolato r, Computer Arithmetic,
             1993. Proceedings., 11th Symposium on , 29 June-2 July 1993, Pages:2-9.
            "An accurate LNS arithmetic unit using interleaved memory function interpolator"
            Lewis, D.M.
             Page(s): 2-9
             Digital Object Identifier 10.1109/ARITH.1993.378115
              (get at ieeexplore.ieee.org)


Thu May 19 18:04:37 2005
  - unityScore CPT viterbi: Is it possible for unity score CPTs when used to make
    it such that viterbi can produce different scores for different triangulations?
    No. But sum of final clique might change for different triangulations when
    in viterbi mode.

Fri May 20 00:22:02 2005
   - in hash table, use the insertUnique() idea when doing a hash resize.
     I.e., need a version of  entryOf() routine that does not check for
     equality, since for a hash set or hash map, when we resize, we're
     guaranteed that each insert will be unique, so not-empty guarantees
     inequality.
   - search for all uses of hash tasbles to use insertUnique().

Fri May 20 14:38:26 2005
  -  add a neural network (multi-layered MLP) CPT. NeuralCPT, MLP-CPT
  -  add an SVM binary and multi-way CDT.
  -  docs: email too email from today on 'pruning, triangulation, scores, and CPT -> clique assignment'

Mon May 23 20:18:34 2005
  - in addition to a .trifile, have a .jtfile that contains all the information
    about forming the JT, including JT, variable ordering information, CPT assignment
    infomration, and so on. Make it in ascii so users can edit it. Default
    should be to re-compute it on the fly. This is starting to get important
    as some .trifiles are differently speed (with pruning) only because
    of differences in CPT assignment. Keep a triangulation ID in
    the file.

Tue May 24 02:55:02 2005
  - change format of clique printing to be more like debug messages,
    i.e., Pr[var=val|var1=val,var2=val2,...] = p


Tue May 24 23:58:05 2005
  - another speedup. Currently, when pruning has occured in previous partitions,
    the result of that pruning is captured by the separator between partitions.
    I.e., more pruning leads to lower state space separator. The separator scores
    can only be applied once (to one clique)) in the next partiiton, but the separator
    itself can also be used as a constraint (set of non-zero variable combinations)
    used to iterate cliques in the next separator other than the left-interface
    clique in that partition. This could significantly reduce the cost of
    iterating cliques. In other words, this would be sort of a loopy
    graph, but the scores would only be applied once, but the inherent constraints
    specified by the JT during pruning would be applied mulitple times.
    This might also be another form of approximate inference speedup under pruning.
       - modification: each separator is projected down to the sub-separator consisting
         of variables that intersect with each clique the separator will be used
         as a constraint for.

Wed May 25 02:26:37 2005
  - have the ability to sample not only from P(H|E) but also from
    P(A|E) where A \subset H. Also, ability to compute \argmax_A P(A|E)
    and sample from P(A|E) by projecting JT down to a sub-JT 
    containing A. As long as sub-JT is connected tree still satisfying RIP, we can just 
    project down individual cliques. The problem is that if we sample from P(H|E), we are
    not getting a sample from P(A|E) if we just use A in H.
    - call program: gmtkMLDecode (for maximum likelihood decode)??
    - note: this type of decoding in ASR is a research topic itself,
      namely if we do ML decoding, while more expensive, is the WER 
      significantly reduced?
    - might need to use a separator like data structure for the resulting projected
      down JT cliques.
    - need clique packing equality for subsets of arguments.


Fri Jun 10 01:57:36 2005
  - both cpts and separators have conditional cardinality, i.e., 
     separators conditioned on previous stuff have card. of remainder
      (as do cpts, such as sparse cpts), and random variables, depending
      on switching. In either case, this info should be used in a variable
     iteration order approach. This can't be done for all separators (i.e.,
     a separator can never go earlier than its AI variables have appeared, but it can
     always go later than its AI variables have appeared. Goal is to decide a-priori
     on the AI and REM variables of a sepator to give the optimal tradeoff between
     maximizing the number of AI variables and the greatest flexibility in variable
     ordering.

Wed Jun 22 12:28:34 2005
  - in addition to the latest -cmbeam, also have a -cbeam that rather than using
    the beam relative to the max probability, uses it relative to the i'th from the
    max (since the max might be artificially high).
  - alternatively/in addition, can combine -cmbeam and -cbeam, so that we take that percentage
    of the top mass, and then do -cbeam style pruning starting from the one below the
    top mass.
  
Wed Jun 22 14:38:17 2005
  - gmtkTime: remove error (change to warning) about not running lin-mem case (line 373 or so).

Tue Jul 5 23:33:15 2005
  - add a DirectionalConstraintCPT, where we have n parents, and the n'th parent
    is a det. function of the first (n-1) parents, and the child is satisfied only
    if the nth parent is that det function of the (n-1) parents. This is an easy
    way of imposing constraints that can be expressed in this way:
       extension:
            child is satisfied only if
                  set of DTs, each relating some subset of parents and one other parent
                               not in that subset. Child satisfied only if all such
                               directional constraints are satisfied. 


Fri Jul  8 01:16:04 2005
  - for new inference, when backjumping, the CPT iters need to be ended since
    otherwise, those CPTs that allocate memory will not free it, and when going
    forward again, will look like memory leak.


Mon Jul 11 22:47:31 2005
 - variable order. Might be able to benefit not only switching parents, but also
   DTs (since inside DTs, we have value specific independence), and also Backoff CPTs.

 - not a GMTK idea, but rather a LVCSR idea with switching weights: give a smaller acoustic weight
   at word endings since pronunciation is more variable at the end of words. This can be done by
   having a word-position dependent variable that is a switching parent for a switchign weight on
   an acoustic variable.

Wed Jul 13 10:31:41 2005
  - memory savings: many of the variables in a clique are det. functions of other variables.
    Have an option where the det. variable that have as their parents either non-det variables
    or other det. variables that have already beencomputed, do not store these in the clique.
    Then on an unpack, have a phase of assigning the RV values of these det variables. This should:
          1) cut down on memory (perhaps significantly in some cases) by reducing the width
             of the stored clique table
          2) reduce hashing costs (smaller keys, and keys that are more defining of their entry).
   This probably won't work for separators since we need to hash into separators, but it is less
   important there since separators will be smaller.          
    TODO: some large separators (non-minimal separators for example), this might be beneficial.

 - two modes of inference:
       1) static variable order
       2) dynamic variable order
   for cliques without any "switching", could use static variable order??


Wed Jul 13 13:32:06 2005
  - new inference should do clause/good caching and conflict/no-good learning accross time
    for common cliques for things that are not related to the observations
    this will work in a DBN, DGM, PRMs, and other structured networks since
    many of the CPTs and variables are almost identical.
  - DTs should also be able to take advantage of dynamic variable order, DT needs
    to export stuff to inference code.
  - New CPTS:
    - MLP-CPT
    - Logistic regression CPT
    - relational constraint CPT - same as directional constraint CPT.

Sun Jul 17 19:10:28 2005
  - added today, but from before:
     add a boundary heuristic that adds edges corresponding to ancestral pairs
     *before* running a size or weight based boundary search. After finding
     a boundary, then remove edges so that triangulation can proceed as normal.
     This way, the resulting partitions will more likely be such that
     it will be possible for triangulation to find case where there are no
     det. variables in clique w/o parents.
  - In clique entries, have a mode where teh clique entries do not include the variables
    that are deterministic functions of other variables in the clique. This will reduce
    the width of the clique and if it spans a 32-bit boundary, could significantly reduce
    memory requirements, especially when there are many large-cardinality deterministic RVs in a clique with
    their parents. When iterating through a clique, we'll need to re-expand the
    RVs associated with the clique entry with new RV routines, rv->setDetValueFromParents()
    which is nil for non-det variables, and sets the value from the current parent value
    for deterministic parents. 
    Note, only should do this if packed reduce clique reduces so that we pass a 32-bit boundary,
    as otherwise there will be no memory savings, and only a speed hit.
    - do this for big separators also. 
    - compare memory savings & speed increase with the use if island algorithm.

Sun Jul 17 19:33:11 2005
  - for new inference, combine all together:
     1) conflict/clause/nogood learning but over time (stuff learned for one clique saved for next)
     2) domain pruning
     3) component caching over time (stuff cached for one clique saved for next)
     4) A* pruning within clique, using polynomial estimate of max probability and
        prod. of most likely RV values as admissible heuristic.


Thu Aug 11 15:46:24 2005
  - When combining pruning with sampling from what was pruned away, have
    a mode that the sampling not only uses the probabilities but biases thinsg
    to try to increase "diversity" of the samples (i.e., if we sample from exactly
    the same high & mid-level but only a slight difference in the low level, that is not
    likely to yield a significant increase in the ultimate effect of the sampled states. Instead,
    reduce the probability of similar samples somehow.

Thu Aug 25 14:04:39 2005
  - undirected models in GMTK, part of a clique, and trained using the perceptron and ballceptron
    algorithm (should be easy, just iterate through teh clique a few times, with posterior
    given to perceptron/balceptron algorithm)
  - When unpacking clique within EM, only unpack values that are being used for
    parameter updating (i.e., this mgiht be useful when det. values are not stored within clique).


Fri Aug 26 00:35:20 2005
  // TODO: have some form of RV::currentlyDeterministic() function, which based on
  // the current switching parents values, checks if the current CPT is deterministic.
     If so (based on switching values), we can set all det variables. 

Sun Aug 28 01:23:06 2005
  - memory management: if it is the case that same clique over time almost never has the
    same values, then the clique value structure, and the hash table to find old clique values
    is just wasting memory. I.e.,:
      with -probE, then both hash table and shared value storage is a complete waste, since
            a) since values aren't same, hash table is always a miss
            b) since values are kept around in shared space even with -probE, early values are
               never discarded.
      without -probE:  
            a) then hash table is a waste since we never find shared values.
      with island:
            a) values are kept around (not deleted) even though the should be.
   Thus, have a mode that doesn't use shared values and hash tables over difference instances
   of the same clique.


Sun Aug 28 13:31:02 2005
  - anothe memory savings idea, clear the clique/separator value holder and cache every so often (N frames). 
    This will be good esp. if values are shared with close frames but not far away frames.
  - place the clique value storage at end, and then dynamically allocate it depending on if the value is contained
    in the clique, or if it is a pointer to a shared hash value.
  - smart mode, where we keep track of the number of clique values that are being shard, and if the
    percentage is high, we continue sharing, but if it is low, we start freeing old memory.


Fri Sep 16 15:25:52 2005
  - smarter debug/trace information, that can be turned on only on particular
    frames (i.e., for frame <rng> turn the debugging/trace from n to m,
    and then back to n afterwards). 
    This will help with log files getting to large before zeros occur.

Thu Oct 27 21:55:54 2005
   For docs:
    Not a valid decision tree:

    >3 % number of parents
    >0 2 {p1} default % query parent 0, 2 splits, 
    >     -1 {p2}
    >     -1 0

    I.e., one of the ranges cannot be the value of another parent.
    This won't work (ranges must be const), but you can get something just
    as easy using:

    3 % number of parents
    -1 { (p0 == p1) ? p2 : 0 }


Thu Oct 27 21:56:54 2005
  - fix observation file error messages so that it tells which feature file is in trouble.
    Improve all of these messages.

Wed Dec 14 00:30:37 2005

  - for DOCS: mention that we can train with a lattice (i.e., rather than a forced alignment
       withi a word or phone sequence, that we can force train according to a lattice which is
      a generalization of that).

Fri Dec 23 21:21:16 2005
  - for DOCS, give an example graph for how to do arbitrary duration modeling
        - base case with geometrics
        - negative binomial with sums of geometrics
        - mixtures of negative bionomials
        - as a fixed duration distribution "histogram" and a reverse
          counter started off at the sample of the histogram.

Sun Jan  8 16:25:04 2006
  - for Gang's Lattice cpt implementation: 
   - GMTK needs file of utterance lengths (to determine how much to unroll)
     when there are no observations. Nemove need in this case to insist that
     there are observations. File of lengths must have the same number of lengths
     as there are utterance files (in this case, the number of iterable LatticeCPTs).

Sun Jan 15 19:01:47 2006
   - fix bug that still remains in sheila_bug2.


Tue Jan 17 16:03:34 2006
   - variable trimming (see SUPERLINK paper), where any hidden
     variables that have no children or are not connected are removed
     from the graph, and do this recursively (but only trim variables
     if they are not needed for either training or producing 
     most likely value thereof).

Wed Jan 18 00:19:03 2006
  - another idea for inference: wave propagation, back and forth in
    the chain-tree, wehre we start with agressive pruning, and if we
    get to zeros, we keep increasing the beam (but don't re-expand the
    states that are already expanded). Once we get to the right end,
    we start going backwards, pruning again along the way. We then
    go to the right, expanding some unexpanded states that look promising,
    and keep going left, right, etc. until things settle down.

  - add a HashSparseCPT, where the CPT is just a list of parents,child values
    in a big hash-table. We can start with an empty such table, and during
    EM training, add entries corresponding to cliques that don't get pruned, 
    or alternatively we can read in a table of the form:
 
       p1v p2v ... pNv cv prob
       ...

    where p1v is parent 1 value, pNv is parent N value, cv is child value, and
    prob is probability. This can be stored in a big hash table, where
    prob. queries (with parent/child values) are computed by hasing into
    the table. We should be able to read & write in this format as well. If
    we read in such a table, we can either have it that only those entries
    will survive EM training, or we can have the non-existing entries correspond
    to a epsilon (implicit laplace smoothign), and then prune the table again
    on writing back out to disk.

    Sun Mar 12 17:50:40 2006: Have a flag for the hash-sparse CPT such that
     if there is only one cv for a given set of parent values, then the CPT
     is completely deterministic, so we can take advantage of this during
     inference (i.e., this type of CPT might dynamically be deterministic or not).
     Also, do the same for SparseCPTs (i.e., if they happen to be deterministic,
     then flag it as such). There should be an abstract CPT flag for "deterministic"
     almost deterministic, etc.
   
  - Domain pruning array: Keep an array of valid RV values, and when
    we want to prune an entry away, we swap it to with the ending value, and 
    decrease N, thereby providing an O(1) prune operation (while keeping the
    table size the same). If we keep track of starting and ending size
    during a phase, it should be relatively easy to swap back values into
    place during a prune pop operation.
        - Rather than keeping one value per word, we can have an array of
          packed words, where we use enough words (say K) to hold more than 
          K values. Example, say we want to store 17-bit values, we could
          store 32 17 bit values in 17 packed words rather than 32 words
          which can be a big savings. Abstract class can still ask for
          to word number i, provid swapping routine, etc. To avoid integer
          mod, can do power of two sized groups of words (pages??).
           Wed Nov 08 00:16:14 2006 - on the other hand, this might not be needed
              since the domains will only be one per variable (and the domain arrays
              can probably be re-used over time).


Mon Jan 23 17:50:36 2006
   Karim's idea of, for certain casesd with value-elimination style inference, it would be
   possible to reduce cache size (and cache creation) when some or all of the CPTs are shared
   for certain switching parent values. I..e, even if a set of RVs is different, if they
   share the same CPTs, then they should be able also to share some of the same cache entries.

Thu Jan 26 10:34:12 2006
   Two new pruning ideas:
     - diversity based pruning: where we make sure that enough diversity exists in the
       clique after it is pruned, e.g., say that we want to make sure that each     
       "category" in the list of items to be diverse is well represented enough,
       even if it is low probabilty. For example, say that we want to keep
       the top k entries, we keep the top k (or beam) in each category, where each category
       might correspond to a particular word, or n-gram. (if we do beam like pruning,
       we'd have a different max score and beam threshold in each category).

     - cluster or partition pruning: cluster the clique entries into categories, and then prune based
       on these unsupervised derived categories. I.e., here the clustering algorithm
       would be based on some measure or cost. Use a distance function on clique entries
       (could use something as simple as hamming, or something more categorical entry based).
     - Thu Mar 23 21:24:53 2006: could also use hamming distance at the RV level, i.e.,
        the number of RV values that are different from each other, or weighted
        RV difference, where the difference in a RV might cost more or less depending
        on how important the RV is. for entries m and n, diff = \sum_i w_i diff(x(m,i),x(n,i))
        where x(m,i) is the ith random variable value for clique entry m.

   Memory savings idea: 
     - do some form of LZ-compression of cliques and their entries (so bit encode not only
       one entry, but an entire list of entries).
 
Sun Jan 29 16:55:08 2006
  - LatticeCPTs currently use hash table for each node in the lattice, but we could
    significantly reduce memory by using a globally shared array (even if hash tables live inside
    the array). Ideally, use packed hash tables in a global array (or alternatively, sorted arrays with
    binary search) to reduce memory usage for lattices.

Tue Jan 31 00:08:25 2006
  - regarding value elimination, rather than just have a dset to cache, but allow
    the dset items to be more than a list of values, but rather a list of ranges (or sets) of values
    to produce more compact versions of things to cache. thus, the cached table can be shared
    more easily. This should reduce cache size considerably in some cases.
  - alternatively, have multiple dset value lists point to the same cached value/table (i.e., 
    merge the cached value/table when we see that both dsets are valid for this case).

Tue Jan 31 01:35:39 2006
  - when possible, implement "variable trimming" as done in Superlink, where hidden variables
    that are childless in the BN can be trimmed away without effecting likelihood (assuming
    that we're just summing over hidden variables, and/or not computing viterbi paths).
    This can be applied recursively (i.e., if we have A --> B --> C, and we trim C, we'd get
    A --> B, and can trim B, etc.)

Tue Jan 31 02:47:47 2006
  - When choosing next variable to sum over (condition on) inside of clique,
    we use (as mentioned above) fail first heuristic (i.e., minimum remaining
    value, MRV, or "most constrained value" heuristic). We break ties
    with the degree heuristic, choosing next variable that is connected to
    the most things. SUPERLINK paper has probably an improvement on
    degree heuristic where they choose node v based on sqrt(N(v))*C(v)
    where C(v) is either just the product card of the closure-neighbors of 
    of v, or where C(v) is is the product card of the closure-neighbors of 
    v in the original graph (i.e., just involving factors). 
    (there are a few other small ideas in SUPERLINK paper that would be easy to incorporate).


Mon Feb  6 16:27:55 2006
  - for DOCS: See email from today in gmtk list (Mon, 06 Feb 2006 14:28:33 -0800) about
    difference between Mealy and Moore FSAs and their representation with a graphical
    model, and/or what to do about null transitions in Mealy machines. Relate to LatticeCPTs.


Sat Feb 11 13:33:17 2006
  - fix bug in reading MTCPTs mentioned from bartels email on 2/4/2006.
  - idea: rather than construct the table of clique values in one pass, consider first building
    the tree of the clique (so RV values are not duplicatd in the tree), and then traversing it again to build the clique
    table (but perhaps this'll be slower since we don't get sub-clique entry copy). Also, this might
    be harder to do with dynamic variable orderings. 

Thu Feb 16 09:58:42 2006
   - Add ability to do sub-Clique printing during collect evidence stage
     (for all cliques, even if they are sub hanging dummy subset cliques)
                

Sun Feb 19 19:48:12 2006
  - for docs and for thinking: when modifying the trifile to do clique printing,
    the modification might make things much slower. For example, suppose
    there is a big cardinality variable F that is deterministic given its parents.
    Normally, F always lives in a clique with its parents, but if we create an
    extra clique with just F and during the collect evidence phase, we might
    do unassigned iteration over F. 
    Solution: need a way to defer iterating over F based on its
       parent clique in the JT
  - This also relates to entry on Feb 16, 2006 where we want to print
    CE-based clique entries. If we project from parent clique down to the
    child clique, this would solve both problems.
  - for the docs (at least for now): when adding extra print cliques for
    printing, do them in a partition (or hang them off a clique) where
    they are subsets of the incomming separator from the JT.

Wed Feb 22 13:39:36 2006
  - have different type of "known" constraints that inference can deal with:
       1) undirectional constraints, such as equals(a,b,c), notequal(a,b,c), etc.
       2) directional constraints, such as a=f(b,c) where f() is a deterministic function.
          These are constraitns, and impose a directionality order on order of evaluation.
          Also, these constraints can be implemented with integer formulas, so they can
          be specified by user (but still can be fast). 
    Both deterministic CPTs and deterministic directional constraints put order constraints
    on variable order of evaluation.
    Question: but what if "child" variable of MTCPT or DirConstraint comes in via a separator?


Wed Feb 22 23:53:46 2006
  - memory savings: have the pack-clique code also include a mode that rahter than
    using \sum_i ceil(log(card(var_i))) bits, instead use: ceil(log(\prod_i card(var_i)))
    bits using int mults, divides, and mods. This will be of course slower than
    the bit-packing, but in certain cases, it will probably be faster to use this
    than to move to island algorithm. Make this a command line option and default
    is still to use bit packing.
      - Wed Nov 08 00:23:29 2006: this probably won't result in any real memory savings since
        case A is not much worse than case B, and due to the rounding to machine words for storage. 
 
  - inside-outside inference (todo when infernece engine v3 is ready).
     New modes of inference:
       - CE = left-to-right followed by DE right-to-left (what currently exists)
       - CE = right-to-left followed by DE left-to-right (good for many arrows pointing left)
       - inside-outside: CE = outside-to-inside followed by DE = inside-to-outside (like parsing)
         - this third way might have two advantages:
            a) if there lots of zeros or pruning at the beginning and end, then we'll take advatnage of them.
            b) this is easy to make to use two threads in parallel (good for dual-core microprocessors giving
               a one time factor of 2 speedup relatively easily).

Tue Mar  7 00:36:07 2006
  - in island algorithm, have an option that the portion whose memory is freed (and so normally
    needs to be re-generated) is instead saved to disk, and then when next needed is just loaded
    in from a disk file. This might be faster than having to always re-generate it.

  - Modification to decision trees. Add the ability to produce more compressed Decision trees
    by allowing to have decision DAGs, where nodes can merge back together. I.e., perhaps
    have a goto statement (or some form of link connect) so that common sub-trees do not
    need to be duplicated over and over (which is the case with trees). Several ways
    to do this:
      - function calls, or DT calls, where a leaf of a DT can "call" another DT.
      - gotos, where a leaf of a DT can call another non-terminal of a DT.
         Syntax:
           Rather than:
                 P=parNo N=numSplits s1 s2 ... s(N-1) default
           we could add an optional label to this:
                 foo: P=parNo N=numSplits s1 s2 ... s(N-1) default
           and then a leaf node could say:
                 -1 goto foo;
         when reading thigns in, keep a hash-table or set of labeled DT non-terminals,
          and an array of pointers to goto, and do a 2nd pass updating all goto tags (so shouldn't
          take too long to do).
     Note: this could cause infinite loops, need to detect for directed cycles this during read-in time.
             (have a command line option that optinally doesn't check for cycles). If cycles,
              and no check, we'll just get a run-time infinite loop. (for docs).

    Take a look at binary decision diagrams (BDD) of and manipulations therein, as in the paper:
     "Graph-Based Algorithms for Boolean Function Manipulation", by Randal E. Bryant.

 
Wed Mar  8 12:27:44 2006
  - pruning option: have pruning numbers come from observation files, and have pruning syntax
    so that pruning numbers can change over sequence. (i.e., break obs range into percentages,
    with diff. pruning option in each range).


Wed Mar  8 18:30:25 2006
 - when doing backjumping, if we backjump to a variable that is
   a deterministic function of other variables, we direclty backjump instead
   to the latest non-deterministic function directly.
    Wed Nov 08 00:27:46 2006: the backjump level should backjump directly to the first
     non-deterministic variable rather than do this in multiple stages, this shoudl be computable "offline" (depending
     on if we use static or dynamic variable selection order).
 
Thu Mar  9 02:02:29 2006
  - for clique packing, have a mode that packs at byte boundaries
    rather than word boundaries (perhaps have a compile-time switch
    that changes the code to pack at different boundaries, perhaps
    using 1, 2, 4, 8 bytes, to work at byte boundaries, shorts,
    32-bit words, or 64-bit word machines).

Fri Mar 10 00:33:50 2006
  - give gibbs sampling style inference a whirl for learning and viterbi.
  - (something else, can't remember)

Sun Mar 12 18:23:32 2006
  - in cases where variables and their parents live in a clique but do not give the clique probability (but
    when they could if they were so assigned), we still use them to trim off zero probability
    entries in the clique. We should also use them for cpbeam style pruning as well (even though
    if the entry survives, it won't be the case that the probabiltiy of those variables contributes
    to any probability).

Sun Mar 12 20:04:44 2006
 - for docs: see email to 'benson limketkai' on
        Date: Sat, 11 Mar 2006 16:21:10 -0800
     for explanation of normalization in DGMs.


Tue Mar 21 01:15:19 2006
 - Get regularized adaptation for Gaussians and other parameters working (i.e.,
   for Gaussians, use L2 norm distance from unadaptated model). For discrete parameters,
   use diriclet prior to unadapted model somehow.
   Note this will be in coordination with MLLR style adaptation (i.e., there'll be two
   forms of adaptation).

Tue Mar 21 03:53:13 2006
  - for docs: see email with subject 'Bug?' from today regarding what kind of means
    are being stored when using Gaussians with dlinks (i.e., make sure people
    realize these are of the form E[X - prediction_of_(X)], so the means are residual.


Wed Mar 22 17:52:43 2006
   - replace set intersection things with counters (see new code in GMTK_RV.cc).
   - set_intersection(factorClique.nodes.begin(),factorClique.nodes.end(),
		       P1.nodes.begin(),P1.nodes.end(),
		       inserter(res,res.end()));

Thu Mar 23 18:54:47 2006
  - with factors, it will be valid to have discrete RVs without any associated
    CPTs. Allow the use of NullCPT, i.e.,
       conditionalparents: nil using NullCPT; (or UnityCPT).
    which is a CPT that effectively always gives probability unity to everything.

 
Fri Mar 24 00:32:02 2006
  - for docs: make sure to indicat that to get cpp output, we need to give
    default -I options that GMTK gives (i.e., cpp -I. ,etc. in order to
    see what cpp does).

Mon Mar 27 16:23:14 2006
  - three different tags types on dcache collections
      - always applicable
         - so once learned, should always be retained except when to reclaim memory.
      - applicable only in this utterance accross time frames
         - so once learned, should always be retained within the same observation sequence, 
           except when to reclaim memory. When 
      - only applicable only to this frame
         - should be reclaimed as soon as we leave the frame (e.g., dcaches with things
           like non-constant observations, VECPTs, lattice CPTs, etc.). 
           Idea: if observation is from an obs file, and it happens to be the same value in the
           neighboring frame, can keep the dcache entry for that one.
           Might want to keep these around on the sidelines during island???

Thu Mar 30 19:41:58 2006
  - in new inference, make sure that all the cached factors and so on are based on
    the actual factors (cpts, and factors) rather than on the undirected edges. 
    For example, if we iterate over the incomming separator first, we are
       essentially conditioning on the variables in the separator which might render
       other variables independent (and can thus be solved separately).
    All variables currently with values we are conditioning on, and might rendor
    remainder of elements in clique independent.
    If we're doing left-interface method, this is particular important to
     note in light of the fact that the left-interface of the next chunk
     has completed 'neighbors' even though those variables might be quite independnet
     conditioned on the current value set that we're conditioning on.
    
       
Sun Apr  2 17:14:34 2006
  - change so that when we detect a zero-prob utterance, we backout to next utterance.   


Tue Apr  4 19:35:52 2006
  - island and hash tables. If we are doing the second CE during island algorithm,
    and we do not clear out the clique's value hash table, we don't need to
    do the extra check, since we know the entry will exist (so we can do the
    faster hash find, under assumption entry will be found). 

Tue Apr  4 20:15:12 2006
  - hash tables clique value holders for P and E. Code should check to see if they can be shared
    with C clique versions (rather than having separate one for P, C, and E cliques).

Wed Apr  5 19:34:39 2006
  - gmtkViz, change so that it removes automount string from .gvp file (i.e., change /atm/foo to /n/foo).

Sat Apr  8 19:06:20 2006
  - add other CPT types
     - Dirichlet distribution for continous vectors on symplexes (for distributions over observed
       values which are really probability vectors)
     - Poison distributions (for length variables), but need to figure out how to make them finite
       valued (since poisson is n >= 0)
     - 

Sun Apr  9 07:41:00 2006
  - regarding the 'diversity pruning' idea from above: the question is how
    to get a distance measure. The distance should probably be based on
    original RV value equality (or distance), with weights to determine
    which RVs being different are more important. I.e., with K variables
    we'll have w_k*d_k(x^{(k)}_1,x^{(k)}_2).
    We could have the user specify both the weights and the d() function. This could
    take the form of, for each variable, a length card vector and card x card table.
    Simpler approach might be just to set w_k = 1, and use hamming distance on varables 
         (i.e., d(a,b) = (a == b))
    Need fast implementation of MST.

Mon Apr 10 13:25:12 2006
  - VE style inference, we can simulate 'factored' representations by
    shrinking d-sets artificially and in a value specific nature, which
    corresponds to increasing the factorization. a d-set shrunk to zero
    means factorization (entire independence). 

Tue Apr 11 21:09:47 2006 
 - another elimination heuristic:
    - do min-fillin if it is below or equal to a threshold (default is 0), and then
      if it is above a threshold, do another heuristic (e.g., size, weight, etc).


Mon Apr 24 22:05:20 2006
 - another memory savings (and savings for unrolling).
   Rather than unroll all RVs for the entire time, unroll only
   enough to use, and then re-use the RVs in place,
   changing their frame number and observed values when necessary.
   This might also improve speed since the RV values will always be in
   cache. Viterbi paths might be printed out in reverse though.


Wed Apr 26 13:36:36 2006
  - add observation 'T-frame' in additionto 'frame', i.e., we should
    have a "distance to end" observation.
  - new CPTs:
      - MLP CPT for discrete distributiosn, learned using 2nd order inside EM
      - Sigmoidal CPT for binary [0,1]-valued variables
         P(S_i = s_i | S_j = s_j, j \in J) = sigmoid( (2s_i - 1) \sum_{j \in J} s_j w_{ij} )
         See Radford Neal's paper "Learning Stochastic Feedforward Networks"
      - log-linear CPT, i.e., each CPT can essentially be a CRF with a log-linear implementation
        where feature functions are given by a list of decision trees.

  - noisy OR CPT, that inference knows about and can do smarter things with.
      - common CPTs that inference can exploit.
  - 


Sun Apr 30 20:39:50 2006
 - for docs:
    - include description of the hash table options, where using primes can decrease memory usage.
    - include also description in GMTK_MaxClique.cc where memory growth rate can be changed at compile time.
    - add these options to the Makefile for easy compile time change.
    - include a 'if memory gets large' section in the documentation.


Fri May  5 15:48:30 2006
  - on cygwin, if cpp doesn't exist, when called from GMTK, it seems to return an empty file
    rather than proper error message saying that cpp doesn't exist. Perhasp system() should be called
    with different set of options on cygwin. etc.

Tue May 16 19:55:14 2006
  - for docs, include all memory optimization operations, don't forget
     about all memory options:
     reduce memory options
     reduction memory
     Reduce

            1) find a different triangulation somehow.
            2) -hashLoadFactor option (set to 0.98)
            3) -deterministic F
            4) use island algorithm  
            5) compile with prime hash tables
              #ifdef OPTIMIZE_FOR_MEMORY_USAGE in GMTK_MaxClique.cc         
               could decrease the following even further in GMTK_MaxClique.cc
                #define MEM_OPT_GROWTH_RATE 1.25
            6) Gaussian cache if using Gaussians.
                -componentCache F
            7) if there is no clique sharing, possibly change in GMTK_MaxClique.h
			#define IMC_NWWOH (1)
			#define ISC_NWWOH_AI (1)
			#define ISC_NWWOH_RM (1)
            8) typedef logp<double,double> logpr; in logpr.h, 
               change to using floats in first argument, i.e., change to:
               typedef logp<float,double> logpr; in logpr.h (but in such
               case, then start using  -localCliqueNorm option in EM training, esp.
               when segments are long.
    Fri Oct 19 04:30:24 2007
     - other things:
        1) logpr, use 32-bit FP rather than default of 64bit
        2) compile with 'make EXCXXFLAGS=-DUSE_TEMPORARY_LOCAL_CLIQUE_VALUE_POOL'


Wed May 17 20:45:12 2006
  - when caching-based inference is working, have option to save cache
    to disk between runs, especially if it take a while to
    reconstruct.
  - Update: Mon Sep  4 01:35:10 2006: we would cache to disk all factors
    that are both time-inhomogeneous (meaning that
    they are activatable at all time frames) and segment-inhomogeneous (meaning
    that they are activatable over different segments, so they don't involve
    an iterable DT or lattice).


Wed May 17 21:44:05 2006
  - Time based pruning, where we have a fixed upper limit on the
    amount of time per C chunk, and if we exceed the time, we decrease
    a pruning threshold, and
    if we are less than that time, we increase a pruning threshold. Time can
    be given in absolute times, or relative to average value, or some other units?


Sat May 27 22:59:17 2006
  - docs, don't forget about 'emarfNum' observation.


Tue May 30 03:09:27 2006
  - in -probE mode, we should also free up the memory of a
    clique inside a partition, not just between partitions. This will
    make it more possible to do static graphs with P(evidence). Also,
    in this case, we should really only keep memory for the outgoing separators,
    and free them up as well as soon as we get to the next clique.

Fri Jun  2 20:08:56 2006
  - redo the MST algorithm for JT computation to use proper fast MST algorithm.
    (important for static graphs).

Tue Jun 13 20:09:10 2006
  - go through LM and lattice error messages and add more info to/fix them.


Tue Jun 13 20:34:09 2006
  - added triangulation heuristic syntax, ability to specify sequence of heuristics, i.e.,
         1000-W-3,1000-F-3
    to do one after the other and take the best out of both heuristics.

Mon Jul 10 03:44:01 2006
  - another logp.h compiled mode that uses all normal arithmetic w/o any normalization. Will need to go through
    and make sure that code is not breaking the interface (i.e., reaching in and
    doing a divide). Need to implement a logp.h divide with no check routine (rather than reaching
    in). This should help out on static networks where we can get a speedup if we're
    doing just single or double precision.

Mon Jul 17 21:54:29 2006
 - add MLLR and regularized MLLR as in bookchapter, as well as regularized adaptation
   (so we can do combinations of all of the above).
 - c(x1,x2|x3,x4) such that for any x3, and x4, we define
   a joint domain on x1 and x2, such that iteration
   for (x1,x2) \in D(x1,x2|x3,x4) for fixed x3,x4.  We build
   these conditional domains 

Tue Jul 25 21:05:30 2006
 - A new time-inhomogeneous CPT that, assuming we are doign inference from left to right,
   has p(a|b) depending on the clique frequency in preceding cliques. I.e., 
   after we are done with a clique, we can iterate over it to update the p(a|b) counts
   which can produce a different prob. model for the next time frame.

Mon Jul 31 17:07:28 2006
  - for docs: there is lots of good text documentation in the NSF progress reports
    over the years, don't forget to include that as well.
  - like the above memory options, also in docs include a summary of all possible
    'speed options' to try to help to improve speed (not just triangulation, but
    also things like order, etc.)

Tue Aug  1 00:52:06 2006

  - in the non-training objects, add a special flag to not train either
       1) all means, 2) all diag covariances, or 3) all B matrices, so that it is easy
       to do proper 3x EM training.

Sun Aug 13 23:40:28 2006
  - for docs: explain how factored structure significantly reduces parameters (perhaps
    an exponential reduction) but that it may not reduce time if the tree-width is
    still big, when doing exact inference.

Thu Aug 17 20:24:58 2006
  - have a decision tree and table to produce switching parent dependent scales/shifts/penalties
    i.e., scale mapping("foo_dt") with table("bla") or something.

Thu Aug 24 13:19:58 2006
  - note, packing of RVs into set of integers is the same problem as backup of a set of
    files on a set of disks (CDs/DVDs) such that the total space is the same (minimal number
    of disks) but the number of file splits is minimized.

Wed Aug 30 13:17:03 2006
  - decision trees:
   - allow leaf nodes to have routine calls to other decision trees,
     i.e.,  a leaf node can have a function call of the
     form otherDT(p0+1,3,45,mod(p1+p2,3)) to call another DT called 'otherDT'
     such that at that other one, the arguments (parents) are defined
     appropriately as call by value.
  - integrate routine calls with tags, so that 'goto tag' and routine
    calls work together.

Thu Aug 31 13:25:13 2006
  - for new inference, have option -inference {v2,v3} to chose which version of inference
    to use. -inference v2 is easier when the user is debugging a graph.

Thu Aug 31 19:10:46 2006
  - current running syntax for factor


   factor: firstFactor {
      variables: fooA(0),fooA(-1);

      // only one type of constraint can be defined at a time??
      symmetricConstraint:
	allVarsEqual;
        allVarsUnequal;
        varsNotEqual;
        varsSumTo(n);  // only values that sum to n
        varsMultiplyTo(n); // only values that multiply to n
        varsSumMod(m,n);  // only values such that (sum(vars) % m) = n
                     // e.g., force even parity done by sumMod(2,0)
        varsSatisfy using mapping("bla");
           // where mapping is a DT

        (added Sun Jun  1 01:43:36 2008)
            allSorted(varset) (i.e., true only if vars of the argument are sorted).
            isPermutationOf(varSetA,varSetB) true if the values of A can be permuted to get the
                variables in set B, both sets must be same size and consist of vars with 
                same domain (cardinality)
            permutationDistance(varSetA,varSetB) returns score of distance of two var sets.


     directionalConstraint: fooA(0) = functionOf(fooA(-1)) using mapping("bla");
        // where 'bla' is the name of a decision tree
        // should it be possible to have more variables be a function? (and why)

     softConstraint: using table("foo"); // table is a new hash table object, EM learnable, will also
                                            be used for TableCPT.
     softConstraint: using logLinear("bar"); // logLinear is a new loglinear object, EM learnable
  }

  - some of these constraints will be jointly iterable
        (i.e., the table soft constraint)
    while some aren't or there is no reason to as they are always positive. 
  - allVarsEqual is such that once one variable is set, all the rest can be set (so should be).


Mon Sep 4 19:07:51 2006
 - memory optimization: in inference cliques, direct pointers member variable
   arrays can be computed anew each time we are about to use a clique/separator
   rather than storing them for all cliques since they are not being used 
   most of the time (the only thing that needs to be retained is are the 
   clique value tables and scores themselves).


Sun Oct 8 17:06:25 2006
  - another form of entropy increaser/smoother for discrete probabilities. For binary
    random variables with prob p = 1 - q, we can warp the probability with
    parameter 0 <= a < = 1 as follows:
         f(p) = a + (1-2a)p
         f(q) = 1 - f(p)
    This will map f(0) = a, and f(1) = 1-a. We can generalize this to
    multinomial distributions both during and after training to keep probabilities
    from ever hitting zero if that is what we want.
    Compare with good-turing and other forms of estimator.


Tue Oct 24 18:48:10 2006
  - in reading/writing accumulators and mean/covariance vectors, extend file IO library
    to read/write vectors rather than scalars, so that we read/write in batch.


Thu Oct 26 20:03:34 2006
  - create a version of the island algorithm that caches frames that are low memory
    rather than just doing a log decomposition of the number of frames (which might
    potentially cache frames which very large state spaces). This way,
    the frames that are cached will take up less mem. This of course will only
    work when there is either 1) lots of determinism or probabilty-based beam pruning,
    and 2) lots of variance in the resulting state spaces over time.


Thu Oct 26 20:05:25 2006
  - have version that has command line versions of IMC_NWWOH, and those
    other parameters (so that island will work more effectively).
 
Thu Oct 26 20:17:06 2006
  - island decoding, store islands to disk in cases where they use lots of memory.
  - in island decoding, in a partition, the separators don't need to be stored.
    Each island should only store the minimum necessary in order to do island
    (rather than everything about a partition), and then we regenerate it as
    needed. 

Tue Nov 07 21:19:06 2006 -----

- Summary of many of the new features compiled from the above up to the
  current date (this includes major new features, but not the bugs or
  other minor things mentioned above). This also includes the
  features from the date 'Wed Aug 18 21:05:31 2004', but modified
  based on new knowledge acquired over the past 2 years.

Categories:


- support for new underlying different models
          - DGM-based Fisher kernels, accumulator kernels. Use sequences for classification.
          - native support for time-inhomogeneous DBNs
          - non-linear Gaussians and non-linear BMMs w. pseudo-2nd-order training methods.
            non-linear regression models, L2 and L1 (lasso) regression, etc.
            NLPs, kernel regression, and Gaussian process giving the non-linearity.
          - smoothing for CPTs in EM
               Laplace smoothing (DONE), 
               Good Turing within EM training, false counts.
          - generalized parameter adaptation directly within GMTK
          -  (Sat Apr 24 22:21:09 2004, probably "decision tree"
             (or some form of) state clustering, generalized to
             graphical models). Clustering of two distributions when they
             get very similar (presumably, p(a|b), when two values of b yield
             the same dist over a, we cluster bs together.
          - generalized probability weights/exponents
          - switching weight and penalty.
          - native mode for virtual evidence 
              - enables native-mode hybrid DBN/ANNs and DBN/SVMs.
          - hidden continuous nodes, both GC models, and also non-linear hidden
            continuous nodes, trained using sampling approaches.
          - sparse inverse covariance matrices (corresponding to sparse undirected graphical
            models as Gaussians). To avoid needing to do an iterative scheme, triangulate
            this mdoel giving chordal Gaussians.
          - ancestral graphs and Gaussians.
          - sparse inverse covariance matrices, non-chordal, using iterative proporotional scaling (IPS)
            to train within EM 

     - new CPTs
        - table and/or hash CPT (table of parent values is specified, could be very sparse).
          Easier to specify than sparse CPT with DTs. 4/26/05, 1/18/06
        - NN-based CPT 
        - SVM binary and multi-class CPT
        - Gaussian process-based CPT
        - logistic regression CPT (i.e., 1 layer MLP)
        - kernel classifier CPT (i.e., kernel nearest neighbor).
        - native negative binomial CPT for length distributions. Also Poisson? 4/8/06
        - log linear CPTs, letting the user specify features, weights can be learnt.
        - dynamically updated CPT based on clique frequencies (7/31/06).

      - new types of factors/constraints:
         - rather than table a=b, have constraint/factor that says a=b=c=d= ... all equal constraint.
           (see currnet GMTK part implementation).
         - directional constraints. 7/5/2005
         - other constraints 2/19/06
         - still more constraints 8/31/2006

       - ARPA LMs and FLMs
       - sequence and lattice CPTs (that iterates through
          a lattice depending on a specific set of parents).
          Like a sparse CPT but easier to specify since it keeps track
          of where it is. 

   
- theory/algorithms
       - generalized discriminative training procedures
           stochastic gradient descent. 
           max margin/MCE-based training, MMIE using regularized gradient desccent 
       - submodular flows for various things.
       - new Gaussian splitting algorithms, use state posterior as well as mixture posterior (dasgupta material??)
       - approximate inference, pruned seperator driven, improves only the left interface clique,
         but if this separator intersects other cliques in the graph (and we have a JT-based implemnetat),
         it can speed up that clique as well (but it will be the form of no-goods/constraints). 5/24/2005.
       - dynamic variable order based on conditional cardinality 6/10/05
       - VE-based algorithm (see notes file). 
       - diversity pruning (8/11/05, 1/26/06 4/9/06)
       - better island algorithm interaction with global shared clique values data structure. 9/28/05
           - automatic way to improve memory in these cases (at commandline option). Also, see 9/28/05
           - save to disk island portions. 
       - wave pruning: 1/18/06
           - dynamic pruning, prune values from observation files.
           - HTK form of pruning, start again with wider beam if we get to a zero.
       - time pruning - upper bound on amount of time 4/17/2006.
 
       - caching CPTs in same clique if RVs happen to use same CPTs (this might be hard). 1/23/06.
       - left-to-right/right-to-left outside-to-in (to take advantage of zeros).
             - also, exploitation of 2-core CPUs in outside-to-in  2/22/06
             - pipelining of segments to support quad- and higher core (new)
       - adaptation algorithms
           - regularized adaptation of Gaussians, MLPs, etc. Use direclet prior for unadapted model.
           - MLLR type from speech recogntion, but applied to DBNs.
           - regularized MLLR adaptation.
       - real-time online decoding.
            - standard algorithms for fixed lookahead 
            - Mukund's ICML'06 algortihm for real-time variable 
              lookahead viterbi approximation

          - better boundary search algorithms & predicting of when boundary
            will and won't be useful.
          - optimal separator iteration orders
          - switching parents in inference in forward/backward
          - characterize nets which require particular M and S
          - triangulation for nets with sparse CPTs (Chris)


- fast exact & approximate inference and other inference features 
          - describe new inference algorithm in some detail (see notes file)
              l-factor types 3/27/06.
          - CSP features into inference
            - AC-3/AC-4 on factors before running and/or after factor creation, for l-time factors that are re-used in
              next iterations, can run AC on these to make smaller (could do this on each next frame for new factors??)
            - fail-first heuristics for variable selection. 1/31/06
          - A*-like search, with contiation costs (both admissible and non-admissible).
             - see Charniak parser heuristics, also heuristics used for stat-MT.
          - New (viterbi) decoder program, that will
               do the max-sum over subsets of cliques rather than
               entire cliques. This will mean that we can
               do a word only (or variable only) 
               backtrace, generalizing LVCSR decoders (see decoder papers), as long
               as the sub-variables we wish to decode are a sub-junction tree of the
               underlying junction tree.
                - ability to max/sum variables (max over some variables, sum over others,
                  for hybrid decoding).
                - n-best and lattice generation
                - should be able to do this without editing the trifile.

          - New inference methods:
              hybrid inference/search based inference methods, generalizing value elimination.
          - iterative approaches, using a form of importance sampling.
          - loopy propagation will not be used, as many of the graphs involve determinism,
            and we have found that large cliques decrease state space, so loopy would not
            be able to take advantage of this. Give example A -> B -> C.
          - sampling and hidden continuous mixture Gaussians.
             - variational and mean-field inference                 
          - fast boundary search, using max-flow/min-cut algorithms.
             - submodular approximation to entropy based approach.
          - gmtkDecode, and support for n-best and lattice generation.
             - graph lattices, generalization of lattices in speech recognition
               which give reduced state space for a DGM.
          - option for inward/outward inference rather than forward-backwards.
            This can significantly speed things up since we don't keep things
            around that are zero (try to have a picture explaining this).

         - support filtering p(x_t | y_{1:t}), prediction p(x_t |
           y_{1:s}) s < t, as well as smoothing p(x_t| y_{1:T}), where
           y is observed, and x are the hidden variables in the DBN.
         - island algorithm interacts with Gaussian caching, so that we cache current island only.
             - island algorithm better integrated with global clique packing.
         - Single-Pass Retraining, ala HTK (sec 8.6). (single pass retraining).
       - sampling-based inference - 
          1) gibbs sampling, and other forms of sampling.
             Block sampling (to sample entire clique entries and use deterministic variables when their parents have
             been sampled).
          2) distribution posterior sampling, i.e., P(x_H|\bar x_E) 5/17/05
          3) sample from p(x_A|x_E) where A \subseteq H, and where it is a sub junction tree.
          4) 


- new usability features
      - automatic unrolling, given only length, where the CPTs are per segment  (i.e., length
         can come from either VECPTs, or just from a string length to do new string alignment models).
      - new DT types: 
          - skipby feature (9/14/2004)
      - clique prining for forward and backwards.
          - subclique printing (w/o having to edit .trifile) 2/16/06.

      - trace output filtering -- right now, trace outputs are enormous, they can take up many 10s of gigi-bytes
        before getting to the part of the trace that is important to the user to debug your graph. Trace is too
        big for egrep since egrep spends all its time searching before it gets to the desired part and it takes
        a long time to both generate all the ASCII output and search through it. Internal trace filtering 
        will user syntax to filter output before being printed (e.g., by frame (or percentage), by variable value, 
        by variable set of values, etc. etc.). Also see 9/16/05.

      - min/max/median/mean-score Gaussian (to allow hidden variable to essentially skip a frame,
        not have a Gaussian score at a particular frame. E.g.,
            \prod_t p(x_t | q_t) p(q_t|q_{t-1})
        previous/next-frame score Gaussian (i.e., could just copy previous Frame's Gaussian score for current state??).
        Some way for a particular state, if it exists, to ignore the current
        x_t (like unity/zero-score) but not make the hyps that extend this state excessively (or underly) probable.

     - extended DT features:
         - constraints 3/17/05
         - new DT functions 4/20/2005
         - compressed DTS. Decision graphs, BDDs, 3/7/2006
         - leafs can have gotos. 8/30/06
         - decision tree formulas (Chris)

      - much better user-level error message reporting, 
          - not only that there is a cycle but which vars are in cycle 
       - quick reference cards (1 page for decision trees).
       - tools for easy automatic parameter creation and structure creation.
          (meta level description of parameters rather than needing to specify
             them all). Also meta tools for Gaussians & continuous densities.
             (converting from dense to sparse cpt).
       - graphical user inferface for graph specification parameter specification
       - regular expressions to determine objects not to train.
       - symbolic observations such as support for language files, etc.
              - word vocabulary objects, so RVs have symbol names associated with their values.
          - Arbitrary sets of variable values, including strings, rather than [0..card-1] 
            so that user interacts with variable values as strings (even though
            internally they are unsigned integers).
          - utilities to convert to and from other standard formats:
             - other GM systems and other speech systems.
             - other systems, HTK, etc.
             - compatibliity with all NIST word scoring tools.
          - emacs editing mode for GMTK structure files.
          - user debugging features
              - (see Sat Mar 15 22:33:27 2003 in TODO file)
              - gmtk user-level debugger to help users figure out why they
                get zero probabilities (i.e., bugs in their determinisitc cpt specifications).
          - fixed upper bounded memory usage.
          - allow for empty frames to do instant down-sampling and multi-rate
            processing.
          - dynamic beam pruning (so that beam widths or counts can come from global
            observation file). beam comes from obs frame corresponding to first 
            variable in the clique. Also, allow beam formulas on command line,
            to adjust variable beams. (might want to have wider beams at beginning
            of segment, an smaller in the middle).
          - allow parameters to be read in in any order, rather than in
            an order that requires any dependencies to have been read in beforehand.
            Do final pass to ensure all is there.
          - set up GMTK wiki page.


- other software engineering speedups, & software infrastructure improvements
       - common cliques in P, C, and E share mother clique.
       - better error messages.
       - currently, trifiles are written out but JT and variable iteration order are computed
         internally. Also have .jtfile to allow this to be edited by a human so that if
         the current heuristics are not ideal for a given graph, a knowledgable user can change it.
       - phipac generator for gaussian evaluation and hash-key evaluations.
       - inference clique pointers computed at start of each clique (rather than once) saving memory.
            9/4/2006
       - speed up logadd ops, + have a scaling mode for logp.h
            - better log table, see 5/19/2006 for software version of hardware log-add implementation.
            - normal floating point for static graphs 7/10/2006
       - ability report how much memory is currently being used. Memory tracking.
       - numerical scaling in addition to log arithmetic. 2/17/05
       - value elimination table storage, shared packed dset value tables.
       - pack to bytes rather than words (to save some memory).
       - don't unroll entire length, just enough for expanded template and re-use RVs changing
         their frame number 4/24/06 (but be careful with forward/backward and island as it might
         assume some variables maintain their state).
       - save l-caches/l-factors to disk between runs 4/17/06.
         - custom floating point routines to implement log(1+exp(x)) as a
           single function.
         - much better hash tables
     - general software infrastructure
           - new makefile system, break large .cc files into smaller (better organization of code)
           - use of gnu config for easy compilation on multiple platforms.
           - sourceforge integration, proper source release and bug
             tracking procedure. Distribution of both sources and pre-compiled native-mode binaries for  
             multiple platforms (linux, windows, maxOS

- tutorials/documentation
        - new unrolling semantics (Nov 2004)
          - many tutorials and examples to be created and released on the web.
              (list different types).
          - update and get aurora tutorial into CVS
          - get Karim's fully implicit tutorial integrated in
          - write other tutorials for
              - language modeling
              - pronunciation modeling
          - integrate in small test suite examples (exp1, exp2, etc.) which will
            both serve as tutorial material and something to
            test things out to ensure compiled version works.  
          - much improved and more complete documentation with many examples.


Wed Nov 08 00:03:29 2006
  - boundary algorithm to create non-minimal interface separators 
     - don't store det values for seps after all (see 5/13/2005)
         - once we instantiate acc-inter + residual, we set all deterministic values and
           then continue.
     - since separators are bigger, might be smaller state space even.
     - how to generalize notion of cut, might be sufficient to just add 
       extra ancestral edges in the graph which will produce a non-minimal separator
       in the original graph (see 7/17/05)
 - 


Mon Nov 13 16:03:18 2006
  - add to gmtkViz, ability to display color display of clique state spaces so that
    we can see how cliques in the C partiiton evolove over time.


Wed Nov 15 18:40:27 2006
  - on lattices:
    1) support lattices that have word definitions on nodes, right now it doesn't work,
       look for set of emails from Hui/Amar on this issue around this time.
    2) support zero time-length links in lattices, still doesn't support that.


Wed Dec 13 05:53:29 2006
  - create an int32 type for domains (since an array of domain values
    shoudln't use 64 bits when we move to 64 bit machines).=======
  - for DOCS: have a large comparison with the FST approach. 


Thu Dec 21 16:20:46 2006
  - algorithm for clustering that is O(kn) and that keeps outliers.
     (pick a random point, find a point that is max distance away,
      then cluster into two, then for each do the same thing or some
      modification).

    Dorit S. Hochbaum and David B. Shmoys. A best possible heuristic
    for the k-center problem. Mathematics of Operations Research,
    10:180--184, 1985.
    http://www.cs.utexas.edu/~abhinay/ee382v/Project/Papers/fft85.pdf
    (or it could be in ).
    T.L. Gonzales, clustering to minimize the max intercluster distance
    theoretical CS, 1985. Vol 38, No 2-3, June 85, 293-306.
  - Using l-infinity norm for removing outliers in l-2 norm problems
    (Christy Sim & Hartly).    

Thu Dec 21 19:49:32 2006
  - for Gaussian mixtures, implement the following algorihtm:
    Learning Mixtures of Arbitrary Gaussians (2001),
    Sanjeev Arora, Ravi Kannan
    ACM Symposium on Theory of Computing


DONE: Sun Feb 11 20:05:10 2007
  - change priorities so that reading/writing of parameter messages come
    out much earlier.
   Mon Feb 26 14:05:33 2007
   - new -verb 39 (to get parameter reading messages out).


Mon Feb 12 22:45:34 2007
  - reg adaptation to implement Bayesian tying, i.e., with reg adapt,
    large coefficient, this is like tying, but we can do Bayesian reasoning
    with mutliple such tyings. Parameter tying approach.

Mon Mar 12 02:22:03 2007
  - Alternate to quadratic programming for lasso-style regression.
    Osborne, M. R., Presnell, B. & Turlach, B. (2000b)Computational
    and Graphical Statistics 9(2), 319, On the lasso and its dual

Tue Mar 20 01:48:42 2007
  - alternate inference method: using a GA to expand the clique during forward
    pass, having multiple individual populations of states, which can 
    merge and the most probable of them are left to re-generate, all within
    one time clique.

Mon Apr 23 18:10:35 2007
  - add fast log implementation: see email in gmtk mail history
    from today by Oriol Vinyals @ ICSI.
    Also see:
       http://www.flipcode.com/cgi-bin/fcarticles.cgi?show=63828
    (and entry above from June 22, 2004).

================================================================================
================================================================================
================================================================================


Fast log() Function
  Submitted by Laurent de Soras 

Here is a code snippet to replace the slow log() function... It just performs an approximation, but the maximum error is below 0.007. Speed gain is about x5, and probably could be increased by tweaking the assembly code.

The function is based on floating point coding. It's easy to get floor (log2(N)) by isolating exponent part. We can refine the approximation by using the mantissa. This function returns log2(N) : 


inline float fast_log2 (float val)
{
   int * const    exp_ptr = reinterpret_cast <int * (&val);
   int            x = *exp_ptr;
   const int      log_2 = ((x  23) & 255) - 128;
   x &= ~(255 << 23);
   x += 127 << 23;
   *exp_ptr = x;   val = ((-1.0f/3) * val + 2) * val - 2.0f/3;   // (1)

   return (val + log_2);
} 
 

The line (1) computes 1+log2(m), m ranging from 1 to 2. The proposed formula is a 3rd degree polynomial keeping first derivate continuity. Higher degree could be used for more accuracy. For faster results, one can remove this line, if accuracy is not the matter (it gives some linear interpolation between powers of 2). 

Now we got log2(N), we have to multiply it by ln(2) to get the natural log : 


inline float fast_log (const float &val)
{
   return (fast_log2 (val) * 0.69314718f);
} 

-- Laurent

================================================================================
================================================================================
================================================================================


Sun May 13 22:35:12 2007
  - when changing the r.v. unrolling process to only use the modified template, 
    we still need to keep track of r.v. values when doing things like viterbi.
    for this process, use the packing process to store the values as we compute
    the viterbi path so that even if we unroll by quite a bit, it will be fast.

Mon Jun  4 19:19:43 2007
  - when we add the neg binomial/poisson dist, also give option to
    reverse the ints (so that most prob. are at the end). Also give
    option to supply a permutation to the int meanings. 


Tue Jun  5 03:44:44 2007
  - memory speedup: in cases where clique pool has very little
    re-use, just turn off the hash table (i.e., just have the
    hash table always miss, so that we insert, but never allocate
    memory for the hash table).
  - actually, a better idea (which is probably mentioned above) is to do the
    old C trick where the last entry in a struct is one element long,
    but maloc is used to allocate enough storage so that the last element
    is large enough.


Fri Jun 22 16:12:28 2007
  - in cmbeam option, add a scale option as well, so that when
    clique tables are very low entropy, they can be adjusted so that
    cmbeam pruning is more effective (without needing the cmmin option).
  useful options:
    ./jtcommand -cmbeam 1e-5 -cmexp 0.1 -cpbeam 200 -dcdrng 0:0 -verb 59


Fri Jun 22 21:24:58 2007
  - when extimating the max clique entry in next clique, rather than
    using a fixed polynomial estimate of max probability from clique to clique,
    instead use online update of least squares estimate (i.e., either
    using the LMS algorithm or RLS recursive least squares) to update coefficients 
    (i.e., adaptively learn) as the process proceeds. 
    (plot these out for various values).


Sun Jul  1 23:04:18 2007
  - iterable CPTs (such as iterable decision trees) are similar to conditional
    evidence, i.e., evidence of the form "if A then B", which might effect
    a model by parameter adjustment rather than by any evidence of the
    form A=a or virtual evidence p(V=1|A=a). See Pearl's paper on Jeffrey's rule
    and the Geffner&Pearl88 reference contained therein.


Tue Jul  3 12:33:09 2007
  - memory optimization: when reading in lots of gaussians in one go, allocate
    one pool of memory for the means and variances (since it can be pre-allocated)
    and have Gausisan's pointers point therein rather than having each gaussian component
    allocate its own little memory thus wasting space.

Tue Jul  3 13:23:51 2007
  - memory opt: when doing the search, right now the shared clique value pool stores a value
    of a clique entry even if it is pruned away. To save memory, don't store the values
    that are ultimately about to get pruned away (perhaps put them in a temporary
    buffer and only commit them to the shared pool once they don't get pruned away).
    Q: how does this interact with island algorithm.


Thu Jul 12 18:10:09 2007
  - another new CPT. Inlucde the case where the probabilities come from an 
    exponential, but where parent integer varialbes can multiply into the exponential
    coefficients. I.e., if we have p_i \propto exp(\lambda_i x) where p_i is prob of a child
    in state i, and \lambda_i is a vector of parameters and x is a set of parents, we might
    instead have a special parent that effects all of the parameters, as in:
         p_i \propto exp( y (\lambda_i x)) where y is a count. This could be used
    if y has a very large (or unbounded) cardinality.


Thu Sep 27 14:00:37 2007
  - use 'fast gauss transform' to speed up Gaussian evaluation on long sequences,
    since dimensionality is low.

Wed Oct  3 12:34:03 2007
  - add a viterbi training option to EMtrain (i.e., accumulate counts only based
    on viterbi path, or n-best path).

Thu Oct  4 12:36:35 2007
  - add a training scrore observation. I.e., this observation would be
    a value from 0 to 1, and it would be multiplied by what gets
    accumulated into each CPT at each frame. Setting this to 1 at each
    frame would do what training is doing now, but it would allow the
    training of a model based on how reliable a given time region is.


Fri Oct  5 13:03:56 2007
  - this is already mentioned above (7/3/2007): to save memory, we can have
    more than one shared packed clique array. The first one is used when the
    clique is being constructed. Then, after pruning, only those entries that
    have survived are inserted into the shared ones, along with pointer modifications.
    Otherwise, what happens is that the shared one contains clique values most of which
    have been pruned away, and prunign has a worse effect on memory.
    An alternative might be to get the better -cpbeam option working.


Mon Oct 15 15:28:13 2007
  - look into new Intel Yorkfield Q9000-series 45nm chips with instructions
    supporting the "Super Shuffle Engine and Radix 16" technique,
    which should speed up packing/unpacking of clique values.

Thu Oct 18 22:38:40 2007
    keywords: size, long, unsigned long, size_t, 64bit, 64-bit
  - go through all code and make sure anything that is representing a size
    object is now either a size_t type, or is an 'unsigned long' (which
    apparently seems on all relevant 32 and 64-bit architectures to be the
    right size, 32 or 64 bits). See /u/bilmes/bin/sizesizet.sh.
    
Sat Oct 20 15:33:47 2007
 The below applies to all forms of pruning, 'beam' = the appropriate beam settings for a given pruning option.
  - add smart decoding pruning, so that if we fail to decode, we increase the
    beam options. If we do decode and it does not take much memory and/or time,
    we re-decode but with a wider set of beam options. 
  - add ability to have time-dependent beams, with a beam observation
    per frame. Also, have beam options associated with individal RVs,
    and in a clique, where the beam might be set based on cardinality,
    or something else and then in a clique, have some form of beam
    aggregation function that combines the beams for all RVs in a
    clique to get a current clique beam.
   
  
Wed Nov  7 11:00:12 2007
  - add variance floor to each Gaussian (rather than needing to be global).
    It could be associated with a Gaussian object, and overide any command line
    if it exists. It would be an optional arg to an object.


Tue Nov 13 14:00:34 2007
  - length distributions
     1) Poisson distribution, with Gamma conjugate priors. Use the
         the mode as the Poisson parameter point estimate.
     2) Negative binomial distributions (which is also a gamma-poisson mixture).
        (this is just like using the mean of the Poisson with Gamma prior).
     3) Negative binomial distribution with Beta distribution conjugate prior.
        Use the mode as the parameter point estimate.
     3) Negative binomial distribution with Beta distribution conjugate prior.
        Use the mean as the parameter point estimate.

Fri Feb 29 12:28:19 2008
  - probability models with additional scaling functions. I.e., for conditional
    distribution p(a|b) given two distributions p1(a|b) and p2(a|b) and we want to combine
    them with an event that occurs with probability f(a,b) (note function of both a,b), the
    ability to form the event:
       p(a|b) = p1(a|b)f(a,b) + p2(a|b)[ 1 - \sum_a p1(a|b)f(a,b) ]
    here, f(a,b) is the probability that the first model p1 is used for the probability,
    and the term on the right ensures proper normalization \sum_a p(a|b) = 1.
    This is similar to a backoff model but this will alow us to construct them, and is also
    similar to proof of metropolis-hastings alg.

        
Fri Feb 29 13:14:27 2008
  - question: for MCMC based inference in a clique, how well do MCMC algorithms work for
    distributions (or clique functions) where the values are orders of mangnitude different
    from each other. I.e., when the dynamic range of values of cliques are large (.e.g, 10^50 or
    so difference, which can happen with high dimensional Gaussians), it seems like MCMC
    methods would require many samples even if the proposal disribution is accurate or not (i.e.,
    if it is accurate, one would rarely get such a sample, and if it is not (say uniform)),
    then the sample would likely be rejected (unless the proposal's reverse position cancels out 
    the effect). Need to look into this ...

Fri Apr 11 00:30:44 2008
  - viterbi decoding front end should have the ability to give a score for each decoded unit.
    E.g., for word decoding, each word should give a score, such that when all the scores
    are added up they add up to the total viterbi score of the utterance.


Tue Apr 25 02:26:46 2008
 - see paper by Streeter & Smith about new methods for algorithm
   portfolio design, which talks about choosing and scheduling
   heuristcis for solving NP complete optimization problems.

Tue Apr 15 23:38:51 2008
  - add another form of bayesian prior that can do parameter
    tying. I.e., this prior would be, say, over a subet of Gaussians,
    and would allow their means to be anything as long as they don't
    get too far away from each other. I.e., lets say that \mu_i is the
    mean for the i^{th} gaussian. Let \mu = average(\mu_i, over i) be the
    mean of the means. Then a Gaussian prior placed over each \mu_i would
    be something like N(\mu,\sigma). I.e., the l2 version of this would
    be to have || \mu_i - \mu ||_2 for each i. There could be approximate
    or exact l1 versions of this as well.

Wed May  7 00:33:19 2008
  - yet another idea for pruning.
    Each clique entry has a bit string and a probability.
       - regular pruning uses only the probabilities
       - diversity pruning (above) prunes but maintains the diversity of the population.
    Another idea is a form of Kolmogorov complexity pruning, where the bit strings
      that are somehow regular are considered more likely than the bit strings that
      are irregular. Pruning then would remove the more irregular bit strings keeping
      the regular ones, appealing to Occam's razer that the regular ones (i.e., the
      ones that are easy or simple or require a short program to describe) are more
      likely to come from a simple generative process, and since there are fewer
      simple then complex processes, the regular strings would be more probable.
    To do this, we'd need for each bit string $b$ a way to get to $K(b)$ which is
     a description length of $b$, and it needs to be fast.
      - possible simple functions to try: 
          1) number of 1 bits (i.e., if l(b) = n, and N(1|b) = n/2, this is more complex
              than if N(1|b) is small.


Thu May  8 00:14:07 2008
  - add observed RV multi-trial multinomial discrete distributions. I.e., a vector of integers
    gives a histogram over N trials, and this distribution models that histogram (so this would
    use up a vector of integer observations for each frame, similar to how a Gaussian uses
    up a vector of cont. observations)
    Allow Dirichlet priors over this, ability to learn this.
  - one simple way to do this is that when the random variable is observed, we allow
    a range of integer features where the range must match the cardinality of the random variable.
    In such case the accumulation will be over the range of counts and the number of trials is implicitely
    given by \sum_i n_i.

Thu May  8 00:15:23 2008
  - add regularization to discrete distributions to favor max-ent and min-ent. I.e.,
    add a pentalty of -\lambda H(p) so that during the update, we encourage the distribution
    to be min/max entropy depending on the sign of \lambda.

 
Tue May 27 23:47:27 2008
  - in cases when objects are not trainable, it should be possible to make them
    all potentially iterable (e.g., iterable sparse CPTs, iterable MDCPTs, etc. etc.) just
    like decision trees and lattices are currently.

Mon Jun  9 00:57:20 2008
  - yet another form of diversity pruning. The diversity pruning above
    uses the clique values to determine diversity. But we can cluster
    not only clique values but also the scores into clusters, and then
    prune separaetly within each cluster. I.e., if the
    probabilities/scores bunch into clusters, we might want to keep
    the most probable within each cluster.
   - k-means might do a good job here, but maybe a better idea is to:
       1) choose a bunch of points at random w/o replacement and throw them into random bins.
       2) keep choosing points randomly w/o replacement and throw them into the bin which
          has the closest mean, updating the bin mean occordingly.
      - simple way of doing this using random w replacement. Randomly sample L points and
          randomly place them into K bins. Randomly sample N-L points and place them into the
          closest bin updaing means accordingly. Next, go through all N points and
          place them into the appropriate bin. This clusters the points in time 2N.
        (should do some analysis here on clique distributions, are they clustered,
         what sort of distributions do we see for certain applications?)


Tue Jul  1 15:57:53 2008
  - add ability to give a list of all paths that are involved in GMTK.
    I.e., in all files, objects, etc. give a list of all paths that any parameter
    somewhere might have a dependence on.

Tue Jul  8 13:43:08 2008
  - this is probably mentioned somewhere above, but just in case:
      when choosing variable order to iterate over, when we have switching
      parents, that is the case that makes the most sence to do variable ordering.
      I.e., say we have switching parent S and conditional parents P1 and P2.
      We would first iterate over S, and then depending on the value of S,
      next iterate over *only* P1, and then C (the child) without iterating
      over P2 (assuming that S switches in P1). This is especially true if
      the relationship between P1 and C is sparse or deterministic as we can
      do a lot of pruning by iterating over only those values of C that
      survive P1 without having to extraneously doing it over and over
      for different values of P2. Once we instantiate S, P1, and C, then
      we can (at some point later) iterate over P2 to get the final clique entries.
      Thus, variable order will need to be built in at the beginning.


Tue Jul  8 16:14:02 2008
  - right now, accumulator files are compatible if the parameter files are identical
    (even if the structure is different). Add an ability to create compatible
    accumulator files from different parameter files (i.e., the ability to
    say something like only save the 'trainable' parameters to the accumulator
    files rather than all of the accumluators for all parameters).


Wed Jul 16 22:49:32 2008
  - regarding diversity pruning, have the ability for a user to
    specify the 'meanings' of random variables for the purposes of calculating diversity. E.g., in a lattice
    CPT, each edge has its own integer, but if the label on two edges are the same, but they have different integer ids,
    then that shouldn't count as more diverse (diversity should be measured by the meaning of the random variables rather
    than their integer id). This could take the form of the vocabulary/strings associated with RVs. Alternatively,
    for each random variable, it could optinoally be given a table of 'meaning' integers which are used for calculating
    diversity.


Wed Jul 23 11:54:30 2008
  - add a scalar gamma observed distribution for continuous values with parametrs k & theta
  - mean is u =  k\theta, variance is s = k \theta^2, so we can get
        \theta = s/u and \theta = u^2/s.


Wed Jul 23 12:25:13 2008
  - add a clique diversit measure. I.e., this is simialr to the diversity pruning above, but also similar
    to how GMTK can print out clique entropy. Here, suppose we have a distance measure d(a,b) for clique
    entries a and b. Clique diversity would be 
          1/N^2 \sum_{a,a') d(a,a') where a and a' sum over all clique entries. 
    This would cost N^2 so to speed this up, we would choose a random subset of size 
          N' ~=~ sqrt(N log N) to do the computation, which would give an N\log N algorithm
         (which will still probably be bast enough)
    This could be printed in addition or in place of the clique entropy.

Tue Aug 12 18:19:30 2008
  - it may be possible to construct cliques at latter times using earlier cliques. I.e., it is possible (and perhaps even
    probable) that most of the entries in the cliques at successive times are redundant (i.e., the same entry). While
    there is probably little innovation at each time step t, we can save memory (which we're already doing by
    the fact that we're using the global clique value pool) but we could also potentially save computation. Several ways:
       1) by clustering, using the idea that the previous cluster could be re-fined rather than reconstructed, and then
          used for pruning.
       2) By pruning, perhaps the previous cliques probabilities could be used for a score adjustment in some way
    Key thing is: what's the typical innovation at each time step?          

Wed Aug 13 13:23:00 2008
  - implement parallel backtracking search where differnet threads eliminate
    part of a clique.

Thu Aug 21 13:27:15 2008
  - take a look at "Incorporating Diversity in Active Learning with SVMs"
     in ICML-2003, by Brinker as a way to do submodular selection of points.
     Potentially relevant to diversity pruning.

Thu Sep 18 23:26:47 2008
  - add an environment variable to get cpp, say GMTK-CPP-COMMAND

Wed Oct 15 14:23:45 2008
  - look at getting a student-t distriution working as an observation distribution.
  - also, get a real Dirichlet prior thing working (hwere we integrate over the priors parameters) useful for protein stuff.

Fri Oct 17 02:12:46 2008
  - add parametric semi-supervised learning to GMTK using the MP objective fnction. 

Wed Oct 22 15:47:02 2008
 - get fisher kernal working with language models
 - get FK working with gamma distributions, and check to make sure working with gaussians. 

Thu Oct 23 06:19:49 2008
 - remove GMTK prefix from file names.

Thu Oct 23 06:24:51 2008
 - integrate rest of Simon Kings tie support.

Fri Oct 24 02:14:14 2008
 - when some objects are shared that might not make sense, give a warning message.
   I.e., when a dpmf is shared between a discrete variable and a mixutre Gaussian variable,
   or when a real matrix is shared between, say, a NN and a Gamma distribution, this might
   not be something the user intends to do.

Fri Oct 24 04:29:34 2008
  - for gradient descent based discriminative learning (and other discriminative learning methods),
    figure out how to best parallelize since otherwise training will be extremely slow.

Fri Oct 24 12:24:56 2008
  - also allow cpp command name to be on command line in addition to env variable.

Fri Oct 24 13:34:17 2008
  - give simple ability to produce posteriors of variables after
    for/back without needing to specify a hand writen trifile.=======

Sun Oct 26 15:38:11 2008 possible relevant paper on dynamic update
  clustering for diverisity pruning.  N. J. Mitra, S. Floery,
  M. Ovsjanikov, N. Gelfand, L. Guibas, and H. Pottmann, "Dynamic
  geometry registration," in Proc. of the Eurographics Symposium on
  Geometry Processing, 2007.

Sat Nov  8 20:32:56 2008
  get beta observation distribution working, which gives us obsevation distributions
   over fixed ranges of continuous values. Should mimic the gamma observation distribution
   (i.e., diagonal independent beta distributions). 
   
Tue Dec  9 13:44:16 2008: easy
 - this is probably mentioned somewhere above, but having time-inhomogeneous CPTs, i.e.,
   CPTs similar to an MDCPT but where the CPT table comes from the observation file, should
   be added (in fact, see entry at Feb 22nd, 2004).

Tue Dec  9 13:58:01 2008
 - get accumulator/fisher kernel working with beta & gamma distributions.

Sun Dec 14 13:19:59 2008
 - in clique expasion code, have backing off beam, i.e., 
   if we hit a zero clique, then use a wider beam just for that clique.

Fri Dec 19 16:27:09 2008
 - another form of pruning: each clique entry
   can (when compiled that way) look in the global clique
   manager to see how frequently that entry has occured so far,
   and base pruning on that relative count. I.e., very frequent
   improbable entries might be pruned but rare improbable 
   ones might be retained for a bit.
 - general idea of using the clique manager as a map rather than
   a set. The map can map to information that pruning can utilize.
   I.e., the map can map to 
       1) freqeuncy of use, as in a count > 0.
       2) most recent use (e.g., frame number of max frame variable
          in the clique)
       3) most recent (relative) probability score and/or integer rank
       4) most recent pruning status (did it get pruned or not)
       5) keep track of shared/not-shared binary/{0,1} status. I.e., we might
          want to prune shared/non-shared entries differently.
          (this is a subset of 1).

Mon Dec 22 12:19:46 2008
 - some sort of switching pruning, i.e., when we expand a switching parent,
   and when it switches away other parents, rather than
   iterate over all of the other parent values, we just prune
   all but one value away.
   ie., lets say that p(c|a,b,s) where c is child, a, b are
   parents, and s is switching parent, and when s=1, 
   it is such that b is irrelevant. Then we
   expand s first, and rather than expand all values of
   b (thereby creating a clique entry for each value of b), 
   we expand only one value of b.
   Q: but what if b's value determines other things that happens
   later in the clique?
   Q: maybe cluster pruning will handle this case (but it won't 
      handle the generation of the clique in the first place).
   Here is what should work. If we have a clique with c,a,s but
   not b, then for s=1, we should be able to expand a,c for s=1
   without needing to expand b.

   
Mon Dec 22 16:20:58 2008
 - look into loop unrolling and other such compiler opts
   for the packing stuff and inner inference recursions.

Tue Dec 23 00:18:45 2008
 - yet another idea for pruning, a little like wave propagation of 1/18/06 but
   not exactly the same. Here, during ce, we move forward expanding cliques and
   if we get a zero, we increaes beam for a while in local clique and if it still
   fails, we go back one clique expand, and move forward. This perhaps could
   be called wave pruning but with expanding boundaries. I.e., an example
   of the clique numbers being expanded might be:
      1 2 3 4 5 6 6 6 5 6 6 6 5 6 6 6 5 6 6 6 4 5 6 6 6 5 6 6 6 5 6 6 6 5 6 6 6 4 5 6 6 6 ...
    (i.e., where we have a max number of expantions of 3).
  If this continues, eventually we would go back to the first clique and expand there.
 - compare this with just going back to the beginning and increasing beam width right
   at the start (which is more like what HTK does).

Wed Dec 24 13:43:43 2008
 - remove all exception handling code so we can compile with -fno-exceptions


Fri Dec 26 01:38:19 2008
 - figure out how to get diversity pruning to persist in future expansions (i.e.,
   we may diversity prune a clique, but then the next clique's expansion might
   just prune away all of the diversity that was previously maintained).
 - get the various clique pruning methods working in separator pruning.
 - get pruning expansion working for other sorts of pruning options.
     (both at the local clique level and at the segmen level).
 - 

Sat Dec 27 20:38:48 2008
 - add Bayesian conjugate priors to the gamma/beta observation distributions.
   
Sun Dec 28 00:41:09 2008
 - for numeric beam arguments, rather than needing to specify 1e10 or whatever happens
   to be log zero, instead have a special value 'off' that is extensible.

Sun Dec 28 00:44:18 2008
 -cpbeam style pruning works on C partitions, but not on P and E partitions,
  which is bad since E partition can be big. Figure out a solution.
   a) use right interface??

Fri Jan  2 20:54:08 2009
 - for packed clique value distace, user can supply the weight of a rv to affect the
   distance without causing a performance hit. I.e., each rv can have a 
   non-negative weight which, rather than the log cardinality, indicates the importance
   of that rv when the values are not equal.

Sat Jan  3 19:07:36 2009
  - have special projection-only cliques. I.e., cliques that are used only
    after forward/backward, and when a parent clique is complete, we will project
    down to the smaller clique.
    Useful for:
        - fast training with large D continuous vectors
        - a non slow-down when needing to use clique posteriors.

Tue Jan 13 10:09:43 2009
  - island algoritm's islands should keep only the smallest set of
    rv's that render left and right indepenent. Therefore, could keep
    only the separator between two partitions rather than the entire
    partition. This could reduce memory.
  - Option during various inferences styles to free/delete the
    separator tables since it is not expensive to recreate them. The
    expensive part is creating the cliques, and they are the things
    that are most important to preserve.

  - to save more memory, the separator AI and REM could all be sorted
    after creation and then a binary search used to find an entry
    rather than a hash table. This could probably be a compile
    option, although if we delete the separators as we go along,
    this wouldn't be necessary (it might be faster just to delete and
    then re-create them). Also, the sorting will need a multi-word
    comparision operator on packed items - perhaps the packer
    can provide a comparison operator.

Wed Jan 14 19:06:40 2009
  - right now, a partition's RI separator is kept in the neighboring partition
    to the right. It may be useful to free-up the separator storage after 
    it has been gathered from and re-created it later (since it is relatively quick to create),
    and it might be easier to do if a partition's RI separator is kept within the
    partition itself.
     - the problem with this is that when we create the separators for gathering into a clique,
       it is very useful to have them all.

Fri Jan 16 21:54:23 2009
  - in the observation code (observation matrix), add another option that is like -startSkip but what it does
    is that it takes the first N real frames, computes mean/variance, and adds dummy noise
    to the beginning of each observation file. That way we can have dlinks into the past without
    the need to modify the observation file at all, and will help with edge effects.
   - have a bunch of different distributiosn for this, i.e, have
     Gaussian, uniform, constant values as well duplicated first
     value, and mirrored values.
   
Sat Jan 17 01:39:21 2009
  - get tied parameter Gamma and Beta observations working
  - get conjugate priors for Gamma & Betas working
  - allow linear conditional continuous dependencies for Gamma and Beta distributions.


Wed Jan 21 23:54:15 2009
  - when constructing a clique, we have an estimate of the max value. We also
    have, a any given time, a set of clique values that have already been inserted.
     we prune if:
          cur_score + cont_score < est_max_value - threshold

     1) if any of the clique values are greater than the estimated max
        value, we should update the max value (it is possible for this
        to happen not because of the continuation scores, but because
        of prediction error). This gives us an updated est_max_value
        that is more accurate, and this is easy since we're already
        estimating the max as we construct the clique.

     2) Most of the clique entries will probably be below the max value.

              cur_mn          (est_mx-thr)                cur_mx       est_mx
                 

        heuristic: if cur_mx has not increased for a while (some
                   threshold count, perhaps a percentage of the
                   previous instance's state space), treat this as an
                   estmate of the real max - we might make some
                   pruning decisions we otherwise would not have
                   made. Of course, if we find a max that is greater,
                   then we bump up the cur_mx as in step 1.

   - another idea: in speech (at least) there is spatial locality. We could use frame
     t to estimate better continuation scores and use that estimate for frame (t+1)
     rather than using the mos optimistic estimate (which is what we're doing now).
     I.e., we'd keep, for each rv, a prev_frame_max_val[rv] and a cur_frame_max_val,
     then use that during the expansion.
     **** In fact, we could use multiple linear predictors to predict
          the cur_frame_max_val based on a set of
          previous_frame_max_values and use that to predict not only
          the clique max value but also the continuation heuristic as
          well!! This should be an optional thing, as the added
          expense of doing the prediction might be a little high (but
          then maybe not, compared to large state space).
      Add a -predictContinuationScores {fixed | lms,i,j | rls,i,j } option.
      We should not bother predicting those scores that are fixed (i.e., those rvs that
      are based on a time-homogeneous markov chain and/or that have observed values).
      In fact, this might really work only for continuous data and/or speech.
   
  - is there a benefit to not having an E? I.e., a problem with E is that it is
    hard to predict. Perhaps this would be another advantage of an outside-inside
    procedure, since as we are moving in from both sides, we'll have good prediction
    of the inner partitions (and will have expanded with, presumabl, a reduced
    state space at the E side).

Fri Jan 23 23:37:04 2009
  - perhaps a crazy idea, but we could use a corpus of speech data to learn
    a swiching AR_HMM on the max-scores for a clique and/or for random variables.
    The idea being that it would not be an adapative filter, but rather
    a model that learns the distribution of max-clique scores on a corpus of daa.
    That model could then be included in a "production" ASR system and be used to
    predict next-frame scores (like a Kalman-filter) and used to improve pruning.
  - perhaps this could be combined with an adapative kalman-filter to adapt to the
    local observation???


Sun Jan 25 00:09:02 2009
  - it is beginning (or has been for a while) to become the case that it is not really valid
    to run ascii files through cpp, since cpp makes C/C++ specific assumptions about
    the tokens. For example, you can not really concatenate things like 
          dt/ 
    and
          name
    in order to get:
          dt/name
    as you get an error message saying that 
         error: pasting "/" and "name" does not give a valid preprocessing token

    The solution for now is to backoff to gcc-2.95.3's cpp as that is the lastest one that
    doesnt' conform to the latest C/C++ conventions. A longer term solution is to implement
    GMTK's own macro expansion processor.

Wed Jan 28 10:31:35 2009
  - observation code needs to be able to work better with island algorithm (i.e., we should be able
    to not load the entire observation into memory at the same time if so desired). This will
    take a bit of reworking as right now some parts of the code use a base_offset + stride*frame_num
    to get the current frame_num base position.

Wed Jan 28 11:31:43 2009
  - for docs, emails from today and yesterday to zafer

Thu Jan 29 16:31:42 2009
  - take mean or median of the three predictors (fixed, rls, lms) for M_t.
     (can produce a new adaptive filter class that combines the 3 existing adaptive filter
      classes).


Fri Jan 30 15:02:04 2009
  - on diveristy pruning, once we have a cluster, is there a way to choose a summary of the
    cluster, rather than only representatives from that cluster?
  - when choosing cluster centers, break ties by choosing high probability values.
       i.e., the issue is that suppose that we choose the most different state
       w.r.t. to probability. We want to avoid the case that we have all of the
       high prob. states in one cluster.


Thu Feb 12 14:57:59 2009
  - for docs: change message about FP exception to include useful infomation
      - Guasisans getting small, use reg parameter, increase variance floor, vanish more agressively, etc.
        (FP errors can be due to vareity of things, describe them in the FP exception message rather than
         just aborting).
        Oh, parameter initialization if doing EM, etc.
        Perhaps have a global state variable that says what is currently happening that the exception
        handler can check to use to determine what is going on.


Sat Feb 28 00:01:44 2009
 - for docs, email to zafer today, giving an exercise about cholesky and UDU,
   make this an exercise to understand GMTK parameters.
   (see a few more emails in gmtk directory about this).

Sat Feb 28 00:17:48 2009
  - simple single compile time option that does all internal numerical math in 
    double  vs. single precision (a one-stop switch for changing this at compile time).

      
Sat Feb 28 03:44:45 2009
  - STL set optimizations
  - for inserting new elements into a set, do:
        copy(cur_set.begin(),cur_set.end(),
	 inserter(allrvs,allrvs.end()));
  - for erasing one set from another, do:
         s.erase(start_iter,end_iter)
        iter must be pointers within s.

Sat Mar  7 20:09:08 2009
  - have an option so that only the partition separator memory is
    saved when doing a forward pass, rather than the partition
    separator and all other cliques/separators. For forward-inference,
    this is like an even lower-memory version of probE, except that
    all separators are saved. For EM training/Viterbi, we'll need to
    re-generate the forward pass in seach partition again on each
    backwards pass, starting from the saved separators. For island
    inference, the islands that are saved should only be the
    separators, not the entire partition.
  - when doing this, do not insert the clique values into the
    globally shared clique pool (i.e., only use the private pool).
  - have the same option work for viterbi decoding.
  - For the island algorithm, will also need some sort of 'save state'
    and 'restore state' for the origin of the clique since there might
    be certain things (such as the the LP prediction code, etc.) that
    is based on going left-to-right in the inference.
  - have a version of gather_into_root that rather than storing things
    at the clique level, immediately projects down to the outgoing
    separator (so it never needs to store the clique values even locally).


Tue Mar 10 00:50:55 2009
  - for E partition, we might want to use a different varible
    order, say where we try to reach the special end-of-utterance
    observation first rather than the other observations, so that
    we can observe the fail first principle. Possibly
    have a separate -vcap option for E partition.
  - in general, have a different order style/options for P, C, and E.


Thu Mar 12 10:57:01 2009
  - have parametric discrete observation distributions.
  - E.g., truncated geometric distributions for integer observations
          truncated Negative binomial distributions integer observations
    Perhaps ML estimates of the truncated forms are the same
     as the truncated parametric length distribtuions above
      (e.g., truncated Poisson distributions).

Tue Apr 28 12:35:18 2009
  - when the strings are very long (bio, conversational, etc) rather than doing forward/bacward complete,
    it seems like we should have a bunch of limited extent messages going from left to right and doing
    that a number of times.
    I.e., rather than doing for i = 1 ... N, left_to_right(i,i+1) and then doing for i = N ... 1, doing
    right_to_left(i+1,i), do something like:
       for all i in paralel
            send messages a few stages to the right and/or left (randomly)
    - This would still be O(N), so would it be faster?
    - can we show convergence in some cases?

Tue Apr 28 12:37:35 2009
  - for debugging message, when we say message C',part[N], clique ...
    include the total length, i.e.,, "message C',part[N of M], clique' where M is total length.


Sun Jun 21 22:54:04 2009
  - this is probably mentioned above, but have conditional exponential-based local CPTs.
      p(c|parents) = 1/Z(parents)exp(\lambda f) where f is a vector function
      of indicators. this is like a local CRF for a CPT. 
       
Tue Jun 23 12:12:51 2009
  - give sarray to have its own allocator (or do a subclass) and in
    such case, it always allocates in chunks of some number of units
    of the main class.

Tue Jun 23 12:13:26 2009
  - symmetric MDCPT - i.e., an MDCPT that is constrained to be symmetric.
  - in 2d case, it always says (and trained with) p(i|j) = p(j|i)
  - in 3d case it says p(i|j,k) = p(j|i,k) = p(k|i,j)
  - in nd case it says that p(i_k | i_\negk) = p(i_j | p_\neg j) for all k \neq j
  - clearly, only works for cubic distributions.
  - one way to do this is, when given a set of values i,j,k, etc.,
    sort them first (so i<=j<=k) and use that to look up the probability.


Tue Jun 23 15:29:08 2009
  Fix island so that there is no (zero, nada) linear dependence: Right now
  it still does w.r.t. 1) the obsevation sequence, 2) the stored
  viteri values, and 3) the shared clique value pool. 

  In this last case, it is not really linear in the length, but
  instead holds the clique values for all partitions even if they
  don't need to be held at all times. Getting this last case to work
  will be a bit tricky. One way would be to allow the clique value
  pool values to be removed, but this will almost surely slow down the
  data structure (which is bad of course). Another option is to have
  local pools (kind of like what we currently have for reach frame)
  and then only insert the values into the true global pool for the
  islands -- this is probably the best idea.

Wed Jun 24 02:34:36 2009
 - update makefile with Arthur Cantor's makefile mods for intel compiler.

Fri Jun 26 11:09:06 2009
 - many objects still allocate memor inefficientl (i.e., simetimes
   we want only blocks of size k of objects. Go through and make
   special allocators for each of these objects to save memory).

 - for debugging output, have a filter that only prints out (curframes
   mod K == 0) where K is some integer.
 - 

Fri Jun 26 23:47:27 2009
  - have a bunch of perl scripts consolidated to parse GMTK output
    (e.g., parsing new GMTK Viterbi output into perl/python arrays).
  - 

Sat Jul  4 01:11:42 2009
  - allow symbol names to be used everywhere, including
      a) when reading in observations
      b) when specifying cpts and sparse cpts
      c) in DT formulas (need to figure out how to do this since
         right now DTs are specified unassociated from where they'll 
         ultimately be used. Perhaps one way to do this is to
         allow a DT to allow a symbol table to be used for
         each random variable (all the parents and the child).
  - perhaps a way to do this with observations is to have a
    file format for symbolic observation names, and then the
    are translated into values as needed as the observation becomes
    associated with a CPT (so this means that the obs would mean
    something different depending on the context it is used in).
      - the same could be done with DT formulas, i.e., constants
        could be symbolic tokens rather than integers (this would
        preclude optimization from occuring, but this is ok).

     
Fri Jul 10 15:07:00 2009
  - dynamic pruning algorithm similar to the TCP congestion algorithm?
    i.e., additive increase and multiplicative decrease.
   - for pruning, when things are going well (i.e., we are
     nowhere close to getting zero cliques, we linearly decrease
     the beam widths. Once we hit a zero, we backtrack to
     previous frames exponentially increasing it as we go.
     Rational for this: the goal is to find a good hyp at the end (or
       many of them if we're doing k-best). 
  - need to get frame-backtracking beams working in the first place
     (one simple option is to immediately backtrack to the beginning
      of the segment, which is what HTK does, but perhaps this
      is overkill and would waste potential effort.
   - basically, we need a 'back and forth' algorithm, or a 'wave'
     algroithm that moves back and forth and has the chance ultimately
     to backtrack to the very beginning (ala HTK) but can also do
     something much better (by avoiding backtracking to the
     beginning).
   - it's probably better to get the new message code working first,
     so that each forward message can be sub-classed by a separate
     algorithm (and also the various memory options for this as well,
     such as to eliminate memory for the clique expansion and save
     only memory for the separators within each partition).
   - steps include: as a function of what we have already done
       1) how far (how many partitions) to backtrack to
       2) how much to increase the beam (function of each backtrack position).
           2a) and each different beam might need to change very differently.
       3) show approximation guarantees. 
   - how does this interact with the island algorithm? 
       - during forward pass, no problem since island first does forward
         along the way.
   - Two phase algoritm:
        1) phase 1: a discover phase, where we use gmtkJT like low-memory inference
           to explore the space of pruning/backtracking schedules that work.
           "work" should be more than completing the decoding with non-zero,
           since there could be search errors not just model errors.
        2) second phase, given we've found a good one, a decoding phase that sues
           the pruning schedule found to do a final decode.
   - 


Mon Jul 13 00:51:31 2009
  - on software and what happens when there are so many options.
    http://www.joelonsoftware.com/uibook/chapters/fog0000000059.html

Wed Jul 15 13:34:09 2009
  - use Brian Lucena's MCS algorithm to find a lower bound on tree-width
    (his Siam 2003 paper, "a new lower bound for tree-width using MCS").
    given this lower bound once (and if) the triangulation heuristics have found
    a triangulation with this clique size (since they are all upper bounds)
    we can stop triangulating since we've found the optimal in these cases.

Fri Sep 18 14:37:23 2009
  - in rngdecisiontree, add back in code that allows/special-cases not
    only single integers to be fast (i.e., not use the expression parser)
    but also single variables (e.g., parent values, such as p0, p1, etc).
 

Fri Sep 25 13:38:17 2009
 - another form of diversity pruning - at each clique expansion,
   do a local backtrace a few frames (i.e., from t to t-\tau) and
   then for each clique entry, develop a diversity measure that
   is a function of that back trace path (could be the best path).
   This would be a form of hash function, where two clique entries
   are similar if they have a very similar history, and are different
   if they are very differnet. It could also combine the score
   of the backtrace as well. 


Sat Oct  3 15:04:38 2009
 - re-write the flex/lex tokenizer to not any longer be dependent on
   lex since that is broken on some platforms (e.g., on the
   mac with gdb and emacs).


Mon Oct 19 23:36:03 2009
 - new pruning options:

   1) do initial pass and compute the posteriors. Add those posteriors
      (to the appropriate variable) on the 2nd pass.

   2) summary pruning: Expand the clique, then contract down to a
      particualr subset of variables, prune on those subset (using any
      of the standard pruning option, including cbeam, ckbeam,
      diversity, etc.) and then prune the expanded clique based on the
      pruned subset.  (need a way to specify a set of rvs and offsets
      on the command line, have a general parser for
      "foo(-3),bar(2),baz(+2),foo(0)") Need a way to take a clique,
      project it down to a sub-clique, do the pruning there, and then
      re-expand. We could either: 

          1) go through original clique, and hash-lookup the
             sub-clique and if not found, then remove the entry.

          2) could go through subclique for each surviving value and
             then do
             

Wed Oct 28 15:00:24 2009
  - in rngdecision tree, add a special value that compuites the lexicographic entry
    to collapse all the parent values down to a unique integer. I.e.,
    something like p1*cont1 + p2*const2 + ... pn*constn. This can be done
    using a DT formulat, but this is very slow, much better to have variable that
    computes that for you. 


Tue Nov 17 17:27:35 2009

   - another particle filter based idea: at time t, we keep the same
     clique values that existed at time t-1 and then rescore them.  We
     do this ratherthan re-expanding the clique at time t and then
     pruning them down (which is a waste).

     Every so often, we will do a full regeneration of time t, based on
     what happened at time t-1.

     Also, while at time t we might keep the same values and rescore 
     what happened at time t-1, we might also do a partial regeneration 
     at time t.
 
     So the general algorithm would be:

     at time t, do one or more of the following steps:
        - keep some (or all) of the clique values that were at time t-1
        - rescore all of the clique values at time t.
        - regenerate some or all of the remaining values at time t
        - do some one of the forms of pruning at time t.


Fri Dec  4 16:41:01 2009
  - for diversity pruning, allow user to enter a sqare matrix object
    that gives the distance between ay pairs of random variables.
  - have a bunch of simple matrices built in (i.e., equals, unequals,
    absolute value of difference, etc.).


Fri Feb  5 17:25:02 2010
  - for island algorithm: should have an option to set base = sqrt(T) for whatever
    the current length is.    

  - need to add some sort of sparsity encouraging prior to the CPT learning
    (is posterior weight of > 1 enough to do this???).

   
Mon Mar 15 14:59:17 2010
  - add a reverse option, to run the model in reverse (i.e., do beta first and then
    alpha, this should be easy to do, but maybe not very fast).>>>>>>> 1.532


Thu Apr  8 17:22:41 2010
  in addition to InferenceMaxClique::cliqueEntropy()   
   also add a compute clique variance that does so in the real space (good for Gaussian variability in cliques).


Wed Apr 21 20:23:00 2010
  for DOCS, add 'cpp -std=c89' tip to  get good working cpp for GMTK.
   See email from Ajit from today.
   
Sun May  9 00:40:25 2010
  Add an option to gmtkTriangulate that duplicates one (or more)
  copies of C into P' and/or into E'. This would allow easy
  specification of different P and/or E without running into the
  interface constraint P|CE, P|CCE, PC|CE.

Fri May 14 11:20:15 2010
  To logp.{cc,h}, add high level operations that are optimized, such
  as summing a bunch of logp objects (that first compute the max and only
  sum the onest hat are in range of the max).
  Pipeline the max computation as well (compile this with unroll, etc.).

Mon Jun 28 11:31:34 2010
  - as a criterion for finding a good interface separator factorziation
    (for the combined BK algorithm with interface separator), pehrpas 
    a good interface separator would be one that can factor into a set of
    factors that does not break apart any original factor. I.e.,
    if the factors are f(a,b)f(b,c)f(c,d) and if ac was a separator,
    factoring this into g(a)g(c) this might be good since it breaks 
    apart no factor. relate this to submodularity.


Wed Jul 21 13:35:00 2010
  - add an option to -copyNFramesOfCIntoP N and -copyNFramesOfCIntoE N 
    which will help to reduce the interface error messages when they
    occur of the form "P|CE, P|CCE, PC|CE". This will need to be
    an option to gmtkTriangulate (but hopefully won't modify the trifile 
    format).

Mon Aug 23 22:07:53 2010
  - for DOCS: in a latticeCPT, what happens if the number of observations runs out
    before we get to the end of the lattice? In such case, it is still
    possible to get a non-zero score, even though we have not fully decoded
    the lattice. In such case, an extra virtual child (=1) needs to be
    included in the epilogue that insists that we make a transition out of
    the last node of the current lattice. For iterable lattices, this
    CPT will also need to be iterable.
          - include a picture of this in the docs.

Tue Aug 24 15:50:38 2010
  - parallel, thread, thread safe
    - take a look at use of strtok in various places. This	
      can't be done in inference code, but it should be ok to do it during
      reading of CPTS (such as lattices).

Tue Aug 24 20:59:55 2010
  - make lattice CPTs trainable. I.e., the lattice node CPT would be an easy
    way to specify a sparce cpt over pair variables P(N_t |N_{t-1}). We could
    easily make them trainable of the scores were simple, i.e, the
    score could just be, say, the LM score.
    We would then need to be able to write the CPT back out to disk.
    Iterable lattice CPTs would not work apply.

Tue Oct 19 15:08:22 2010
  - When T is just too large, have an option to print out viterbi values backwards
    but that never stores anything of lenght O(T).
    This will also need the observation code not to load in all O(T) of the observations
    (i.e., the observation code needs to include an interface that the island algorithm
     can use).

Wed Nov 24 15:58:24 2010
  - have a compile time option that turns off all bounds checking. Ie.,
    things like checking the output of a DT/deterministic mapping, array bounds
    checking in DT evaluations, and any other bounds checking. This
    compiler flag should interact with the NDEBUG compile time macro. I.e.,
    there should be a one-stop option to turn off all bounds checking
    and assertions at compile time. Or perhaps there could be separate
    macros for different ones, and then a one-stop macro definition that
    turns them all off.

Tue Nov 16 23:50:10 2010
   - Regarding the DT entry from Wed Apr 20 00:46:19 2005, instead of
     having all of thes functions, have instead user-defined DT
     mappers (i.e., there'd be one extra file called something like
     GMTK_UserDTs.cc which by default is a stub, but has a few
     commented examples of how to define and then register named
     decision trees that become available as would a DT that is
     defined in a master file. In fact, this could be the way to add a
     number of interall DTs that I've been wanting to add and that are
     often used (such as copy parent if other parent is non-zero,
     etc). This would also allow loops, etc. (and would go along with
     the source release we're going to do soon). The overhead would
     still be calling a function via a pointer-to-function but that's
     still much better than what we've currently got ...