Parameters of language modeling tools

Num. Name Last modification date
1. cache 2005.08.24
2. calc_perplexity 2006.04.26
3. cluster 2003.09.10
4. decay 2005.08.26
5. dl 2006.04.26
6. evallm_1 2002.12.02
7. gram2gramcl 2003.09.03
8. idg2ids_idg 2006.03.02
9. idg2ids_idgs 2005.11.15
10. idg2rev_idg 2004.11.18
11. idngram2lm 2005.11.24
12. initClasses 2004.01.27
13. intCache 2004.07.07
14. interpolateEM 2005.08.23
15. kn1 2006.04.26
16. lmManager 2006.06.06
17. mergeGrams 2004.02.05
18. rescorer 2006.06.06
19. skaldymas 2006.03.07
20. skaldymas_class 2003.12.02
21. text2idngram 2005.11.24
22. text2idngram_s 2006.11.24
23. text2wfreq 2004.05.13
24. topicCluster 2004.08.23
25. vocabFromTrigram 2004.08.23
26. wfreq2id_wc 2004.03.30
27. wfreq2idsg 2006.10.09
28. wfreq2vocab 2004.03.30
29. zodynasClass 2003.04.28

1. cache

Last modification date: 2005.08.24

Info: Calculates cache probabilities.

N. Name In/Out Type Info
1. cacheSize In int 1..integer - ;
2. decay In file Decay file for cache models
3. fileout Out file Probability file((Word Num.); probability;)
4. fileout1 Out file File for output of bigram cache information: 1-if w(i-1) in cache, 0-otherwise
5. start In string start - starts normal cache; start_bigram - starts bigram cache; start_bigram_decay - starts bigram cache with decay function; start_decay - starts cache with decay function;
Settings file: 16_cache.set ; 17_cache.set
Bat file: 16_cache.bat


2. calc_perplexity

Last modification date: 2006.04.26

Info: Evaluates perplexity and OOV of language model

N. Name In/Out Type Info
1. close In int 1 - Program exits after calculation has been completed;
2. count In int 0 - word count will be detected automatically according "datafile"; k - word count in test corpus;
3. datafile In file Probability file((Word Num.); probability;)
4. info Out file Out text file for the information about the calculation process
5. start In int 1 - Starts calculating;
6. startin In string - Starting directory;
Settings file: 09_calc_perplexity.set
Bat file: 09_calc_perplexity.bat


3. cluster

Last modification date: 2003.09.10

Info: Word clustering according to MI criterion.

N. Name In/Out Type Info
1. cCount In int k - Class count k;
2. classf In file Class map file (Word class)
3. classfOut Out file Class map file (Word class)
4. info Out file Out text file for the information about the calculation process
5. iteration In int 1.. - Iterations of clustering algorithm;
6. vocSize In int v - Vocabulary size v;
Settings file: 11_cluster.set
Bat file: 11_cluster.bat


4. decay

Last modification date: 2005.08.26

Info: Estimates decay function for cache models using EM algorithm.

N. Name In/Out Type Info
1. cacheSize In int 1.. - cache size for decay function;
2. d In int 1..3 - specifies the calculated reoccurrences position. Is used with "start" and "startbigram";
3. fileout Out file Decay file for cache models
4. start In string start - Calculates histogram of word reoccurrences; startbigram - Calculates histogram of bigram reoccurrences; startdecayem - decay function is being evaluated using EM; startdecayembi - bigram decay function is being evaluated using EM; startword - Calculates histogram of word reoccurrences for specified "word";
5. word In string - specifies the word to estimate decay for;
Settings file: 15_decay.set
Bat file: 15_decay.bat


5. dl

Last modification date: 2006.04.26

Info: Very large vocabulary recognition engine. Acoustic models are read from HTK text file, language model is provided by lmManager

N. Name In/Out Type Info
1. beamSize In int p - beam size for Viterbi search;
2. contextWords In int 0 - no cross-word triphones; k>0 - context phonemes are used by expanding k last words;
3. featuresIn In file , text file of feature vectors (in HTK format);
4. fileOut Out file N-best list file in mlf format (HTK), recognized sequences;
5. hmm In file , acoustic models in HTK format;
6. info Out file Out text file for the information about the calculation process
7. lmContext In int (n-1)>0 - order of ngram model; 0 - unigram; 1 - bigram;
8. lmScale In int 0 - language model is not used; >0 - language model weight;
9. nbest In int 1 - returns only the best word sequence; N - returns N-best list;
10. pron In file , pronunciation vocabulary (in HTK format);
11. skipFrames In int s - skips s frames in stack search;
12. start In string decode - decodes; load - load acoustic models for online decoding;
13. tiedList In file , tied list of phonemes in HTK format;
14. tree In file , questions tree (in HTK format) for synthesis of new phonemes;
15. useSynthesis In int 0 - none; 1 - synthesizes new phonemes according to question tree;
16. wordEndBeamSize In int p - beam size at the word level;
17. wordInsertionPenalty In int p - word insertion penalty;
Settings file: 31_dl.set
Bat file: 31_dl.bat


6. evallm_1

Last modification date: 2002.12.02

Info: Evaluates perplexity and OOV of language model.

N. Name In/Out Type Info
1. -annotate Out file Annotation
2. -binary In file Binary language model file
3. -fl In file File list of text corpus
4. -oovs Out file OOV file (word)
Bat file: 05_evallm_1.bat


7. gram2gramcl

Last modification date: 2003.09.03

Info: Generates class idtrigram file.

N. Name In/Out Type Info
1. classf In file Class map file (Word class)
2. gram In file Idngram file (id0 id1 ... idN count), Word trigram;
3. gramOut Out file Idngram file (id0 id1 ... idN count), Class trigram;
4. info Out file Out text file for the information about the calculation process
5. start In int 1 - starts working;
Settings file: 25_gram2gramcl.set
Bat file: 25_gram2gramcl.bat


8. idg2ids_idg

Last modification date: 2006.03.02

Info: Splits words into two parts and generates ngram files of word beginning and word ending.

N. Name In/Out Type Info
1. galuniuFile In file Ordered list of endings
2. idngram In file Idngram file (id0 id1 ... idN count)
3. idngramG Out file Idngram file (id0 id1 ... idN count), idtrigram of word endings;
4. idngramS Out file Idngram file (id0 id1 ... idN count), idtrigram of word beginning;
5. neskaidytiFile In file , list of words that will not be split;
6. vocab In file Vocabulary
7. vocabG Out file Vocabulary, Vocabulary of word endings;
8. vocabGFreq Out file Word frequency file, frequency file of word endings;
9. vocabS Out file Vocabulary, Vocabulary of word beginning;
10. vocabSkaldymas Out file , Word splitting information;
Settings file: 33_idg2ids_idg.set
Bat file: 33_idg2ids_idg.bat


9. idg2ids_idgs

Last modification date: 2005.11.15

Info: Splits words into two parts and generates ngram files of word beginning and word ending.

N. Name In/Out Type Info
1. galuniuFile In file Ordered list of endings
2. idngram In file Idngram file (id0 id1 ... idN count)
3. idngramGS Out file Idngram file (id0 id1 ... idN count), idngram ggsg (ending ending beginning ending);
4. idngramS Out file Idngram file (id0 id1 ... idN count), idtrigram of word beginning;
5. neskaidytiFile In file , list of words that will not be split;
6. vocab In file Vocabulary
7. vocabG Out file Vocabulary, Vocabulary of word endings;
8. vocabGFreq Out file Word frequency file, frequency file of word endings;
9. vocabS Out file Vocabulary, Vocabulary of word beginning;
10. vocabSkaldymas Out file , Word splitting information;
Settings file: 32_idg2ids_idgs.set
Bat file: 32_idg2ids_idgs.bat


10. idg2rev_idg

Last modification date: 2004.11.18

Info: Reverses the idngram file.

N. Name In/Out Type Info
1. -ascii_input In - Idngram is loaded from text file;
2. -fin In file Idngram file (id0 id1 ... idN count)
3. -fout Out file Reversed idngram file (idN id(N-1) ... id0 count)
4. -n N In int 1..integer - Order of the ngram;
Bat file: 07_idg2rev_idg.bat


11. idngram2lm

Last modification date: 2005.11.24

Info: Builds language model.

N. Name In/Out Type Info
1. -ascii_input In - Idngram is loaded from text file;
2. -binary Out file Binary language model file
3. -good_turing In - Smoothing type;
4. -idngram In file Idngram file (id0 id1 ... idN count)
5. -n N In int 1..integer - Order of the ngram;
6. -vocab In file Vocabulary
Bat file: 04_idngram2lm.bat


12. initClasses

Last modification date: 2004.01.27

Info: Initializes classes.

N. Name In/Out Type Info
1. cCount In int k - Class count k;
2. fOut Out file Class map file (Word class)
3. info Out file Out text file for the information about the calculation process
4. wCount In int v - Vocabulary size v;
Settings file: 10_initClasses.set
Bat file: 10_initClasses.bat


13. intCache

Last modification date: 2004.07.07

Info: Interpolates cache model probabilities using EM algorithm

N. Name In/Out Type Info
1. ann Out file File for dynamic interpolated lambdas
2. cache In int 1..integer - cache for dynamic interpolate;
3. datafile0..2 In file Probability file((Word Num.); probability;)
4. datafile2i In file File for output of bigram cache information: 1-if w(i-1) in cache, 0-otherwise
5. lamdastart In file Lambda file for models interpolation arrayd: double double ..., array of lambdas for ngram, cache unigram and bigram;
6. lamdastart1 In file Lambda file for models interpolation arrayd: double double ..., array of lambdas for ngram and cache unigram;
7. probCount In int 3 - count of interpolation models;
8. probOut Out file Probability file((Word Num.); probability;)
9. start In string calculate - Calculates perplexity and OOV; interpolate_dynamic - Dynamic interpolation. Model weighs are set using "cache" history before calculating word probability estimate;
Settings file: 19_intCache.set
Bat file: 19_intCache.bat


14. interpolateEM

Last modification date: 2005.08.23

Info: Interpolates several prob files using EM algorithm

N. Name In/Out Type Info
1. ann Out file File for dynamic interpolated lambdas
2. cache In int 1..integer - cache for dynamic interpolate;
3. datafile0..(probCount-1) In file Probability file((Word Num.); probability;)
4. lamdaout Out file Lambda file for models interpolation arrayd: double double ...
5. lamdastart In file Lambda file for models interpolation arrayd: double double ...
6. probCount In int 1..integer - count of interpolation models;
7. probOut Out file Probability file((Word Num.); probability;)
8. start In string calculate - Calculates perplexity and OOV; interpolate - Interpolates; interpolate_dynamic - Dynamic interpolation. Model weighs are set using "cache" history before calculating word probability estimate;
Settings file: 12_interpolateEM.set ; 13_interpolateEM.set ; 14_interpolateEM.set
Bat file: 12_interpolateEM.bat


15. kn1

Last modification date: 2006.04.26

Info: Evaluates probabilities with language model using Kneser-Ney smoothing.

N. Name In/Out Type Info
1. ann Out file Annotation
2. close In int 1 - Program exits after calculation has been completed;
3. filein In file Text file for word ngram probabilities evaluation (Word num ; X; word0 ;...; wordX; )
4. fileout Out file Probability file((Word Num.); probability;)
5. gramKN In file Reversed idngram file (idN id(N-1) ... id0 count)
6. gramVoc0..N In int X - gramVocA=X - ngram Ath id of ngram is from vocabulary vocabKNX;
7. info Out file Out text file for the information about the calculation process
8. n=N In int 1..integer - Order of the ngram;
9. oov Out file OOV file (word)
10. start In string calc - Calculates ngram probabilities;
11. startin In string - Starting directory;
12. vocabKN0..N In file Vocabulary
13. wordOrder In int 1 - "filein" represents a data file for evaluation of model P(g|ggs); 2 - "filein" represents a word endings ngram file; 3 - "filein" represents a word beginning ending ngram file for evaluation of model P(g|s); 4 - "filein" represents a word ngram file;
Settings file: 06_kn1.set
Bat file: 06_kn1.bat


16. lmManager

Last modification date: 2006.06.06

Info: Tool loads language model and provides interface (COM) for using them from other applications. Realization of word ngram, skip bigram, cache, topic mixture, class-based models

N. Name In/Out Type Info
1. EMHist In int 1.. - word count for setting dynamic model weights (lamdaDinamic=1);
2. info Out file Out text file for the information about the calculation process
3. lamdaDinamic In int 0 - static model weights; 1 - dynamic model weights;
4. lm In file , settings file for model information;
5. start In string load - load language model;
6. type In int 1 - word ngram with Good-Turing smoothing. See settings file; 10 - skip bigram model. See settings file; 2 - cache models. See settings file; 3 - topic mixture model. See settings file; 5 - class-based model. See settings file; 8 - word ngram with Kneser-Ney smoothing. See settings file;
Settings file: 30_lmManager.set ; 301_topicmixture.zip ; 302_skip.zip ; 303_good_turing.ZIP ; 304_class.zip ; 305_Cache.zip
Bat file: 30_lmmanager.bat


17. mergeGrams

Last modification date: 2004.02.05

Info: Prepares trigram file for particular topic.

N. Name In/Out Type Info
1. fileList In file List of links to text idunigram files of particular topic
2. fileOut Out file Idngram file (id0 id1 ... idN count), topic idtrigram file;
3. info Out file Out text file for the information about the calculation process
4. start In int 1 - starts working;
5. vocab In file Vocabulary
6. vocabOut In file Vocabulary, topic vocabulary;
Settings file: 23_mergeGrams.set
Bat file: 23_mergeGrams.bat


18. rescorer

Last modification date: 2006.06.06

Info: Tool rescores n-best list and calculates perplexity. The language model is provided by LMManager.

N. Name In/Out Type Info
1. amFactor In float x - acoustic model weight;
2. format In int v - Vocabulary size v;
3. info Out file Out text file for the information about the calculation process
4. insertPenalty In float x - word insertion penalty;
5. listIn In file Text file for word ngram probabilities evaluation (Word num ; X; word0 ;...; wordX; )
6. lmFactor In float x - language model weight;
7. nbest In int N - rescores the list of N sequences;
8. nbestIn In file N-best list file in mlf format (HTK), N-best list;
9. out Out file N-best list file in mlf format (HTK), rescored best list out (start="nbest"); Probability file((Word Num.); probability;), start="calcperplexity";
10. saveall In int 0 - only the best sequence is saved to out file; 1 - all sequences are saved to out file;
11. start In string calcperplexitylist - calculates perplexity; nbest - rescores N-best list;
12. usecache In int 0 - the last best word sequence is not added to word history that is used for evaluation of further sequences; 1 - the last best word sequence is added to word history that is used for evaluation of further sequences;
13. usereverse In int 0 - none; 1 - reversed model evaluation;
Settings file: 29_rescorer.set ; 291_rescorer_nbest.set
Bat file: 29_rescorer.bat


19. skaldymas

Last modification date: 2006.03.07

Info: Splits words into two parts and generates a word ngram file for calculating probabilities

N. Name In/Out Type Info
1. filein In file File list of text corpus
2. fileout Out file Text file for word ngram probabilities evaluation (Word num ; X; word0 ;...; wordX; )
3. galuniuFile In file Ordered list of endings
4. neskaidytiFile In file , list of words that will not be split;
Settings file: 35_skaldymas.set
Bat file: 35_skaldymas.bat


20. skaldymas_class

Last modification date: 2003.12.02

Info: Tool creates a word ngram file for calculating probabilities

N. Name In/Out Type Info
1. close In int 1 - Program exits after calculation has been completed;
2. filein In file File list of text corpus
3. fileout Out file Text file for word ngram probabilities evaluation (Word num ; X; word0 ;...; wordX; )
4. info Out file Out text file for the information about the calculation process
5. n=N In int 1..integer - Order of the ngram;
6. skip=s In int 0..integer - skip of s words for skip bigram model (start="start_2g_skip");
7. start In string start - prepares file for class evaluation; start_2g_skip - prepares file for skip bigram evaluation; start_3g - prepares file for trigram evaluation; start_3g_reverse - prepares file for reverse trigram evaluation ; start_reverse - prepares file for reverse class evaluation ;
8. startin In string - Starting directory;
Settings file: 08_skaldymas_class.set
Bat file: 08_skaldymas_class.bat


21. text2idngram

Last modification date: 2005.11.24

Info: Generates idngram file.

N. Name In/Out Type Info
1. -fl In file File list of text corpus
2. -fout Out file Idngram file (id0 id1 ... idN count)
3. -n N In int 1..integer - Order of the ngram;
4. -vocab In file Vocabulary
5. -write_ascii In - Result in text format;
Bat file: 03_text2idngram.bat


22. text2idngram_s

Last modification date: 2006.11.24

Info: Generates skip idbigram file.

N. Name In/Out Type Info
1. -fl In file File list of text corpus
2. -fout Out file Idngram file (id0 id1 ... idN count)
3. -skip s In int 0..integer - skip words;
4. -vocab In file Vocabulary
5. -write_ascii In - Result in text format;
Bat file: 24_text2idngram_s.bat


23. text2wfreq

Last modification date: 2004.05.13

Info: Generates word frequency file.

N. Name In/Out Type Info
1. -fl In file File list of text corpus
2. -fout Out file Word frequency file
Bat file: 01_text2wfreq.bat


24. topicCluster

Last modification date: 2004.08.23

Info: Text clustering into topics using unigram perplexity criterion or TFIDF.

N. Name In/Out Type Info
1. articleCount In int a - text count v;
2. articleList In file List of links to text idunigram files (number; idunigram)
3. classCount In int k - Class count k;
4. classf In file Class map file (Word class)
5. classfOut Out file Class map file (Word class)
6. fileList Out file List of links to text idunigram files of particular topic, file name for topic idngram list (start="start_filelist");
7. info Out file Out text file for the information about the calculation process
8. iteration In int 1.. - Iterations of clustering algorithm;
9. newExt Out string - extension of idngrams (start="start_filelist");
10. newPath Out string - path of idngrams (start="start_filelist");
11. start In string start_filelist - prepares idngram file list for every topic; start_tfidf - clustering using TFIDF criterion; start_unigram - clustering using unigram perplexity criterion;
12. wordCount In int v - Vocabulary size v;
Settings file: 20_topicCluster.set ; 21_topicCluster.set
Bat file: 20_topicCluster.bat


25. vocabFromTrigram

Last modification date: 2004.08.23

Info: Prepares topic vocabulary.

N. Name In/Out Type Info
1. fileList In file List of links to text idunigram files of particular topic
2. info Out file Out text file for the information about the calculation process
3. newVocab Out file Vocabulary, topic vocabulary;
4. start In int 1 - starts working;
5. vocab In file Vocabulary
Settings file: 22_vocabFromTrigram.set
Bat file: 22_vocabFromTrigram.bat


26. wfreq2id_wc

Last modification date: 2004.03.30

Info: Generates class-word idbigram for class model P(w|c).

N. Name In/Out Type Info
1. 2gram Out file Idngram file (id0 id1 ... idN count), Class-word idbigram;
2. classf In file Class map file (Word class)
3. vocab In file Vocabulary
4. wfreq In file Word frequency file
Settings file: 27_wfreq2id_wc.set
Bat file: 27_wfreq2id_wc.bat


27. wfreq2idsg

Last modification date: 2006.10.09

Info: Splits words into two parts and generates idbigram for word beginning and ending.

N. Name In/Out Type Info
1. 2gram Out file Idngram file (id0 id1 ... idN count), Idbigram of word beginning and ending;
2. galuniuFile In file Ordered list of endings
3. neskaidytiFile In file , list of words that will not be split;
4. vocabG In file Vocabulary, Vocabulary of word endings;
5. vocabS In file Vocabulary, Vocabulary of word beginning;
6. wfreq In file Word frequency file
Settings file: 34_wfreq2idsg.set
Bat file: 34_wfreq2idsg.bat


28. wfreq2vocab

Last modification date: 2004.03.30

Info: Generates vocabulary.

N. Name In/Out Type Info
1. -gt N In int 1..integer - The words that appeared in text corpus more than N - 1 times;
2. -top N In int 1..integer - The N most frequent words;
3. < In file Word frequency file
4. > Out file Vocabulary
Bat file: 02_wfreq2vocab.bat


29. zodynasClass

Last modification date: 2003.04.28

Info: Generates class vocabulary.

N. Name In/Out Type Info
1. classf In file Class map file (Word class)
2. info Out file Out text file for the information about the calculation process
3. start In int 1 - starts working;
4. vocabclass Out file Vocabulary, Class vocabulary;
Settings file: 26_zodynasClass.set
Bat file: 26_zodynasClass.bat


Files used by modeling tools

N. Name Generated by Sample
1. Annotation evallm_1 01.anot
2. Binary language model file idngram2lm
3. Class map file (Word class) topicCluster ; initClasses ; cluster 01.cla
4. Decay file for cache models decay 01.decay
5. File for dynamic interpolated lambdas interpolateEM
6. File for output of bigram cache information: 1-if w(i-1) in cache, 0-otherwise cache 03.ca
7. File list of text corpus 01.fsr
8. File sample of text corpus 10004.txt
9. Idngram file (id0 id1 ... idN count) gram2gramcl ; mergeGrams ; text2idngram 01.2gram
10. Lambda file for models interpolation arrayd: double double ... interpolateEM 01_lambdas.txt
11. List of links to text idunigram files (number; idunigram) 02.list
12. List of links to text idunigram files of particular topic topicCluster topic.list_0
13. N-best list file in mlf format (HTK) dl ; rescorer
14. OOV file (word) evallm_1 01.oov
15. Ordered list of endings galunes.txt
16. Out text file for the information about the calculation process
17. Probability file((Word Num.); probability;) interpolateEM ; kn1 01.prob
18. Reversed idngram file (idN id(N-1) ... id0 count) idg2rev_idg 01_r.2gram
19. Text file for word ngram probabilities evaluation (Word num ; X; word0 ;...; wordX; ) skaldymas_class 01.2txt
20. Vocabulary zodynasClass ; vocabFromTrigram ; wfreq2vocab 01.vocab
21. Word frequency file text2wfreq 01.wfreq