Parameters of SLM tools

Parameters of language modeling tools

Num.	Name	Last modification date
1.	cache	2005.08.24
2.	calc_perplexity	2006.04.26
3.	cluster	2003.09.10
4.	decay	2005.08.26
5.	dl	2006.04.26
6.	evallm_1	2002.12.02
7.	gram2gramcl	2003.09.03
8.	idg2ids_idg	2006.03.02
9.	idg2ids_idgs	2005.11.15
10.	idg2rev_idg	2004.11.18
11.	idngram2lm	2005.11.24
12.	initClasses	2004.01.27
13.	intCache	2004.07.07
14.	interpolateEM	2005.08.23
15.	kn1	2006.04.26
16.	lmManager	2006.06.06
17.	mergeGrams	2004.02.05
18.	rescorer	2006.06.06
19.	skaldymas	2006.03.07
20.	skaldymas_class	2003.12.02
21.	text2idngram	2005.11.24
22.	text2idngram_s	2006.11.24
23.	text2wfreq	2004.05.13
24.	topicCluster	2004.08.23
25.	vocabFromTrigram	2004.08.23
26.	wfreq2id_wc	2004.03.30
27.	wfreq2idsg	2006.10.09
28.	wfreq2vocab	2004.03.30
29.	zodynasClass	2003.04.28

1. cache

Last modification date: 2005.08.24

Info: Calculates cache probabilities.

N.	Name	In/Out	Type	Info
1.	cacheSize	In	int	1..integer - ;
2.	decay	In	file	Decay file for cache models
3.	fileout	Out	file	Probability file((Word Num.); probability;)
4.	fileout1	Out	file	File for output of bigram cache information: 1-if w(i-1) in cache, 0-otherwise
5.	start	In	string	start - starts normal cache; start_bigram - starts bigram cache; start_bigram_decay - starts bigram cache with decay function; start_decay - starts cache with decay function;

Settings file: 16_cache.set ; 17_cache.set

Bat file: 16_cache.bat

2. calc_perplexity

Last modification date: 2006.04.26

Info: Evaluates perplexity and OOV of language model

N.	Name	In/Out	Type	Info
1.	close	In	int	1 - Program exits after calculation has been completed;
2.	count	In	int	0 - word count will be detected automatically according "datafile"; k - word count in test corpus;
3.	datafile	In	file	Probability file((Word Num.); probability;)
4.	info	Out	file	Out text file for the information about the calculation process
5.	start	In	int	1 - Starts calculating;
6.	startin	In	string	- Starting directory;

Settings file: 09_calc_perplexity.set

Bat file: 09_calc_perplexity.bat

3. cluster

Last modification date: 2003.09.10

Info: Word clustering according to MI criterion.

N.	Name	In/Out	Type	Info
1.	cCount	In	int	k - Class count k;
2.	classf	In	file	Class map file (Word class)
3.	classfOut	Out	file	Class map file (Word class)
4.	info	Out	file	Out text file for the information about the calculation process
5.	iteration	In	int	1.. - Iterations of clustering algorithm;
6.	vocSize	In	int	v - Vocabulary size v;

Settings file: 11_cluster.set

Bat file: 11_cluster.bat

4. decay

Last modification date: 2005.08.26

Info: Estimates decay function for cache models using EM algorithm.

N.	Name	In/Out	Type	Info
1.	cacheSize	In	int	1.. - cache size for decay function;
2.	d	In	int	1..3 - specifies the calculated reoccurrences position. Is used with "start" and "startbigram";
3.	fileout	Out	file	Decay file for cache models
4.	start	In	string	start - Calculates histogram of word reoccurrences; startbigram - Calculates histogram of bigram reoccurrences; startdecayem - decay function is being evaluated using EM; startdecayembi - bigram decay function is being evaluated using EM; startword - Calculates histogram of word reoccurrences for specified "word";
5.	word	In	string	- specifies the word to estimate decay for;

Settings file: 15_decay.set

Bat file: 15_decay.bat

5. dl

Last modification date: 2006.04.26

Info: Very large vocabulary recognition engine. Acoustic models are read from HTK text file, language model is provided by lmManager

N.	Name	In/Out	Type	Info
1.	beamSize	In	int	p - beam size for Viterbi search;
2.	contextWords	In	int	0 - no cross-word triphones; k>0 - context phonemes are used by expanding k last words;
3.	featuresIn	In	file	, text file of feature vectors (in HTK format);
4.	fileOut	Out	file	N-best list file in mlf format (HTK), recognized sequences;
5.	hmm	In	file	, acoustic models in HTK format;
6.	info	Out	file	Out text file for the information about the calculation process
7.	lmContext	In	int	(n-1)>0 - order of ngram model; 0 - unigram; 1 - bigram;
8.	lmScale	In	int	0 - language model is not used; >0 - language model weight;
9.	nbest	In	int	1 - returns only the best word sequence; N - returns N-best list;
10.	pron	In	file	, pronunciation vocabulary (in HTK format);
11.	skipFrames	In	int	s - skips s frames in stack search;
12.	start	In	string	decode - decodes; load - load acoustic models for online decoding;
13.	tiedList	In	file	, tied list of phonemes in HTK format;
14.	tree	In	file	, questions tree (in HTK format) for synthesis of new phonemes;
15.	useSynthesis	In	int	0 - none; 1 - synthesizes new phonemes according to question tree;
16.	wordEndBeamSize	In	int	p - beam size at the word level;
17.	wordInsertionPenalty	In	int	p - word insertion penalty;

Settings file: 31_dl.set

Bat file: 31_dl.bat

6. evallm_1

Last modification date: 2002.12.02

Info: Evaluates perplexity and OOV of language model.

N.	Name	In/Out	Type	Info
1.	-annotate	Out	file	Annotation
2.	-binary	In	file	Binary language model file
3.	-fl	In	file	File list of text corpus
4.	-oovs	Out	file	OOV file (word)

Bat file: 05_evallm_1.bat

7. gram2gramcl

Last modification date: 2003.09.03

Info: Generates class idtrigram file.

N.	Name	In/Out	Type	Info
1.	classf	In	file	Class map file (Word class)
2.	gram	In	file	Idngram file (id0 id1 ... idN count), Word trigram;
3.	gramOut	Out	file	Idngram file (id0 id1 ... idN count), Class trigram;
4.	info	Out	file	Out text file for the information about the calculation process
5.	start	In	int	1 - starts working;

Settings file: 25_gram2gramcl.set

Bat file: 25_gram2gramcl.bat

8. idg2ids_idg

Last modification date: 2006.03.02

Info: Splits words into two parts and generates ngram files of word beginning and word ending.

N.	Name	In/Out	Type	Info
1.	galuniuFile	In	file	Ordered list of endings
2.	idngram	In	file	Idngram file (id0 id1 ... idN count)
3.	idngramG	Out	file	Idngram file (id0 id1 ... idN count), idtrigram of word endings;
4.	idngramS	Out	file	Idngram file (id0 id1 ... idN count), idtrigram of word beginning;
5.	neskaidytiFile	In	file	, list of words that will not be split;
6.	vocab	In	file	Vocabulary
7.	vocabG	Out	file	Vocabulary, Vocabulary of word endings;
8.	vocabGFreq	Out	file	Word frequency file, frequency file of word endings;
9.	vocabS	Out	file	Vocabulary, Vocabulary of word beginning;
10.	vocabSkaldymas	Out	file	, Word splitting information;

Settings file: 33_idg2ids_idg.set

Bat file: 33_idg2ids_idg.bat

9. idg2ids_idgs

Last modification date: 2005.11.15

Info: Splits words into two parts and generates ngram files of word beginning and word ending.

N.	Name	In/Out	Type	Info
1.	galuniuFile	In	file	Ordered list of endings
2.	idngram	In	file	Idngram file (id0 id1 ... idN count)
3.	idngramGS	Out	file	Idngram file (id0 id1 ... idN count), idngram ggsg (ending ending beginning ending);
4.	idngramS	Out	file	Idngram file (id0 id1 ... idN count), idtrigram of word beginning;
5.	neskaidytiFile	In	file	, list of words that will not be split;
6.	vocab	In	file	Vocabulary
7.	vocabG	Out	file	Vocabulary, Vocabulary of word endings;
8.	vocabGFreq	Out	file	Word frequency file, frequency file of word endings;
9.	vocabS	Out	file	Vocabulary, Vocabulary of word beginning;
10.	vocabSkaldymas	Out	file	, Word splitting information;

Settings file: 32_idg2ids_idgs.set

Bat file: 32_idg2ids_idgs.bat

10. idg2rev_idg

Last modification date: 2004.11.18

Info: Reverses the idngram file.

N.	Name	In/Out	Type	Info
1.	-ascii_input	In		- Idngram is loaded from text file;
2.	-fin	In	file	Idngram file (id0 id1 ... idN count)
3.	-fout	Out	file	Reversed idngram file (idN id(N-1) ... id0 count)
4.	-n N	In	int	1..integer - Order of the ngram;

Bat file: 07_idg2rev_idg.bat

11. idngram2lm

Last modification date: 2005.11.24

Info: Builds language model.

N.	Name	In/Out	Type	Info
1.	-ascii_input	In		- Idngram is loaded from text file;
2.	-binary	Out	file	Binary language model file
3.	-good_turing	In		- Smoothing type;
4.	-idngram	In	file	Idngram file (id0 id1 ... idN count)
5.	-n N	In	int	1..integer - Order of the ngram;
6.	-vocab	In	file	Vocabulary

Bat file: 04_idngram2lm.bat

12. initClasses

Last modification date: 2004.01.27

Info: Initializes classes.

N.	Name	In/Out	Type	Info
1.	cCount	In	int	k - Class count k;
2.	fOut	Out	file	Class map file (Word class)
3.	info	Out	file	Out text file for the information about the calculation process
4.	wCount	In	int	v - Vocabulary size v;

Settings file: 10_initClasses.set

Bat file: 10_initClasses.bat

13. intCache

Last modification date: 2004.07.07

Info: Interpolates cache model probabilities using EM algorithm

N.	Name	In/Out	Type	Info
1.	ann	Out	file	File for dynamic interpolated lambdas
2.	cache	In	int	1..integer - cache for dynamic interpolate;
3.	datafile0..2	In	file	Probability file((Word Num.); probability;)
4.	datafile2i	In	file	File for output of bigram cache information: 1-if w(i-1) in cache, 0-otherwise
5.	lamdastart	In	file	Lambda file for models interpolation arrayd: double double ..., array of lambdas for ngram, cache unigram and bigram;
6.	lamdastart1	In	file	Lambda file for models interpolation arrayd: double double ..., array of lambdas for ngram and cache unigram;
7.	probCount	In	int	3 - count of interpolation models;
8.	probOut	Out	file	Probability file((Word Num.); probability;)
9.	start	In	string	calculate - Calculates perplexity and OOV; interpolate_dynamic - Dynamic interpolation. Model weighs are set using "cache" history before calculating word probability estimate;

Settings file: 19_intCache.set

Bat file: 19_intCache.bat

14. interpolateEM

Last modification date: 2005.08.23

Info: Interpolates several prob files using EM algorithm

N.	Name	In/Out	Type	Info
1.	ann	Out	file	File for dynamic interpolated lambdas
2.	cache	In	int	1..integer - cache for dynamic interpolate;
3.	datafile0..(probCount-1)	In	file	Probability file((Word Num.); probability;)
4.	lamdaout	Out	file	Lambda file for models interpolation arrayd: double double ...
5.	lamdastart	In	file	Lambda file for models interpolation arrayd: double double ...
6.	probCount	In	int	1..integer - count of interpolation models;
7.	probOut	Out	file	Probability file((Word Num.); probability;)
8.	start	In	string	calculate - Calculates perplexity and OOV; interpolate - Interpolates; interpolate_dynamic - Dynamic interpolation. Model weighs are set using "cache" history before calculating word probability estimate;

Settings file: 12_interpolateEM.set ; 13_interpolateEM.set ; 14_interpolateEM.set

Bat file: 12_interpolateEM.bat

15. kn1

Last modification date: 2006.04.26

Info: Evaluates probabilities with language model using Kneser-Ney smoothing.

N.	Name	In/Out	Type	Info
1.	ann	Out	file	Annotation
2.	close	In	int	1 - Program exits after calculation has been completed;
3.	filein	In	file	Text file for word ngram probabilities evaluation (Word num ; X; word0 ;...; wordX; )
4.	fileout	Out	file	Probability file((Word Num.); probability;)
5.	gramKN	In	file	Reversed idngram file (idN id(N-1) ... id0 count)
6.	gramVoc0..N	In	int	X - gramVocA=X - ngram Ath id of ngram is from vocabulary vocabKNX;
7.	info	Out	file	Out text file for the information about the calculation process
8.	n=N	In	int	1..integer - Order of the ngram;
9.	oov	Out	file	OOV file (word)
10.	start	In	string	calc - Calculates ngram probabilities;
11.	startin	In	string	- Starting directory;
12.	vocabKN0..N	In	file	Vocabulary
13.	wordOrder	In	int	1 - "filein" represents a data file for evaluation of model P(g\|ggs); 2 - "filein" represents a word endings ngram file; 3 - "filein" represents a word beginning ending ngram file for evaluation of model P(g\|s); 4 - "filein" represents a word ngram file;

Settings file: 06_kn1.set

Bat file: 06_kn1.bat

16. lmManager

Last modification date: 2006.06.06

Info: Tool loads language model and provides interface (COM) for using them from other applications. Realization of word ngram, skip bigram, cache, topic mixture, class-based models

N.	Name	In/Out	Type	Info
1.	EMHist	In	int	1.. - word count for setting dynamic model weights (lamdaDinamic=1);
2.	info	Out	file	Out text file for the information about the calculation process
3.	lamdaDinamic	In	int	0 - static model weights; 1 - dynamic model weights;
4.	lm	In	file	, settings file for model information;
5.	start	In	string	load - load language model;
6.	type	In	int	1 - word ngram with Good-Turing smoothing. See settings file; 10 - skip bigram model. See settings file; 2 - cache models. See settings file; 3 - topic mixture model. See settings file; 5 - class-based model. See settings file; 8 - word ngram with Kneser-Ney smoothing. See settings file;

Settings file: 30_lmManager.set ; 301_topicmixture.zip ; 302_skip.zip ; 303_good_turing.ZIP ; 304_class.zip ; 305_Cache.zip

Bat file: 30_lmmanager.bat

17. mergeGrams

Last modification date: 2004.02.05

Info: Prepares trigram file for particular topic.

N.	Name	In/Out	Type	Info
1.	fileList	In	file	List of links to text idunigram files of particular topic
2.	fileOut	Out	file	Idngram file (id0 id1 ... idN count), topic idtrigram file;
3.	info	Out	file	Out text file for the information about the calculation process
4.	start	In	int	1 - starts working;
5.	vocab	In	file	Vocabulary
6.	vocabOut	In	file	Vocabulary, topic vocabulary;

Settings file: 23_mergeGrams.set

Bat file: 23_mergeGrams.bat

18. rescorer

Last modification date: 2006.06.06

Info: Tool rescores n-best list and calculates perplexity. The language model is provided by LMManager.

N.	Name	In/Out	Type	Info
1.	amFactor	In	float	x - acoustic model weight;
2.	format	In	int	v - Vocabulary size v;
3.	info	Out	file	Out text file for the information about the calculation process
4.	insertPenalty	In	float	x - word insertion penalty;
5.	listIn	In	file	Text file for word ngram probabilities evaluation (Word num ; X; word0 ;...; wordX; )
6.	lmFactor	In	float	x - language model weight;
7.	nbest	In	int	N - rescores the list of N sequences;
8.	nbestIn	In	file	N-best list file in mlf format (HTK), N-best list;
9.	out	Out	file	N-best list file in mlf format (HTK), rescored best list out (start="nbest"); Probability file((Word Num.); probability;), start="calcperplexity";
10.	saveall	In	int	0 - only the best sequence is saved to out file; 1 - all sequences are saved to out file;
11.	start	In	string	calcperplexitylist - calculates perplexity; nbest - rescores N-best list;
12.	usecache	In	int	0 - the last best word sequence is not added to word history that is used for evaluation of further sequences; 1 - the last best word sequence is added to word history that is used for evaluation of further sequences;
13.	usereverse	In	int	0 - none; 1 - reversed model evaluation;

Settings file: 29_rescorer.set ; 291_rescorer_nbest.set

Bat file: 29_rescorer.bat

19. skaldymas

Last modification date: 2006.03.07

Info: Splits words into two parts and generates a word ngram file for calculating probabilities

N.	Name	In/Out	Type	Info
1.	filein	In	file	File list of text corpus
2.	fileout	Out	file	Text file for word ngram probabilities evaluation (Word num ; X; word0 ;...; wordX; )
3.	galuniuFile	In	file	Ordered list of endings
4.	neskaidytiFile	In	file	, list of words that will not be split;

Settings file: 35_skaldymas.set

Bat file: 35_skaldymas.bat

20. skaldymas_class

Last modification date: 2003.12.02

Info: Tool creates a word ngram file for calculating probabilities

N.	Name	In/Out	Type	Info
1.	close	In	int	1 - Program exits after calculation has been completed;
2.	filein	In	file	File list of text corpus
3.	fileout	Out	file	Text file for word ngram probabilities evaluation (Word num ; X; word0 ;...; wordX; )
4.	info	Out	file	Out text file for the information about the calculation process
5.	n=N	In	int	1..integer - Order of the ngram;
6.	skip=s	In	int	0..integer - skip of s words for skip bigram model (start="start_2g_skip");
7.	start	In	string	start - prepares file for class evaluation; start_2g_skip - prepares file for skip bigram evaluation; start_3g - prepares file for trigram evaluation; start_3g_reverse - prepares file for reverse trigram evaluation ; start_reverse - prepares file for reverse class evaluation ;
8.	startin	In	string	- Starting directory;

Settings file: 08_skaldymas_class.set

Bat file: 08_skaldymas_class.bat

21. text2idngram

Last modification date: 2005.11.24

Info: Generates idngram file.

N.	Name	In/Out	Type	Info
1.	-fl	In	file	File list of text corpus
2.	-fout	Out	file	Idngram file (id0 id1 ... idN count)
3.	-n N	In	int	1..integer - Order of the ngram;
4.	-vocab	In	file	Vocabulary
5.	-write_ascii	In		- Result in text format;

Bat file: 03_text2idngram.bat

22. text2idngram_s

Last modification date: 2006.11.24

Info: Generates skip idbigram file.

N.	Name	In/Out	Type	Info
1.	-fl	In	file	File list of text corpus
2.	-fout	Out	file	Idngram file (id0 id1 ... idN count)
3.	-skip s	In	int	0..integer - skip words;
4.	-vocab	In	file	Vocabulary
5.	-write_ascii	In		- Result in text format;

Bat file: 24_text2idngram_s.bat

23. text2wfreq

Last modification date: 2004.05.13

Info: Generates word frequency file.

N.	Name	In/Out	Type	Info
1.	-fl	In	file	File list of text corpus
2.	-fout	Out	file	Word frequency file

Bat file: 01_text2wfreq.bat

24. topicCluster

Last modification date: 2004.08.23

Info: Text clustering into topics using unigram perplexity criterion or TFIDF.

N.	Name	In/Out	Type	Info
1.	articleCount	In	int	a - text count v;
2.	articleList	In	file	List of links to text idunigram files (number; idunigram)
3.	classCount	In	int	k - Class count k;
4.	classf	In	file	Class map file (Word class)
5.	classfOut	Out	file	Class map file (Word class)
6.	fileList	Out	file	List of links to text idunigram files of particular topic, file name for topic idngram list (start="start_filelist");
7.	info	Out	file	Out text file for the information about the calculation process
8.	iteration	In	int	1.. - Iterations of clustering algorithm;
9.	newExt	Out	string	- extension of idngrams (start="start_filelist");
10.	newPath	Out	string	- path of idngrams (start="start_filelist");
11.	start	In	string	start_filelist - prepares idngram file list for every topic; start_tfidf - clustering using TFIDF criterion; start_unigram - clustering using unigram perplexity criterion;
12.	wordCount	In	int	v - Vocabulary size v;

Settings file: 20_topicCluster.set ; 21_topicCluster.set

Bat file: 20_topicCluster.bat

25. vocabFromTrigram

Last modification date: 2004.08.23

Info: Prepares topic vocabulary.

N.	Name	In/Out	Type	Info
1.	fileList	In	file	List of links to text idunigram files of particular topic
2.	info	Out	file	Out text file for the information about the calculation process
3.	newVocab	Out	file	Vocabulary, topic vocabulary;
4.	start	In	int	1 - starts working;
5.	vocab	In	file	Vocabulary

Settings file: 22_vocabFromTrigram.set

Bat file: 22_vocabFromTrigram.bat

26. wfreq2id_wc

Last modification date: 2004.03.30

Info: Generates class-word idbigram for class model P(w|c).

N.	Name	In/Out	Type	Info
1.	2gram	Out	file	Idngram file (id0 id1 ... idN count), Class-word idbigram;
2.	classf	In	file	Class map file (Word class)
3.	vocab	In	file	Vocabulary
4.	wfreq	In	file	Word frequency file

Settings file: 27_wfreq2id_wc.set

Bat file: 27_wfreq2id_wc.bat

27. wfreq2idsg

Last modification date: 2006.10.09

Info: Splits words into two parts and generates idbigram for word beginning and ending.

N.	Name	In/Out	Type	Info
1.	2gram	Out	file	Idngram file (id0 id1 ... idN count), Idbigram of word beginning and ending;
2.	galuniuFile	In	file	Ordered list of endings
3.	neskaidytiFile	In	file	, list of words that will not be split;
4.	vocabG	In	file	Vocabulary, Vocabulary of word endings;
5.	vocabS	In	file	Vocabulary, Vocabulary of word beginning;
6.	wfreq	In	file	Word frequency file

Settings file: 34_wfreq2idsg.set

Bat file: 34_wfreq2idsg.bat

28. wfreq2vocab

Last modification date: 2004.03.30

Info: Generates vocabulary.

N.	Name	In/Out	Type	Info
1.	-gt N	In	int	1..integer - The words that appeared in text corpus more than N - 1 times;
2.	-top N	In	int	1..integer - The N most frequent words;
3.	<	In	file	Word frequency file
4.	>	Out	file	Vocabulary

Bat file: 02_wfreq2vocab.bat

29. zodynasClass

Last modification date: 2003.04.28

Info: Generates class vocabulary.

N.	Name	In/Out	Type	Info
1.	classf	In	file	Class map file (Word class)
2.	info	Out	file	Out text file for the information about the calculation process
3.	start	In	int	1 - starts working;
4.	vocabclass	Out	file	Vocabulary, Class vocabulary;

Settings file: 26_zodynasClass.set

Bat file: 26_zodynasClass.bat

Files used by modeling tools

N.	Name	Generated by	Sample
1.	Annotation	evallm_1	01.anot
2.	Binary language model file	idngram2lm
3.	Class map file (Word class)	topicCluster ; initClasses ; cluster	01.cla
4.	Decay file for cache models	decay	01.decay
5.	File for dynamic interpolated lambdas	interpolateEM
6.	File for output of bigram cache information: 1-if w(i-1) in cache, 0-otherwise	cache	03.ca
7.	File list of text corpus		01.fsr
8.	File sample of text corpus		10004.txt
9.	Idngram file (id0 id1 ... idN count)	gram2gramcl ; mergeGrams ; text2idngram	01.2gram
10.	Lambda file for models interpolation arrayd: double double ...	interpolateEM	01_lambdas.txt
11.	List of links to text idunigram files (number; idunigram)		02.list
12.	List of links to text idunigram files of particular topic	topicCluster	topic.list_0
13.	N-best list file in mlf format (HTK)	dl ; rescorer
14.	OOV file (word)	evallm_1	01.oov
15.	Ordered list of endings		galunes.txt
16.	Out text file for the information about the calculation process
17.	Probability file((Word Num.); probability;)	interpolateEM ; kn1	01.prob
18.	Reversed idngram file (idN id(N-1) ... id0 count)	idg2rev_idg	01_r.2gram
19.	Text file for word ngram probabilities evaluation (Word num ; X; word0 ;...; wordX; )	skaldymas_class	01.2txt
20.	Vocabulary	zodynasClass ; vocabFromTrigram ; wfreq2vocab	01.vocab
21.	Word frequency file	text2wfreq	01.wfreq