HAPAX LEGOMENA AS A PLATFORM FOR TEXT ALIGNMENT
Rūta Marcinkevičienė
Centre of Computational Linguistics
at
Vytautas Magnus University
Daukanto 28, 3000 Kaunas, Lithuania
Tel.: 370 7 224515 Fax: 370 7 203858
E-mail: ruta.marcinkeviciene@vdu.lt
ABSTRACT
The paper deals with single occurrences of words in a text, here called hapax legomena (understood more narrowly than single occurrences of a lexeme in language). They are supposed to be a distinctive feature of an author's style and as such to a certain extend retain in the translation of a text The hypothesis is that in two parallel texts many of them should have another hapax as their translation equivalent. The procedure of several steps was taken to check the hypothesis: listing, lemmatising single occurrences, measuring their distribution in the text of Orwell's novel 1984 and its Lithuanian translation, automatic and manual check-up of their matching and equivalence. Correspondence of hapaxes depend on lemma structures and differences in lemmatisation. It has been found out that 31, 6 % of English hapaxes have corresponding Lithuanian hapaxes. Correspondence for the Lithuanian part of hapaxes is 48, 3%. Thus hapax legomena is a useful but insufficient basis for text alignment.
1. Introduction
Hapax legomena, originally understood as single occurrences of a lexeme in a language and presently interpreted as single occurrences of a word form in a text, were discussed by rev. A. Q. Morton as a distinctive feature of an author's style. Their distribution was supposed to be unique and therefore possibly retained in the translation of a text. The hypothesis of a present study is that in two parallel texts many of hapaxes should have another hapax as their translation equivalent - because they will be uncommon words for the topic area of the text. If so, the positions of hapax legomena could serve as anchor points for text alignment.
Texts, chosen for testing the hypothesis, are Orwell's famous novel 1984 and its translation into the Lithuanian language by Virgilijus Čepliejus. The reason for the choice is that the texts exist both in a readily aligned and non-aligned forms. That makes it possible to check the method of alignment by hapax legomena. Softwear used for text processing was Mike Scott's Wordsmith tools (for statictics, word listing and concordancing of both texts and lemmatising of the English text), lemmatizer by Vytautas Zinkevičius (for the Lithuanian translation) and text aligner Vanilla, developed by Pernilla Danielsson and Daniel Ridings.
2. Comparison of statistics
The count of the number of tokens, types, sentences, type/token ratio as well as average length of letters in words and words in the sentences (see Table 1) manifested considerable differences between two texts. They have to be commented as they influence and explain the outcome of the procedure.
| George Orwell's 1984 | Džordžo Orvelo 1984-ieji | |
|
Running words (tokens) |
104,407 |
71,210 |
|
Different words (types) |
8,957 |
17,939 |
|
type/token ratio, % |
8.57 |
25.19 |
|
ave. word length (letters) |
4.44 |
5.84 |
|
Sentences |
6,587 |
6,538 |
|
ave. length (words) |
15.81 |
10.87 |
Table 1: Comparison of statistics
Difference in the number of tokens (104,407 : 71,210) has to do with the difference in the average length of words. There are much more English tokens as words are shorter in English than in Lithuanian (4.44 : 5.84). Reverse difference in the number of types (8,957 : 17,939) show lesser amount of repeated words and words forms and could be explained by the flective nature of the Lithuanian language. Rich in morphological categories (to be alone as a verb and nine verbal manifestations has 307 forms in two genders and numbers, three moods and persons, two genders and numbers six cases) Lithuanian translation has a greater variety of different types. Different type/token ratio (8.57 : 25.19) manifests that there are more different types of words and less repetative word forms in the Lithuanian text. Every 12th token in English and every 4th token in the Lithuanian text is new due to the same reason.
The combination of the greater number of shorter English running words, used in longer sentences and fewer but longer Lithuanian words, used in shorter sentences, amount to unimportant difference in the number of sentences. More or less equal amount of sentences could make it possible to compare automatically the position of each hapax in both texts. That couldn't be done on word level due to great difference in the number of running words.
It could be predicted from the comparative statistics that identification and comparison of hapax legomena for the selected texts can be made more difficult by the different nature of respective languages.
3. Identification, lemmatisation and comparison of hapaxes
The first step in that direction was to automatically produce word lists for both languages. They revealed the percentage of hapaxes and the so-called rare words, i.e. tokens which occur twice or three times in the texts (see Figure 2). The translation text contained greater number of both hapaxes and rare words.
Figure 1: Occurrences of tokens in English and Lithuanian versions of George Orwell's 1984
Closer look at the lists of hapaxes and their later comparison disclosed that they fall into five clear-cut categories according to their degree of correspondance:
1. proper names that have 100% correspondence,
2. hapaxes that make pairs of translation equivalents,
3. hapaxes that aren't translation equivalents but are used within a reasonable distance (in the same sentence in our case) from each other in the original version and translation. They can be classified as strong correspondences since their occurrence is bi-unique,
4. hapaxes that are positioned outside the boundaries of the same sentence but can be found in the translation not futher than within two sentences above or below. They form weak correspondences,
5. hapaxes that don't have any above mentioned types of correspondence, i.e. non-corresponding hapaxes.
Next step was to pick out proper names from the list of hapaxes, to locate them in the texts and to find out their distribution (see Figure 2). Proper names are distributed very unevenly in the texts as they tend to be used in clusters. The greatest distance is more than 20,000 running words, average distance being equal to 1,710. Thus if used alone proper names could serve as a reliable but unsufficient platform for text alignment.
Figure 2: Distribution of proper words Later the remaining lists were lemmatised, those cases were discarded where
a) the lemma occurs more than once, b) the lemma occurs among multiple occurences.
For the obvious differences between source and target languages lemmatisation reversed the ratio between the English and Lithuanian hapaxes (see Figure 3). There remained almost twice as much English as Lithuanian hapaxes after lemmatisation though before it Lithuanian hapaxes exceeded English ten times.
Figure 3: Hapaxes Before and After Lemmatisation
The result of lemmatisation could have been influenced not only by the different numbers of morphological categories but also by the mismatch of lemmas, which cannot be avoided due to the differences in the lemma structures of the two languages. E.g. Lithuanian verbal forms were lemmatised up to the level of verb while irregular English participles were left unntouched by the automatic lemmatizer. Besides, Lithuanian verbal nouns, corresponding English gerund, were attributed to the lemma of a respective verb. Suffixed but not preffexed words were lemmatised more systematically, e.g. Lithuanian deminutives were attributed to non-deminutive forms. This factor might have caused the loss of some hapaxes on the side of more heavily lemmatised language.
Before the following step, i.e. manual search of translation equivalents among lemmatised hapaxes, an attempt has been made to locate hapaxes in their respective texts and to check their distribution. The numerous English hapaxes are distributed rather evenly at an average distance of 54.84 words, i.e. they can be found in every 3rd sentence of Orwell's novel (see Figure 4).
Figure 4: Distribution of English hapaxes
If all of them had corresponding hapaxes in the target language, they could align texts. Nevertheless, there are only half as much hapaxes in the translation text, which means that only this number can be in an ideal case the number of correspondences. In such an ideal case hapaxes still could produce a good platform for alignment as they appear at an average distance of every 5th sentence in the translation (see Figure 5).
Figure 5: Distribution of Lithuanian hapaxes
Unfortunately, only insignificant part of Lithuanian hapaxes (ca 300 words) were found to be translation equivalents (see Figure 6). Some translations equivalents, esp. contextual ones, could have been missed since two lists were compared. They revealed only systemic or dictionary equivalence.
Figure 6 depicts the ratio of overlapping and non-overlapping areas of single occurrences in the source and target texts. Only 31.6% of English hapaxes have correspondences in the translation. There are less non-corresponding Lithuanian hapaxes - 48.3%. Overlapping hapaxes differ in the degree of correspondence, they range from proper names to translation equivalents and finally to bi-unique hapaxes. The latter share only approximate position in the two texts.
Figure 6: Correspondence of hapaxes
Translation equivalents, similarly to proper names, produce precise connections between source and target texts. Nevertheless those connections are far from being sufficient as they appear at an average distance of 247.53 words (see Figure 7).
Figure 7: Distribution of translation equivalents
Translation equivalents have been analyzed from the point of view of their lexical and semantic structure. It revealed that 31% of them consist of pairs of cognates, e.g. optimizm - optimizmo, contrast - kontrastas, aureole - aureolė, etc. Besides, there were words from the "newspeak", phrases in Latin (primae noctis), cases of irregular pronunciation and proper names. 35% are thoses hapaxes, which appear in the text because of unique topic are, peculiarities of the author's style or both reasons. Specific categories of translation equivalents as cognates could be detected ad hoc from the word lists and used for alignment in the same way as proper names.
Though proper names and translation equivalents form only insignificant part of hapax legomena, the remaining single occurrences could also serve for text alignment if they appeared within a reasonable window. In that case hapaxes could be called bi-unique and used as approximate but strong connections. Attempts have been made to compare location of hapaxes both automatically and manually.
Automatical comparison on a word level turned out to be impossible as the source and target texts differ in the number of words they contain. Besides, it was impossible to calculate their relative positions because of inconsistency of their number of order in the text (see Figure 8). The wave in the figure illustrates an increase of words at the beginning of the translation.
Figure 8: Ratio between word position in English and Lithuanian versions
Better results were obtained comparing the positions of hapaxes on a sentence level. Almost equal number of sentences in both texts seemed very promising for the purpose. Still sentence boundaries could not be used to limit the range of hapax location automatically as their is a great mismatch of sentence correspondence. Due to different conventions of presenting direct speech and some other reasons that are not so obvious one source sentence is translated by two or more sentences or vice versa. This causes translation equivalents to be located sometimes in the 22th sentence ahead or 46th sentence behind its place in the source text (see Figure 9).
Figure 9: Distance between sentence positions
Failure to detect bi-unique hapaxes automatically incouraged manual comparison of hapaxes for the sake of detection those that are bi-unique. For that an extract from Chapter 1 has been selected. It contained 278 English and 143 Lithuanian hapaxes. A careful comparison of hapaxes revealed that 41 of them were translation equivalents (proper names included), 35 consisted of bi-unique hapaxes, i.e. a hapax in the source sentence had a hapax in a corresponding target sentence, 12 hapaxes had correspondences in the two nearest sentences, while the rest of them (190 in English and 55 in Lithuanian) were translated as non-hapaxes. Thus out of 100 Lithuanian hapaxes almost half could be called bi-unique. They add a considerable amount of strong connections to the translation equivalents.
Manual comparison of previously aligned versions helped not only to detect bi-unique hapaxes but also to clarify their semantic links. In most cases two non-equivalent hapaxes (put in italics) are used either a) in the same collocations or, if they are more remote, b) in the descriptions of the same situation, e.g.:
a)
He hated her because she was young and pretty and sexless, because he wanted to go to bed with her and would never do so, because round her sweet supple waist, which seemed to ask you to encircle it with your arm, there was only the odious scarlet sash, aggressive symbol of chastity. .
Todėl, kad ji jauna, graži ir belytė, todėl, kad nori gulėti su ja lovoje, bet niekada negulės, todėl, kad jos gražus lieknas juosmuo, kuris prašyte prašosi apkabinamas, tėra tik tas rėksmingas raudonas raištis, agresyvus nekaltybės simbolis.
It gave off a sickly, oily smell, as of Chinese rice spirit. . .
Į nosį trenkė šleikštus, riebus kvapas, panašiai kaip kinų ryžių spirito. . .
b)
The little sandyhaired woman had flung herself forward over the back of the chair in front of her.
Nedidukė šiaudaplaukė moteris buvo užsikniaubusi ant priekinės kėdės atlošo.
Besides, sometimes a cluster of several hapaxes in one language correspond to one hapax in another, e.g.:
a)
Were there always these vistas of rotting nineteenth-century houses, their sides shored up with baulks of timber, their windows patched with cardboard and their roofs with corrugated iron, their crazy garden walls sagging in all directions?
Ar visada stūksojo eilės tų puvančių devyniolikto amžiaus namų su rąstais paramstytomis sienomis, kartonu užkalinėtais langais, rifliuotos skardos stogais ir į visas puses kvailiausiai išsiklaipiusiomis sodų tvoromis?
b)
Presumably -- since he had sometimes seen her with oily hands and carrying a spanner she had some mechanical job on one of the novel-writing machines.
Kadangi dažnai matydavo ją tepaluotomis rankomis ir su veržlėrakčiu, tai spėjo, kad tikriausiai yra techninėė darbuotoja, aptarnaujanti kurią nors romanų rašymo mašiną. .
An attempt has been made to investigate those cases where English hapaxes were translated by Lithuanian non-hapaxes and vice versa. Analysis of translation equivalents revealed that in most cases that was caused by mismach of compounding patterns. If a rarely used English compound was translated by a Lithuanian word combination, there were few chances that those words could be hapaxes:
then you saw a lifeboat full of children with a helicopter hovering over it.
paskui pasirodė gelbėjimosi valtis, pilna vaikų, o virš jos malūnsparnis, valties priekyje sėdėjo pusamžė moteris,
Another reason for hapax change into non-hapax is determinologization, i.e. all those cases when a specific term or a word with a restricted usage is translated by non-terminological frequently used lexeme:
The smell was already filling the room, a rich hot smell which seemed like an emanation from his early childhood, but which one did occasionally meet with even now,
Kvapas jau buvo pasklid¿s po kambarį, aromatingas, stiprus kvapas, tartum atlėk¿s iš tolimos vaikystės, bet vis dėlto kartais sutinkamas ir šiandien
At the other side of him stood a man in a white coat, holding a hypodermic syringe. .
Kitoje pusėje buvo žmogus su baltu chalatu, jis laikė švirkštą. .
Still, in very many cases hapaxes were translated by lemmas, used in the text rarely, i.e. twice or three times. Taking this into acount rare words could be added to the list of hapaxes and in this way help to increase correspondence significantly.
Conclusions
Thus English hapax legomena, consisting of proper names, translation equivalents and bi-unique correspondences, are translated by hapaxes into Lithuanian and therefore help to align precisely. Besides, a considerable part of English hapaxes are translated by rare words. Taking this into account it could be concluded that hapax legomena is a useful but unsufficient basis for text alignment. The effectiveness of alignment with the help of happaxes heavily depend on the structures of source and target languages, differences of their lemma structures, mismatch of lemmatizers and quality of a particular translation. Even if non-highly effective, when used alone, hapaxes could be a good supplement for structural mark-up of texts.
Straipsnis atspausdintas: Proceedings of the Third European Seminar "Translation Equivalence" Montecatini Terme, Italy, October 16-18, 1997, P. 125 - 137