3. Quantitative Analysis

    The exploration of corpora is most often connected with the assignment of some kind of frequencies to language phenomena. For example, one may count the relative frequencies of all the verbs, or a number of occurrences of the article the, and so on. This kind of analysis is referred to as quantitative analysis of corpora. Quantitative analysis may be contrasted to qualitative analysis of corpora, which does not try to assign frequencies to language phenomena. In qualitative research corpora are used only to provide examples of particular phenomena.

    The quantitative analysis allows generalization so long as valid sampling and significance techniques have been employed. Quantitative analysis thus enables one to separate the wheat from the chaff: it enables one to discover which phenomena are likely to be genuine reflections of the behaviour of a language or its variety and which are merely chance occurrences. Quantitative analysis enables one to get a precise picture of the frequency or rarity of particular phenomena and hence, arguably, of their relative regularity or irregularity (McEnery and Wilson 1996). Quantitative analysis tends to ignore rare phenomena of the language. One reason for this is that many significance techniques are quite often not reliable for low frequencies.

    Statistical methods require grouping and classification of language phenomena. For statistical purposes it is necessary to decide strictly either an item belongs to class A or it does not. However, we know that language phenomena most often do not strictly belong to one particular class, they are ambiguous. The computer is not able to distinguish all language phenomena. Consequently, the data is often idealized by a corpus researcher, and sometimes the researcher is forced to make a decision which is not a 100 per cent reliable.

    It is worth noting that quantitative analysis is much more than simple counting. Quantitative analysis uses very complex and sophisticated statistical techniques in order to give reliable evaluations of findings in text corpora. Quantitative analysis assumes that language is inherently probabilistic (Halliday 1991, p. 31). Consequently, linguistic phenomena can be assigned relative probabilities and thus we may evaluate with some degree of certainty the occurrence of the phenomena. Many of these techniques are successfully used in other disciplines such as medicine, economics, sociology, etc., but their application in linguistic studies has often been criticized or judged as inappropriate.

    The most famous critic of quantitative analysis is Chomsky. His most often quoted words are these:

Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list. (quoted in Leech 1991, p. 8)

    Chomsky pointed out to a very important issue in corpus linguistic, which is representativeness of corpora. The efficiency and reliability of statistical evaluation depends crucially on representativeness of corpus. The corpora may be deceptive, if they are sampled inappropriately. McEnery and Wilson (1996, p. 64) emphasized that before making any claims about any findings in a corpus, it is very important to remember that the corpus is a sample of a much larger population (genre or variety) and thus it is very important to know as clearly as possible the limits of the population. As Leech (1991, p. 11) has suitably pointed out, a corpus differs from a mere archive, since it is designed or required for a particular representative function. So, corpora should be as representative as possible of the larger population which is studied. Only then the generalizations and findings may be considered to be reliable representations of particular phenomena.

    Even though the size of a corpus does not ensure its representativeness, it has been observed that small corpora are only representative for high frequency linguistic phenomena. However, the size of computerized corpora of today are of many million words. Such enormous amount of text is representative of quite rare linguistic phenomena.

    It must be also mentioned that quantitative research is not an end in itself. The purpose of quantification is to inspire new ideas and new insights. Perhaps, it is true that very often the quantitative research just confirms or rejects the facts that were found by pure rationalistic methods, but it is also true that quantitative approach has found facts that could not be discovered in any other way.