V.Stetsyuk - Statistical Linguistics

We shall calculate a probability of a coincidence for a concrete case between the Chuvash and Slavic languages. For illustration of the simple mathematical calculations in the computing linguistical probabilities we take the Chuvash word salat = "to throw" and the Slovak sálat ’ and Cz. sálat ‘to throw”.

First step is purely linguistic, we determine the specific properties particular to the word-formation in the Chuvash language. We establish the frequencies of the particular letters occurrence in the defined positions of the Chuvash words. We pick a data base of all five-letter words of the type cvcvc, where c is any consonant and v is any vowel, like kalax, salax, palax, valax, etc. This data base contains 2,100 Chuvash words.

In the data base of 2,100 Chuvash words, approximately 210 words begin with a letter s, so the probability that any Chuvash five-letter word of the type cvcvc would begin with letter s is equal to 210 / 2100 = 1/10.

Repeating the investigation for the frequency of appearance of the letter a as a second letter in the cvcvc type words, we find approximately 350 words, and the probability is 1/6.

Repeating the investigation for the frequency of appearance of the letter l as a third letter in the cvcvc type words, we find approximately 175 words, and the probability is 1/12.

Repeating the investigation for the frequency of appearance of the letter a as a fourth letter in the cvcvc type words, we find approximately 260 words, and the probability is 1/8.

Repeating the investigation for the frequency of appearance of the letter t as a last letter in the cvcvc type words, we find approximately 210 words, and the probability is 1/10.

To calculate the approximate value of the probability to have the word salat in the Chuvash language, we multiply the values of individual probabilities: 1/10x1/6x1/12x1/8x1/10 = 1/57,600.

Next step is a semantical calculation. We calculate a probability that the word salat would have a meaning close to the meaning of "to throw". We will divide the 2,100 Chuvash words in our list into semantical groups of words to match a general criteria. This division is subjective, as to a certain degree the borders between semantical fields are always somewhat blurred. However, dividing the data base of 2,100 words into 100 separate semantical groups is sufficient to prevent, to a sufficiently small degree, the semantical fields of different groups from overlaying other group's semantical fields. Then the probability of the Chuvash word salat to have a meaning close to the meaning of "to throw, scatter, move briskly, or fly out”, etc. by a random accident will be equal or less than 1/100. Considering that division of the 2,100 word vocabulary into 100 groups implies that an average group has an exaggerated number of 21 synonyms, this is an extremely conservative estimate by at least a factor of 5, and resolving the 2,100 word vocabulary into still conservative 500 groups would ensure acceptable semantical separation and increase the probability estimate to a conservative 1/500.

Combining the phonetical and semantical probabilities, the extremely conservatively calculated probability of chance emergence in the Chuvash language of a five-letter word of the type cvcvc, phonetically and semantically identical to the Slovak sálat ’ and Cz. sálat ‘to throw”, is equal to 1/5,760,000. A more realistic estimate is approximately 1/25,000,000. This probability applies to the case when the phonetical coincidence of the words is nearly exact, like in the case of Ch. salat vs. Sl. sálat ’ and Cz. sálat.

Following the above procedure, we will get a similar probability value for phonetical change of the first letter s to z, sh, ch, j, jd, etc. Accepting 6 modifications as a conservative allowance, accounting for these possibilities result as a sum of probabilities for each individual case, equal to 6/5,760,000. Similarly, a phonetical change of the second letter a to o, e, u, i, etc. and accepting 6 modifications as a conservative allowance, would result in probability of 6/5,760,000. Providing a liberal allowance for changes to four out of five letters, we would have 6⁴ or approximately 1,000 phonetical siblings. Including all the siblings, the estimate of probability for the chance phonetical and semantical coincidence is 1/5,760 extremely conservative and 1/25,000 conservative.

If we have few similar concurrences, the probability of their random occurrence in different languages can be estimated as several tens of zeroes after a decimal. For two words, the probability is 1/5,760,000 X 1/5,760,000 = 1/33,177,600,000,000, for three words 191,102,976,000,000,000,000, etc. Practically it means that if there is a good phonetic and semantic coincidence of two words in unrelated languages, in words with five and more phonemes, one of them is somehow borrowed.

A minor caveat to the above statement is that both words should not have onomatopoetic character which could hypothetically cause an independent emergence of the similar words in different languages. For example, the widespread Slavic word duda, dudka “a wind pipe” is matching well the Chagat. and Turk. düdük "a pipe". Miklosich and Berneker considered this Slavic word to be borrowed from Türkic, but Vasmer and Brückner believe that the close sound rendition of these onomatopoetic words is a "mere chance" (Vasmer M., 1964, Т1, 550). Clearly, these doubts about a Slavic loan word from the Türkic language may be considered a reasonable hypothesis, and therefore the Slavic duda cannot be counted as an unquestioned loan word, even though in a sanity check, no other unrelated world language created the name for the pipe as duda, dudka, düdük etc.

Statistical Linguistics Phonetical and Semantical Concurrence

Statistical Linguistics
Phonetical and Semantical Concurrence