In Russian later
|Ogur and Oguz||Russian version needs a translation||
How likely are chance resemblances between languages?
© 1998-99, 2002 by Mark Rosenfelder
http://www.zompist.com/chance.htm - this is a mirror posting
All I's in this posting refer to the author, Mark Rosenfelder. All links in the text refer to the author's website. The method is a scientific tool that helps to bring reality to philological claims, it is aimed toward mass comparison results, but is equally applicable for any linguistic comparison lists. It complements the statistical method that evaluates chances that two words in compared languages belong to the same linguistic phyla, allowing to verify results against chance coincidences.
On sci.lang we are often presented with lists of resemblances between far-flung languages (e.g. Basque and Ainu, Welsh and Mandan, Hebrew and Quechua, Hebrew and every other language, Basque and every other language), along with the claim that such resemblances "couldn't be due to chance", or are "too many" to be due to chance.
Linguists dismiss these lists, for several reasons. Often a good deal of work has gone into them, but little linguistic knowledge. Borrowings and native compounding are not taken into account; the semantic equivalences proffered are quirky; and there is no attempt to find systematic sound correspondences. And linguists know that chance correspondences do happen.
All this is patiently explained, but it doesn't always convince those with no linguistic training-- especially the last point. Human beings have been designed by evolution to be good pattern matchers, and to trust the patterns they find; as a corollary their intuition about probability is abysmal. Lotteries and Las Vegas wouldn't function if it weren't so.
So, even one resemblance (one of my favorites was gaijin vs. goyim) may be taken as portentous. More reasonably, we may feel that one resemblance may be due to chance; but some compilers have amassed dozens of resemblances. Such lists may be criticized on other grounds, but even linguists may not know if the chance argument applies. Could a few dozen resemblances be due to chance? If not, what is the approximate cutoff?
The same question comes up in evaluating the results of Greenbergian mass comparisons; or proposals relating language families (e.g. Japanese and Tibeto-Burman) based on very small numbers of cognates. Again, it would be useful to know how many chance resemblances to expect.
I will propose a simple but linguistically informed statistical model for estimating the probability of such resemblances, and show how to adjust it to match the specific proposal being evaluated.
A trivial model: Abstract languages sharing a phonology
Let's start with a simplified case (we'll complicate it later). We will compare two unrelated languages A and B, each of which has 1,000 lexemes of the form CVC, and an identical semantics and phonology. That is, if there is a lexeme a in A with some meaning M, there will be a lexeme bp in B phonetically identical to a, and a lexeme bs with the same meaning as a.
What is the probability that bp is bs?-- that is, that there is a chance resemblance with a? It can be read off from the phonology of the typical root. Supposing there are 14 consonants and 5 vowels, it is 1/14 * 1/5 * 1/14, or 1 in 980. (This assumes that the vowels and consonants are equiprobable, which of course they are not.) For ease of calculation we'll round this to 1 in 1000.
How many chance resemblances are there? As a first approximation we might note that with a thousand chances at 1 in 1000 odds, there's a good chance of getting one match.
However, this is not enough. How likely is it exactly that we get one match? What's the chance of two matches? Would three be quite surprising? Fortunately probability theory has solved this problem for us; the chance that you'll find r matches in n words where the probability of a single match is p is the binomial probability
(n! / (r! (n-r)!)) pr (1 - p)(n-r)or in this case
(1000! / (r! (1000-r)!)) .001r .999(1000-r)For the first few r:
p(1) = .368
So the probability of between 1 and 6 matches is .368 + .184 + .0613 + .0153 + .00305 + .000506 = .632, or about 63%. It would be improbable, in other words, if we found no exact matches in the entire dictionary. (But not very improbable; p(0), which we can find by subtracting the above p's from 1.0, is 37%.)
Semantic and phonetic leeway
Proffered resemblances are rarely exact. There is always some phonetic and semantic leeway. Either can be seen as increasing the set of words in B we would consider as a match to a given word a in A.
For instance, suppose for each consonant we would accept a match with 3 related consonants, and for each vowel, 3 related vowels. Since we're assuming a CVC root structure, this gives 3*3*3 = 27 words in B which might match any given a.
And suppose for each word a we will accept 10 possible meanings for b. This applies to each of the 27 phonetic matches; so a can now match a pool of 27*10 = 270 lexemes. The probability that it does so is of course 270 in 1000, or .27. Every lexeme in A, in other words, has a better than 1 in 4 chance of having a random match in B!
How many chance resemblances are there now? The same formula can be used, with the revised estimate for p:
(1000! / (r! (1000-r!))) .27r .73(1000-r).There is a significant probability for very high numbers of matches, so we must continue calculating for r well into the hundreds. The results can be summarized as follows:
This looks a lot like a normal distribution, and in fact it is one, if np and n(1-p) are both over 5. (For typical lexicon sizes, the distribution will be normal if p > .01.) The 'expected case'-- the number of matches with the highest probability-- will be np; in the above case, 270, with a probability of .0284.
I will suggest refinements to this model below, but the basic features are in place: a probability for a single match; a calculation for number of expected matches; and an adjustment for phonetic and semantic leeway.
Vocabulary size and other variablesWhat happens as various parameters of the model are changed?
In the model above, vocabulary size is not independent of the sample phonology. If syllables are all CVC, with 5 vowels and 14 consonants, then there are only 980 phonologically possible words.
If we use a larger vocabulary size in the formula, we are conceptually re-using some of these possible words. This might not be a bad model for a language with many homonyms, or if we are simply ignoring some of the phonology of the language (e.g. tones, or final vowels).
We can also try different phonologies. For instance, here are a the numbers of expected matches for a few different word types, again allowing 3 matches per sound, and 10 semantic matches per word:
The number of matches isn't increasing as fast as the vocabulary size. For instance, as we go from CVC to CVCV, we increase the lexicon fivefold (for each CVC syllable, there are 5 CVCV syllables, one with each vowel), while the number of matches only increases threefold (the extra vowel can be matched only 3 ways).
(For the first (CV) case, the expected number of matches is larger than the lexicon size! This is because, with the rules as stated, we're often going to find more than one match per word. We'll come back to this later.)
What about exact matches? Curiously, the number of exact matches averages about 1 for any phonology, and therefore for widely differing vocabulary sizes. The reason is not hard to find: as the lexicon size increases, the chance of a given match goes down, but we get more chances.
(The average of 1 exact match needs to be qualified, because of homonymy: a word can have several meanings, and thus match several possible words. In effect, languages come with a built-in semantic leeway; a handful of exact matches is therefore no surprise.)
What about different phoneme inventories? What if we have 20 consonants and 10 vowels? There are now 4000 possible CVC syllables; yet without varying any other parameters, we would still expect only 270 matches.
But the number of phonetic matches is not really independent of the phonology we use. In a five-vowel system, a phonetic leeway of 3 per sound matches means (e.g.) accepting pat and pit as matches for pet. In a ten-vowel system, might we not also accept pEt (open e) and pöt (rounded e) as well?
Contrariwise, suppose there are only 3 vowels, as in Quechua or Classical Arabic. A 3-vowel leeway means that all vowels match all vowels!
Here's some 20C/10V languages with a few different phonetic leeways (and the same semantic leeway of 10):
The reader may well conclude that with the appropriate choice of parameters, any number of matches is possible! That would be more or less accurate; but note:
A computer modelI've written a computer program which generates two random languages and compares them for matches. Phonology, phoneme inventory, matching algorithm, phonetic leeway, and semantic leeway can all be varied.
You can get the source code for the program here.
The program gives several counts, corresponding to slightly differing match-counting methodologies. For instance, here's the output of a sample run simulating one of the models discussed above: 980 possible words with phonetic leeway of 3 and semantic leeway of 10.
Matches in 980 words:
The program generates two random languages A and B. Then, for each word in language A, the program tests each word in language B. There is obviously a chance that this will result in more than one match for each word in A. The total number of matches is reported in the top left; the value here (290) is close to our expectation based on np (980 * .275 = 270). In statistical terms, this is a binomial probability distribution with replacement.
The bottom row counts only one match for each word in A. So, even if we found three matches for a given word in A, we count only one. So in the example, though there were 290 total matches, there were 256 words in A that had at least one match, and some had more than one.
The right-hand column counts matches if we allow each word in B to be used only once. That is, if any word B is ever counted as a match for a word in A, it's withdrawn from the pool, and no further word in A can match it.
It might seem obvious that we only want one match per word; but again, we must follow the practice of the comparison we're evaluating. A comparison of Basque and Ainu posted to sci.lang, for instance, matched Ainu poro monpeh 'big finger (=thumb)' to erpuru, and poro aynu 'big man (=adult)' to porrokatu 'tired'. Yes, the semantic leeway there is formidable; but the point is that one Ainu word, poro, is being used to match two entirely different Basque words.
It's not hard to find the expected number of words in A with one or more matches (the number in the lower left of the program output). Let's start by subtracting out the cases where we found two matches in B. How many of these will there be?
The probability that word Ai has the same meaning as Bi is of course 1 in n (the lexicon size). If we allow phonetic leeway, we get additional chances for a match; each of these phonetic near-matches is multiplied further by the semantic leeway. In the example above we effectively get 270 chances for each word in A.
This is really another binomial probability problem: we have 270 trials, each with probability 1 in 980, and we want to know the chances for 2 successes. Using the formula given, the chance is (270! / (2! (270-2)!)) (1/980)r (979/980)(270-r) = .0288. Now since there are 980 words in A, we would expect about .0288 * 980 = 28 two-word matches. Since we expected 270 total matches, this leaves about 242 words with one or more matches. This is not far from the 256 found in the sample run.
The same formula will tell us how often a word in A will match three words in B-- the probability is .00262, leading us to expect 2.57 matches. Since these would count three times in the total count and we want them to count only once, we have to subtract another 2 * 2.57 = 5 from that 242, giving us a more accurate estimate of 237. A four-word match has a probability of .00018, which doesn't affect the final number any. In practice it's not really worth correcting for anything beyond two-word matches.)
(Some readers may worry that the numbers from the sample run (290, 256) are 'too far' from the expected values (270, 237). But no, these numbers are well within the variance. If I run the program a hundred times, I get an average of 270.6 total matches, 240.5 words matched.)
The model above assumes that both languages have the same phonology, which obviously isn't usually the case. If you're evaluating a proposed set of resemblances, it's usually fairly evident what the compiler counted as a match. For instance, the Quechua/Semitic comparison discussed below normally just (loosely) matches initial and medial consonants. To estimate p, then, we multiply the probability of a match on initial consonants with that of a match on medial consonants.
To do this, we need to know how many possible consonants there are, and how many of them are considered a match. The latter of course must be taken from the proposed resemblances; if there's enough of them the number of consonants can be too, otherwise we need to find out the language's phonology.
Phonemes don't actually appear with equal frequencies, and this can be important. For instance, Quechua has just three vowels, a/i/u (though allophonic o/e are found in some dictionaries). The chance that the main vowel of a root is an a, however, is not 33.3% but 56%. A match with medial a is therefore correspondingly less surprising. In a sidebar, I've estimated the chances of random matches between Quechua and Chinese taking phoneme frequencies into account.
What if we're only given a word or two-- say, that cute Japanese gaijin / Hebrew goyim pair? Even with one word, we can estimate phonetic leeway segment by segment:
Of course it's worse than that, since there's semantic leeway as well. Goyim means 'nations' (from goy 'nation'); though it's metaphorically used for 'non-Jews', it's certainly not an exact semantic match for a word meaning 'foreigner'. (Gaijin is a borrowing from Chinese wairén 'outside-person'.) I can't emphasize enough that inexact matches greatly multiply the chances for finding random resemblances.
About all that can be done is to emphasize that most actual attempts to find "cognates" accept very considerable semantic leeway, and that this greatly increases the chance of random matches.
It's helpful to see just what a semantic leeway of 10 or 100 meanings looks like. For instance, let's take the word "eat".
If we allow 10 matches, we'd accept something like:
eat, dine, bite, chew, swallow, feed, hungry, edible, meal, meat
If we allow 25 matches, we'd accept:
eat, dine, bite, nibble, chew, munch
If we allow 100 matches, we'd accept:
Ten matches may seem like a lot; but nothing in even the list of 100 matches is a real stretch. The real question is: looking for matches for 'to eat' in A, would our language comparer take a word meaning 'nourish' or 'delicious' or 'gnaw' as a match--indicating that his semantic leeway is more like 100 than like 10? The temptation is almost irresistible.
It's worth pointing out that semantic leeway is really multidimensional. The 100-word list shows several directions we can explore starting from the word 'eat': types of eating; other types of ingestion; types of food; types of feeding or food preparation; qualities or states associated with eating; associated parts of the body; differences in connotation or register; differences in the type of eater; metaphorical extensions. You may accept only a little variation along each dimension, but because there are many dimensions the total variation is high.
(This may suggest a rough method for estimating semantic leeway: count the dimensions in which the words differ. For instance, Greenberg & Ruhlen's mekeli ('nape', from Faai) and amu'ul ('to suck', from Mixe) seems to follow the semantic links nape / (nearby body part) neck or throat / (associated verb) swallow / (another bodily verb) suck; if there are as few as five alternatives at each of these four steps, the semantic leeway is at least 5 x 5 x 5 = 125. (The reader may enjoy tracing how many steps are required to reach their proposed Indo-European cognate, 'milk'.)
Effect of semantic leeway
What happens to the number of expected matches as semantic leeway increases?
We are traipsing through language A, word by word. For each word a, the probability p of finding a match in B is, say, .01. But now suppose the semantic leeway m is 10. So for word a, we have to check for a match 10 times.
What does this mean numerically? It depends on how we count matches.
Below, I'll pursue the first of these alternatives, on the grounds that it most closely resembles the methodology of the comparisons we want to evaluate. As we've seen with the Basque/Ainu comparison, comparers generally don't scruple at using the same word twice. My impression, however, is that this only applies to one of the languages.
In other words, their method is, I take it, to loop through the words of language A and for each one find, if possible, one match in B. This methodology avoids multiple matches per word in A, but doesn't prevent us from using the same word in B more than once. The formulas suggested here model this methodology.
Semantic leeway is not really independent of vocabulary size.
As the vocabulary grows into the thousands, we can expect finer and finer distinctions of meaning; but the language comparer is likely to ignore them. A language of 10,000 words may well distinguish 'eat (of humans)' vs. 'eat (of animals)' vs. 'devour' vs. 'bite' and so on. Yet the language comparer, looking for a match for 'eat (of humans)', is not going to skip over 'eat (of animals)'.
If the comparer is working only with (say) a 200-item wordlist, not much semantic variation will be available. Quite a bit is available in a 2000 or 20,000-word lexicon.
Mismatched lexiconsI've assumed so far that the languages being compared share word meanings. What if they don't?
Let's say we're comparing two lexicons of 1000 words each, and the chance of a single phonetic match is .03. If we know that the meaning of a word in language A is also found in language B, then we can expect about 30 exact semantic matches.
Now let's say that 1/2 of the meanings recorded in A aren't found in B. In effect, as we're comparing two words, there's only a 1 in 2 chance that the meaning we're looking at is even found in the other lexicon. So in our example, we'd expect 15 matches.
Another complication is that words are polysemous, and the particular set of meanings a word has in language A are not likely to be found in language B. At first glance this might be expected to lower the chances for a match; but from what I've seen of alleged resemblances, it increases it. For instance, Quechua simi means 'mouth, word, language'; a comparer is liable to feel free to match on any of these senses. In effect, polysemy brings with it its own increased semantic leeway.
We will need to know:
Greenberg & RuhlenGreenberg & Ruhlen are comparing multiple languages. For a truly rigorous judgment we would need phoneme and root structure frequency analyses, as well as vocabulary size estimates, for each language concerned. This information is not available; but fortunately G&R's resemblance list alone is enough to deduce most of the parameters required.
To estimate the amount of phonetic leeway they allow, we can simply count the correspondences they allow. For instance, the vowels seem to be completely ignored. The middle consonant can be one of l, ly, lh, n, r, or zero-- (at least) 6 possibiliites. The end consonant can be one of g, j, d, k, q, q', kh, k', X, zero-- (at least) 10 possibilities.
The initial consonant must, it seems, be m. There is a reason for that, I think. Our brains, as psycholinguists have found, respond very strongly to initials. Find words in A and B that begin with the same letter, and the battle is half-won-- the brain is predisposed to find them very similar. Indeed, posters to sci.lang sometimes offer "resemblances" that correspond in nothing but the first letter. (The fact that they find the coincidence remarkable is more evidence for human beings' lousy intuition about probabilities.)
(It's also worth pointing out that a very high 7% of all words in my Quechua sample text begin with m. This initial isn't particularly common in Chinese, but one has to wonder if this choice of initial doesn't increase the possibility of false cognates, simply by being more common.)
Let's consider (arbitrarily) that G&R's languages have about 25 phonemes each. Then the minimum probability of a single, exact-meaning random match is 1/25 * 6/25 * 10/25 or .004. It must be emphasized that this is a minimum; if we saw more of G&R's potential cognates we might find that they allow more leeway than this. And of course it's only a loose estimate, because it's an abstract measure intended to cover any two arbitrary languages.
As for semantic matches, we can also form a lower bound on the semantic leeway they allow by counting the separate meanings in their glosses. I count at least 12; but I would consider it highly misleading not to at least double this number, and based on our lists of words related to 'eat' I'd consider 100 to be quite reasonable. If you can accept 'breast', 'cheek', 'swallow', 'drink', 'milk', and 'nape of the neck' as cognates, it's hard to seriously claim that you wouldn't accept 'eat', 'mouth', 'guzzle', 'vomit', 'suckle', and 'stomach'.
Let's say the semantic leeway is 25 words. Then the probability that we'll find at least one match for a word is 1 - ((1 - p)m = 1 - 0.99625 = .0953.
How many random matches can R&G expect to find?
Let's assume the lexicon is 2000 words-- hopefully a good estimate of the size of the lexicons available for many of the obscure languages they work with. Our formula produces (omitting ranges with minimal probabilities):
p( 126 to 150 ) = .0080
In other words we should expect close to 200 random matches, or a tenth of the lexicon.
Of course, that's between two languages. What's the likelihood of finding matches across languages?
If the languages are related, of course, we would expect many non-random resemblances (though with G&R-style phonetic and semantic laxity we will find many random matches as well). We should not be cowed by the size of G&R's cognate list; there are multiple entries per family. They are not really comparing hundreds of languages, but a much smaller number of language families.
There's no general number of random matches to expect this way, since language families vary in number of languages, and in how similar their lexicons are. But even closely related languages, like Spanish and French, have significant numbers of non-cognate words (chien vs. perro, chapeau vs. sombrero, manger vs.comer, regarder vs. mirar, etc.). So if you don't find enough resemblances in language A1, you can try A2, A3, etc. Or even search dialects for cognates, as it seems they've done with Quechua and Aymara. The list of random resemblances between families will be larger than the list of random resemblances between languages.
Another key question is how many families they're comparing. If G&R are right about certain high-level categories, such as Amerind and Eurasiatic, this turns out to be "three or four". If families A and B have 500 random resemblances, they'll have about 125 with family C and 31 with family D-- a quite respectable listing for "proto-World".
There are a number of obvious problems with this list.
The above criticisms cannot answer this question; but the statistical model developed here can.
First let's estimate the degree of phonetic laxness the compiler is allowing. I'll use my Quechua frequency table, but I don't have similar data for the Semitic languages.
What's the level of semantic laxness? This is hard to say. Some matches are quite close (burp, copper, basket, son, frog, owl); others are fairly remote (leader/priest; snake/teach, jerky/drought, pancreas/belly; sickness/afflict; door/open; sweet tooth/glutton; small thing/suck; god/father; build/disturbed; domineering/youthful; wall/fortress; stone seat/throne room; two herb names). I think it would be quite conservative to assume 1 Quechua word can match 20 Semitic words in the compiler's mind.
Given this semantic leeway, the probability of a match on a single Quechua word is 1 - ((1 - p)m = 1 - 0.91120 = .845. That's quite telling right there-- it means that, given his phonetic and semantic laxness, the comparer is ordinarily going to find a random match for almost every Quechua word.
My own Quechua dictionary has about 2000 non-Spanish roots. Our comparer will very likely find more than 1500 chance resemblances.
With a vowel match-- which, I emphasize, the comparer bothers with only half the time-- the match probability becomes .302, for roughly 600 chance resemblances.
It's just not that hard to match just two consonants. How about the three-consonant matches? Given the initial and medial consonant match probabilities calculated above, the probability of a single 3-consonant exact-semantic match is .17 * .524 * .524 = 0.047. With the same semantic leeway of 20 words, the probability of a match per word becomes 1 - 0.95320 = .618.
Our formula gives:
p( 1101 to 1200 ) = .051
Is it a mere coincidence that there are so many correspondences between these languages? No, it isn't; what would be really surprising would be if there weren't a thousand more of the same quality.
With a vowel match, the match probability becomes .171 and we'd expect over 300 resemblances. However, there are only eight words in the list that can be described as matching three consonants and a vowel. Three of those are ruined by including -na as part of the root, and the remaining five include such uninspiring matches as q/t and ll/g.
Characteristics of the modelWhat we have seen offers quantified support for what many linguists would have suspected: the number of chance resemblances soars as phonetic and semantic matches are loosened.
The actual variation is almost linear. That is, allow word a to match n words phonetically and m words semantically, and you very nearly increase the expected number of matches by n * m.
Thus, the results depend almost entirely on the value of n and m, and small variations in either n or m become very large variations in the number of matches.
Calculating an expected number of matches, then, it is essential to estimate phonetic and semantic laxness from the claim being evaluated. There is no such thing as a general "number of random chances expected". It depends almost entirely on what you count as a match.
Equally, it's necessary to carefully evaluate any probability calculations offered by the comparer. Comparers typically present calculations for extremely narrow matches, and then give very loose matches in their word lists. And they often ignore semantic looseness entirely. Such errors are not trivial in this game; they can produce numbers that are off by several orders of magnitude.
Why are we so easy to fool?
Why do people fool themselves so easily in this area? Why is it so hard for even highly intelligent people to convince themselves that random matches will be few? I think there's several reasons.
A computer program for calculating matchesHere's a very simple computer program for applying the probability calculation described above. It will work for most C compilers, but if the probability is very high, you'll need a very robust double data type, which I have on the Mac but not Windows.
The important parameters are the lexicon size n, the probability of a phonetic match p, and the semantic leeway semN (how many meanings count as a match).
The program uses some algebraic tricks to simplify the calculation of the binomial probability. For instance, we're always dividing by r!, but we always calculate the probability for each r starting at one. So we can start with rfac = 1.0, and just multiply by r each iteration to get r!.
The rest of the factorial calculation is n! / (n-k)!. For r = 1, this is n! / (n - 1)! which can be rewritten n (n-1)! / (n-1)!; the (n-1)! terms cancel, leaving us with just n. For r = 2, we have n! / (n - 2)!, which is the same as n (n-1) (n-2)! / (n-2)!, which reduces to n (n-1). That in turn is just (n-1) times the value we got for r = 1. It works out that factor for each value of r is just (n - r + 1) times the value for the previous iteration, and that's how nfac_nrfac is calculated below.
#include <stdio.h>Here's the program output generated when the parameters are set as shown. Note that stopat and bin are reporting parameters. stopat = 70 told the computer to calculate propabilities for r = 1 to 70; bin = 10 told it to report these probabilities in groups of 10. With a little trial and error you can adjust these values for the most useful reporting of results.
The probability reported at the top is the phonetic probability, adjusted to reflect the given semantic leeway.
When the probability (after the semantic adjustment) is very low, .002 or less, you should set the bin size to 1 for best results.
AcknowledgmentsThanks to the people on sci.lang, particularly Jacques Guy, Mark Hubey, Ross Clark, Mikael Thompson, and Brian M. Scott, for discussions which led to the writing of this paper.
© 1998-99, 2002 by Mark Rosenfelder
In Russian later
|Ogur and Oguz||