Tonight I’ve been running Sukhotin’s algorithm against some of the Voynich transcriptions.
In brief, Sukhotin’s algorithm identifies vowels in text. It accepts text, sorts the letters by order of occurrence and looks to see which letters appear most frequently next to a vowel or a consonant.
Yes I’m aware that Jacques Guy ran a series of experiments on this in the past. He identified o and c as likely vowels. That seems to have been back in 1992 and there doesn’t seem to have been much else done with this algorithm since then. (I only have an abbreviated copy of his article – send me the full thing if anyone has it please!). The transcription I’m using is likely to have changed substantially since then.
It’s not, naturally enough, fool proof. It just works on averages, and says this letter appears frequently enough in the correct positions to be a Vowel. Some letters will fool it – in English, for example, it has a nasty habit of flagging T as a vowel (think of words like The, Three….).
In a random sample of 1181 English words extracted from a Project Gutenburg ebook, the algorithm identified the following as vowels (always ordered by most likely to least likely):
In a random sample of Spanish, I got an article from ElMundo.com which was 3550 in length, and the algorithm identified the following as vowels:
I did think about introducing a condition that if a vowel had a likelyhood of less than x% of the most likely it would be dropped, but I can’t be bothered to experiment with that at the moment.
In any case, I took small sub-samples (sometimes just a dozen words) from each of these texts and got broadly the same results. Sometimes additional letters crept in, but the main list was always there.
Having established all this, let’s have a look at what happens when we apply Sukhotin’s to some of the Voynich pages. All transcripts were in basic EVA from the VIB with suspicious spaces removed. Since my program strips out all !,* signs we’re sometimes jumping over illegible glyphs, but that’s the price you gotta pay to work with the transcriptions.
Updated! It’s been pointed out that the first draft of this page incorrectly calculated “combo glyphs” in EVA. A schoolboy mistake (told you I was working late at night!) which has now been corrected. The rest of this page has been analysed afresh and updated. I now, in accordance with general theory, assume ch,sh,cth,cph,ckh,cfh are all one glyph respectively. I did consider including “ii”, or derivatives such as “iin” “iir” but decided against it.
Let’s look at the following vowels that it found (Vowels ordered by the frequency of occurrence, most frequent first):
- PAGE 1r : Vowels are: a,o,y,n,s,t
- PAGE 1v: Vowels are: o,y,a
- PAGE 2r: Vowels are: a,y,o,n,t
- PAGE 2v: Vowels are: a,o,y,n,t
- PAGE 3r: Vowels are: o,a,y,n
- PAGE 3v: Vowels are: o,a,y,t,n,s
- PAGE 4r: Vowels are: a,y,o,n,t
- PAGE 4v: Vowels are: o,y,i,s
- PAGE 5r: Vowels are: o,y,a
- PAGE 5v: Vowels are: o,a,n,y,g,s
- PAGES 1r to 10v : Vowels are: a,o,y,n,t,s,g
These pages are all fairly short in text length, with the exception of folio 1r. I thus ran the checker against all the first 10 folios and saw that the vowels were much the same, although we got seven back.
So I took a few random pages from further on in the book:
- f66r: Vowels are: o,a,y,n,e (stripped out individual glyphs that appear)
- f66v: Vowels are: a,o,y,t,e,n
- f67r1: Vowels are: a,o,y,t,n
- f100r: Vowels are: o,a,y,n,c
- f93v: Vowels are: o,y,a,t,n,s
- f115r: Vowels are: a,o,y,t,n (this was a very text heavy page)
- F110r to f115r: Vowels are: a,o,y,n
Longer text heavy pages seem to give around five vowels, shorter pages between three to six. Interestingly enough, the number of vowels is similar (about 5.4) to the average length of a Voynich word (5.5 glyphs).
However – on natural text, the number of vowels returned by Sukhotin is generally higher than for Voynichese. English returns 8-10 “vowels” depending on the text, Spanish 7-8 in my previous experiments. Of course, English has 5+1(y) vowels, Spanish only 5.
So why is Voynich returning a much lower vowel count? After all, if we assume a 30% false positive rate then we’re only seeing 3 or 4 vowels in the Voynichese. The obvious answer is that the glyphs being used as “vowels” are have a much tighter grammatical function than in natural languages, as we see with other grammatical “rules” that have been identified (see the work of Stolfi or Rene for more) as well as the CLS rule previously described by Cham and myself.
This is only my first dabble into this angle of work but it appears at first sight to be backing up the deductions already made about the Voynichese “language”: tight grammatical rules and a limited glyph interaction with strong rules on placement.