Tonight I’ve been running Sukhotin’s algorithm against some of the Voynich transcriptions.
In brief, Sukhotin’s algorithm identifies vowels in text. It accepts text, sorts the letters by order of occurrence and looks to see which letters appear most frequently next to a vowel or a consonant.
Yes I’m aware that Jacques Guy ran a series of experiments on this in the past. He identified o and c as likely vowels. That seems to have been back in 1992 and there doesn’t seem to have been much else done with this algorithm since then. (I only have an abbreviated copy of his article – send me the full thing if anyone has it please!). The transcription I’m using is likely to have changed substantially since then.
It’s not, naturally enough, fool proof. It just works on averages, and says this letter appears frequently enough in the correct positions to be a Vowel. Some letters will fool it – in English, for example, it has a nasty habit of flagging T as a vowel (think of words like The, Three….).
In a random sample of 1181 English words extracted from a Project Gutenburg ebook, the algorithm identified the following as vowels (always ordered by most likely to least likely):
e,a,o,i,t,y,u,g,w,z
In a random sample of Spanish, I got an article from ElMundo.com which was 3550 in length, and the algorithm identified the following as vowels:
e,a,o,i,u,y,z
I did think about introducing a condition that if a vowel had a likelyhood of less than x% of the most likely it would be dropped, but I can’t be bothered to experiment with that at the moment.
In any case, I took small sub-samples (sometimes just a dozen words) from each of these texts and got broadly the same results. Sometimes additional letters crept in, but the main list was always there.
Having established all this, let’s have a look at what happens when we apply Sukhotin’s to some of the Voynich pages. All transcripts were in basic EVA from the VIB with suspicious spaces removed. Since my program strips out all !,* signs we’re sometimes jumping over illegible glyphs, but that’s the price you gotta pay to work with the transcriptions.
Updated! It’s been pointed out that the first draft of this page incorrectly calculated “combo glyphs” in EVA. A schoolboy mistake (told you I was working late at night!) which has now been corrected. The rest of this page has been analysed afresh and updated. I now, in accordance with general theory, assume ch,sh,cth,cph,ckh,cfh are all one glyph respectively. I did consider including “ii”, or derivatives such as “iin” “iir” but decided against it.
Let’s look at the following vowels that it found (Vowels ordered by the frequency of occurrence, most frequent first):
- PAGE 1r : Vowels are: a,o,y,n,s,t
- PAGE 1v: Vowels are: o,y,a
- PAGE 2r: Vowels are: a,y,o,n,t
- PAGE 2v: Vowels are: a,o,y,n,t
- PAGE 3r: Vowels are: o,a,y,n
- PAGE 3v: Vowels are: o,a,y,t,n,s
- PAGE 4r: Vowels are: a,y,o,n,t
- PAGE 4v: Vowels are: o,y,i,s
- PAGE 5r: Vowels are: o,y,a
- PAGE 5v: Vowels are: o,a,n,y,g,s
- PAGES 1r to 10v : Vowels are: a,o,y,n,t,s,g
These pages are all fairly short in text length, with the exception of folio 1r. I thus ran the checker against all the first 10 folios and saw that the vowels were much the same, although we got seven back.
So I took a few random pages from further on in the book:
- f66r: Vowels are: o,a,y,n,e (stripped out individual glyphs that appear)
- f66v: Vowels are: a,o,y,t,e,n
- f67r1: Vowels are: a,o,y,t,n
- f100r: Vowels are: o,a,y,n,c
- f93v: Vowels are: o,y,a,t,n,s
- f115r: Vowels are: a,o,y,t,n (this was a very text heavy page)
- F110r to f115r: Vowels are: a,o,y,n
Longer text heavy pages seem to give around five vowels, shorter pages between three to six. Interestingly enough, the number of vowels is similar (about 5.4) to the average length of a Voynich word (5.5 glyphs).
However – on natural text, the number of vowels returned by Sukhotin is generally higher than for Voynichese. English returns 8-10 “vowels” depending on the text, Spanish 7-8 in my previous experiments. Of course, English has 5+1(y) vowels, Spanish only 5.
So why is Voynich returning a much lower vowel count? After all, if we assume a 30% false positive rate then we’re only seeing 3 or 4 vowels in the Voynichese. The obvious answer is that the glyphs being used as “vowels” are have a much tighter grammatical function than in natural languages, as we see with other grammatical “rules” that have been identified (see the work of Stolfi or Rene for more) as well as the CLS rule previously described by Cham and myself.
This is only my first dabble into this angle of work but it appears at first sight to be backing up the deductions already made about the Voynichese “language”: tight grammatical rules and a limited glyph interaction with strong rules on placement.
Remember the prototype “BR” algorithm I showed you, that also identified vowels as a side effect? IIRC the results on English and Spanish were more accurate than Sukhotin’s identifications here. Anyway I wonder what would happen if I ran it on each folio individually and compared to what you have here. No time for that yet though.
Yes I do Brian, but Sukhotin’s is less complicated 🙂
I’ve found four sets of letter values in my work; three for the labels, one for the text. Assuming you tested only the paragraph on f67, here are my values for the vowels found:
o – O
a – I
y – T
h – not used
n – K
t – M or R
s – G
So that pesky T showed up again, but let’s count that as a positive, so 3/6.
I’ve found the Voy alphabet to be 22 letters; the 20-letter Latin alphabet plus K, and q as a special case.
VFQ lists, as vowels in order of likelihood:
O C A Y N S E
Which, I think, changes according to the length of words. I don’t remember which sections were input. Probably all the running text.
I am writing as if VMs “words” were real words. You can do better by looking at the frequency of individual letter contacts. It happens that vowels are glyphs that look like vowels — a e o y . Therein lies a mystery. Other EVA letters are Consonants, Probable Consonants, or Unassigned (i q & i-series). It depends on what letters the glyphs represent. As in some real languages, ch and sh should be single glyphs. When not followed by h, s is not the same as the s in sh. When the scribe follows the rules, m is a decorative glyph that is a substitute for in. Gallows and bench-gallows are anyone’s guess.
You can download VFQ from
http://ixoloxi.com/voynich/tools.html
It will run in a command window on a 32-bit machine or in Dos Box on a 64-bit machine. It is, also, useful for finding initial, final, internal, and isolated letters in a text. I once read that VFQ uses a *modification* of Sukhotin’s algorithm. I don’t know about that. MONKEY is another useful app on that page but should be re-written to overcome its limitations. Scores for letters obtained by Findnull (available for download on the same page) can be used to find one aspect of page similarity.
Hey Knox,
Thanks for the link! The findings are similar to what I’m starting to see.
You’ll see I’ve changed the article to take into account “combo” glyphs now.
I’m not respecting word breaks by the way.
Hi David,
I’m reasonably sure that adapting your code to take account of word breaks will give you fewer vowels (specifically, less EVA ‘n’s), and also will change the way that EVA ‘y’ is assessed (because it has such strong word-initial and word-final placement preferences).
To my eyes, Voynichese’s “vowel problem” (haha) is simply that the in-word vowels then reduce to ‘a’ and ‘o’, which not only are (real) vowel-shapes, but are far too few in number to support a real language.
And the longer I look at ‘a’ and ‘o’, the less I believe that they are vowels in anything apart from the covertext. But you knew that anyway. 🙂
Cheers, Nick
Hi Nick,
Yes it probably will, but Sukothin’s is designed to ignore word breaks, and indeed start and end of sentences.
Still, to adapt slight what you said, the problem with Voynichese is that it has far too high a fixed glyph position within “words”.
It indicates, to me, either a category based artificial language (but then we run into the problem of not enough variety to construct all permutations you’d expect to find) or mechanically generated gibberish (my Volvelle theory again).
Now to prove it (don’t wait up!)
Hello David, I wonder if you would be interested in my argument which also puts [a, o, y] forward as vowels. It uses very different reasoning to your own argument, but they are complementary and not contradicting.
https://agnosticvoynich.wordpress.com/2015/03/24/are-o-a-y-vowels/
I do hope this is useful.
Hi David, I just found your blog 🙂
I wonder what would happen if you switch around the way you treat bench ligatures and iin-kinds of things? For example, I think the various bench chars are ligatures of vowels and consonants, while the iin might just be an m. If you allow the bench to contain two different sounds, I’m sure that would take some of the stress off the T.