How many glyphs are there in the Voynich alphabet?
Note: This page is a work in edit. Still fiddling. Comments and feedback more than welcome, they’re encouraged!
The very question itself is imbued with menace. Before we even get going, we have to first define the very semiotic basis of “glyph” within the manuscript. We first need to define a paradigm for what a glyph actually is.
Note: This page is mainly a compilation of work that is already out there. I wished to collate and to define the very basics of Voynichese before delving into some more complex topics, and to re-examine the assumptions that underpin all of our bigger theories. Most of this is NOT based on my own examination of Voynichese, but upon a compilation of what other people / working groups have observed, with sources, although I do attempt an analysis of certain combination glyphs further on. Certainly none of this is written “in stone” – the question, by its very nature, is subjective. And I can only work from previous work, from the transcription alphabets and the transcription corpus.
The alphabet of a language is the set of symbols, letters, or tokens (which in Voynichese are called glyphs) from which the strings of the language may be formed. The content strings (the signifier) formed from this alphabet are called words. A formal language is often defined by means of a formal grammar such as a regular grammar or context-free grammar, also called its formation rule. [^]
We thus need to agree on an alphabet. The alphabet must be finite and include all glyphs found in the Voynich corpus. But we have no reference as to what the glyphs mean and hence must use logic to decide upon the parameters of this alphabet.
Glyphs, by the way, are what the original scribe(s) wrote in the manuscript, on the page.
We have two options available to us. We can use the paradigm of a formal language, taking the text at its face value, and assume that we are looking at what we are used to – distinct words, separated by spaces (although, of course, a space could be a glyph in Voynichese!), formed by a finite number of glyphs that can be combined in a finite number of ways to form new words. Arabic, Cyrillic and Roman languages all look very different on a page, but their paradigm is the same.
Or we can postulate that a deliberate deception has been created, that the glyphs we assume are letters are not the base unit, and the base unit is something else.
Let us examine the glyphs.
Some look like they’re copied from the Roman alphabet post late 14th century (a, o, 4, 8, 9, etc). Others are unique to this corpus.
Currier made the observation that almost all glyphs within the manuscript, with the exception of “a,o,4,8,9” and the bench glyphs, can be divided into line or curve glyphs. Brian Cham made this observation independently and has a website on the subject here. If we take this hypothesis, that the alphabet was created according to these line shapes, then we can assume each glyph is used as a letter, thus discarding any hypotheses that involve breaking glyphs down into their component strokes (which leads, as the work of Prof Newbold shows, to an ab absurdum solution).
The difficulty with breaking down these glyphs into “components” is that there are a finite number of glyphs which repeat time and again. If we assume a solution like Prof Newbold’s “stroke theory”, in which each glyph is formed of minuscule sub-glyphs, then we cannot explain how the glyphs continue to form the same pattern time and again. What is more, since all of the “mini-glyphs” which form the main glyph are in the same pattern, they would have the same meaning and thus we can ignore them, taking the main glyph as the signifier.
Therefore, we take as a starting point that each glyph functions as a letter in an alphabet that is unknown to us. We take the text “at its face value”.
First off, we must understand the difference between the original glyphs from the various transcription alphabets used to convert the manuscript into the Roman alphabet.
(A full overview of the different types of transcription alphabets used by different transcribers is given here at Rene Zandbergen’s page. All examples from now on will be in the EVA transcription.)
Transcription alphabets are the (usually) Roman alphabet transcriptions that we see on our computer screens. Somebody has sat down and decided that the glyph that looks like a “4” can be represented on our keyboard by the number 4 – but the original glyph has none of the connotations that the number “4” has for us. We cannot continue without leaving behind all prejudices about our own alphabet. Look at the example transcription alphabet image below – the glyph which is represented by “k” on our keyboard has nothing in common with “k”. “k” is arbitrary, we could have represented that glyph with any letter, number or symbol.
For that reason, all transcription alphabets are incomplete. The original manuscript is handwritten, and many of the words are blurred, either by the passage of time or by the simple vagrancies of handwriting. All the transcriptions are “best guesses”.
Cham and Jackson [^] developed the Curve Line System (CLS) which postulates a word structure theory for Voynichese. Here’s a handy flowchart developed by Cham explaining it:
Let’s start the hunt!
Remember that we are not looking for bigrams, only separate glyphs that would appear in the original alphabet. To take an example from English, “th” is a common enough bigram, but it is formed by two letters and so would not appear in an alphabet. However, “w”, despite looking like a double “v”, is a separate letter. One way for someone unfamiliar to English to decipher the difference would be to look at capital letters. “th” and “Th” are interchangeable (in certain locations within the sentence group), so recognising that t and T function as the same letter allows the distinction to be made. But since no apparent punctuation is used in the Voynich corpus, this is not as easy as would be expected. We must also recognise that the manuscript is assumed to have been written at a time when punctuation was not standard in many languages, and thus punctuation may be a modern artifice imposed by us.
Reddy & Knight examine the possibility of case and punctuation in the corpus using mathematical models, and dismiss both possibilities.
A lot of the work of breaking apart the glyphs has of course been carried out for us by the original transcribers. Let’s take a look at their work, and follow the path that they have beaten.
If we look at the main transcription alphabets (FSG, Currier, Eva, Frogguy) then we find they roughly agree on a total of 26 different base glyphs. From now on, let us use EVA, being the latest and most used alphabet, as it includes all of these base glyphs in its main alphabet.
[^] The EVA alphabet was created by René Zandbergen and Gabriel Landini and is composed of different types of characters
- Basic EVA characters which can be used to write the great majority of the manuscript. All basic EVA characters appear at least 10 times in the corpus of the manuscript.
- Extended EVA characters are a collection of weird forms, variations, embellishments that occasionally appear in the manuscript
- Punctuation characters are used to mark the layout of the text.
- Meta codes are used for extra-script coding.
The new Eva font contains all these and a few extras including 2 unofficial EVA characters (borrowed from Jacques Guy’s Frogguy alphabet), a few “doodles” (most probably embellishments) and the “Dee-like numerals” which resemble the pagination numbers. The idea of using a ‘splat’ symbol to mark unreadable ‘*’ characters was introduced by Julie Porter.
The fact that EVA basic finds 26 glyphs, the same as the English alphabet, has been met with some suspicion. Are we forcing our own alphabet onto the Voynichese alphabet?
Well, that question misses the point, because these are the base glyphs that can then be combined to find other glyphs, so we have more than 26 glyphs. These are the contentious glyphs, the ones that people argue over. Let’s look at them
We have three types of additional glyphs not in base Eva – repetitions, ligatures and weirdos.
There are a number of glyphs that appear with repetitions that have provoked much discussion. Are these distinct unique glyphs or single glyphs? For example:
Is “iin” “i-in”, “ii-n”, or just “iin”?
Other such examples include “ee”, “eee”. We must look at the corpus to try to identify rules controlling how these repetitions appear.
Timm postulates the following rules for these repetitions:
Based on the observation that it is possible to generate other words, which exist in the VMS, by replacing similar shaped glyphs, it is possible to list the following rules:
“in”,”iin”and “iiin” can replace each other
“e”,”ee” and “eee” can replace each other (e-ee-eee)
So we see that the following words all appear in the corpus:
Since both variations are valid words in the corpus, this would indicate that the double i and the single i are repetitions of the glyph “i”, not two separate glyphs. Whilst some languages (ie Spanish until the last decade) recognise double letters as a second entry in the alphabet (ie ll and rr were both additional letters in the alphabet) this is usually a formality, as recognised by the Spanish linguistic authority when they dropped them from the alphabet as being unworthy of a separate entry.
Triple characters are extremely rare in modern languages – Romanian [^] has some, English has a few combined words like headmistressship [^], but in all such cases this is normally almost always across morpheme boundaries where one is a doubled letter and the other is that same letter but in its singular form. Now, the reason for not finding triple characters in words in most languages is because of the pronunciation, but agglutinating languages do permit these combinations. An agglutinating language is one that morphs words together without modifying them. Words such as headmistressship in English are agglutinated morphemes and are rare exceptions that have slipped into the language. Notwithstanding that, in practise, such languages do tend to discourage the appearance of long vowel strings via grammatical rules – but given that Voynichese is a unique example, we cannot assume the same has happened here.
So the presence of the triple “i” and “e” could indicate one of the following:
- Voynichese is an agglutinative language
- The double “i” and “e” exist as a separate glyph and here are being followed by their singular counterpart (ii-i or i-ii)
- The grammar of Voynichese permits triple character repetition
- Their presence is a mistake by the scribe (although the high number of appearances suggests otherwise)
Voynichese is unlikely to be an agglutinative language, simply because of the average length of words, which average 5,5 glyphs. Agglutinative languages tend to form much longer words (because you are combining shorter words into a new long word) and we don’t see that in Voynichese.
If we look at the subsequent glyph, there seems to be no pattern with “eee”. However, “iii” (168 occurrences) appears 156 times followed by n, 2 times by k,o,8,i respectively and once by m,r.
What can be said here? The example of “iii” suggests that the glyph is actually a double “i” followed by “in”, but is more likely to be a bigram.
I believe the key is in the scribes writing method. Look at this example from the first line of f27v (two examples):
or this example from the last paragraph of f76v, which has a number of examples (click images to make them bigger if you can’t see them properly)
There are, of course, many more examples which I omit for now.
The strokes of the “i” are clearly separate in these examples, as is the n. We are looking at the glyph sequence i-i-i-n .
There are other examples where the glyphs tend to run together, but I believe this simply to be a characteristic of the handwriting of those pages. If we concentrate our gaze upon the clearer examples of the handwriting, we see indication that the scribe was writing the same letter repeatedly.
So I conclude that the repeated letter examples given above are combinations of glyphs, not unique glyphs as has been suggested.
(For an English example, contemplate the common suffix -ing. It appears often enough, but is made up of three letters. Clearer handwriting would show the suffix to be made up of i,n,g letters, but sloppier writers might run the three letters together).
Note: I am aware that Rene Zandbergen considers some of the repetitions to be unique glyphs (which is why he includes them in his transcription comparison pages) although he has not made any public conclusion that I am aware of. I await any further research from him with interest.
Ligatures are glyphs that appear to be made up of several independent glyphs. Are they really unique glyphs or are they ngrams mascaraing as glyphs? We look at them on a case by case basis.
To examine “ch” we compare it to the similar bigram, the double “e”.
Interestingly, Timm has shown that the bigram and the ch glyph are interchangeable. “ee” and “ch” can be substituted for one another. However, in the manuscript they are clearly written as separate glyphs, with different penstrokes.
This would indicate that “ee” is not a repetition of “e” or a scribal mistake for “ch”. Instead, we can conjecture (were it not for the fact that this sort of conjecture is beyond the scope of this article) that the two glyphs are homophones – that “ch” is pronounced the same as the double “e”, thus leading to this confusion on the part of the scribe.
We thus have two glyphs forming three ngrams: “c” and “ch”.
Here we have an example of “c”, “cc” and “ch” from folio f76v:
But what about other ligatures – the “park benches”? These are more problematic.
However, Timm again shows that each can be substituted for one another within words. This would appear to indicate that they are two or more separate glyphs being linked together in different syllables. By using the deduction for the “ee/ch” combination above, we can say that “ckh” and “eke” are separate ngrams built using the base glyphs “k, e, ch”. The same rule is assumed to extend to the other park benches, which in any case are rare (cth for example). All of these can be deconstructed into valid glyphs, so we assume they are ngrams and not unique glyphs.
Let us look at diacritics.
Sh appears 4501 times in the corpus, ch 11.003 times (meaning Sh appears 41% as often as ch). This frequency, together with the very real distinction made by the scribe when lettering this glyph, plus the fact that the diacritic does not appear on any other glyph combination, makes me inclined to agree that Sh should be in our base list of glyphs.
Let us look at weirdos.
Actually, let’s not bother.
Weirdos are characters that appear only once or twice in the whole corpus. By their very nature they can be dismissed as embellishments, mistakes and doodles. Some may very well have meaning, but because of their scarcity they are unlikely to form part of the base alphabet and hence fall out side of this study.
The core alphabet
Here is my provisional core alphabet of the Voynich manuscript. Over 99% of all words in the transcription corpus can be created using these EVA representations (or their equivalent in other languages).
Of course, our logic could be wrong.
I am not suggesting here any correlation between letters – simply the method the scribe used to create the alphabet.
It seems fairly clear that the creator of the alphabet created an alphabet using stroke and curves forms. He then discovered he hadn’t enough letters and added a number more.
But he was from a simpler age, and hence did not have the benefit of our information Age overload. Hell, if Wilfred Voynich himself had written it, he would have been from a simpler age. So it is unlikely that the alphabet sprung from the scribes hand in a perfect form, and we are likely to see additions and modifications as the scribe came across unexpected twists and turns.
But it’s the best we can make of it.