A study upon word spaces in the Voynich manuscript

This article is an attempt to answer the most basic of questions: are the spaces between Voynich words arbitrary or purposeful?

Despite the essential simplicity of this question, it’s still a burning issue. If we can prove that spaces are arbitrary, then it’s a push towards the theories that the text is encoded or gibberish. But if we can prove that the spaces are purposeful, that they separate words in the same way as our modern usage, then it’s a push towards a natural or artificial language.

But how can we prove this either way?

Are words “words”?

As I have argued before, the text of the manuscript is divided up into clearly defined word-like glyph groups (what we would call words if we could assign a sense unit to each glyph group). These glyph groups have a non-trivial internal structure which is manifest in the severe restrictions imposed upon the positioning of glyphs within the glyph groups. From now on I will refer to these glyph groups as “words” (I am not a fan of Stolfi’s terminology of token as I find it confuses people).

Voynichese has a very strict phototactic structure for glyphs that appears to indicate that these words are assembled intentionally. They are bound together as if they were words.

We are used to the paradigm that words form a sentence with spaces between the words. The Voynich corpus (with the exception of labels, single words that are attached to images) appears to follow this paradigm (albeit with no punctuation). But it is possible that this is a deception. The spaces between words could be an encoded null character, or an arbitrary sp acet om akei t mo rediff cult for the uniniti atedto read.

If this were so, we would expect the words to have a low repeat value. Words would be broken up into sub-sections, or jumbled around, and this would mean that they would not repeat very often. On the other hand, if spaces are separating words, then we would expect words to be repeated throughout the corpus.

Knight and Reddy (What we know about the Voynich manuscript) prove that words are repeated throughout the VMS, and that furthermore the word frequency distribution of the manuscript follows Zipf’s law.

Furthermore, they note that Landini (2001) found that the corpus follows Zipf’s law of word lengths: there is an inverse relationship between the frequency and the length of a word.

From a slightly different angle, let us look at how often labels repeat within the corpus, as this allows us to see if words are repeated through different contexts. If the labels truly function as “labels”, ie sense units denominating illustrations or objects, we would expect a fair number of them to be repeated within the main corpus. And indeed we do: MarcoP found that 70% of all labels appear within the main corpus (study here).

So we find that voynich “words” obey frequency distribution laws; are repeated with a frequency which is normal for language; and furthermore that they are oft-repeated throughout different contexts.

These conclusions lead us towards the assertion that glyph groups are indeed words, and the spaces between said words are significant, serving to separate sense units.


Is the Voynich a natural language?

This article is a work in progress. Comments and feedbacks are enthusiastically welcomed!

First off, let’s discuss what we mean by a natural language.

A natural language is one that has evolved spontaneously amongst a group of people (I include creoles, pidgins and other bricolage in this study) or an artificial language that is capable of being used as a primary source of transmitting information in a natural way (think Esperanto or other a posteriori languages).

In short, I here define a natural language as one that any cognitively normal human being is able to learn, understand and use without recourse to artificial means. (As opposed to the a priori code based artificial languages that require the memorisation of thousands of ciphers; these would be artificial languages under my definition).

Shorthand (Tironian notes) or notarial code are banished to the “artificial languages” page when they occupy most of the text; I make a short discussion of their limited use within a written natural language below.

Oneiric languages (basically those spontaneous languages such as the languages of the insane, or glossolalia) are consigned to the “gibberish” pigeon hole.


OK, so what script did Kircher mean in his 1639 letter?

Diane quite rightly pointed out on my “Kircher to Moretus reply” that we don’t know exactly which script Kircher meant when he said “Illyrian”. It’s usually taken to mean Glagolitic, but does it really?

(There’s also the question of does it matter? Close examination of the letter tends to discard the subject under discussion from being the Voynich MS. But it’s still a widely quoted quote, and I was interested, so let’s apply ourselves).

First off, let’s remind ourselves of what Kircher wrote:

Alterum denique folium quem ipsi ignoti characteris genere scriptum videbatur illyrico idiomate, charactere quem D. Hieronymi vulgo vocant, impressum sciat; utuntur eodem charactere hic Romae in missalibus aliisque sacris libri illyrico sermone imprimendis.

Which I translate as:

Finally the other leaf upon which are written types of unknown characters I observe are in the Illyric language, characters the printing of which I know are commonly called D. Hieronymi;  characters used here in Rome in various Holy Books and Illyrician printed sermons.

Some terms:

  • Illyric: We would now call this area Croatian. Kircher, as was the wont of the time, used the Roman provincial name for the area.
  • D. Hieronymi: Saint Jerome.

Now, Glagolitic is an ancient Slavic alphabet. The name Glagolithic probably wasn’t applied to the script until the 14th century. The Glagolitic alphabet was invented during the 9th century by the missionaries St Cyril (827-869 AD) and St Methodius (826-885 AD) in order to translate the Bible and other religious works into the language of the Great Moravia region.  It’s not a language, it’s a script that could be used for any of the proto-Slavic languages (in the same way that our alphabet can be used for French, Spanish, English etc). Here’s an example of the script:

This chart shows the Glagolitic alphabet with the names of the letters in Old Church Slavonic, the Cyrillic equivalents of the letters, and IPA transcription. Image from Omniglot.com

So Glagolitic proper dates from the 9th century, and then started to evolve. When it became adopted and standardised by the Church, it became known as Old Church Slavonic with loads of variations across different regions (see the prior wikipedia link for more on that).

By the 12th century the first Slavic languages were evolving in different directions. In the late 14th century, a new script evolved for use by the Church: Church Slavonic. It’s still in use today.

Right. Where does Jerome fit in?

Well, there was a persistent myth that one of the founding fathers of the Church, St Jerome, was the chap who had invented the script. The intention appears to have been to use his authority to counter attacks by Rome upon the local Church. The alphabet was thus called by some as Hieronymian in pre-Renaissance times after his Roman name, and that’s the word Kircher used.

And here’s a 16th century Vatican printed work showing “the characters of the Illyrian language in Hieronymian script”:

Pages from a book describing Glagolitic script. (A. Rocca: Biblioteca Apostolica Vaticana a Sixto V… translata, Roma, 1591: Alfabeto glagolitico). Wikipedia.

So far, so clear. Kircher used the term “Hieronymi” to refer to a specific Slavic script, and furthermore identified the base language as Croatian (Illyrian). Can we corroborate this? What did Kircher himself understand by “Hieronymi”? Let’s find out.

Here’s Kirchers ’72 names of God’ image from Athanasius Kircher’s Oedipus Aegyptiacus:

72 names of God in the languages of the 72 nations of the world
Kircher has carefully used Cabalah and then written out the name of God in the 72 different languages of the world. As always, philology is carefully ignored or manipulated to gain his end: You’ll notice that for English, he wrote GOOD instead of GOD. Why? Because all of the 72 names had to be four letters long. For a full interpretation of this diagram click here. I’m only interested in entry 13: Illyirici.

Damn it, he’s only gone and written it out in the Roman alphabet! BOOG. Why? I don’t know. He’s done the same in Japanese and Chinese, and where he got BOSA from for “Mexican”, or “SOLV” for Californian is beyond me (local native American dialects?). Frankly, the more I learn of Kircher, the more I agree with Descartes’s opinion of the bloke. And despite a morning searching, I have yet to find any other example of Illyrian in any of Kircher’s works.

Let’s look elsewhere. This is the Virga Aurea of James Bonaventure Hepburn published at Rome in 1616. The Virga Aurea, or to give the full title, “The Heavenly Golden Rod of the Blessed Virgin Mary in Seventy-two Praises” consists of a list of seventy two alphabets (actually seventy, plus Latin and Hebrew which are the two languages of the text of the plate). Usefully enough, “Illyricum” is included. Eighth down, left column.virga_aurea3

Comparing the Virgo Aureum with Rocca’s Alfabeto glagolitico we see obvious similarities, but at the same time, differences. Remember, they were published just 25 years apart. Hepburn’s sources are unknown, but it’s assumed (since he was head of rare books at the Vatican) that he was probably getting the info out of the books there.

Hepburn has 37 glyphs, whereas Rocca only has 33. The glyphs are also slightly different if you compare them. And neither of them really correspond with the Omniglot table I posted above. So what’s going on?

Well, Hepburn & Rocca are both confusing different versions of the script, as shown below, and certainly Rocca is missing a number from his book:

So yes, the above scripts would appear to be what Kircher knew as “Hieronymiam”.

Here’s a page from the first Croat language printed book, a 1483 work entitled Misal po zakonu rimskoga dvora:

1483 “Misal po zakonu rimskoga dvora”


It turns out that Galglithic is a right pain. A typographer writing on the ministryoftype.co.uk website commented that:

One of the things I noticed when looking at examples of Glagolitic is the way some characters appear and disappear; I was trying to set some text in it, and whichever bit of text I tried had some extra characters that weren’t in the font or in any other examples – each one seemed to have characters unique to it. Of course, this isn’t a deficiency of the font (or of the language), but more a sign of the evolution of the written language and of the strong influences on it from Latin, Cyrillic and Church Slavonic over the years. Croatian was written in all three systems in parallel, and as a local system not widely known outside of the Balkans (despite being the oldest of the Slavic alphabets), the form of written Glagolitic has perhaps been more influenced than influencing; In some written examples there are Cyrillic characters, while in others the characters are presumably the original Glagolitic ones, or newer hybrid forms.

So it seems clear enough that for Kircher, Hieronymian would be Galgolitic. He used the name “Hieronymian” because of the ongoing myth at the time that St Jerome had invented the script, and attempts to link his name to the script; as Kircher would only have known of the script via his Catholic Church contacts, the name Hieronymian would have been the correct one to use at this period, even if elsewhere it was known as Galgolitic. The Church was printing books in the script, indeed, it was even standardising a version of the script for its own use.

It also turns out that Galgolitic is still alive and thriving in Croatia, where it’s treated as a national treasure and part of their identity.

Early modern universal languages as seen through the thoughts of Kircher

I have spoken before of Kircher’s Universal Language dictionary, which aimed to reduce all words to numbers which could then be transmitted to a recipient, who would then look up those numbers in his vernacular, and so read the message.

Of course, all this translation system really does, from our modern day viewpoint, is slow translations down by introducing a dictionary reference instead of the actual word. Direct translation of words leads to interesting mixups like this one:

Anyway, in 1663 Kircher was to publish Polygraphia nova et universalis (the New and Universal Polygraphy), a grandiose tract that was to promise far more than it could deliver. It was a typical Kircher work in that it took the ideas of others, spun them around and presented in a new manner. Kircher was a master of this art, and his polygraphia is one of his masterpieces. Notwithstanding that, the polygraphia is a good example of how the intellectuals of Europe considered language and codes at this point in time, and so let’s look at the state of “universal languages” in the pre-Renaissance via this book – because the content of the book is intellectual very much on the dividing line between medieval and Renaissance thinking. (more…)

