Is the Voynich a natural language?

This article is a work in progress. Comments and feedbacks are enthusiastically welcomed!

First off, let’s discuss what we mean by a natural language.

A natural language is one that has evolved spontaneously amongst a group of people (I include creoles, pidgins and other bricolage in this study) or an artificial language that is capable of being used as a primary source of transmitting information in a natural way (think Esperanto or other a posteriori languages).

In short, I here define a natural language as one that any cognitively normal human being is able to learn, understand and use without recourse to artificial means. (As opposed to the a priori code based artificial languages that require the memorisation of thousands of ciphers; these would be artificial languages under my definition).

Shorthand (Tironian notes) or notarial code are banished to the “artificial languages” page when they occupy most of the text; I make a short discussion of their limited use within a written natural language below.

Oneiric languages (basically those spontaneous languages such as the languages of the insane, or glossolalia) are consigned to the “gibberish” pigeon hole.

So under this definition, English, Esperanto, Chinese and Nahuatl are all natural languages. Latin written with scribal abbreviations likewise is a natural language, but a page full of shorthand is not as it is properly a code. I do not distinguish between the linguistic definitions of a prescribed and unprescribed natural language.

An a priori language is one that is created from scratch (like logic based artificial languages or a code language); an a posteriori language is one that is built upon the foundations of a pre-existing natural language (like Esperanto or, I don’t know, Pig Latin).

Some thoughts on phonetic mapping and the script

It has been argued that Voynichese cannot have a 1:1 mapping to phonetic words. This is a simplistic argument, by itself. Prof. Stephen Bax points out that English does not have a 1:1 mapping to words, look at the pronunciation of through (T H U R A amongst other pronunciations!).

At the other end of the spectrum, shorthand is essentially a personalised script for the base language, capable of being read aloud in a natural language by those in the know, and so has almost a 1:1 mapping. Each symbol or word is a homophone for a word and one word alone.

Note that we want a written natural language. A sign language is considered to be a natural language, but it is not a written language (well, it can be but under strictly controlled situations).

In theory, any script can be mapped onto any language (see my page on the Aljamiadas of Spain for examples of Arabic written in the Roman script). Of course, in practise a pressing need is always behind such attempts (in the case of the aljamiadas, it was a way for the Moors in Christian Spain to maintain their heritage and get around the prescriptions of the Inquisition).

So yes, it is possible that a new script has been invented for a pre-existing language. But do the characteristics of Voynichese permit this to be a possibility?

Some terms

Language is the ability to acquire and use complex systems of communication, particularly the human ability to do so, and a language is any specific example of such a system.

Script is the collection of visual signs used to represent textual information on the page.

Glyphs are the “letters” used in Voynichese. There are 25 glyphs in the Voynich alphabet.

Let’s go!

We are looking for evidence to suggest that Voynichese is a natural language (but not what language it is). So we want to test the following hypothesis:

Voynichese is a natural language. It encodes the natural communication of a group.

In other words, is Voynichese is a language capable of being spoken and written in the same way as English, Russian, Hawaiian or Chinese? It could be a new script for an existing language (imagine someone understood Chinese but couldn’t write it, so they invented an alphabet based on their own Latin) or a dead language that has only given us one written copy.

It could be:

  • A unique example of a natural language
  • An attempt at writing down a language that had no natural written system (or not one that was understood by the scribe)
  • A copy of a natural language that was not understood by the scribes

Let’s us consider some of the philosophical characteristics of natural languages and see if they fit Voynichese:

  • All languages are systematic. They are governed by a set of interrelated systems that include phonology, graphics (usually), morphology, syntax, lexicon, and semantics.
  • All natural languages are conventional and arbitrary. They obey rules, such as assigning a particular word to a particular thing or concept. But there is no reason that this particular word was originally assigned to this particular thing or concept.
  • All natural languages are redundant, meaning that the information in a sentence is signalled in more than one way.
  • All natural languages change. There are various ways a language can change and various reasons for this change.

(C. M. Millward and Mary Hayes, A Biography of the English Language, 3rd ed. Wadsworth, 2011).

Sadly, we cannot answer “yes” or “no” to any of the above propositions, because we don’t understand Voynichese! But we cannot exclude any of them. Let’s go a bit more indepth.

Charles Hockett proposed the “Ten fundamental characteristics of human language as system of communications”

  • Human language is learned
    • 1)acquired through cultural transmission
    • 2)speakers of one language can learn another
  • . . is discrete
    • language consists of minimal units
  • . . . is recombinable
    • these minimal units can be combined in infinite varieties
  • . . . is unconscious/intuitive
    • structural knowledge of a language is not necessarily conscious or articulated by its speakers
  • . . . is interchangeable
    • any speaker potentially can create and utter any message
  • . . . is reflexive
    • people can talk about language; language has the ability to refer to itself
  • . . . is arbitrary
    • meaning depends on arbitrary association of meaning with sign or symbol, on conventions shared by sender and receiver of message
  • . . . is redundant
    • language contains redundant communicative elements (message may be conveyed or reinforced twice in same utterance)
  • . . . can displace
    • language can convey imaginary, distant, past, present, future, conjectural, and/or counterfactual statements (including lies)
  • . . . is productive
    • a speaker can create totally novel statements and a listener can understand them

Again, we cannot answer our question by examining it from this philosophical angle – because we don’t understand Voynichese. Currently, these statements are irrefutable.

It seems clear that instead we must change our philosophical approach to a more analytical approach.

Is there a grammar?

These concepts evolve from the theory of Universal Grammar, a theory developed by Noam Chomski. In brief:

The theory of Universal Grammar proposes that if human beings are brought up under normal conditions (not conditions of extreme sensory deprivation), then they will always develop language with a certain property X (e.g., distinguishing nouns from verbs, or distinguishing function words from lexical words). As a result, property X is considered to be a property of universal grammar in the most general sense.

So we need to look for elements of grammar in the Voynichese corpus. If it is a natural language, it will per force contain a grammar; and if there are syntax rules within the corpus, they should be empirically identifiable.

So, first question:

Can we identify morphemes?

A morpheme is the smallest grammatical unit in a language.

Yes we can identify morphemes in the corpus. Voynichese glyph combinations are very positional aware within words  –  glyph groups are non-trivial in their internal positioning. We can identify, and have identified, a long list of suffixes and prefixes within Voynichese. We know that certain glyphs only appear as suffixes; we know that certain glyphs only appear as prefixes; and we know that other glyphs are free form. We have also identified (via the CLS theorem) that glyphs appear in a certain pattern.

We assume these are bound morphemes because they obey certain rules of positioning. We make no assumptions about words that do not include such bound morphemes as we are unable to identify a meaning for such unbound morphemes.

Second question:

Do these morphemes function as they do in natural languages?

Now, the following is based on the transcriptions we have (no particular one, but a general average between all of them). It also depends upon the transcription alphabet we use (because the spelling of words in the same transcription can change when we alter the transcription alphabet used).

Note: Usually when people refer to a “grammar” in Voynichese they are actually referring to the construction of words, not syntax. In this article I’m actually referring to sentence syntax when I use the concept of “grammar”.

Now, the text of the manuscript is divided up into clearly defined word-like glyph groups. These glyph groups have a non-trivial internal structure which is manifest in the severe restrictions imposed upon the positioning of glyphs within the word groups, as we mentioned before when talking about morphemes.

The works of Stolfi, Brig. J. Tiltman (1951) (reproduced in D’Imperio, Fig. 27), Vogt, Reddy & Knight, Cham & Jackson and many others demonstrate this statement.

This internal distribution of words makes comparison of average word lengths against other natural languages somewhat irrelevant, as the strict internal construction of words means they cannot be compared with most alphabet based languages.

Whilst all alphabet languages contain bound morphemes, internal letter positioning is not to such an extent as in Voynichese, which is severely restrictive as to the number of free positioning of unbound glyphs.

Irrespective of that, D. Abbot et al conclude that

  • Based on characteristics such as word length distribution (WLD) and WRI, the text appears similar to languages such as Hebrew and Latin.

But this conclusion is based solely on WLD and the team point out that “There is a distinctly low word order in comparison to known languages.”

The approach of D. Abbot et al demonstrates the shortcoming of most analytical comparison tests. The analysis only looks at factors such as WLD or basic entropy, without attempting to analyse the syntax of sentence structure or positioning of glyphs within glyph groups (words).

Can we thus make any deduction as to the phototactic structure of the words?

Voynichese has a very strict phototactic structure – syllables appear in predefined places, and only there.
Such high phonological structure in languages tends to be associated with a tendency towards monosyllabilism (see, for example, Language Complexity: Typology, contact, change; Matti Miestamo (ed)). We see this in many Asian and African languages. To compensate for the low morphological complexity, such languages tend to introduce concepts such as tones, which vary the significance of the syllable. These tones then need to be identified in any written version of that language.
As words based on a finite alphabet appear in the manuscript, we are not looking at a direct monosyllablic language. IE, it’s not an ideographic language like Chinese or hieroglyphics.  So we could therefore be looking at a phonetic transcription (similar to Pinyin) in which glyph groups represent syllables.
But given that there appears to be no tonal markers (as marked by diacritics or other symbols) in Voynichese, we must assume the contrary, ie, that there are no tonal markers in Voynichese. And so it is not transcribing a language with a high phonotactic structure.

This is an interesting point, for most languages do have some sort of diacritic, even if it’s only to identify the stress. English is one of the few that doesn’t (loan words excepted), although of course there are more.

In fact, given the lack of diacritics or other verbal modifiers, we would assume a free phonotactic structure – an assumption which is contradicted by the evidence of non-trivial internal structure of words. A contradiction in terms whose solutions argue against a natural language.

Due to the restrictive nature of glyph positioning, we cannot assume a 1:1 translation from Voynichese to Romance languages. However, there are many other families of language. Can we make an intuitive leap and try to identify sentence matching with a different family of language?

The new question is thus:

Is there a sentence syntax in Voynichese?

So, can we say that there is a basic grammar in Voynichese? Can we identify nouns, adverbs, sentence subject and other such concepts?

To do this, we must first identify sentences.

The works of Reddy & Knight (section 3.2 and 3.3) show that there is no natural punctuation in Voynichese, nor do letters appear to have case. However, this is not unnatural in manuscripts of the era (assuming the early half of the 15th century as its terminus a quo), as punctuation and case were still developing in written manuscripts, although this is a messy and complex subject.

However, to balance this, in formal writing a sense called per cola et commata was in use, in which a sentence was described as a sense-unit, and coming to the end of that sense-unit involved starting a new line. This method does seem to be in use in the Voynich, inasmuch as we have paragraphs. We also seem to have lines that are longer than others, possibly indicating sentence breaks within the paragraphs.

Of course, to assume per cola et commata was in use is to assume a European tradition for the manuscript. There are sufficient visual clues (not entered into here) for this to be a distinct possibility. But even without assuming this, we can see that words are ordered into rows of distinct length; and that rows are grouped into paragraphs with additional space between them. From the visual evidence we can assume paragraphs are used, and if we take a contrary position then we are introducing an element of deliberate deception into the argument which by itself negates the possibility of a natural language.

But is there word order? This is a delicate subject. Word order refers to how words must be arranged in a sentence. Latin, for example, has weak word order, inasmuch as the words in a sentence can be rearranged according to the needs of the inflection.

There are six theoretically possible basic word orders for the transitive sentence: subject–verb–object (SVO), subject–object–verb (SOV), verb–subject–object (VSO), verb–object–subject (VOS), object–subject–verb (OSV) and object–verb–subject (OVS). The overwhelming majority of the world’s languages are either SVO or SOV, with a much smaller but still significant portion using VSO word order. The remaining three arrangements are exceptionally rare, with VOS being slightly more common than OSV, and OVS being significantly more rare than the two preceding orders [^]

Some languages have no fixed word order (189 according to this). These languages often use a significant amount of morphological marking to disambiguate the roles of the arguments. However, some languages use a fixed word order, even if they provide a degree of marking that would support free word order. Also, some languages with free word order—such as some varieties of Datooga—combine free word order with a lack of morphological distinction between arguments.

There is very weak word order within the Voynich.

But there is an important point here – when written, such languages do tend to obey the word order selected by the author. In other words, Latin prose and Latin legalese have very different word order even for the same subject yet throughout the text will keep the same rough order.

It’s just common sense. It may not make much sense to a native English speaker, where word order is relatively strong (OK, both Jim likes beans and Beans are liked by Jim are valid sentences, but which is more likely?) but in other languages you’ll find two basic ideas that are especially important when you’re talking about word order: topic and focus.

The topic of a sentence is “what the sentence is about”. A sequence of sentences will have the same or related topics. But when you want to change topics between sentences, or to contrast two topics with one another — speakers tend to mark the new topic in some way when they do this.

For example, a change in topic: Jim likes beans. Me, I hate them. We were talking about Jim in the left-dislocated subject, but then switched to me.

The other important idea is focus. Focus has many different uses but one of the biggest uses is to mark the answers to questions. In English we use prosody to mark focus. For example: Who likes beans? Jim likes beans. What’s the focus about? Beans!

So, in any natural language, semantic analysis should be able to detect:

  • Sentence syntax (words that commonly appear before others)
  • Topic (words that appear commonly in certain locations)
  • Focus (words that appear commonly in certain order)

Schinner (2007 –  The Voynich Manuscript: Evidence of the hoax hypothesis) has shown that the probability of similar words repeating themselves at a given distance within the same text follows a geometric distribution. This is not obeyed by the Voynich corpus (language B) according to his study.

In other words, sentences syntax does not repeat itself over distances. Words appear in any order.

Knight & Reddy (5.1) state that:

Bigram contexts only provide marginal improvement in predictability for the VMS, compared to the other texts. For comparison with a language that has ‘weak word order’, we also compute the same numbers for the first 22766 word tokens of the Hungarian Bible, and find that the empirical word order is not that weak after all.
In their example, we see that Voynichese has half the predictability of English, and a third that of Arabic. Word order is very weak compared to the weakest natural languages samples.

Meanwhile, Montemurro and Zanette note that:

In natural languages, the degree of specificity of words over
different parts of a text is determined by their individual semantic
role. In particular, content-bearing words appear in texts in a sort
of clustered pattern, while structural and functional words tend to
be distributed more uniformly. We have analysed the text in
the Voynich manuscript using methods derived from Information
Theory, that assign a value of information to the individual words
in a text without any aprioristic assumption about the structure of
the language. Words that are related by their semantic contents
tend to co-occur along the text. This property is the basis of
standard methods in automatic information retrieval. We
compared the patterns of use of the most informative words in the
text and found that some of them bear strong relationships in their
use. Interestingly, the network of relationships that we obtained
showed that related words share similar morphological patterns,
either in their prefixes or suffixes. This fact suggests that any
underlying code or language in the Voynich manuscript has a
strong connection between morphology and semantics, recalling
scripts where -as in the cases of Chinese and hierographical
Ancient Egyptian- the graphical form of words directly derives
from their meaning. (emphasis mine).

Let us consider this.

What all studies agree on is that there is a link between word classes, not sentence order.

  • In short, we see evidence of a non-trivial statistical correlation structure between certain word morphemes. Words do appear next to one another, but based on suffixes and prefixes, not the actual words.
  • Also, there is an affinity between certain words and where they appear in the manuscript. This suggests that there are topics.
  • Words exhibit a very regular structure.
  • There do not appear to be any particles (a, the) within the text.
  • Words have a comparatively uniform length.

We can thus say that whilst we can identify structure within words, we cannot identify how words should be structured within sentences.

In conclusion:

  • Voynichese words have a non-trivial internal structure – certain glyphs only appear as prefixes or suffixes.
  • There is very weak word order in the corpus. When distinct paragraphs repeat the same word, they do not tend to repeat them in the same order.

In conclusion, no study has been able to match these characteristics to any known natural language.

Thus, it is unlikely that Voynichese is a natural language (as defined in this article), and very unlikely it is a known European natural language.

If it is a natural language, then we should be able to discard language families that it cannot be. And then try to match it with the remaining language families. After all, if it’s a natural language, it must have evolved in the same way as every other one in the world. Even if we can’t match it to a language, surely we could match it to a family by type pattern if such a comparison exists to “narrow it down”. And that would be the first step for anyone arguing for a natural language solution – to prove the pattern matches.

3 thoughts on “Is the Voynich a natural language?

  1. I think there are plainer arguments against the natural language. For a ready example, of 97 botanical folios that I examined so far, the first “letter” of 93 folios is one of the four gallows “letters” – k, p, f or t.

    Is it reasonable to expect such regularity in a natural language? I would rather expect all folios to start with the same letter – this would be appropriate for reference books’ sections entitled something like “De ….”, e.g. “De rosis”, “De papaveris” &c.

    On the other hand, as I noted somewhere else, consideration of the natural language Voynich theory (and, generally, of any other Voynich theory) usually relies too heavily on the assumption of the underlay text that flows freely. But have a look at this query which maps labels onto the Voynichese text:

    What it suggests is a highly compacted conspectus rather than a general narrative. In other words, it is not

    “Take a good glass, visit the bishop’s hostel, place yourself comfortably in the devil’s seat, look twenty-one degrees and thirteen minutes northeast and by north, note the main branch, count the seventh limb on the east side… &c”

    but rather:

    “A good glass in the bishop’s hostel in the devil’s seat twenty-one degrees and thirteen minutes northeast and by north main branch seventh limb east side…”

    1. I agree. It reads very much like Talmudic text, i.e. half of the meaning is inferred and just enough words are provided to tease out the meaning. The text is so dense and irregular, I cast serious doubt that it will ever be translated. It will take a rosetta stone to crack it.

  2. “But given that there appears to be no tonal markers (as marked by diacritics or other symbols) in Voynichese, we must assume the contrary, ie, that there are no tonal markers in Voynichese. And so it is not transcribing a language with a high phonotactic structure.”

    I don’t know if this is a sound argument. It supposes that the Voynich language 1) must have tones, and 2) that they must be represented. The structure of words in the manuscript is obvious and fairly rigid, and it is a reasonable assumption that individual characters may stand for sounds. This would naturally lead to a theory (even if wrong) that the language has a fairly simple but rigid syllable structure. I think the argument used to dismiss this is weaker than the logic and evidence for it.

