This article is a work in progress. Comments and feedbacks are enthusiastically welcomed!
First off, let’s discuss what we mean by a natural language.
A natural language is one that has evolved spontaneously amongst a group of people (I include creoles, pidgins and other bricolage in this study) or an artificial language that is capable of being used as a primary source of transmitting information in a natural way (think Esperanto or other a posteriori languages).
In short, I here define a natural language as one that any cognitively normal human being is able to learn, understand and use without recourse to artificial means. (As opposed to the a priori code based artificial languages that require the memorisation of thousands of ciphers; these would be artificial languages under my definition).
Shorthand (Tironian notes) or notarial code are banished to the “artificial languages” page when they occupy most of the text; I make a short discussion of their limited use within a written natural language below.
Oneiric languages (basically those spontaneous languages such as the languages of the insane, or glossolalia) are consigned to the “gibberish” pigeon hole.
So under this definition, English, Esperanto, Chinese and Nahuatl are all natural languages. Latin written with scribal abbreviations likewise is a natural language, but a page full of shorthand is not as it is properly a code. I do not distinguish between the linguistic definitions of a prescribed and unprescribed natural language.
An a priori language is one that is created from scratch (like logic based artificial languages or a code language); an a posteriori language is one that is built upon the foundations of a pre-existing natural language (like Esperanto or, I don’t know, Pig Latin).
Some thoughts on phonetic mapping and the script
It has been argued that Voynichese cannot have a 1:1 mapping to phonetic words. This is a simplistic argument, by itself. Prof. Stephen Bax points out that English does not have a 1:1 mapping to words, look at the pronunciation of through (T H U R A amongst other pronunciations!).
At the other end of the spectrum, shorthand is essentially a personalised script for the base language, capable of being read aloud in a natural language by those in the know, and so has almost a 1:1 mapping. Each symbol or word is a homophone for a word and one word alone.
Note that we want a written natural language. A sign language is considered to be a natural language, but it is not a written language (well, it can be but under strictly controlled situations).
In theory, any script can be mapped onto any language (see my page on the Aljamiadas of Spain for examples of Arabic written in the Roman script). Of course, in practise a pressing need is always behind such attempts (in the case of the aljamiadas, it was a way for the Moors in Christian Spain to maintain their heritage and get around the prescriptions of the Inquisition).
So yes, it is possible that a new script has been invented for a pre-existing language. But do the characteristics of Voynichese permit this to be a possibility?
Language is the ability to acquire and use complex systems of communication, particularly the human ability to do so, and a language is any specific example of such a system.
Script is the collection of visual signs used to represent textual information on the page.
Glyphs are the “letters” used in Voynichese. There are 25 glyphs in the Voynich alphabet.
We are looking for evidence to suggest that Voynichese is a natural language (but not what language it is). So we want to test the following hypothesis:
Voynichese is a natural language. It encodes the natural communication of a group.
In other words, is Voynichese is a language capable of being spoken and written in the same way as English, Russian, Hawaiian or Chinese? It could be a new script for an existing language (imagine someone understood Chinese but couldn’t write it, so they invented an alphabet based on their own Latin) or a dead language that has only given us one written copy.
It could be:
- A unique example of a natural language
- An attempt at writing down a language that had no natural written system (or not one that was understood by the scribe)
- A copy of a natural language that was not understood by the scribes
Let’s us consider some of the philosophical characteristics of natural languages and see if they fit Voynichese:
- All languages are systematic. They are governed by a set of interrelated systems that include phonology, graphics (usually), morphology, syntax, lexicon, and semantics.
- All natural languages are conventional and arbitrary. They obey rules, such as assigning a particular word to a particular thing or concept. But there is no reason that this particular word was originally assigned to this particular thing or concept.
- All natural languages are redundant, meaning that the information in a sentence is signalled in more than one way.
- All natural languages change. There are various ways a language can change and various reasons for this change.
(C. M. Millward and Mary Hayes, A Biography of the English Language, 3rd ed. Wadsworth, 2011).
Sadly, we cannot answer “yes” or “no” to any of the above propositions, because we don’t understand Voynichese! But we cannot exclude any of them. Let’s go a bit more indepth.
Charles Hockett proposed the “Ten fundamental characteristics of human language as system of communications”
- Human language is learned
- 1)acquired through cultural transmission
- 2)speakers of one language can learn another
- . . is discrete
- language consists of minimal units
- . . . is recombinable
- these minimal units can be combined in infinite varieties
- . . . is unconscious/intuitive
- structural knowledge of a language is not necessarily conscious or articulated by its speakers
- . . . is interchangeable
- any speaker potentially can create and utter any message
- . . . is reflexive
- people can talk about language; language has the ability to refer to itself
- . . . is arbitrary
- meaning depends on arbitrary association of meaning with sign or symbol, on conventions shared by sender and receiver of message
- . . . is redundant
- language contains redundant communicative elements (message may be conveyed or reinforced twice in same utterance)
- . . . can displace
- language can convey imaginary, distant, past, present, future, conjectural, and/or counterfactual statements (including lies)
- . . . is productive
- a speaker can create totally novel statements and a listener can understand them
Again, we cannot answer our question by examining it from this philosophical angle – because we don’t understand Voynichese. Currently, these statements are irrefutable.
It seems clear that instead we must change our philosophical approach to a more analytical approach.
Is there a grammar?
These concepts evolve from the theory of Universal Grammar, a theory developed by Noam Chomski. In brief:
The theory of Universal Grammar proposes that if human beings are brought up under normal conditions (not conditions of extreme sensory deprivation), then they will always develop language with a certain property X (e.g., distinguishing nouns from verbs, or distinguishing function words from lexical words). As a result, property X is considered to be a property of universal grammar in the most general sense.
So we need to look for elements of grammar in the Voynichese corpus. If it is a natural language, it will per force contain a grammar; and if there are syntax rules within the corpus, they should be empirically identifiable.
So, first question:
Can we identify morphemes?
A morpheme is the smallest grammatical unit in a language.
Yes we can identify morphemes in the corpus. Voynichese glyph combinations are very positional aware within words – glyph groups are non-trivial in their internal positioning. We can identify, and have identified, a long list of suffixes and prefixes within Voynichese. We know that certain glyphs only appear as suffixes; we know that certain glyphs only appear as prefixes; and we know that other glyphs are free form. We have also identified (via the CLS theorem) that glyphs appear in a certain pattern.
We assume these are bound morphemes because they obey certain rules of positioning. We make no assumptions about words that do not include such bound morphemes as we are unable to identify a meaning for such unbound morphemes.
Do these morphemes function as they do in natural languages?
Now, the following is based on the transcriptions we have (no particular one, but a general average between all of them). It also depends upon the transcription alphabet we use (because the spelling of words in the same transcription can change when we alter the transcription alphabet used).
Note: Usually when people refer to a “grammar” in Voynichese they are actually referring to the construction of words, not syntax. In this article I’m actually referring to sentence syntax when I use the concept of “grammar”.
Now, the text of the manuscript is divided up into clearly defined word-like glyph groups. These glyph groups have a non-trivial internal structure which is manifest in the severe restrictions imposed upon the positioning of glyphs within the word groups, as we mentioned before when talking about morphemes.
This internal distribution of words makes comparison of average word lengths against other natural languages somewhat irrelevant, as the strict internal construction of words means they cannot be compared with most alphabet based languages.
Whilst all alphabet languages contain bound morphemes, internal letter positioning is not to such an extent as in Voynichese, which is severely restrictive as to the number of free positioning of unbound glyphs.
Irrespective of that, D. Abbot et al conclude that
- Based on characteristics such as word length distribution (WLD) and WRI, the text appears similar to languages such as Hebrew and Latin.
But this conclusion is based solely on WLD and the team point out that “There is a distinctly low word order in comparison to known languages.”
The approach of D. Abbot et al demonstrates the shortcoming of most analytical comparison tests. The analysis only looks at factors such as WLD or basic entropy, without attempting to analyse the syntax of sentence structure or positioning of glyphs within glyph groups (words).
Can we thus make any deduction as to the phototactic structure of the words?
This is an interesting point, for most languages do have some sort of diacritic, even if it’s only to identify the stress. English is one of the few that doesn’t (loan words excepted), although of course there are more.
In fact, given the lack of diacritics or other verbal modifiers, we would assume a free phonotactic structure – an assumption which is contradicted by the evidence of non-trivial internal structure of words. A contradiction in terms whose solutions argue against a natural language.
Due to the restrictive nature of glyph positioning, we cannot assume a 1:1 translation from Voynichese to Romance languages. However, there are many other families of language. Can we make an intuitive leap and try to identify sentence matching with a different family of language?
The new question is thus:
Is there a sentence syntax in Voynichese?
So, can we say that there is a basic grammar in Voynichese? Can we identify nouns, adverbs, sentence subject and other such concepts?
To do this, we must first identify sentences.
The works of Reddy & Knight (section 3.2 and 3.3) show that there is no natural punctuation in Voynichese, nor do letters appear to have case. However, this is not unnatural in manuscripts of the era (assuming the early half of the 15th century as its terminus a quo), as punctuation and case were still developing in written manuscripts, although this is a messy and complex subject.
However, to balance this, in formal writing a sense called per cola et commata was in use, in which a sentence was described as a sense-unit, and coming to the end of that sense-unit involved starting a new line. This method does seem to be in use in the Voynich, inasmuch as we have paragraphs. We also seem to have lines that are longer than others, possibly indicating sentence breaks within the paragraphs.
Of course, to assume per cola et commata was in use is to assume a European tradition for the manuscript. There are sufficient visual clues (not entered into here) for this to be a distinct possibility. But even without assuming this, we can see that words are ordered into rows of distinct length; and that rows are grouped into paragraphs with additional space between them. From the visual evidence we can assume paragraphs are used, and if we take a contrary position then we are introducing an element of deliberate deception into the argument which by itself negates the possibility of a natural language.
But is there word order? This is a delicate subject. Word order refers to how words must be arranged in a sentence. Latin, for example, has weak word order, inasmuch as the words in a sentence can be rearranged according to the needs of the inflection.
There are six theoretically possible basic word orders for the transitive sentence: subject–verb–object (SVO), subject–object–verb (SOV), verb–subject–object (VSO), verb–object–subject (VOS), object–subject–verb (OSV) and object–verb–subject (OVS). The overwhelming majority of the world’s languages are either SVO or SOV, with a much smaller but still significant portion using VSO word order. The remaining three arrangements are exceptionally rare, with VOS being slightly more common than OSV, and OVS being significantly more rare than the two preceding orders [^]
Some languages have no fixed word order (189 according to this). These languages often use a significant amount of morphological marking to disambiguate the roles of the arguments. However, some languages use a fixed word order, even if they provide a degree of marking that would support free word order. Also, some languages with free word order—such as some varieties of Datooga—combine free word order with a lack of morphological distinction between arguments.
There is very weak word order within the Voynich.
But there is an important point here – when written, such languages do tend to obey the word order selected by the author. In other words, Latin prose and Latin legalese have very different word order even for the same subject yet throughout the text will keep the same rough order.
It’s just common sense. It may not make much sense to a native English speaker, where word order is relatively strong (OK, both Jim likes beans and Beans are liked by Jim are valid sentences, but which is more likely?) but in other languages you’ll find two basic ideas that are especially important when you’re talking about word order: topic and focus.
The topic of a sentence is “what the sentence is about”. A sequence of sentences will have the same or related topics. But when you want to change topics between sentences, or to contrast two topics with one another — speakers tend to mark the new topic in some way when they do this.
For example, a change in topic: Jim likes beans. Me, I hate them. We were talking about Jim in the left-dislocated subject, but then switched to me.
The other important idea is focus. Focus has many different uses but one of the biggest uses is to mark the answers to questions. In English we use prosody to mark focus. For example: Who likes beans? Jim likes beans. What’s the focus about? Beans!
So, in any natural language, semantic analysis should be able to detect:
- Sentence syntax (words that commonly appear before others)
- Topic (words that appear commonly in certain locations)
- Focus (words that appear commonly in certain order)
Schinner (2007 – The Voynich Manuscript: Evidence of the hoax hypothesis) has shown that the probability of similar words repeating themselves at a given distance within the same text follows a geometric distribution. This is not obeyed by the Voynich corpus (language B) according to his study.
In other words, sentences syntax does not repeat itself over distances. Words appear in any order.
Knight & Reddy (5.1) state that:
Meanwhile, Montemurro and Zanette note that:
Let us consider this.
What all studies agree on is that there is a link between word classes, not sentence order.
- In short, we see evidence of a non-trivial statistical correlation structure between certain word morphemes. Words do appear next to one another, but based on suffixes and prefixes, not the actual words.
- Also, there is an affinity between certain words and where they appear in the manuscript. This suggests that there are topics.
- Words exhibit a very regular structure.
- There do not appear to be any particles (a, the) within the text.
- Words have a comparatively uniform length.
We can thus say that whilst we can identify structure within words, we cannot identify how words should be structured within sentences.
- Voynichese words have a non-trivial internal structure – certain glyphs only appear as prefixes or suffixes.
- There is very weak word order in the corpus. When distinct paragraphs repeat the same word, they do not tend to repeat them in the same order.
In conclusion, no study has been able to match these characteristics to any known natural language.
Thus, it is unlikely that Voynichese is a natural language (as defined in this article), and very unlikely it is a known European natural language.
If it is a natural language, then we should be able to discard language families that it cannot be. And then try to match it with the remaining language families. After all, if it’s a natural language, it must have evolved in the same way as every other one in the world. Even if we can’t match it to a language, surely we could match it to a family by type pattern if such a comparison exists to “narrow it down”. And that would be the first step for anyone arguing for a natural language solution – to prove the pattern matches.