I was musing on the implementation of Zipf’s Law to the Voynich. Several people have carried out studies into this, and all come back saying it falls within parameters. I did my own Zipf study and generally got the same results.
Now, Zipf is a theory that words used within a written language should fall within a logarithmic scale. Voynich does. Doesn’t prove anything other than the text is not purely random.
For more, and a full explanation of how Zipf can be used to analyse the text, see Seravana Reddy and Kevin Knight in section 4 of their paper What We Know About The Voynich Manuscript.
But I then tried to turn Zipf around. After all, Zipf “law” is really only a sub clause of Benford’s law, which states that in any large dataset, the numbers 1-9 should fall within a logarithmic scale.
You might think that in any large dataset, 1 would appear 11% of the time (it being one of 9 possible numbers). It usually doesn’t. It appears about 30% of the time. Strange, eh?
So the percentages of the times each number appears results in a chart like this:
Does Benford’s Law apply for, let us say, paragraph length? If it does for word distribution, why not for paragraph length?
Now, Benford’s is often used in fraud detection, and I’ve used it myself for similar things. It applies to any exponential data set (ie, doubles then doubles again in the same time span) but it also applied to a lot of datasets where an exponential growth pattern isn’t obvious, but there is a constant and natural variation. The size of cities (within an economic area). Income distribution. Etc. It doesn’t work for datasets that are limited in some way – ie shoe sizes, growth patterns, IQ scores.
So, in principle, I can’t see why it wouldn’t work for language. Zipf is a logarithmic judgement of how often words appear in text. Let’s expand that idea to the number of words in every paragraph.
I loaded an English language text selected at random from Project Gutenburg: The Maid of Sker by R.D. Blackmore. I trimmed it to the same number of words as the entire Voynich transcript dump from the Voynich Information Browser (VIB). (aprox 41000 words). First 16 chapters, I remember.
One of the principles of Benford’s is that 0.1,1,10,100 is all basically the same digit: 1. We only take the first digit on the left, ignoring 0 which isn’t a logarithmic scale.
So I ran the text through a word parser, individually counting the number of words in every paragraph. I dumped this into OpenOffice calc, counted the number of times the digits 1-9 appear in the first column of the count (doesn’t matter if there are 2, 20 or 200 words in the paragraph, it all goes into the 2 column). I then took the logarithmic scale of this rate against the total count.
Here is the results for The Maid of Sker (525 paragraphs):
Now, what we’re looking for is for the blue and red lines to match. The red Benford rate is the theoretical limit of each digit. The blue Sample rate is the actual correspondence. If the blue is a lot bigger or a lot smaller than the red, we’re seeing… well, we’re seeing something artificial, is the best way I can say.
And as you can see, this random text matches pretty damn perfect.
Here’s the same analysis done on “Il piccolo santo”, an Italian language theatre by Roberto Bracco. Only 4051 words & 350 paragraphs, but it’s in a different language from English AND it’s a different style, being a script designed to be read (or sung) aloud.
Not a good match for Benford. Why? I’m assuming it’s because it’s a script designed to be read aloud. There are an awful lot of short tense sentences, with a wordcount in the low teens. What we have here is a dataset that’s artificially limited, in as much as it’s designed to be read aloud (or sung aloud).
Let’s try something in German. Die Tote und andere Novellen by Heinrich Mann. About 10,000 words. Only 85 paragraphs! Germans, it seems, don’t like short and snappy paragraphs. What the hey, I’ve formatted it now so in it goes.
It’s a short data sample, but it fits nicely. And it’s a perfect example of Benford. Only 85 paragraphs, but the word length is a good logarithmic fit. Don’t get carried away : 6-7-8 counts only appeared 4 times each.
See the difference between the English & German, and the Italian?
The English & German are novels, stream of conciousness. As such, paragraph length is pretty random, yet interlinked, a perfect dataset for applying Benford’s to.
The Italian text was limited, as it was designed to be spoken aloud, and as such, was mainly short, stubby phrases. So it was a bad match for Benford.
Let’s try the Voynich.
I ran the same process, and at once came up with a quandary. The transcript includes an awful lot of “labels”, which are just one paragraph in the transcript. Most of them are the single words written beside objects in the VM. Include, or leave out?
I tried it both ways, first with all one word paragraph (labels) included (1788 paragraphs) :
There were hundreds of labels included into the text, and it seems they’re skewing the data by gobbling up almost half of the text.
Let’s try it without the single labels (1016 paragraphs):
See what I mean about Benford not working if you artificially limit the text in any way? I’ve removed ALL the single labels and, it seems, cut out too many “1”‘s.
Back to the VIB. I then went and extracted the complete T.T. transcript of the Text, Biological, Cosmological and Pharmaceutical sections, on the basis that these are the ones with the fewest labels. This limited me to 706 paragraphs.
Back to the spreadsheet and I produced this:
Back to square uno. There are just too many “labels” in the text – over 60% of the restricted text is one paragraph labels. What I need is a transcription that detects and strips out “labels”, whilst leaving the one paragraphs in the main text.
Ask, and ye shall receive – Stolfi has already worked out the paragraph text, which I “borrowed”. He actually wanted the opposite – he was analysing the labels, and wanted to discard all the other text. The actual transcript doesn’t matter – the important thing is the word boundaries and the paragraphs. Stolfi gave me 1134 paragraphs.
There are still quite a few one label paragraphs, but very very few “teens” (paragraphs with 10-19 words) appearing in Stolfi’s text.
What we’re seeing are a lot of paragraphs bunched around the 40’s marks.
Here’s the stats for the Stolfi text:
|Digit||Ocurrence||Sample Rate||Benford Rate|
The total number of words per paragraph, in a csv format, are as follows:
…just in case you want to copy that into excel.
So, what do I think we’re seeing with this text?
There are two options here. One is that instead of stream of consciousness (ie, a novel, or someone writing down in paragraph format as they think of the ideas) it’s a manual.
The second, somewhat more likely option, is that the text just doesn’t contain paragraphs. Instead, the scribe is running all sentences together until they hit a natural “barrier”, whether it be a change in topic, an image or the end of the page.
So don’t bother looking for paragraphs as denotation of blocks of information. There aren’t any, statistically speaking.
Why is this important?
- I don’t know how consistent this is with the writing style of the 15th century. A point to investigate.
- It indicates a specific pattern to the text. What sorts of texts are written in a similar style?
- If it’s a modern forgery, the forger wasn’t subconsciously attempting to imitate a writing style, but instead attempting to follow some sort of pattern.
Although I reserve the right to change my opinion while I keep looking – I’m running more tests as we speak!