CLS unigram correlation study


This article is an extension of the summary provided in section 4.5 of the Curve Line System paper presented by Brian Cham upon which I helped.

This article is an attempt to test all curve and line glyphs groups to ensure all members are in the correct groups as proposed by the CLS theory. This is done by selecting a glyph and assembling all bigrams containing this glyph found in the text. The other glyph found in the bigram is then checked to see if it belongs to the same group. Aberrant bigrams are noted for analysis.


A transcription file was loaded into the glyph parser. The transcription file consists of an extraction from the  transcription file. Only Currier A language is tested in this document. All words with weirdos and exclamations are stripped out, and paragraphs ignored. There were 10.645 words in the corpus after formatting was complete.

The parser is given a glyph and a group to test against. All examples of that glyph are then flagged up. In theory, the glyph should only be found residing amongst glyphs of its own group or a special character such as “a”.

A series of tests was run. After each test, the conclusions are examined to detect “special cases” and aberrations. Following tests are then run to support the conclusions. Analysis is run using Excel; manual checking is performed using Notepad++ to ensure results are correct.

A “rule of thumb” for transcription errors has been assumed. In essence, when an aberrant glyph has a very low occurrence – under 0,5% of the corpus or about 50 occurrences- I ignore it. When such occurrences are manually checked against the original manuscript they usually turn out to be ambiguous, smudged or erroneous. The results are still noted in the graphs below.


A word is defined as standard with a word break (space) either side.

A glyph (abbreviated g) is a Voynich glyph as defined under the EVA transcription alphabet. Several EVA characters may be combined to create combo glyphs or pedestalled gallows (example: cfh <cfh>). Curve glyphs are abbreviated c, line glyphs abbreviated l. A glyph group is the whole group being tested (see “Glyph groups”, below). An aberrant glyph is a glyph in an n-gram from the opposite glyph group being tested.

Prefixed means (glyph group)+glyph. Suffixed means glyph+(glpyh group). In both cases the fix is on the glyph group being tested, not the glyph being tested.

Mixed refers to when a glyph is surrounded by glyphs from distinct groups (ie c-g-l, a curve glyph, the glyph being tested, a line glyph). Mixed left means an aberrant glyph has been found to the left of the glyph in question; mixed right means an aberrant glyph has been found to the right.

A special case is an aberrant bigram which cannot be explained away as a scribal or transcription error but which appears to have a specific rule.

Theoretical bases

There are two separate cases to test, depending on whether the glyph is in the middle of a word (forming a trigram) or whether it is a prefix or suffix. Each case has possible combinations within.

When the glyph is centred in a trigram, there are four possible combinations:

lgl lgc cgl cgc

In this case, the parser will return a maximum of two counts for this case (lgl and cgc will both be counted double for c and l counts).

When the glyph is at the end or the beginning of a word, there are four possible combinations:

gl* gc* *gl *gc

In this case, the parser will return one count for this case. Since the parser takes word boundaries into consideration, it should not double count between the two cases.

Therefore, the total number of occurrences will not coincide with the results returned by the parser, as the parser is counting combinations for each case, not occurrences.

Glyph Groups

Unless otherwise noted, I am using the standard CLS glyph groups as devised by Brian Cham. There is of course the eternal bugbear of deciding what is a glyph and what is not. The CLS system returns to basic EVA for this and the reasons have been expounded upon by Brian Cham in the original article. Combo glyphs (pedestalled gallows) have been broken apart into their constituent glyphs (ie <cfh> becomes <c><f><h>) by the regex parser.

Curve glyphs : a,b,c,d,e,f,g,h,k,o,p,q,s,t,u,y


Line glyphs: i,j,l,m,n,r,v,x,z


 Test 1

(Group glyphs defined as above)

All bigrams were analysed to identify bigrams which contained a c prefixed or suffixed; or an l prefixed or suffixed. I contented myself with a simple bar graph to identify trends.

I first test all bigrams containing a curve glyph:
Table of concordance showing prefixes and suffixes of c glyphs
Table of concordance showing prefixes and suffixes of c glyphs

Here, non conforming glyphs in this group should have excessive yellow or green bars.

The aberrant glyphs are as follows

  • <a> appears to be mainly prefixed with a curve, suffixed with a line, as expected under the CLS [a] rule. It is not aberrant.
  • <o> appears as a suffix to a line quite often.

Conclusion: Non conforming glyphs in this test appear to be <o>. Other glyphs appear as expected.

I next test all bigrams containing a line glyph
Table of concordance showing prefixes and suffixes of line glyphs
Table of concordance showing prefixes and suffixes of line glyphs

In this graph, non conforming glyphs should see high blue or orange bars.

Tables of concurrence

Having established the basics of the CLS theory, I proceeded to draw up a series of tables of concurrences. In essence, I drew up an excel of all aberrant bigrams to identify the major trends. I then proceeded to check the number of occurrences to see if aberrant bigrams follow an order or are random.

Table 1.: Aberrant bigrams in which a curve glyph comes first (c-l)

Table of Concurrence: incidence of [c-l] aberrant bigrams
Table of Concurrence: incidence of [c-l] aberrant bigrams
Comments on table 1:

The important column here is the % of Occurrence, which shows which percentage of the total occurrences of each glyph is in an aberrant bigram. Totals refer to the total of the left hand counts; Occurrences refers to the total number of times the glyph appears in the transcription file.

No real surprises here. <a> has a high number of l as expected under the <a> rule of CLS (in all bigrams it appears before a line glyph in 97% of cases, which is well within tolerance levels).  The only aberrant glyph here is <o>, which in 44.5% of the cases appears before a curve glyph, as discussed in the special cases and the tests.<d><s><e><h><t> all have some very small occurrence but so few as to be irrelevant. Overall the percentage of aberrant bigrams compared to total occurrences is under 10%; if we exclude glyph <o>, the rate of non conformity drops to just 0,26%, as we can see in the following table (where <a> and <o> are excluded from the totals):

Table of Concurrence: glyphs  (which acts as expected) and  have been excluded.
Table of Concurrence: glyphs (which acts as expected) and have been excluded.
Table 2: Aberrant bigrams in which a line glyph comes first (l-c)
Table of Concurrence: line glyphs that appear in aberrant bigrams (c-l).
Table of Concurrence: line glyphs that appear in aberrant bigrams (c-l).

Comments on table 2:

Check the % of occurrence,  we can see that <i> appears aberrant just 1,75% of the time, whereas <l> is aberrant 26.83% of the time. Overall, a line glyph will be aberrant 11.43% of the time. In the transcription file, a manual check on 25 occurrences against the original manuscript suggests that <y> (valid) will be substituted for <l> (invalid) about 28% of the time (seven times out of the 25 checks), although this is a very rough and ready check.

Parsing the text – checking for rules

The above tables suggests that certain bigrams will always appear in a certain order. Can we examine the text, identifying the aberrant bigrams and see if any rules emerge? It turns out that we can.

Note: I have been counting <ee> and <e>; <ii> and <i> as different glyphs here, following the example of many Romance languages where a double letter such as ll or rr is considered separate to the singular. <e> appears 2.5 times as often as <ee>, but <i> appears 382 times (26%) more frequently than <ii>.

Here is a list of the aberrant glyphs in their respective bigrams:

  • <o> appears in 8204 bigrams. Counting aberrant bigrams, we find that 3652 it appears with with a line glyph as a suffix (<o>-l) and just 189 with a line glyph as a prefix (l-<o>).
    • When it is the second glyph, it appears mainly as <lo> (111 times) or <ro> (68 times). Given the similarity between glyphs <l> and <r> it seems we can combine the two into one bigram: <lo>. There does not seem to be any trigrams associated with <lo>.
    • When it is a first glyph, it appears 3652 times in a more distributed fashion, as follows:
      • oi <oi> 155 times
      • ol <ol> 2033 times
      • om <om> 122 times
      • or <or> 1338 times
      • and 4 other occurrences which can be discounted.
    • Note: <om> appears with a frequency of about 9% of <or>. It could be a confusion ( om / or ). <oi> appears with a frequency of about 7% of <ol> (oi / ol). It could be a confusion.
    • Conclusion: When <o> appears as an aberrant glyph, it is as <lo><ol><or>. These are special cases. The possible confusion between <o> and <a> is not addressed here.
  • <a> is always aberrant with a line curve to its left. This proves the CLS rule.
  • <l> is aberrant 26.83% of the time, when it appears as:
    • <lo> (see above)
    • <ly> (159 times) divided into:
      • <oly> (73 times, 46%)
      • <aly> (79 times, 50%)
      • 7 other occurrences
    • <ld> (170 times) divided into:
      • <old> (52 times, 31%)
      • <ald> (114 times, 67%)
    • Conclusion: When <l> appears as an aberrant glyph, it is as <lo><ol> (see the rule of <o>), <ly> and <ld>. These two bigrams can also be associated into four trigrams which account for 97% of all occurrences: <oly><aly><old><ald>. An argument exists for further amalgamation of these four trigrams due to the similarity of their glyphs, but requires a manual check that has not yet been performed against the manuscript.
  • <r> appears as aberrant 346 times (15.76%) of all occurrences. This can be divided into:
    • <ra> (105 times, 30%) when it appears as either <ara> or <ora>.
    • <ry> (97 times, 28%) when it appears as either <ary> or <ory>.
    • <ro> (81 times, 24%) (see rule for <o>)


Here is the summary of the above results showing the special cases to the CLS theory. When I have been able to link an aberrant bigram into a larger n-gram I have indicated this fact; otherwise, the bigram appears in words “as is” with no discernible pattern to the rest of the word.

  1. <o> is aberrant 44.51% of the time, when it appears in the following bigrams: <ol><or> and (rarely) <lo>, <ro> (where <ro> could be a confusion for <lo>).
  2. <l> is aberrant 26.83% of the time, when it appears in the following bigrams: <lo> (see rule 1) <ly><ld>.  Furthermore, these two bigrams always appear in the following trigrams: <oly><aly><old><ald>.
  3. <r> is aberrant 15.76% of the time, when it appears in the following bigrams: <ro> (see rule 1), <ry>,<ra>. These last two bigrams are almost always part of the following trigrams: <ara><ora><ary><ory>.


I have identified three aberrant glyphs which only have medium or high  conformity to the proposed CLS system. However, these three aberrant glyphs conform to very specific rules, and seem to be part of specific ngrams that occur due to some as-yet-unidentified, but very specific, reason. The percentage of occurrence of these aberrant n-grams suggests that the aberrant glyphs usually appear freely (in which case their location conforms to CLS) but occasionally as part of n-grams which have a very specific meaning.

One thought to “CLS unigram correlation study”

Leave a Reply

Your email address will not be published. Required fields are marked *