The Cracking Code Book

Автор

Simon Singh

Год написания книги

2018

<< 1 2 3 4 5 >>

На страницу:

Перейти

3 из 5

Настройки чтения

Размер шрифта

Высота строк

Поля

In addition to a greater understanding of secular subjects, the invention of cryptanalysis also depended on the growth of religious education. Major theological schools were established in Basra, Kufa and Baghdad, where theologians studied the revelations of Muhammad as contained in the Koran. The theologians were interested in establishing the chronology of the revelations, which they did by counting the frequencies of words contained in each revelation. The theory was that certain words had evolved relatively recently, and hence if a revelation contained a high number of these newer words, this would indicate that it came later in the chronology. Theologians also studied the Hadīth which consists of the Prophet’s daily utterances. They tried to demonstrate that each statement was indeed attributable to Muhammad. This was done by studying the etymology of words and the structure of sentences, to test whether particular texts were consistent with the linguistic patterns of the Prophet.

Significantly, the religious scholars did not stop their investigation at the level of words. They also analyzed individual letters, and in particular they discovered that some letters are more common than others. The letters a and I are the most common in Arabic, partly because of the definite article al-, whereas the letter j appears only a tenth as frequently. This apparently minor observation would lead to the first great breakthrough in cryptanalysis.

The earliest known description of the technique is by the ninth-century scientist Abū Yūsūf Ya‘qūb ibn Is-hāq ibn as-Sabbāh ibn ‘omrān ibn Ismaīl al-Kindī. Known as “the philosopher of the Arabs”, al-Kindī was the author of 290 books on medicine, astronomy, mathematics, linguistics and music. His greatest treatise, which was rediscovered only in 1987 in the Sulaimaniyyah Ottoman Archive in Istanbul, is entitled A Manuscript on Deciphering Cryptographic Messages. Although it contains detailed discussions on statistics, Arabic phonetics and Arabic syntax, al-Kindī’s revolutionary system of cryptanalysis is summarized in two short paragraphs:

One way to solve an encrypted message, if we know its language, is to find a different plaintext of the same language long enough to fill one sheet or so, and then we count the occurrences of each letter. We call the most frequently occurring letter the “first”, the next most occurring letter the “second”, the following most occurring letter the “third”, and so on, until we account for all the different letters in the plaintext sample.

Then we look at the ciphertext we want to solve and we also classify its symbols. We find the most occurring symbol and change it to the form of the “first” letter of the plaintext sample, the next most common symbol is changed to the form of the “second” letter, and the third most common symbol is changed to the form of the “third” letter, and so on, until we account for all the symbols of the cryptogram we want to solve.

Al-Kindī’s explanation is easier to explain in terms of the English alphabet. First of all, it is necessary to study a lengthy piece of normal English text, perhaps several, in order to establish the frequency of each letter of the alphabet. In English, e is the most common letter, followed by t, then a, and so on, as given in Table 1 (#ulink_6dca087c-16b8-5be5-985b-d7a37571c6e6). Next, examine the ciphertext in question, and work out the frequency of each letter. If the most common letter in the ciphertext is, for example, J, then it would seem likely that this is a substitute for e. And if the second most common letter in the ciphertext is p, then this is probably a substitute for t, and so on. Al-Kindī’s technique, known as frequency analysis, shows that it is unnecessary to check each of the billions of potential keys. Instead, it is possible to reveal the contents of a scrambled message simply by analyzing the frequency of the characters in the ciphertext.

Table 1 This table of relative frequencies is based on passages taken from newspapers and novels, and the total sample was 100,362 alphabetic characters. The table was compiled by H. Beker and F. Piper, and originally published in Cipher Systems: The Protection of Communication.

However, it is not possible to apply al-Kindī’s recipe for cryptanalysis unconditionally, because the standard list of frequencies in Table 1 (#ulink_6dca087c-16b8-5be5-985b-d7a37571c6e6) is only an average, and it will not correspond exactly to the frequencies of every text. For example, a brief message discussing the effect of the atmosphere on the movement of striped quadrupeds in Africa (“From Zanzibar to Zambia and Zaire, ozone zones make zebras run zany zigzags”) would not, if encrypted, yield to straightforward frequency analysis. In general, short texts are likely to deviate significantly from the standard frequencies, and if there are fewer than a hundred letters, then decipherment will be very difficult. On the other hand, longer texts are more likely to follow the standard frequencies, although this is not always the case. In 1969, the French author Georges Perec wrote La Disparition, a 200-page novel that did not use words that contain the letter e. Doubly remarkable is the fact that the English novelist and critic Gilbert Adair succeeded in translating La Disparition into English while still following Perec’s avoidance of the letter e. Entitled A Void, Adair’s translation is surprisingly readable (see Appendix A). If the entire book were encrypted via a monoalphabetic substitution cipher, then a naive attempt to decipher it might be prevented by the complete lack of the most frequently occurring letter in the English alphabet.

Having described the first tool of cryptanalysis, I shall continue by giving an example of how frequency analysis is used to decipher a ciphertext. I have avoided littering the whole book with examples of cryptanalysis, but with frequency analysis I make an exception. This is partly because frequency analysis is not as difficult as it sounds, and partly because it is the primary cryptanalytical tool. Furthermore, the example that follows provides insight into the method of the cryptanalyst. Although frequency analysis requires logical thinking, you will see that it also demands cunning, intuition, flexibility and guesswork.

CRYPTANALYZING A CIPHERTEXT

PCQ VMJYPD LBYK LYSO KBXBJXWXV BXV ZCJPO EYPD KBXBJYUXJ LBJOO KCPK. CP LBO LBCMKXPV XPV IYJKL PYDBL, QBOP KBO BXV OPVOV LBO LXRO CI SX’XJMI, KBO JCKO XPV EYKKOV LBO DJCMPV ZOICJO BYS, KXUYPD: “DJOXL EYPD, ICJ X LBCMKXPV XPV CPO PYDBLK Y BXNO ZOOP JOACMPLYPD LC UCM LBO IXZROK CI FXKL XDOK XPV LBO RODOPVK CI XPAYOPL EYPDK. SXU Y SXEO KC ZCRV XK LC AJXNO X IXNCMJ CI UCMJ SXGOKLU?”

OFYRCDMO, LXROK IJCS LBO LBCMKXPV XPV CPO PYDBLK

Imagine that we have intercepted this scrambled message. The challenge is to decipher it. We know that the text is in English, and that it has been scrambled according to a monoalphabetic substitution cipher, but we have no idea of the key. Searching all possible keys is impractical, so we must apply frequency analysis. What follows is a step-by-step guide to cryptanalyzing the ciphertext, but if you feel confident, then you might prefer to ignore this and attempt your own independent cryptanalysis.

The immediate reaction of any cryptanalyst upon seeing such a ciphertext is to analyze the frequency of all the letters, which results in Table 2 (#ulink_87f03b69-8309-5fd7-8773-8afd26fb9aff). Not surprisingly, the letters vary in their frequency. The question is, can we identify what any of them represent, based on their frequencies? The ciphertext is relatively short, so we cannot rely wholly on frequency analysis. It would be naive to assume that the commonest letter in the ciphertext, O, represents the commonest letter in English, e, or that the eighth most frequent letter in the ciphertext, Y, represents the eighth most frequent letter in English, h. An unquestioning application of frequency analysis would lead to gibberish. For example, the first word, PCQ, would be deciphered as aov.

Table2 Frequency analysis of enciphered message.

However, we can begin by focusing attention on the only three letters that appear more than thirty times in the ciphertext, namely O, X and P. Let us assume that the commonest letters in the ciphertext probably represent the commonest letters in the English alphabet, but not necessarily in the right order. In other words, we cannot be sure that O = e, X = t and P = a, but we can make the tentative assumption that

O = e, t or aX = e, t or aP = e, t or a

In order to proceed with confidence and pin down the identity of the three most common letters, O, X and P, we need a more subtle form of frequency analysis. Instead of simply counting the frequency of the three letters, we can focus on how often they appear next to all the other letters. For example, does the letter O appear before or after several other letters, or does it tend to neighbour just a few special letters? Answering this question will be a good indication of whether O represents a vowel or a consonant. If O represents a vowel, it should appear before and after most of the other letters, whereas if it represents a consonant, it will tend to avoid many of the other letters. For example, the vowel e can appear before and after virtually every other letter, but the consonant t is rarely seen before or after b, d, g, j, k, m, q or v.

The table below takes the three most common letters in the ciphertext, O, X and P, and lists how frequently each appears before or after every letter. For example, O appears before A on one occasion but never appears immediately after it, giving a total of one in the first box. The letter O neighbours the majority of letters, and there are only seven that it avoids completely, represented by the seven zeroes in the O row. The letter X is equally sociable, because it too neighbours most of the letters and avoids only eight of them. However, the letter P is much less friendly. It tends to lurk around just a few letters and avoids fifteen of them. This evidence suggests that O and X represent vowels, while P represents a consonant.

Now we must ask ourselves which vowels are represented by O and X. They are probably e and a, the two most popular vowels in the English language, but does O = e and X = a, or does O = a and X = e? An interesting feature in the ciphertext is that the combination OO appears twice, whereas XX does not appear at all. Since the letters ee appear far more often than aa in plaintext English, it is likely that O = e and X = a.

At this point, we have confidently identified two of the letters in the ciphertext. Our conclusion that X = a is supported by the fact that X appears on its own in the ciphertext, and a is one of only two English words that consist of a single letter. The only other letter that appears on its own in the ciphertext is Y, and it seems highly likely that this represents the only other one-letter English word, which is i. Focusing on words with only one letter is a standard cryptanalytic trick, and I have included it among a list of cryptanalytic tips in Appendix B. This particular trick works only because this ciphertext still has spaces between the words. Often, a cryptographer will remove all the spaces to make it harder for an enemy interceptor to unscramble the message.

Although we have spaces between words, the following trick would also work where the ciphertext has been merged into a single string of characters. The trick allows us to spot the letter h once we have already identified the letter e. In the English language, the letter h frequently goes before the letter e (as in the, then, they, etc.), but rarely after e. The table below shows how frequently the O, which we think represents e, goes before and after all the other letters in the ciphertext. The table suggests that B represents h, because it appears before O on nine occasions but never goes after it. No other letter in the table has such an asymmetric relationship with O.

Each letter in the English language has its own unique personality, which includes its frequency and its relation to other letters. It is this personality that allows us to establish the true identity of a letter, even when it has been disguised by monoalphabetic substitution.

We have now confidently established four letters, O = e, X = a, Y = i and B = h, and we can begin to replace some of the letters in the ciphertext with their plaintext equivalents. I shall stick to the convention of keeping ciphertext letters in uppercase, while putting plaintext letters in lowercase. This will help to distinguish between those letters we still have to identify and those that have already been established.

PCQ VMJiPD LhiK LiSe KhahJaWaV haV ZCJPe EiPD KhahJiUaJ LhJee KCPK. CP Lhe LhCMKaPV aPV liJKL PiDhL, QheP Khe haV ePVeV Lhe LaRe CI Sa’aJMI, Khe JCKe aPV EiKKev Lhe DJCMPV ZelCJe hiS, KaUiPD: “DJeaL EiPD, ICJ a LhCMKaPV aPV CPe PiDhLK i haNe ZeeP JeACMPLiPD LC UCM Lhe laZReK CI FaKL aDeK aPV Lhe ReDePVK CI aPAiePL EiPDK. SaU i SaEe KC ZCRV aK LC AJaNe a laNCMJ CI UCMJ SaGeKLU?”

eFiRCDMe, LaReK IJCS Lhe LhCMKaPV aPV CPe PiDhLK

This simple step helps us to identify several other letters, because we can guess some of the words in the ciphertext. For example, the most common three-letter words in English are the and and, and these are relatively easy to spot – Lhe, which appears six times, and aPV, which appears five times. Hence, L probably represents t, P probably represents n and V probably represents d. We can now replace these letters in the ciphertext with their true values:

nCQ dMJinD thiK tiSe KhahjaWad had ZCJne EinD KhahJiUaJ thJee KCnK. Cn the thCMKand and MJKt niDht, Qhen Khe had ended the taRe CI Sa’aJMI, Khe JCKe and EiKKed the DJCMnd ZelCJe hiS, KaUinD: “DJeat EinD, ICJ a thCMKand and Cne niDhtK i haNe Zeen JeACMntinD tC UCM the laZReK CI FaKt aDeK and the ReDendK CI anAient EinDK. SaU i SaEe KC ZCRd aK tC AJaNe a laNCMJ CI UCMJ SaGeKtU?”

eFiRCDMe, taReK IJCS the thCMKand and Cne niDhtK

Once a few letters have been established, cryptanalysis progresses very rapidly. For example, the word at the beginning of the second sentence is Cn. Every word has a vowel in it, so C must be a vowel. There are only two vowels that remain to be identified, u and o; u does not fit, so C must represent o. We also have the word Khe, which implies that K represents either t or s. But we already know that L = t, so it becomes clear that K = s. Having identified these two letters, we insert them into the ciphertext, and there appears the phrase thoMsand and one niDhts. A sensible guess for this would be thousand and one nights, and it seems likely that the final line is telling us that this is a passage from Tales from the Thousand and One Nights. This implies that M = u, I = f, J = r, D = g, R = I and S = m.

We could continue trying to establish other letters by guessing other words, but instead let us have a look at what we know about the plain alphabet and cipher alphabet. These two alphabets form the key, and they were used by the cryptographer to perform the substitution that scrambled the message. Already, by identifying the true values of letters in the ciphertext, we have effectively been working out the details of the cipher alphabet. A summary of our achievements, so far, is given in the plain and cipher alphabets below.

By examining the partial cipher alphabet, we can complete the cryptanalysis. The sequence VOIDBY in the cipher alphabet suggests that the cryptographer has chosen a keyphrase as the basis for the key. Some guesswork is enough to suggest the keyphrase might be A VOID BY GEORGES PEREC, which is reduced to AVOIDBYGERSPC after removing spaces and repetitions. Thereafter, the letters continue in alphabetical order, omitting any that have already appeared in the keyphrase. In this particular case, the cryptographer took the unusual step of not starting the keyphrase at the beginning of the cipher alphabet, but rather starting it three letters in. This is possibly because the keyphrase begins with the letter A, and the cryptographer wanted to avoid encrypting a as A. At last, having established the complete cipher alphabet, we can unscramble the entire ciphertext, and the cryptanalysis is complete.

Now during this time Shahrazad had borne King Shahriyar three sons. On the thousand and first night, when she had ended the tale of Ma’aruf, she rose and kissed the ground before him, saying: “Great King, for a thousand and one nights I have been recounting to you the fables of past ages and the legends of ancient kings. May I make so bold as to crave a favour of your majesty?”

Epilogue,Tales from the Thousand and One Nights

RENAISSANCE IN THE WEST

Between AD 800 and 1200 Arab scholars enjoyed a vigorous period of intellectual achievement. At the same time, Europe was firmly stuck in the Dark Ages. While al-Kindī was describing the invention of cryptanalysis, Europeans were still struggling with the basics of cryptography. The only European institutions to encourage the study of secret writing were the monasteries, where monks would study the Bible in search of hidden meanings, a fascination that has persisted through to modern times (see Appendix C).

By the fifteenth century, however, European cryptography was a growing industry. The revival in the arts, sciences and scholarship during the Renaissance nurtured the capacity for cryptography, while an explosion in political intrigue offered ample motivation for secret communication. Italy, in particular, provided the ideal environment for cryptography. As well as being at the heart of the Renaissance, it consisted of independent city-states, each trying to outsmart the others. Diplomacy flourished, and each state would send ambassadors to the courts of the others. Each ambassador received messages from his respective head of state, describing details of the foreign policy he was to implement. In response, each ambassador would send back any information that he had gathered. Clearly there was a great incentive to encrypt communications in both directions, so each state established a cipher office, and each ambassador had a cipher secretary.

At the same time that cryptography was becoming a routine diplomatic tool, the science of cryptanalysis was beginning to emerge in the West. Diplomats had only just familiarized themselves with the skills required to establish secure communications, and already there were individuals attempting to destroy this security. It is quite probable that cryptanalysis was independently discovered in Europe, but there is also the possibility that it was introduced from the Arab world. Islamic discoveries in science and mathematics strongly influenced the rebirth of science in Europe, and cryptanalysis might have been among the imported knowledge.

Arguably the first great European cryptanalyst was Giovanni Soro, appointed as Venetian cipher secretary in 1506. Soro’s reputation was known throughout Italy, and friendly states would send intercepted messages to Venice for cryptanalysis. Even the Vatican, probably the second most active centre of cryptanalysis, would send Soro seemingly impenetrable messages that had fallen into its hands.

This was a period of transition, with cryptographers still relying on the monoalphabetic substitution cipher, while cryptanalysts were beginning to use frequency analysis to break it. Those yet to discover the power of frequency analysis continued to trust monoalphabetic substitution, ignorant of the extent to which cryptanalysts such as Soro were able to read their messages.

Meanwhile, countries that were alert to the weakness of the straightforward monoalphabetic substitution cipher were anxious to develop a better cipher, something that would protect their own nation’s messages from being unscrambled by enemy cryptanalysts. One of the simplest improvements to the security of the monoalphabetic substitution cipher was the introduction of nulls, symbols or letters that were not substitutes for actual letters, merely blanks that represented nothing. For example, one could substitute each plain letter with a number between 1 and 99, which would leave seventy-three numbers that represent nothing, and these could be randomly sprinkled throughout the ciphertext with varying frequencies. The nulls would pose no problem to the intended recipient, who would know that they were to be ignored. However, the nulls would baffle an enemy interceptor because they would confuse an attack by frequency analysis.

Another attempt to strengthen the monoalphabetic substitution cipher involved the introduction of codewords. The term code has a very broad meaning in everyday language, and it is often used to describe any method for communicating in secret. However, it actually has a very specific meaning, and applies only to a certain form of substitution. So far we have concentrated on the idea of a substitution cipher, whereby each letter is replaced by a different letter, number or symbol. However, it is also possible to have substitution at a much higher level, whereby each word is represented by another word or symbol – this would be a code. For example,

Using this very limited set of coded words, we can encode a simple message as follows:

Technically, a code is defined as substitution at the level of words or phrases, whereas a cipher is defined as substitution at the level of letters. Hence the term encipher means to scramble a message using a cipher, while encode means to scramble a message using a code. Similarly, the term decipher applies to unscrambling an enciphered message, and decode to unscrambling an encoded message. The terms encrypt and decrypt are more general, and cover scrambling and unscrambling with respect to both codes and ciphers. Figure 6 (#litres_trial_promo) presents a brief summary of these definitions. In general, I shall keep to these definitions, but when the sense is clear, I might use a term such as codebreaking to describe a process that is really cipher breaking – the latter phrase might be technically accurate, but the former phrase is widely accepted.

At first sight, codes seem to offer more security than ciphers, because words are much less vulnerable to frequency analysis than letters. To decipher a monoalphabetic cipher you need only identify the true value of each of the twenty-six characters, whereas to decipher a code you need to identify the true value of hundreds or even thousands of codewords. However, if we examine codes in more detail, we see that they suffer from two major practical failings when compared with ciphers. First, once the sender and receiver have agreed upon the twenty-six letters in the cipher alphabet (the key), they can encipher any message, but to achieve the same level of flexibility using a code they would need to go through the painstaking task of defining a codeword for every one of the thousands of possible plaintext words. The codebook would consist of hundreds of pages, and would look something like a dictionary. In other words, compiling a codebook is a major task, and carrying it around is a major inconvenience.

Second, the consequences of having a codebook captured by the enemy are devastating. Immediately, all the encoded communications would become transparent to the enemy. The senders and receivers would have to go through the process of having to compile an entirely new codebook, and then this hefty new book would have to be distributed to everyone in the communications network, which might mean securely transporting it to every ambassador in every state. In comparison, if the enemy succeeds in capturing a cipher key, then it is relatively easy to compile a new cipher alphabet of twenty-six letters, which can be memorized and easily distributed.

Even in the sixteenth century, cryptographers appreciated the inherent weaknesses of codes and instead relied largely on ciphers, or sometimes nomenclators. A nomenclator is a system of encryption that relies on a cipher alphabet, which is used to encrypt the majority of a message, and a limited list of codewords. For example, a nomenclator book might consist of a front page containing the cipher alphabet, and then a second page containing a list of codewords. Despite the addition of codewords, a nomenclator is not much more secure than a straightforward cipher, because the bulk of a message can be deciphered using frequency analysis, and the remaining encoded words can be guessed from the context.

Figure 6 The science of secret writing and its main branches.

As well as coping with the introduction of the nomenclator, the best cryptanalysts were also capable of dealing with the presence of nulls. In short, they were able to break the majority of encrypted messages. Their skills provided a steady flow of uncovered secrets, which influenced the decisions of their masters and mistresses, thereby affecting Europe’s history at critical moments.

Nowhere is the impact of cryptanalysis more dramatically illustrated than in the case of Mary Queen of Scots. The outcome of her trial depended wholly on the battle between her codemakers and Queen Elizabeth’s codebreakers. Mary was one of the most significant figures of the sixteenth century – queen of Scotland, queen of France, pretender to the English throne – yet her fate would be decided by a slip of paper, the message it bore, and whether or not that message could be deciphered.

<< 1 2 3 4 5 >>

На страницу:

Перейти

3 из 5

Другие электронные книги автора Simon Singh

The Code Book: The Secret History of Codes and Code-breaking

Fermat’s Last Theorem

Big Bang