This document is a proposal for the UTF-DNA standard, the Unicode Transformation Format for DNA, a compact representation of text in a theoretical DNA sequence. It draws heavily, though not exactly, from UTF-8 which has become the overwhelmingly predominant text encoding standard on the web and in computing in general.
DNA is a molecule found in every living organism which defines the instructions for its functioning. DNA is composed of a sequence of organic molecules called nucleotides where each nucleotide can be one of four molecules, Adenine, Cytosine, Guanine and Thymine. These molecules are commonly given the abbreviations A, C, G and T.
If each of the four nucleotide is considered to be a digit, there would be four digits (0 to 3) and DNA could effectively be considered a quaternary (base 4) representation. For the purposes of this standard, each nucleotide corresponds to a digit as follows:
The DNA sequence for our purposes is theoretical; it is a string of digits drawn from A, C, G and T which considers the mathematics without reference to the underlying biochemistry.
Modern computing is based on the binary (base 2) system where each bit (binary digit) can be either 0 or 1. Bits are grouped into bytes where a byte almost exclusively refers to eight bits. For positive values, a byte can represent numbers 0 to 255. So how long should a byte be in a DNA sequence? Mathematically, four quaternary nucleotides maps exactly to eight bits. We therefore have a direct mapping from any text encoding to 8-bit bytes, such as UTF-8, to the nucleotide equivalent. However, nature has it’s own answer; DNA works on a grouping of three nucleotides where each unique triplet (also known as a codon) encodes one of 20 or so “amino acids”. Amino acids are yet more molecules which are the building blocks of proteins. For the purposes of this document we will follow nature’s cue and a byte will therefore be three quaternary digits. Three quaternary digits can represent numbers from 0 to 63 and the equivalent number of binary digits is six, so UTF-DNA will have 6-bit bytes.
There are many thousands of unique characters across a multitude of languages and non-languages (think arrows, emojis, chess pieces etc.) that require encoding. The Unicode Standard defines a unique number, called a code point, for each character. At the time of writing Unicode 16.0 defines over 154,998 characters, though the standard can accommodate up to 1,114,111 characters. The purpose of a text encoding is to describe a method of transforming each character into one or more bytes. A single byte, whether 6-bit or 8-bit, cannot come close to representing every Unicode code point, so text encodings for Unicode will use multiple bytes per character.
UTF-8 defines a so called “variable-width” encoding where the number of bytes used to represent a character will vary depending on the character. In the case of UTF-8, one, two, three or four bytes are used. The uppermost one or more bits of each 8-bit byte are hijacked and used as “control bits” which determine what that byte means. The rest of the bits are “data bits” which actually define the code point. Here is a template 2-byte representation:
Byte 1 Byte 2
______|______ ______|______
| | | |
1 1 0 x x x y y 1 0 y y z z z z
|___| |_______| |_| |__________|
| | | |
Control Data Control Data
bits bits bits bits
In this example:
110
signify that this is a 2-byte representation. Note that in the single byte representation, the control bit is 0
, two bytes 110
, three bytes 1110
and four bytes 11110
.10
are “continuation bits” signifying that this is the 2nd or higher of a multi-byte representation.xxxyyyyzzzz
.Unicode code points are represented with values 0 to U+10FFFF where the number after U+ is in hexadecimal (base 16) representation. Here is the full scheme:
Bytes represented in binary
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
U+0000 | U+007F | 0yyyzzzz | |||
U+0080 | U+07FF | 110xxxyy | 10yyzzzz | ||
U+0800 | U+FFFF | 1110wwww | 10xxxxyy | 10yyzzzz | |
U+010000 | U+10FFFF | 11110uvv | 10vvwwww | 10xxxxyy | 10yyzzzz |
Let’s try DNA encoding in the same manner as UTF-8, but with 6-bit bytes instead of 8-bit bytes:
Bytes represented in binary
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 |
---|---|---|---|---|---|---|
U+0000 | U+001F | 0yzzzz | ||||
U+0020 | U+007F | 110yyy | 10zzzz | |||
U+0080 | U+03FF | 1110xx | 10yyyy | 10zzzz | ||
U+0400 | U+1FFF | 11110w | 10xxxx | 10yyyy | 10zzzz | |
U+2000 | U+FFFF | 111110 | 10wwww | 10xxxx | 10yyyy | 10zzzz |
U+010000 | U+10FFFF | FULL |
This doesn’t work. With a 6-bit encoding we can only get so far as five bytes until we run out of space in the initial control byte. The highest code point representable is only U+FFFF
which falls far short of the highest Unicode code point of U+10FFFF
.
Can we tweak our 6-bit encoding scheme to allow us to reach U+10FFFF
?
Notice how for 2-byte representations, three control bits are used up in the first byte. And for every extra byte added, yet another control bit is added. As we’re working in binary, surely each control bit can represent either one or two extra bytes rather than just one extra byte? There is a good reason why UTF-8 chooses to use just one control bit when representing a single byte, and it is one of UTF-8’s most important properties. UTF-8 was designed to represent code points from U+0000
to U+007F
, the characters from the longstanding ASCII standard of mostly Latin characters which requires 7 bits in just a single byte. So for backward compatibility, valid ASCII is also valid UTF-8. Also, a significant number of the world’s languages, and likely the majority of text encoded and stored digitally, will have a high proportion of ASCII characters. UTF-8 encoding therefore minimises number of bytes used for most text markedly reduces storage and bandwidth requirements. This is at the expense of using up control bits more greedily.
With 6-bit bytes, we have no choice but to use a second byte to represent the higher end of ASCII’s 7-bit requirement. In fact, considering the upper case letter A has code point U+0041
, and all other upper and lower case letters have higher code points, the majority of ASCII text will require two bytes. With this knowledge, we can craft a scheme which forces the use of two bytes in earlier code points allowing us to use less control bits as we increase the number of bytes.
Here’s one option:
Bytes represented in binary
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
---|---|---|---|---|---|---|---|
U+0000 | U+000F | 00zzzz | |||||
U+0010 | U+00FF | 01yyyy | 10zzzz | ||||
U+0100 | U+03FF | 1100xx | 10yyyy | 10zzzz | |||
U+0400 | U+3FFF | 1101ww | 10xxxx | 10yyyy | 10zzzz | ||
U+4000 | U+3FFFF | 1110vv | 10wwww | 10xxxx | 10yyyy | 10zzzz | |
U+40000 | U+3FFFFF | 1111uu | 10vvvv | 10wwww | 10xxxx | 10yyyy | 10zzzz |
The tweaks here are:
U+10FFFF
, we can use just two extra control bits for the 3, 4, 5 and 6 byte representation, at the cost of precluding us having a representation for a 7th byte, should we need it in the future. But we already have headroom insofar as our maximum code point is U+3FFFFF
which far exceeds UTF-8’s maximum of U+10FFFF
.The resulting schema for the control bits is reminiscent of Huffman encoding where we are using the control bits as efficiently as possible in the knowledge that we only need to encode up to six bytes and no more.
By commandeering 2-bits at a time as control bits, we gain an extra valuable property. As mentioned earlier, quaternary representation is such that two binary bits represent a single quaternary digit. We can therefore recreate this table in native quaternary:
Bytes represented in quaternary
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
---|---|---|---|---|---|---|---|
U+0000 | U+000F | 0zz | |||||
U+0010 | U+00FF | 1yy | 2zz | ||||
U+0100 | U+03FF | 30x | 2yy | 2zz | |||
U+0400 | U+3FFF | 31w | 2xx | 2yy | 2zz | ||
U+4000 | U+3FFFF | 32v | 2ww | 2xx | 2yy | 2zz | |
U+40000 | U+3FFFFF | 33u | 2vv | 2ww | 2xx | 2yy | 2zz |
One final improvement. There is, in my mind, just a little more beauty if the first control digit for the first byte represents the number of extra bytes needed, and if it’s a 2
, then the second control digit is the additional number of bytes needed. This would incidentally make the continuation digit 3
:
Bytes represented in quaternary
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
---|---|---|---|---|---|---|---|
U+0000 | U+000F | 0zz | |||||
U+0010 | U+00FF | 1yy | 3zz | ||||
U+0100 | U+03FF | 20x | 3yy | 3zz | |||
U+0400 | U+3FFF | 21w | 3xx | 3yy | 3zz | ||
U+4000 | U+3FFFF | 22v | 3ww | 3xx | 3yy | 3zz | |
U+40000 | U+3FFFFF | 23u | 3vv | 3ww | 3xx | 3yy | 3zz |
The final step is to translate our quaternary digits to their nucleotide equivalents…
This table is the final proposal for UTF-DNA.
Bytes represented in quaternary nucleotide digits
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
---|---|---|---|---|---|---|---|
U+0000 | U+000F | Azz | |||||
U+0010 | U+00FF | Cyy | Tzz | ||||
U+0100 | U+03FF | GAx | Tyy | Tzz | |||
U+0400 | U+3FFF | GCw | Txx | Tyy | Tzz | ||
U+4000 | U+3FFFF | GGv | Tww | Txx | Tyy | Tzz | |
U+40000 | U+3FFFFF | GTu | Tvv | Tww | Txx | Tyy | Tzz |
Where:
Let us take the following text as an example:
À votre santé 😊
First convert each character to its Unicode code point:
À = U+C0
[space] = U+20
v = U+76
o = U+6F
t = U+74
r = U+72
e = U+65
[space] = U+20
s = U+73
a = U+61
n = U+6E
t = U+74
é = U+E9
[space] = U+20
😊 = U+1F60A
Next encode in quaternary encoding:
À = 130 300
[space] = 102 300
v = 113 312
o = 112 333
t = 113 310
r = 113 302
e = 112 311
[space] = 102 300
s = 113 303
a = 112 301
n = 112 332
t = 113 310
é = 132 321
[space] = 102 300
😊 = 221 333 312 300 322
Then translate each digit to its equivalent nucleotide:
À = CTA TAA
[space] = CAG TAA
v = CCT TCG
o = CCG TTT
t = CCT TCA
r = CCT TAG
e = CCG TCC
[space] = CAG TAA
s = CCT TAT
a = CCG TAC
n = CCG TTG
t = CCT TCA
é = CTG TGC
[space] = CAG TAA
😊 = GGC TTT TCG TAA TGG
Finally we have our complete DNA sequence:
CTATAACAGTAACCTTCGCCGTTTCCTTCACCT
TAGCCGTCCCAGTAACCTTATCCGTACCCGTTG
CCTTCACTGTGCCAGTAAGGCTTTTCGTAATGG
Incidentally, in this example the UTF-8 representation is 160 bits long (20 bytes of 8-bits) and the UTF-DNA representation is 198 bits (33 bytes of 6-bits).
An important question is whether the choice of three nucleotides per byte instead of four is warranted. As mentioned, the justification for three nucleotides over a more convenient (and more compact) four is because “this is how nature does it”. On the basis that we’re dealing with theoretical DNA sequences, this argument is weak. However, the interesting consequence of three nucleotide bytes is that it permits a direct encoding of text to a corresponding sequence of bytes, and in turn to a corresponding sequence of amino acids, and therefore potentially to a protein. This may or may not be valuable, and in truth, there is probably space for both forms of encoding.
The aim of this standard has been to propose a straightforward and compact representation of text as a theoretical DNA sequence, however in practice, at the time of writing, real-world synthetic DNA data storage avoids repeated nucleotides to reduce problems with accurate reading, also known as sequencing. For example, the sequence GCTCAGCTCTGA
is fine whereas the sequence AGCTTCAAAG
is not because the nucleotide T
is repeated twice and A
repeated three times. With this in mind, either further work can be done to refine UTF-DNA, perhaps as a separate standard, to make it suitable for real-world use (though the representation will be far less compact) or perhaps in future, technology will improve such that repeated nucleotides no longer present issues with sequencing.
What if the Unicode standard increases the number of code points? UTF-8 has plenty of room for expansion. This proposal has headroom up to U+3FFFFF, about four times the current limit, but no further. The trade-off was to reduce the number of bytes used in most representations. Considering there are 150,000 characters defined in a standard of over 1 million code points, this is likely a good trade-off.
Single byte code points in this proposal represent Unicode code points from 0 to 15 which are all non-printable Control Characters which are rarely used, except for the line feed (U+0A) and newline (U+0D) characters. This asks the question whether we should entertain a single byte representation at all, and rather go straight to double bytes for ASCII? Alternatively, to try to maximise the single byte representations and reduce bandwidth and storage, we could depart even further from UTF-8 and devise an encoding where as many as possible of the 52 upper and lower case Latin characters are accommodated in a single byte. Unfortunately, as a minimum, the uppermost bit of a 6-bit byte must be a control bit, which leaves five bits for data, so encompassing only 32 characters and not 52. We might consider covering the lower case letters alone in a single byte, though the deviation from UTF-8 would be even more marked.
Naming is hard. Most UTF standards hint at the number of bits used in the encoding giving rise to names such as UTF-5, UTF-6, UTF-8 and UTF-16. UTF-DNA is a 6-bit encoding so it could be argued that the number 6 should feature in the name. However, the number of bits used is a consequence of the fact that we are mimicking DNA in the real world with DNA nucleotide triplets, and it was therefore felt that using DNA nucleotides was the overriding feature to this standard. There is also precedent with the rarely used UTF-EBCDIC standard, both in naming format, and insofar as it’s a “translation” of the UTF-8 standard, hence settling on the name UTF-DNA.
This document outlines a proposed UTF-DNA standard, a variable-width encoding of the Unicode 16.0 standard in one to six 6-bit bytes, represented in base 4 DNA nucleotides.
© Safe Hammad 2025 | Powered by Hugo | HUGO-XMIN theme