UTF-DNA: A Text Encoding for DNA Sequences

This document is a proposal for the UTF-DNA standard, the Unicode Transformation Format for DNA, a compact representation of text in a theoretical DNA sequence. It draws heavily, though not exactly, from UTF-8 which has become the overwhelmingly predominant text encoding standard on the web and in computing in general.

What is a DNA sequence?

DNA is a molecule found in every living organism which defines the instructions for its functioning. DNA is composed of a sequence of organic molecules called nucleotides where each nucleotide can be one of four molecules, Adenine, Cytosine, Guanine and Thymine. These molecules are commonly given the abbreviations A, C, G and T.

If each of the four nucleotide is considered to be a digit, there would be four digits (0 to 3) and DNA could effectively be considered a quaternary (base 4) representation. For the purposes of this standard, each nucleotide corresponds to a digit as follows:

A = 0
C = 1
G = 2
T = 3

The DNA sequence for our purposes is theoretical; it is a string of digits drawn from A, C, G and T which considers the mathematics without reference to the underlying biochemistry.

How large is a byte?

Modern computing is based on the binary (base 2) system where each bit (binary digit) can be either 0 or 1. Bits are grouped into bytes where a byte almost exclusively refers to eight bits. For positive values, a byte can represent numbers 0 to 255. So how long should a byte be in a DNA sequence? Mathematically, four quaternary nucleotides maps exactly to eight bits. We therefore have a direct mapping from any text encoding to 8-bit bytes, such as UTF-8, to the nucleotide equivalent. However, nature has it’s own answer; DNA works on a grouping of three nucleotides where each unique triplet (also known as a codon) encodes one of 20 or so “amino acids”. Amino acids are yet more molecules which are the building blocks of proteins. For the purposes of this document we will follow nature’s cue and a byte will therefore be three quaternary digits. Three quaternary digits can represent numbers from 0 to 63 and the equivalent number of binary digits is six, so UTF-DNA will have 6-bit bytes.

The Unicode Standard

There are many thousands of unique characters across a multitude of languages and non-languages (think arrows, emojis, chess pieces etc.) that require encoding. The Unicode Standard defines a unique number, called a code point, for each character. At the time of writing Unicode 16.0 defines over 154,998 characters, though the standard can accommodate up to 1,114,111 characters. The purpose of a text encoding is to describe a method of transforming each character into one or more bytes. A single byte, whether 6-bit or 8-bit, cannot come close to representing every Unicode code point, so text encodings for Unicode will use multiple bytes per character.

The UTF-8 standard

UTF-8 defines a so called “variable-width” encoding where the number of bytes used to represent a character will vary depending on the character. In the case of UTF-8, one, two, three or four bytes are used. The uppermost one or more bits of each 8-bit byte are hijacked and used as “control bits” which determine what that byte means. The rest of the bits are “data bits” which actually define the code point. Here is a template 2-byte representation:

       Byte 1                     Byte 2
    ______|______              ______|______
   |             |            |             |
   1 1 0 x x x y y            1 0 y y z z z z
   |___| |_______|            |_| |__________|
     |       |                 |       |
  Control   Data            Control   Data
   bits     bits             bits     bits

In this example:

The byte 1 control bits 110 signify that this is a 2-byte representation. Note that in the single byte representation, the control bit is 0, two bytes 110, three bytes 1110 and four bytes 11110.
The byte 2 control bits 10 are “continuation bits” signifying that this is the 2nd or higher of a multi-byte representation.
The data itself is the binary number xxxyyyyzzzz.

Unicode code points are represented with values 0 to U+10FFFF where the number after U+ is in hexadecimal (base 16) representation. Here is the full scheme:

Bytes represented in binary

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
U+0000	U+007F	0yyyzzzz
U+0080	U+07FF	110xxxyy	10yyzzzz
U+0800	U+FFFF	1110wwww	10xxxxyy	10yyzzzz
U+010000	U+10FFFF	11110uvv	10vvwwww	10xxxxyy	10yyzzzz

The DNA equivalent of UTF-8

Let’s try DNA encoding in the same manner as UTF-8, but with 6-bit bytes instead of 8-bit bytes:

Bytes represented in binary

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5
U+0000	U+001F	0yzzzz
U+0020	U+007F	110yyy	10zzzz
U+0080	U+03FF	1110xx	10yyyy	10zzzz
U+0400	U+1FFF	11110w	10xxxx	10yyyy	10zzzz
U+2000	U+FFFF	111110	10wwww	10xxxx	10yyyy	10zzzz
U+010000	U+10FFFF	FULL

This doesn’t work. With a 6-bit encoding we can only get so far as five bytes until we run out of space in the initial control byte. The highest code point representable is only U+FFFF which falls far short of the highest Unicode code point of U+10FFFF.

Can we tweak our 6-bit encoding scheme to allow us to reach U+10FFFF?

Notice how for 2-byte representations, three control bits are used up in the first byte. And for every extra byte added, yet another control bit is added. As we’re working in binary, surely each control bit can represent either one or two extra bytes rather than just one extra byte? There is a good reason why UTF-8 chooses to use just one control bit when representing a single byte, and it is one of UTF-8’s most important properties. UTF-8 was designed to represent code points from U+0000 to U+007F, the characters from the longstanding ASCII standard of mostly Latin characters which requires 7 bits in just a single byte. So for backward compatibility, valid ASCII is also valid UTF-8. Also, a significant number of the world’s languages, and likely the majority of text encoded and stored digitally, will have a high proportion of ASCII characters. UTF-8 encoding therefore minimises number of bytes used for most text markedly reduces storage and bandwidth requirements. This is at the expense of using up control bits more greedily.

With 6-bit bytes, we have no choice but to use a second byte to represent the higher end of ASCII’s 7-bit requirement. In fact, considering the upper case letter A has code point U+0041, and all other upper and lower case letters have higher code points, the majority of ASCII text will require two bytes. With this knowledge, we can craft a scheme which forces the use of two bytes in earlier code points allowing us to use less control bits as we increase the number of bytes.

Here’s one option:

Bytes represented in binary

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
U+0000	U+000F	00zzzz
U+0010	U+00FF	01yyyy	10zzzz
U+0100	U+03FF	1100xx	10yyyy	10zzzz
U+0400	U+3FFF	1101ww	10xxxx	10yyyy	10zzzz
U+4000	U+3FFFF	1110vv	10wwww	10xxxx	10yyyy	10zzzz
U+40000	U+3FFFFF	1111uu	10vvvv	10wwww	10xxxx	10yyyy	10zzzz

The tweaks here are:

To use two control bits rather than one for the single byte representation. This affords us more headroom in control bits once we have further bytes.
Because we know we don’t need numbers higher than U+10FFFF, we can use just two extra control bits for the 3, 4, 5 and 6 byte representation, at the cost of precluding us having a representation for a 7th byte, should we need it in the future. But we already have headroom insofar as our maximum code point is U+3FFFFF which far exceeds UTF-8’s maximum of U+10FFFF.

The resulting schema for the control bits is reminiscent of Huffman encoding where we are using the control bits as efficiently as possible in the knowledge that we only need to encode up to six bytes and no more.

By commandeering 2-bits at a time as control bits, we gain an extra valuable property. As mentioned earlier, quaternary representation is such that two binary bits represent a single quaternary digit. We can therefore recreate this table in native quaternary:

Bytes represented in quaternary

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
U+0000	U+000F	0zz
U+0010	U+00FF	1yy	2zz
U+0100	U+03FF	30x	2yy	2zz
U+0400	U+3FFF	31w	2xx	2yy	2zz
U+4000	U+3FFFF	32v	2ww	2xx	2yy	2zz
U+40000	U+3FFFFF	33u	2vv	2ww	2xx	2yy	2zz

One final improvement. There is, in my mind, just a little more beauty if the first control digit for the first byte represents the number of extra bytes needed, and if it’s a 2, then the second control digit is the additional number of bytes needed. This would incidentally make the continuation digit 3:

Bytes represented in quaternary

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
U+0000	U+000F	0zz
U+0010	U+00FF	1yy	3zz
U+0100	U+03FF	20x	3yy	3zz
U+0400	U+3FFF	21w	3xx	3yy	3zz
U+4000	U+3FFFF	22v	3ww	3xx	3yy	3zz
U+40000	U+3FFFFF	23u	3vv	3ww	3xx	3yy	3zz

The final step is to translate our quaternary digits to their nucleotide equivalents…

The proposal for UTF-DNA

This table is the final proposal for UTF-DNA.

Bytes represented in quaternary nucleotide digits

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
U+0000	U+000F	Azz
U+0010	U+00FF	Cyy	Tzz
U+0100	U+03FF	GAx	Tyy	Tzz
U+0400	U+3FFF	GCw	Txx	Tyy	Tzz
U+4000	U+3FFFF	GGv	Tww	Txx	Tyy	Tzz
U+40000	U+3FFFFF	GTu	Tvv	Tww	Txx	Tyy	Tzz

Where:

A = 0
C = 1
G = 2
T = 3

Example

Let us take the following text as an example:

À votre santé 😊

First convert each character to its Unicode code point:

À       = U+C0
[space] = U+20
v       = U+76
o       = U+6F
t       = U+74
r       = U+72
e       = U+65
[space] = U+20
s       = U+73
a       = U+61
n       = U+6E
t       = U+74
é       = U+E9
[space] = U+20
😊      = U+1F60A

Next encode in quaternary encoding:

À       = 130 300
[space] = 102 300
v       = 113 312
o       = 112 333
t       = 113 310
r       = 113 302
e       = 112 311
[space] = 102 300
s       = 113 303
a       = 112 301
n       = 112 332
t       = 113 310
é       = 132 321
[space] = 102 300
😊      = 221 333 312 300 322

Then translate each digit to its equivalent nucleotide:

À       = CTA TAA
[space] = CAG TAA
v       = CCT TCG
o       = CCG TTT
t       = CCT TCA
r       = CCT TAG
e       = CCG TCC
[space] = CAG TAA
s       = CCT TAT
a       = CCG TAC
n       = CCG TTG
t       = CCT TCA
é       = CTG TGC
[space] = CAG TAA
😊      = GGC TTT TCG TAA TGG

Finally we have our complete DNA sequence:

CTATAACAGTAACCTTCGCCGTTTCCTTCACCT
TAGCCGTCCCAGTAACCTTATCCGTACCCGTTG
CCTTCACTGTGCCAGTAAGGCTTTTCGTAATGG

Incidentally, in this example the UTF-8 representation is 160 bits long (20 bytes of 8-bits) and the UTF-DNA representation is 198 bits (33 bytes of 6-bits).

Issues and questions

An important question is whether the choice of three nucleotides per byte instead of four is warranted. As mentioned, the justification for three nucleotides over a more convenient (and more compact) four is because “this is how nature does it”. On the basis that we’re dealing with theoretical DNA sequences, this argument is weak. However, the interesting consequence of three nucleotide bytes is that it permits a direct encoding of text to a corresponding sequence of bytes, and in turn to a corresponding sequence of amino acids, and therefore potentially to a protein. This may or may not be valuable, and in truth, there is probably space for both forms of encoding.
The aim of this standard has been to propose a straightforward and compact representation of text as a theoretical DNA sequence, however in practice, at the time of writing, real-world synthetic DNA data storage avoids repeated nucleotides to reduce problems with accurate reading, also known as sequencing. For example, the sequence GCTCAGCTCTGA is fine whereas the sequence AGCTTCAAAG is not because the nucleotide T is repeated twice and A repeated three times. With this in mind, either further work can be done to refine UTF-DNA, perhaps as a separate standard, to make it suitable for real-world use (though the representation will be far less compact) or perhaps in future, technology will improve such that repeated nucleotides no longer present issues with sequencing.
What if the Unicode standard increases the number of code points? UTF-8 has plenty of room for expansion. This proposal has headroom up to U+3FFFFF, about four times the current limit, but no further. The trade-off was to reduce the number of bytes used in most representations. Considering there are 150,000 characters defined in a standard of over 1 million code points, this is likely a good trade-off.
Single byte code points in this proposal represent Unicode code points from 0 to 15 which are all non-printable Control Characters which are rarely used, except for the line feed (U+0A) and newline (U+0D) characters. This asks the question whether we should entertain a single byte representation at all, and rather go straight to double bytes for ASCII? Alternatively, to try to maximise the single byte representations and reduce bandwidth and storage, we could depart even further from UTF-8 and devise an encoding where as many as possible of the 52 upper and lower case Latin characters are accommodated in a single byte. Unfortunately, as a minimum, the uppermost bit of a 6-bit byte must be a control bit, which leaves five bits for data, so encompassing only 32 characters and not 52. We might consider covering the lower case letters alone in a single byte, though the deviation from UTF-8 would be even more marked.
Naming is hard. Most UTF standards hint at the number of bits used in the encoding giving rise to names such as UTF-5, UTF-6, UTF-8 and UTF-16. UTF-DNA is a 6-bit encoding so it could be argued that the number 6 should feature in the name. However, the number of bits used is a consequence of the fact that we are mimicking DNA in the real world with DNA nucleotide triplets, and it was therefore felt that using DNA nucleotides was the overriding feature to this standard. There is also precedent with the rarely used UTF-EBCDIC standard, both in naming format, and insofar as it’s a “translation” of the UTF-8 standard, hence settling on the name UTF-DNA.

Conclusion

This document outlines a proposed UTF-DNA standard, a variable-width encoding of the Unicode 16.0 standard in one to six 6-bit bytes, represented in base 4 DNA nucleotides.

History

12th February 2025: Add to Issues and questions
6th February 2025: First published

UTF-DNA: A Text Encoding for DNA Sequences

What is a DNA sequence?

How large is a byte?

The Unicode Standard

The UTF-8 standard

The DNA equivalent of UTF-8

The proposal for UTF-DNA

Example

Issues and questions

Conclusion

History

References