Basics - Encodings

In computer systems, all data is stored in binary, which are sequences of 1s and 0s. Usually, these sequences are grouped in 8 bits, which is called a byte. Even the text that you read right now is, on a low level, stored as bytes. Computers have various way of interpreting those bytes. For example, when a computer sees the byte 01100001, it can interpret it as a binary representation of the number 97. However, if we look at all bytes as numbers, we would not be able use bytes as text, pictures, or even computer programs. Therefore, we can tell a computer how to interpret a byte, e.g. as a character instead of a number. In that case the same byte 01100001 represents the character a. When playing capture the flag (CTF) challenges, you will encounter many different ways of representing data. These representations are called encodings. Here we will discuss some of the basic encodings of data.

ASCII & UTF-8

Historically, the English alphabet (a-zA-Z) and Arabic numerals (0-9) of nearly all modern computer systems are encoded using the ASCII encoding scheme. Here, each character is assigned a number, whose binary representation (byte) should be interpreted as a character. There are some clever tricks that ASCII uses to ease interpretation of bytes. E.g., bytes starting with 0011 xxxx represent the characters for Arabic numerals 0-9, where 0011 0000 = 0, 0011 0001 = 1, ..., etc. Furthermore, characters starting with 010 xxxxx represent capital letters A-Z, where 010 00000 = A, 010 00001 = B, ..., etc. Then, we have the lowercase letters a-z of the form 011 xxxxx, where 011 00000 = a, 011 00001 = b, ..., etc. Finally, remaining bytes are used to represent other characters such as spaces, punctuation, or control characters that are used to control peripherals such as printers. Fortunately, programming languages offer easy methods to convert between ASCII and bytes. E.g., in Python:

string_ascii =  "Hello World!" # Create normal string
string_bytes = b"Hello World!" # Create string as bytes, using the b character in front of the string
string_ascii.encode('ascii')   # Convert ASCII string to bytes
string_bytes.decode('ascii')   # Convert bytes to ASCII string

While the English alphabet and Arabic numerals are widely used, there are many more characters than the 7-bit ASCII scheme can capture. Therefore, if we want to represent different characters used in languages such as Arabic (عربى), Russian (русский), Hebrew (עִברִית), or Hindi (हिंदी), we require a different encoding scheme. The Unicode consortium introduced a system where every conceivable character is assigned a number or code point (yes, including the 💩 emoji, number 128169, or 1f4a9 in hexadecimal). To represent these larger numbers, most computer systems use the UTF-8 format. This format is the same as ASCII for the first 127 (7-bit) numbers, but has a clever way of extending this encoding to multiple bytes for unicode numbers that exceed the 7-bit ASCII limit. Similarly to ASCII, programming languages such as Python can easily convert between bytes and UTF-8:

string_utf8  = "Hello 中国"                 # Create  UTF-8 string
string_bytes = string_utf8.encode('utf-8') # Convert UTF-8 string to bytes
string_bytes.decode('utf-8')               # Convert bytes to UTF-8

Hexadecimal

As all data on a computer is stored as bytes, we need a way to exchange these bytes in a format that is both (at least somewhat) readable for humans and computers. We could use ASCII for this, but some bytes such as 00000000 (Null-character) do not have a character that humans can understand. If we would represent 00000000 as Null, how would we distinguish it from the sequence of english characters N, u, l and l? Therefore, one common way of transforming bytes to a human-readable format is through hexadecimal (sometimes simply called hex) encoding. Hexadecimal is a way of representing the numbers 0-15, where the following scheme is used:

decimal    hexadecimal    binary
      0              0      0000
      1              1      0001
      2              2      0010
      3              3      0011
      4              4      0100
      5              5      0101
      6              6      0110
      7              7      0111
      8              8      1000
      9              9      1001
     10              a      1010
     11              b      1011
     12              c      1100
     13              d      1101
     14              e      1110
     15              f      1111

We observe that a byte (8-bits) consists of 2 times 4-bits (sometimes called a nibble). From the Table above, we see that 4-bits can be represented by a hexadecimal character, and therefore every byte can be represented as 2 hexadecimal characters. Take the example of the ASCII character a = 0110 0001. In hexadecimal the 0110 becomes 6 and 0001 becomes 1. Therefore, the hexadecimal representation of a in ASCII becomes 61. Programming languages such as Python have easy ways of converting between encodings such as ASCII, bytes, hexadecimal and integers:

character   = "a"                        # ASCII/UTF-8 character "a"
bytes       = character.encode("utf-8")  # Bytes representation of "a"
hexadecimal = bytes.hex()                # Transform bytes to hexadecimal
bytes_again = bytes.fromhex(hexadecimal) # Transform hexadeximal to bytes
integer     = int(hexadecimal, 16)       # Transform hexadecimal (base-16) to integer
binary      = "{:08b}".format(integer)   # Transform integer to single byte (In case of integers larger than 1 byte, change 8 to 8*number_of_bytes)

Base64

While hexadecimal can represent any byte data, it is quite wasteful. After all, each byte is transformed into 2 hexadecimal characters. These hexadecimal characters are then usually encoded in UTF-8. In our example the byte for the letter a = 0110 0001 is 61 in hexadecimal, and will be sent or stored as the two bytes for 6 and 1 which is 00110110 00110001. This is a doubling of data, which for longer sequences can become huge.

Base64 tries to improve upon hexadecimal encoding by using 64 different printable characters to represent 6-bit binary sequences (2^6 = 64). To this end it uses the characters A-Z, a-z, 0-9 (26+26+10 = 62 different characters) and two other characters, usually + and / (See the full Table here) to encode bytes. There is one catch. Base64 encoding splits byte data into sections of 6-bits, this means that for a sequence of 4 bytes (32 bits), it will require 5.333... characters in base64. The base64 encoding will use 6 characters, and use a padding character (=) to indicate that it could not use the exact required number of characters to fill the 6-bit encoding. Again, programming languages such as Python offer convenient ways of decoding and encoding in base64.

from base64 import b64encode, b64decode
text           = "Text to encode / decode into base64" # Text to encode into bytes, and then base64
text_as_bytes  = text.encode('utf-8')                  # Encode text to bytes
base64_bytes   = b64encode(text_as_bytes)              # Outputs b'VGV4dCB0byBlbmNvZGUgLyBkZWNvZGUgaW50byBiYXNlNjQ='
base64_text    = base64_bytes.decode('utf-8')          # Transform to string "VGV4dCB0byBlbmNvZGUgLyBkZWNvZGUgaW50byBiYXNlNjQ="
original_bytes = b64decode(base64_bytes)               # Convert back to bytes data
original_text  = original_bytes.decode('utf-8')        # Convert back to text data

Challenge

The byte size of 8 bits was chosen because it is a power of 2, other than that, it is an arbitrary length. Instead, we think that the 9-bit byte, i.e. the (k)nyte is much cooler! The string below is a hexadecimal encoding of our flag represented as ASCII in knytes, can you retrieve the flag?

2a120a67b1412c526e3c950665f1e97cc4332a15066722f9d0d0343717c84793a0cce67d