Character Packing - Pico CTF 2021 :: Greg Scharf

The following is a breakdown of solving a simple challenge named “enc” from PicoCTF 2021. It’s a useful introduction to bit shifting, character encoding and conversion between different base numbering systems.

The challenge starts with a file containing a string of character glyphs: 灩捯䍔䙻ㄶ形楴獟楮獴㌴摟潦弸彤㔲挶戹㍽

In addition to the file containing that string, the python one liner below was included, which presumably encoded the string above by running some text (i.e the flag) through that code.

''.join([chr((ord(flag[i]) << 8) + ord(flag[i + 1])) for i in range(0, len(flag), 2)])

If you’re not familiar with Python, the above is a “list comprehension”, which creates a sort of shorthand syntax for manipulating lists. A List in Python is just an array. The for loop clause on the right will loop through the length of the flag string starting at index 0, 2 characters at a time: for i in range(0, len(flag), 2). Each iteration of the loop is passed to the first part of the comprehension: chr((ord(flag[i]) « 8) + ord(flag[i + 1])). The ”.join() surrounding the entire expression simply takes all the values in the array and puts them into a string.

ord(flag[i + 1]) takes the character from the variable named flag at index i + 1 and converts it to decimal/base 10. Assume the first two characters in the string are “pi”, so in this example flag[0] is ‘p’ and flag[1] is ‘i’. These are in fact the first two characters retrieved in the first iteration of the for loop.

The ord() function in Python will convert a character value to its corresponding decimal value. If you look up the character ‘i’ at asciitable.com you’ll see that it’s decimal value is 105. Basic and extended ascii values range from decimal 0 – 255, which can all fit within an octet, or 8 bits. 105 in binary/base 2 is: 0 1 1 0 1 0 0 1.

The letter p (decimal 112) represented in bits is: 0 1 1 1 0 0 0 0. The expression (ord(flag[i]) « 8) will shift that value 8 bits to the left. A left shift of 8 bits, executed by this operator «, is the same as multiplying the left part of that expression by 256. The result of (ord(flag[i]) « 8) is 28,672.

Why shifting a number 8 bits to the left is the same as multiplying by 256. Below is a visual representation of 8 bits with corresponding decimal values above each bit. Each successive bit is simply 2n, n being the index of the bit and the first index being 0. So below, the first bit is 20 or 1 in decimal and the last bit is 27 or 128 decimal

128  64   32   16   8   4   2   1
0    1    1    1    0   0   0   0

If you add up each of the decimal numbers above the 1 bits you get 112, which again is the character ‘p’ in ascii. After shifting 8 bits to the left you’d have the following.

32768 16384 8192 4096 2048 1024  512  256  128   64  32   16   8   4   2   1
0     1     1     1     0    0    0    0    0    0    0    0   0   0   0   0

Now there are 8 extra bits to the right of the original 8 bits. You might assume a character is only 8 bits and for example in C the “char” datatype is only 1 byte. However, what we have in the challenge are characters that are UTF-8 encoded. If you run “file” on the challenge file you can verify this. UTF-8, the character encoding used in *nix based systems and html5 among others, is a variable width encoding of 1 – 4 octets representing a single character with 1 octet being the minimum size. In the original UTF-8 specification the max number of octets was 6 but in the revised spec it was reduced to 4 because of security concerns ranging from overflows to a web sever virus dating back to 2001 (see rfc3629. UTF-16, shortsightedly adopted early on by Microsoft, has a minimum of 16 bits per character with 32 bits being the maximum. If you have problems with files you’ve moved from Windows to Linux it’s most likely because the Windows file was UTF-16 encoded. Microsoft has recently come to their senses on this. From wikipedia: “Microsoft failed to support UTF-8 until 2017. In May 2019 Microsoft reversed course and started recommending using UTF-8 exclusively.” This article is a brief, beginner friendly introduction to Unicode and encoding. Taking a few hours to fully understand Unicode now could save you many more hours of confusion in the future.

Referring back to the visual representation of those 16 bits above: adding up the 1 bits now equals 28,672. As mentioned previously, shifting 8 bits to the left is the same as multiplying by 256, therefore both 112 * 256 and 112«8 equal 28,672.

Ultimately, this isn’t the best way to convert binary to decimal but if you’re unfamiliar or it’s the first time you’re seeing it, a visual representation makes the concept easier to grasp. Doing math, or at least understanding math, in bases other than 10 (Base 2 and Base 16 especially) can sometimes be a valuable skill for various IT roles. binarymath.info provides a introduction to binary math.

Putting all of that together, the expression (ord(‘p’) « 8) + ord(‘i’) or (28,672 + 105) equals 28,777. The Unicode value of which is the first character in the encoded string: 灩. So two ascii characters were packed into a single UTF-8 encoded character. Unicode is represented in base 16/hexadecimal. 28,777 converted to hex is 7069. So the Unicode value of the character above is formally represented as U+7069 and can be looked up in this table of Unicode characters spanning hex values 7000-7FFF.

Below is a QaD packer/unpacker of the encoding described above using some random text rather than the actual flag so as not to spoil the challenge.

#!/usr/bin/python3
 
def unpack(encoded_flag):
    flag = ""
    for packed_char in encoded_flag:
        start_int = ord(packed_char) #get the integer value of the packed unicode character
        highChar = ord(packed_char)>>8  #bit shift the integer 8 bits to the right, which in essence divides by 256
        flag += chr(highChar)
        lowChar = start_int - (highChar<<8) #difference between sum of 2 packed characters, and the first character multiplied by 256 or <<8
        lowChar = ord(packed_char)%256 #or you could simply use the modulus operator to get the remainder, which will be the lower char
        # explanation for retrieving lowChar.  If you multiple a number by 256 
        # and then add another number that is less than 256, 
        # the number added will always be the remainder when dividing the 
        # sum of those two numbers by 256.
        flag += chr(lowChar)
    print(flag)
 
def pack(flag):
    if(len(flag)%2):  #must be divisible by 2 since two chars are packed into a single char during encoding
        flag+=" "     #add blank space for padding
    encoded_flag = ''.join([chr((ord(flag[i]) << 8) + ord(flag[i + 1])) for i in range(0, len(flag), 2)])
    print(encoded_flag)
    return encoded_flag
 
 
flag = "This is the text that will be packed"
enc = pack(flag)
unpack(enc)