Converting from utf-8 code units to unicode codepoints

To go from utf-8 code units to a unicode codepoint do the following:
We'll use the code units for these unicode characters 全然 真棒.

If we are looking at a hex editor, for a utf-8 encoded text file, we should see something like the following:
e5 85 a8 e7 84 b6 e7 9c 9f e6 a3 92
We'll assume no BOM, so we'll be reading hex left to right.

Some slight understanding of utf-8 is required here. Since utf-8 is multibyte, the first byte (8 bits) tells us how many code units to "concatenate" together to get to the codepoint.
The concat algorithm goes something like this:

Once we know the number of bytes to use do the following.

Example

  1. Convert code units to binary
  2. e5 = 1110 0101
    85 = 1000 0101
    a8 = 1010 1000
  3. Determine number of bytes needed to find the codepoint
  4. We look at the first byte ("e5") and take note that there are 3 '1's, before the leftmost '0'.
    This means we need to take the current byte, and the next two bytes (we're going to look at a total of 24 bits, or 3 bytes)
  5. Start chopping the bits up
  6. Now that we know all the bytes we need, we're going to start chopping of bits.
    11100101 - We remove the beginning '1's and first zero. We are left with
    0101

    We look at the next byte.
    10000101 - We remove the first two bits which we know will be 10. Remaining bits are
    000101

    lets look at the last byte
    10101000 we remove the first two bits which we know will be a 10. Remaining bits are
    101000

  7. Concatenate the bits
  8. Concatenate all of our remaining bits in order.
    0101 - 000101 - 101000 becomes
    0101000101101000
    Note we are left with 16 bits (left fill zeros if needed to make an even multiple of 8)
  9. Convert to hex
  10. Break the binary string into nibbles (4 bits each) for easy conversion.
    0101 0001 0110 1000
    5 1 6 8
  11. Look up
  12. Check one of those unicode lookup sites for the codepoint 5168.
    全 Nice this looks like what we wanted.
  13. Celebrate with a cold beverage
  14. 🍻