To go from utf-8 code units to a unicode codepoint do the following:
We'll use the code units for these unicode characters 全然 真棒.
If we are looking at a hex editor, for a utf-8 encoded text file, we should see something like the following:
e5 85 a8 e7 84 b6 e7 9c 9f e6 a3 92
We'll assume no BOM, so we'll be reading hex left to right.
Some slight understanding of utf-8 is required here. Since utf-8 is multibyte,
the first byte (8 bits) tells us how many code units to "concatenate" together
to get to the codepoint.
The concat algorithm goes something like this:
- Convert to binay
- Get the first byte (8 bits).
- If the leftmost bit is a 0, we return the current byte for decoding.
- If the leftmost bit is a 1 keep reading 1's until we hit a zero.
- The number of ones tell us how many bytes to use INCLUDING the first byte.
Once we know the number of bytes to use do the following.
- For the first byte chop off the repeating 1's up to and including the first 0. The remaining bits start the bits of the codepoint.
- For each of the next bytes continue to remove the leftmost two bits, which should always be a 1 and a 0.
- the remaining bits all get concatenated together
Example
- Convert code units to binary e5 = 1110 0101
- Determine number of bytes needed to find the codepoint We look at the first byte ("e5") and take note that there are 3 '1's, before the leftmost '0'.
- Start chopping the bits up Now that we know all the bytes we need, we're going to start chopping of bits.
- Concatenate the bits Concatenate all of our remaining bits in order.
- Convert to hex Break the binary string into nibbles (4 bits each) for easy conversion.
- Look up Check one of those unicode lookup sites for the codepoint 5168.
- Celebrate with a cold beverage 🍻
85 = 1000 0101
a8 = 1010 1000
This means we need to take the current byte, and the next two bytes (we're going to look at a total of 24 bits, or 3 bytes)
11100101 - We remove the beginning '1's and first zero. We are left with
0101
We look at the next byte.
10000101 - We remove the first two bits which we know will be 10. Remaining bits are
000101
lets look at the last byte
10101000 we remove the first two bits which we know will be a 10. Remaining bits are
101000
0101 - 000101 - 101000 becomes
0101000101101000
Note we are left with 16 bits (left fill zeros if needed to make an even multiple of 8)
0101 0001 0110 1000
5 1 6 8
全 Nice this looks like what we wanted.