Encoding Hangeul, Koreas writing system.

Introduction

Hangeul (한글) is the modern writing system for the Korean language, created in 1443 by King Sejong the great, the fourth king of the Joseon dynasty1. Before its invention, Korean was written using Hanja (한자) — Chinese characters adapted for Korean. Even after Hangeul’s creation, Hanja remained in regular use for centuries, particularly in formal and scholarly contexts2.

Hangeul is a syllabic writing system where each syllable is composed of letters called Jamo (자모). These Jamo function like Latin letters, each representing an individual sound. There are a total of 11,172 syllabic combinations of Jamo which can be used to form Hangeul3.

However, learning Hangeul doesn’t require memorizing thousands of combinations. Instead, one only needs to learn the 68 different Jamo shapes and rules governing the construction of Hangeul. This is much easier than it sounds with the basics of the Korean alphabet being able to be learned within 60-90 minutes.

This post explores how Hangeul is encoded in Unicode, a universal standard that assigns unique values to characters across all languages4. It also delves into the mathematical operations used to construct and deconstruct individual Jamo from encoded Hangeul.


Hangeul, Jamo, and Unicode

A Korean syllable consists of a lead consonant, a medial vowel, and a tail consonant. For example,

한 (han) can be deconstructed like so:

자 (ja) can be deconstructed like so:

Syllables starting with an initial vowel, for example 안 (an), must be prefixed with ㅇ (eung), a special lead consonant that produces no sound.

There are 19 different lead consonants ranging between Unicode values U+1100 and U+1112. There are 21 different vowels ranging between Unicode values U+1161 and U+1175. Finally, there are 27 tail consonants ranging between Unicode values U+11A8 and U+11C2 + a null consonant used when there is no tail. This range of Unicode is called the Hangeul Jamo block.

Below is a set of tables providing each Jamo, its canonical order, as well as its Unicode value. When Jamo are used in combination with each other, they change their shape, size, and positions slightly to fit into a single Hangeul. Unicode has a second range of Jamo characters called “Hangeul Compatibility Jamo” starting at U+3130 which represent the initial and final consonants with the same Unicode codepoint as opposed to the Hangeul Jamo block which assigns them separate codepoints5. These compatibility Jamo typically render much more clearer.

The table below shows both, the conjoining Jamo as well as the [compatibility Jamo] with the character reference hex value referring to the former6. Depending on your browser, both Jamo may render the same.

NumberLeadJamoCharacter reference
1Gᄀ [ㄱ]0x1100
2GGᄁ [ㄲ]0x1101
3Nᄂ [ㄴ]0x1102
4Dᄃ [ㄷ]0x1103
5DDᄄ [ㄸ]0x1104
6Rᄅ [ㄹ]0x1105
7Mᄆ [ㅁ]0x1106
8Bᄇ [ㅂ]0x1107
9BBᄈ [ㅃ]0x1108
10Sᄉ [ㅅ]0x1109
11SSᄊ [ㅆ]0x110A
12ᄋ [ㅇ]0x110B
13Jᄌ [ㅈ]0x110C
14JJᄍ [ㅉ]0x110D
15Cᄎ [ㅊ]0x110E
16Kᄏ [ㅋ]0x110F
17Tᄐ [ㅌ]0x1110
18Pᄑ [ㅍ]0x1111
19Hᄒ [ㅎ]0x1112
NumberVowelJamoCharacter reference
1Aᅡ [ㅏ]0x1161
2AEᅢ [ㅐ]0x1162
3YAᅣ [ㅑ]0x1163
4YAEᅤ [ㅒ]0x1164
5EOᅥ [ㅓ]0x1165
6Eᅦ [ㅔ]0x1166
7YEOᅧ [ㅕ]0x1167
8YEᅨ [ㅖ]0x1168
9Oᅩ [ㅗ]0x1169
10WAᅪ [ㅘ]0x116A
11WAEᅫ [ㅙ]0x116B
12OEᅬ [ㅚ]0x116C
13YOᅭ [ㅛ]0x116D
14Uᅮ [ㅜ]0x116E
15WEOᅯ [ㅝ]0x116F
16WEᅰ [ㅞ]0x1170
17WIᅱ [ㅟ]0x1171
18YUᅲ [ㅠ]0x1172
19EUᅳ [ㅡ]0x1173
20YIᅴ [ㅢ]0x1174
21Iᅵ [ㅣ]0x1175
NumberTailJamoCharacter reference
1Gᆨ [ㄱ]0x11A8
2GGᆩ [ㄲ]0x11A9
3GSᆪ [ㄳ]0x11AA
4Nᆫ [ㄴ]0x11AB
5NJᆬ [ㄵ]0x11AC
6NHᆭ [ㄶ]0x11AD
7Dᆮ [ㄷ]0x11AE
8Lᆯ [ㄹ]0x11AF
9LGᆰ [ㄺ]0x11B0
10LMᆱ [ㄻ]0x11B1
11LBᆲ [ㄼ]0x11B2
12LSᆳ [ㄽ]0x11B3
13LTᆴ [ㄾ]0x11B4
14LPᆵ [ㄿ]0x11B5
15LHᆶ [ㅀ]0x11B6
16Mᆷ [ㅁ]0x11B7
17Bᆸ [ㅂ]0x11B8
18BSᆹ [ㅄ]0x11B9
19Sᆺ [ㅅ]0x11BA
20SSᆻ [ㅆ]0x11BB
21NGᆼ [ㅇ]0x11BC
22Jᆽ [ㅈ]0x11BD
23Cᆾ [ㅊ]0x11BE
24Kᆿ [ㅋ]0x11BF
25Tᇀ [ㅌ]0x11C0
26Pᇁ [ㅍ]0x11C1
27Hᇂ [ㅎ]0x11C2

Encoding Hangeul

Although each Jamo has its own individual Unicode value, Unicode values U+AC00 to U+D7A3 define every combination of Jamo, or every Hangeul character. This range of Unicode is called the Hangeul Syllables block7.

These syllables can be directly mapped by algorithm back to sequences of two or three Jamo in the Hangeul Jamo block mentioned earlier8. It’s due to this that it’s theoretically possible to encode Korean texts with Jamo only and let the font rendered handle Hangeul construction. This is not recommended in practice though due to limitations and lack of support in current renderers.

Isolated Jamo are rarely found in Korean texts instead opting for use of the precomposed Hangeul syllables. The codepoint of a Hangeul can be calculated from its Jamo components using the following formula9.

Hangeul Unicode = (((lead − 1) * 588) + ((vowel − 1) * 28) + tail) + 44032

In the formula, lead, vowel, and tail refer to the small integers given in the above tables. If there is no tail, the value 0 is used. The value 28 is the sum of the tail count. The value 588 is the sum of vowel and tail count. 44032 (0xAC00) is the first character of the Hangeul Syllables Unicode block.

As an example, let’s take the Hangeul 한 (han), which in Unicode is U+D55C. Using the Jamo that construct it, let’s try reach the final value.

ㅎ = 19 // Lead consonant
ㅏ = 1  // Median vowel
ㄴ = 4  // Tail consonant

한 = U+D55C = 54620 = (((19 − 1) * 588) + ((1 − 1) * 28) + 4) + 44032
                    = (10584 + 0 + 4) + 44032
                    = 10588 + 44032
                    = 54620

Given the above formula, it can be rearranged to extract the Jamo of a Hangeul.

tail = mod(Hangeul Unicode - 44032, 28)
vowel = 1 + mod(Hangeul Unicode - 44032 - tail, 588) / 28
lead = 1 + floor[(Hangeul Unicode - 44032) / 588]

As another example, let’s try work backwards. Given the Hangeul 한 (han), let’s try deconstruct it.

한 = U+D55C = 54620

// tail
ㄴ = 4 = mod(54620 - 44032, 28)
       = mod(10588, 28)
       = 4
          
// vowel
ㅏ = 1 = 1 + mod(54620 - 44032 - 4, 588) / 28
       = 1 + mod(10584, 588) / 28
       = 1 + 0 / 28
       = 1

// lead
ㅎ = 19 = 1 + floor[(54620 - 44032) / 588]
        = 1 + floor[10592 / 588]
        = 1 + 18
        = 19

Obsolete Hangeul

Overtime, Hangeul has undergone many changes with several Jamo no longer being used in modern Korean10.

Since these obsolete characters are still used in older literature and historical texts, there’s still a need to have some way of using them. However, because they’re obsolete, Unicode does not offer precomposed Hangeul. Instead, syllables using them must be coded by Jamo.

Conclusion

Next time you type “한글” on your keyboard or read a text in Korean, you’ll know there’s a beautifully logical system at play behind the scenes — one that blends centuries of linguistic history with the modern efficiency of Unicode.

Thanks for reading,
- Brook ❤

Footnotes

1: https://en.wikipedia.org/wiki/Hangul
2: https://en.wikipedia.org/wiki/Hanja
3: https://www.namhansouthkorea.com/how-many-korean-characters-are-there/
4: https://www.unicode.org/standard/WhatIsUnicode.html
5: https://en.wikipedia.org/wiki/Hangul_Compatibility_Jamo
6: http://www.gernot-katzers-spice-pages.com/var/korean_hangul_unicode.html
7: https://en.wikipedia.org/wiki/Hangul_Syllables
8: https://www.unicode.org/reports/tr15/tr15-29.html#Hangul
9: https://en.wikipedia.org/wiki/Korean_language_and_computers#Hangul_in_Unicode
10: https://colab.research.google.com/github/bebechien/gemma/blob/main/Translator_of_Old_Korean_Literature.ipynb