An Introduction to Unicode
This articles walks through the concept of Unicode as a means for handling Greek (and Hebrew, as well). I have in mind scholars who have had little success understanding or appreciating the movement to Unicode fonts and texts in their work. A more precise title of this document might be "An Introduction to Unicode Greek on Mac OS X". However, this conceptual introduction will be helpful to those using other OSes, including Windows. Also, while the examples I give center on Greek, they could easily use Hebrew or any other language as well. The level of discussion is designed for the bible scholar who considers themselves a non-techie-- whose response to Unicode so far has mostly been, "Uni-what?" This first article centers only on beginning concepts, in contrast to the old system we are familiar with. I will later move towards "trying things out" on your computer.
(If the Unicode characters on this web page are not displaying correctly, and you're using OS X, use Safari or Firefox, not Explorer. If you're running OS 8 or 9, have a look
here to get you started.)
The Way We Were: Legacy Fonts
For years, we typed our Greek (and Hebrew) with the same ASCII characters we use for English (or other indigenous language). In order to achieve the correct appearance, we would switch the font of the Greek text to a font that used Greek character forms. Each different font (Helena, SuperGreek, Graeca, TekniaGreek, SBL Greek, etc.) had its own unique character associations, although the primary letters were mostly the same: an "a" became an Alpha; a "b" became a Beta. We typed in "abba" and it became "αββα". Once you typed in the letter "a", the computer encodes that key press and stores it in memory as ASCII character number 61, and then you can change fonts in your program to customize how that specific ASCII character is rendered, as an Alpha or an Aleph or a goofy glyph from Zapf Dingbats if you so desire.
Hence, you had to change fonts for each change in language.
Accents, breathers, vowel points and other diacriticals were added by assigning otherwise unneeded letters and punctuation to non-spacing versions of those diacriticals that combined to create the character form desired. This is where the keyboard layouts associated with each font really began to differ. An Alpha with an Acute accent (ά) might be typed as "av" or "a/" depending on what font you were going to use.
Hence, a text created in one font system needs that font on any computer that tries to display or print the text.
Some technologies for embedding fonts into documents (PDF, Powerpoint, Word) are an attempt to help the situation. Another limitation is that the font you choose ends up dictating the keyboard layout you use to enter the text.
This western-centric system really begins to break down once we take a global perspective on characters. Ask yourself which Roman letter and its associated ASCII number should we assign to ཛྷ the Tibettan Dzha or す the Japanese Su? In languages with larger numbers of character forms, you quickly ran out of options since the extended ASCII range is limited to 256 characters.
Another limitation is really shown in texts at the digital level. You can't create a web page that uses a specific Greek font, because a large percentage of visitors won't be using computers that have that font installed. And consider searching... it didn't take long for compiled documents and different documents that you maintained to use more than one different font system. So, you could try a search for τάς by searching for "tav"" or perhaps "ta/s" or maybe "tavV" depending on what font is being used... but good luck getting consistent and reliable results.
The "old" font system proves insufficient.
The Unicode Standard:
Expanding the range
Unicode was invented by computer geeks, linguists and scholars as an evolving solution to creating/encoding text in a multilingual world. Unicode is based on a system similar to the original ASCII system, where each character is encoded as a unique number called a character
code point, except that the range of numbers is greatly expanded. Instead of just 128 characters (or 256 in extended ASCII ranges), Unicode currently has room for around a million different characters. The intent is to have space to assign a unique number to every known character in human history, in addition to supplemental drawing lines ┎, mini pictures ✈, icons ⌥, symbols ϗ, dingbats ☚, emoji, etc.
(including a small collection of crosses and religious symbols: ✙ ✚ ✛ ✜ ✝ ✞ ✟ ✠ ☓ ☦ ☧ ☨ ☩ ♰ ♱ † ‡ ⁜ ☥ ✡ ☪ ☫ ☬ ☰)
The Unicode ranges are divided into Character Blocks, each associated with different languages or a type of punctuation or symbol.
Code points in the range of decimal numbers 1 thru 128 are English letters (Basic Latin technically) and punctuation, actually identical to the old ASCII number assignments which increases backwards-compatibility. So, "a" is 61, "b" is 62, and so on.
Code points 768-879 contain combining diacriticals.
Code points 880-1023 are the basic Greek letters and symbols. By "basic", I mean modern Greek letters with no diacriticals other than a simple tonos accent or diaeresis. The "α" is decimal number 945, "β" is 946, etc.
Code points 1424-1535 are the Hebrew character block.
Code points 7936-8190 are the Extended Greek range. These are characters with the full set of accents and breathing marks added on.
The range 8592-8703 presents various types of arrows, for instance.
Code points 11392-11519 are Coptic characters (which used to be conflated into the Basic Greek range).
The range 65536-65786 is Linear B. Well, you get the idea. In addition to the unique numbers given for each language or alphabet's character, the published Unicode Standard provides a unique name. The letter "a" is "Latin Small Letter A"; the letter "α" is "Greek Small Letter Alpha". These ranges are the decimal numbers for the character blocks, but Unicode code points are most often given in their hexadecimal equivalents (for computer reasons you really don't need to care about). So, a lower case Alpha is Unicode decimal 945, which is the same as hexadecimal 03B1 (normally given as U+03B1).
All that is to illustrate the only thing you need to know, which is that
each character is encoded in your document with its own unique code point.
From now on, an "α" will never be mistaken for an "a", and an "a" will never be mistaken for an "α", and the "a" and the "α" can be side by side in a document, even in the same font!
The New Way of doing things:
Unicode Code Points versus Font Glyphs
Unicode is actually much more than just an expanded range of characters. It represents a new way of approaching the relationship between the abstract concept of a character represented by a code point and the glyphs (character forms) that fonts use to represent the characters.
A code point is not precisely synonymous with a character. The code points from the combining or modifying diacritical ranges combine with the previous code point to create one text element, one character. So, U+03B1 α plus U+0301 ´ becomes ά which your Unicode-savvy application turns into one character, much unlike the old legacy font system where you could move your cursor through the text using an arrow key and see the insertion point pause between the letter and the accent, letting you know that there was really still two characters behind the one character being displayed.
As I've indicated, the most powerful and benefitial aspect of the new approach represented by the Unicode Standard is that a lower case Alpha is always encoded with the same code point, so that no matter what font you are using or what font someone else uses to view your document or web page, a lower case Alpha will always be a lower case Alpha. A ἇ will always be a ἇ, and a ᾬ will always be a ᾬ. Now, just how the letters look will change with each font of course; you'll most certainly prefer the typography of the Greek character forms of some fonts better than others, but you'll never encounter the problem of needing a specific font to correctly render an Alpha loaded with diacriticals. You'll never run your Keynote or Powerpoint presentation on a classroom computer and encounter the problems of missing fonts.
Truth is, often when you're reading a web page or other document that has Unicode Greek in it, you won't always know what font you are even using to view it. As I said, the Unicode character set is divided into Character Blocks associated with different languages (or some kind of widget or punctuation). But no Unicode font contains a character form for every code point in every Character Block. Some large Unicode fonts that come installed on the latest Mac OSX contain most Character Blocks, but still not all. Some specialized fonts, such as those made for Greek scholars, may only contain Character forms for a handful of blocks, such as some diacriticals, Greek Basic and Extended, Coptic, and a few more Character forms from within various Character Blocks that are typically used in Greek scholarship.
So, what happens when you are reading a document whose text is set to a font which does not contain a Character form for the character you are encountering in the text? Your smart Mac by default goes searching in its font mapping database and displays that character in a font that contains a character form for that code point. For instance, let's say you find the word τᾷς. The ᾷ is the code point U+1FB7, which is from the Extended Greek Character Block. Your text is set to the Arial font, which does not contain that Unicode range. So, that character is displayed by another font that does contain the range. If you have OS X 10.2 or 10.3, and you have not installed any other Unicode Greek fonts, I can assure you that the ᾷ is being displayed by the Lucida Grande font, because that is the
only font that comes on a standard install that contains all the characters from the Extended Greek Character Block.
Some specialized applications (such as Mellel) can always override this default substitution behavior, so that you'll see a garbled character until you manually change the font to one that contains a Character form for that code point. Firefox also can stymie font substitution on the fly. This highlights another powerful aspect of the Unicode approach to characters... each code point for every character under the sun within the Unicode Standard is an abstract concept, and it leaves the issues of what glyphs are used to display that character up to the font an application and OS that is using the encoded text.
Well, I think I'll end the abstract introduction here. I hope to offer concise summaries of further issues regarding Unicode Greek on your Mac. If you have requests for topics to cover, be sure and send them along.
Future ideas for articles on Unicode
- Unicode Tutorial -- a brief glance on your Mac OSX
- Embedding Unicode into your html documents and blog entries
- Unicode Keyboards -- typing in your language
- Unicode Fonts -- displaying the character the way you want
- Unicode tools, helps, and other links
- Understanding Extended Greek letters and combining diacriticals
- Converting your "old" style, legacy font Greek documents into Unicode
- Unicode Tables of interest to bible scholars
Note: If you are uncomfortable with the way I have glossed over some issue or made a statement that is "not quite right actually," then this document is probably not for you. For instance, I know the different between a code point and which encoding a document uses to encode that code point. I've ignored the issues regarding Combining diacriticals and precompiled forms and Normalization standards C & D. I may address some of the issues in a followup discussion. It's just beyond the scope of this basic introduction.
If some part of this document is confusing, or you have a suggestion for how to explain a concept, please pass it along that we might improve the explanations given.
Last updated: 2/11/05