February 09, 2005

Unicode: A Bible Scholar's Introduction

An Introduction to Unicode

This articles walks through the concept of Unicode as a means for handling Greek (and Hebrew, as well). I have in mind scholars who have had little success understanding or appreciating the movement to Unicode fonts and texts in their work. A more precise title of this document might be "An Introduction to Unicode Greek on Mac OS X". However, this conceptual introduction will be helpful to those using other OSes, including Windows. Also, while the examples I give center on Greek, they could easily use Hebrew or any other language as well. The level of discussion is designed for the bible scholar who considers themselves a non-techie-- whose response to Unicode so far has mostly been, "Uni-what?" This first article centers only on beginning concepts, in contrast to the old system we are familiar with. I will later move towards "trying things out" on your computer. (If the Unicode characters on this web page are not displaying correctly, and you're using OS X, use Safari or Firefox, not Explorer. If you're running OS 8 or 9, have a look here to get you started.)

The Way We Were: Legacy Fonts

For years, we typed our Greek (and Hebrew) with the same ASCII characters we use for English (or other indigenous language). In order to achieve the correct appearance, we would switch the font of the Greek text to a font that used Greek character forms. Each different font (Helena, SuperGreek, Graeca, TekniaGreek, SBL Greek, etc.) had its own unique character associations, although the primary letters were mostly the same: an "a" became an Alpha; a "b" became a Beta. We typed in "abba" and it became "αββα". Once you typed in the letter "a", the computer encodes that key press and stores it in memory as ASCII character number 61, and then you can change fonts in your program to customize how that specific ASCII character is rendered, as an Alpha or an Aleph or a goofy glyph from Zapf Dingbats if you so desire. Hence, you had to change fonts for each change in language.
Accents, breathers, vowel points and other diacriticals were added by assigning otherwise unneeded letters and punctuation to non-spacing versions of those diacriticals that combined to create the character form desired. This is where the keyboard layouts associated with each font really began to differ. An Alpha with an Acute accent (ά) might be typed as "av" or "a/" depending on what font you were going to use. Hence, a text created in one font system needs that font on any computer that tries to display or print the text. Some technologies for embedding fonts into documents (PDF, Powerpoint, Word) are an attempt to help the situation. Another limitation is that the font you choose ends up dictating the keyboard layout you use to enter the text.
This western-centric system really begins to break down once we take a global perspective on characters. Ask yourself which Roman letter and its associated ASCII number should we assign to ཛྷ the Tibettan Dzha or す the Japanese Su? In languages with larger numbers of character forms, you quickly ran out of options since the extended ASCII range is limited to 256 characters.
Another limitation is really shown in texts at the digital level. You can't create a web page that uses a specific Greek font, because a large percentage of visitors won't be using computers that have that font installed. And consider searching... it didn't take long for compiled documents and different documents that you maintained to use more than one different font system. So, you could try a search for τάς by searching for "tav"" or perhaps "ta/s" or maybe "tavV" depending on what font is being used... but good luck getting consistent and reliable results.
The "old" font system proves insufficient.

The Unicode Standard: Expanding the range

Unicode was invented by computer geeks, linguists and scholars as an evolving solution to creating/encoding text in a multilingual world. Unicode is based on a system similar to the original ASCII system, where each character is encoded as a unique number called a character code point, except that the range of numbers is greatly expanded. Instead of just 128 characters (or 256 in extended ASCII ranges), Unicode currently has room for around a million different characters. The intent is to have space to assign a unique number to every known character in human history, in addition to supplemental drawing lines ┎, mini pictures ✈, icons ⌥, symbols ϗ, dingbats ☚, emoji, etc.
(including a small collection of crosses and religious symbols: ✙ ✚ ✛ ✜ ✝ ✞ ✟ ✠ ☓ ☦ ☧ ☨ ☩ ♰ ♱ † ‡ ⁜ ☥ ✡ ☪ ☫ ☬ ☰)
The Unicode ranges are divided into Character Blocks, each associated with different languages or a type of punctuation or symbol.
Code points in the range of decimal numbers 1 thru 128 are English letters (Basic Latin technically) and punctuation, actually identical to the old ASCII number assignments which increases backwards-compatibility. So, "a" is 61, "b" is 62, and so on. Code points 768-879 contain combining diacriticals.
Code points 880-1023 are the basic Greek letters and symbols. By "basic", I mean modern Greek letters with no diacriticals other than a simple tonos accent or diaeresis. The "α" is decimal number 945, "β" is 946, etc. Code points 1424-1535 are the Hebrew character block. Code points 7936-8190 are the Extended Greek range. These are characters with the full set of accents and breathing marks added on. The range 8592-8703 presents various types of arrows, for instance. Code points 11392-11519 are Coptic characters (which used to be conflated into the Basic Greek range). The range 65536-65786 is Linear B. Well, you get the idea. In addition to the unique numbers given for each language or alphabet's character, the published Unicode Standard provides a unique name. The letter "a" is "Latin Small Letter A"; the letter "α" is "Greek Small Letter Alpha". These ranges are the decimal numbers for the character blocks, but Unicode code points are most often given in their hexadecimal equivalents (for computer reasons you really don't need to care about). So, a lower case Alpha is Unicode decimal 945, which is the same as hexadecimal 03B1 (normally given as U+03B1). All that is to illustrate the only thing you need to know, which is that each character is encoded in your document with its own unique code point. From now on, an "α" will never be mistaken for an "a", and an "a" will never be mistaken for an "α", and the "a" and the "α" can be side by side in a document, even in the same font!

The New Way of doing things: Unicode Code Points versus Font Glyphs

Unicode is actually much more than just an expanded range of characters. It represents a new way of approaching the relationship between the abstract concept of a character represented by a code point and the glyphs (character forms) that fonts use to represent the characters.
A code point is not precisely synonymous with a character. The code points from the combining or modifying diacritical ranges combine with the previous code point to create one text element, one character. So, U+03B1 α plus U+0301 ´ becomes ά which your Unicode-savvy application turns into one character, much unlike the old legacy font system where you could move your cursor through the text using an arrow key and see the insertion point pause between the letter and the accent, letting you know that there was really still two characters behind the one character being displayed. As I've indicated, the most powerful and benefitial aspect of the new approach represented by the Unicode Standard is that a lower case Alpha is always encoded with the same code point, so that no matter what font you are using or what font someone else uses to view your document or web page, a lower case Alpha will always be a lower case Alpha. A ἇ will always be a ἇ, and a ᾬ will always be a ᾬ. Now, just how the letters look will change with each font of course; you'll most certainly prefer the typography of the Greek character forms of some fonts better than others, but you'll never encounter the problem of needing a specific font to correctly render an Alpha loaded with diacriticals. You'll never run your Keynote or Powerpoint presentation on a classroom computer and encounter the problems of missing fonts. Truth is, often when you're reading a web page or other document that has Unicode Greek in it, you won't always know what font you are even using to view it. As I said, the Unicode character set is divided into Character Blocks associated with different languages (or some kind of widget or punctuation). But no Unicode font contains a character form for every code point in every Character Block. Some large Unicode fonts that come installed on the latest Mac OSX contain most Character Blocks, but still not all. Some specialized fonts, such as those made for Greek scholars, may only contain Character forms for a handful of blocks, such as some diacriticals, Greek Basic and Extended, Coptic, and a few more Character forms from within various Character Blocks that are typically used in Greek scholarship. So, what happens when you are reading a document whose text is set to a font which does not contain a Character form for the character you are encountering in the text? Your smart Mac by default goes searching in its font mapping database and displays that character in a font that contains a character form for that code point. For instance, let's say you find the word τᾷς. The ᾷ is the code point U+1FB7, which is from the Extended Greek Character Block. Your text is set to the Arial font, which does not contain that Unicode range. So, that character is displayed by another font that does contain the range. If you have OS X 10.2 or 10.3, and you have not installed any other Unicode Greek fonts, I can assure you that the ᾷ is being displayed by the Lucida Grande font, because that is the only font that comes on a standard install that contains all the characters from the Extended Greek Character Block. Some specialized applications (such as Mellel) can always override this default substitution behavior, so that you'll see a garbled character until you manually change the font to one that contains a Character form for that code point. Firefox also can stymie font substitution on the fly. This highlights another powerful aspect of the Unicode approach to characters... each code point for every character under the sun within the Unicode Standard is an abstract concept, and it leaves the issues of what glyphs are used to display that character up to the font an application and OS that is using the encoded text.
Well, I think I'll end the abstract introduction here. I hope to offer concise summaries of further issues regarding Unicode Greek on your Mac. If you have requests for topics to cover, be sure and send them along.

Future ideas for articles on Unicode

  • Unicode Tutorial -- a brief glance on your Mac OSX
  • Embedding Unicode into your html documents and blog entries
  • Unicode Keyboards -- typing in your language
  • Unicode Fonts -- displaying the character the way you want
  • Unicode tools, helps, and other links
  • Understanding Extended Greek letters and combining diacriticals
  • Converting your "old" style, legacy font Greek documents into Unicode
  • Unicode Tables of interest to bible scholars
Note: If you are uncomfortable with the way I have glossed over some issue or made a statement that is "not quite right actually," then this document is probably not for you. For instance, I know the different between a code point and which encoding a document uses to encode that code point. I've ignored the issues regarding Combining diacriticals and precompiled forms and Normalization standards C & D. I may address some of the issues in a followup discussion. It's just beyond the scope of this basic introduction. If some part of this document is confusing, or you have a suggestion for how to explain a concept, please pass it along that we might improve the explanations given. Last updated: 2/11/05

10 comments:

Anonymous said...

I am not confused by how unicode works. What I really need to know is how do I start using it. I am on the B-greek mailing list and I see no greek fonts in the messages. I dont know if this is my problem or someone elses or if there is some special code of capitals and lower case leters that I am unaware of.
I am new to Greek but very interested and I can follow things if I look up the verses but I can't read them on my screen.

Anonymous said...

Great Article. Maybe Accordance folks will link to it from their site?

It's too bad (very too bad) that B-Greek insists on not using Unicode. They have
some lame excuse about some email clients not being able to read it. Ha! We're in 2005, and I'm absolutely sure that's not an issue. It really boils down to people who got familiar with one way of doing things and have too much trouble changing.

Joe Weaks said...

Dear anonymous commenter #1... my initial response to your comment was, "huh?" Now I understand you were simply sounding off your frustration and confusion that B-Greek doesn't use Unicode, but rather a transliteration scheme in its messages.
And for commenter #2, it is a fact that some B-Greek subscribers can't recieve utf encoded messages. I lament the lack of Unicode on the list, as well. A really good discussion on the issue was just this last November.
Have a look at the archive thread Change of prefered Language.

Helen Brown said...

Thanks, Joe, for an excellent introduction to the topic. This is extremely relevant to Accordance users, especially those on the PC Emulator who must use Unicode to export the text to Word. We will certainly link to your article from our website.

Anonymous said...

I was not just sounding off abnout the B-greek list not using unicode. I am still confused as to how it works. I am not a computer geek at all. All I know is that when I want to switch and type in Greek I click Greek in the little language box in the corner of my screen (I have windows XP) same as when I want to type in Spanish. Is this using unicode? I really have no idea what I use. It seems to work for posting in the few forums I ask questions in.

Sarah Dylan Breuer said...

Oh my.

I learned how to enter things in the SuperGreek font without having to refer to a character map. I learned basic HTML, so I could engage more fully with scholarly communities online. Then I learned basic CSS, the better to create faster-loading and attractive pages.

I never thought I'd say this, but I think that after learning Greek, Hebrew, Latin, Italian, Spanish, German, and Swahili, in addition to my native English, and then smatterings of HTML and CSS (after a series of not-very-useful programming languages of my youth), I think I may have just reached the limit of learning additional languages/codes. I'm ashamed to say it, but when I'm talking about Hebrew or Greek on my blog, I just type the standard characters from the alphabet I use for my native tongue that come closest to transliterating what I mean to get across. I depend on readers' forgiveness when they can't tell an omicron from an omega or I can't supply a rough breathing.

I absolutely adore what I can do with technology, but I wish that someday I could say that I'd learned my last bit of code.

Dylan

Joe Weaks said...

Hi Dylan, Since you "learned how to enter things in the SuperGreek font without having to refer to a character map", you can at least say you've learned all the Greek keymapping you need to. I wrote a Unicode Greek Keyboard that uses the same keymapping as SuperGreek. So, no problemo there. In an upcoming Unicode article, I explain Unicode keyboards... hopefully soon.

Steve Walton said...

Very helpful and clear, Joe - I'm now finding that I can easily provide documents for my Greek students which Windoze users can handle (formerly, SPIonic was the best shot I could make at providing stuff for them that their computers could read). What would help me, as someone who has documents using Graeca II (same keyboard as SuperGreek), SPIonic, Helena, Corinth, etc. from different times in the past, is to know how I can convert these easily into Unicode. Any wisdom?

Matthew C. Baldwin said...

Joe, this was great. I think it answered a question I was trying to "google," which is, why does "Microsoft Word" keep changing the font to Lucida Grande when I am entering Polytonic greek in Unicode?

The answer seems to be that the font I prefer to use (Times New Roman) lacks the full character set? So Microsoft Word just chooses the font Lucida Grande for me?

Why doesn't Times New Roman have the full set? Will it ever?

Finally, I use accordance too. Is that stuff in Unicode?

Joe Weaks said...

Jacob Wright said...
Thanks for your wonderful site.
I have a Mac and I want to be able to easily transliterate Hebrew and Akkadian, preferably with hotkeys or switching to a new font. I have tried Assur, but it is only good for Akkadian. Is there another good font out there for Hebrew. I need the "s caron", the h with dot and u under it, vowels with long and short signs, s with dot under it, etc. What would you recommend?
It is strange that when I go to symbols, there is no choice for Times/Times New Roman, which supposedly has the best selection.
Thanks for your help,
Jacob Wright
Prof. of Hebrew Bible
Emory University