Home Page The Club Computers News Links Glossary EYAWTK
Before Amiga Background ICS OCS ECS AGA ??? PPC
U-Boot SLB Linux Amiga OS Dual Boot Hardware Peripherals Other
Initialisation Installation OS4 Updates About OS4 File Systems Networking Printing Other
Introduction File System Workbench Preferences Commands Error Msgs Miscellaneous

AmigaOS 4.0 - About OS4 - Miscellaneous

Understanding Character Sets

With the release of AmigaOS 4.0 Upsate #4 there has been a bit of confusion regarding character sets, in particular where they were previously used incorrectly, or not fully understood. To help clarify things, Detlef Wuerkner who did the development, compiled the following:

What is a character set?

A character set is just a table that defines which numerical value does mean which character. Example: In many charsets code 65 means the latin uppercase letter A, code 66 means B etc.

Popular character sets

US-ASCII
US-ASCII (American Standard Code for Information Interchange) is a 7bit (values 0 - 127) charset originally designed for American English. The charset does include C0 control codes (ESC=Escape, BS=Backspace, CR=Carriage Return, LF=Linefeed, FF=Formfeed etc) in the range 0 - 31, "normal" characters, digits, punctation etc in the range 32 - 126, 127 is the Delete control character.

ISO-646
ISO-646 does define various variants of 7bit US-ASCII where the dollar sign and some punctuation characters are replaced by non-english alphabetic characters needed for french, german, spanish etc. Those charsets should be considered obsolete.

ISO-8859
ISO-8859-1 is a superset of US-ASCII, it has additional C1 control codes (e.g. used by AmigaOS) in the range 128 - 159, and the range 160 - 255 contains more punctuation (copyright sign, trademark, french quotes etc) and many national characters. Most of them are just accented versions of standard characters (A with acute, A with grave, A with tilde etc) but some "unique" characters like the german sharp s or the icelandic thorn are also included.

Before AmigaOS 4.0 the Amiga character set was defined to be ECMA-94 Latin 1, later also used by the ISO commitee as ISO/IEC 8859-1 Latin Alphabet #1.

AmigaOS 4.0 does only support US-ASCII and charsets which are a superset of it, so the numerical values 0 - 127 do always mean the same characters.

For usage as system default charset only charsets that are identical to ISO-8859-1 in the range 0 - 160 are supported. The reason for this limitation is the fact that important parts of AmigaOS (console.device, filesystems etc) do interpret the C1 control codes in the range from 128 to 159.

Because most operating systems (including AmigaOS) use(d) a byte to store a character, and ISO-8859-1 defined all 256 possible values, it was not possible to extend it (to add more characters).

The ISO committe then decided to publish additional ISO-8859-X character set standards. All are identical to ISO-8859-1 in the range 0 - 160 and do vary in the "national" range 161 - 255. This implies that you have to use a different font for a different charset (reading a greek text with a cyrillic font is impossible) and that you have to mark your mails, web pages etc with a charset tag which allows the software to choose the right font for display.

ISO-8859-1 covers Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, English, Faroese, Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic (new orthography), Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish. Plus some non-European languages like Indonesian/Malay, Tagalog (Philippines), Swahili, Afrikaans.

The following charsets cover more than the mentioned languages but for simplicity I did only list the languages that are not covered by a charset with a lower number:

ISO-8859-2 covers Croatian, Czech, Hungarian, Polish, Slovak, Slovenian, Sorbian that can not be written in ISO-8859-1.

ISO-8859-3 covers Maltese and Esperanto.

ISO-8859-4 covers Estonian, Finnish, Latvian, Lithuanian, Sami.

ISO-8859-5 covers some languages written in Cyrillic: Bulgarian, Byelorussian, (Slavic) Macedonian, Russian, Serbian and Ukrainian.

ISO-8859-6 covers Arabic (it tries to).

ISO-8859-7 covers Greek. Note that it was updated in 2003, some previously undefined values do now define the Euro sign, the Drachma sign and the greek Ypogegrammeni.

ISO-8859-8 covers Hebrew.

ISO-8859-9 covers Turkish.

ISO-8859-10 covers some Sami variants not covered by ISO-8859-4.

ISO-8859-11 covers Thai. Unfortunately it was still not listed in the IANA registry (see L:charsets/character-sets) when this document was written, so I've added support for the very similar TIS-620 instead.

ISO-8859-12 does not exist (yet?).

ISO-8859-13 covers Latvian and Lithuanian (also coverd by ISO-8859-4).

ISO-8859-14 covers Irish Gaelic (old orthography), Manx Gaelic and Welsh.

ISO-8859-15 covers French. Yes, ISO-8859-1 does not contain all finnish and french characters, these are added in ISO-8859-15. Plus the Euro sign. This makes it important for us, so I'll list all covered european languages: Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic (new orthography), Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish.

ISO-8869-16 covers Romanian. Plus the Euro sign, so here is the list: Albanian, Croatian, English, Finnish, French, German, Hungarian, Irish Gaelic (new orthography), Italian, Latin, Polish, Romanian, Slovenian.

TIS-620
Thai character set very similar to ISO-8859-11.

Amiga-1251
Amiga-1251 covers the missing combination Russian plus Euro, see the extra documentation for this character set.

KOI8
KOI8-R and the ukrainian variant KOI8-U are cyrillic character sets which are more popular than ISO-8859-5. They don't contain the C1 control chars so they can't be used as OS4 system default charset.

windows-125X
The widely used character sets windows-125X (X currently ranges from 0 to 8) cant be used as system default charset in OS4 because those charsets do redefine the C1 control codes as normal characters, punctuation etc. windows-1252 is a special case because its identical to ISO-8859-1 except the C1 control codes, so OS4 does accept windows-1252 fonts to be sufficient when an ISO-8859-1 font was requested (assuming no normal application will try to display C1 control codes).

Unicode
Unicode and ISO-10646 do both describe a universal character set. It uses the value range from 0 to 0x10FFFF, so up to 21 bits are needed to store a character value. Currently (Unicode 4.0.1) about 70,000 characters are defined, the rest is reserved for future additions.

What is UTF-X?

UTF-X are encodings for Unicode characters. Because there probably exists no computer that uses exactly 21 bits needed to store a Unicode character, Unicode has to be encoded in an encoding that uses a more common number of bits which is usable by computers.

The simplest encoding is UTF-32 which uses 32 bits per character. It exists in the two variants UTF-32BE and UTF-32LE where BE stands for Big Endian and LE for Little Endian. Big endian means the most significant byte is stored first, little endian means the least significant byte is stored first. The Unicode codepoint U0E007D (TAG RIGHT CURLY BRACKET) is stored as byte sequence 0x00, 0x0E, 0x00, 0x7D in UTF-32BE and as byte sequence 0x7D, 0x00, 0x0E, 0x00 in UTF-32LE.

Another encoding is UTF-16 in the two variants UTF-16BE and UTF-16LE. It uses either 16 or 32 bits per character, 16 bits for the characters in the BMP (Basic MultiLingual Plane) with numeric values 0 to 0xFFFF and 32 bits for other characters. There exists reserved numerical value ranges in the BMP which ensure that its possible to decide if a 16bit value will be followed by another 16bit value to form a 32bit value or not.

The most popular Unicode encoding is UTF-8. It uses one to four bytes (8, 16, 24 or 32 bits) per character. (Note: Older definitions of UTF-8 which defined 5-byte and 6-byte sequences are obsolete and not supported by AmigaOS 4.0). It has the following advantages: US-ASCII characters are always represented as themselves. This means this text is both valid US-ASCII and valid UTF-8. Latin text can be stored in less bytes than in UTF-16 or UTF-32. The basic string functions used by many applications (strcpy(), strcat(), strcmp() etc) that were designed for 8bit charsets do also work with UTF-8. AmigaOS 4.0 currently does support UTF-8 in catalog files and in keymap files.

A special UTF variant is UTF-7 which can be used by email applications to send mail over 7-bit-only paths.

Which charset should I use on AmigaOS 4.0?

In AmigaOS 4.0, the first language driver in the list of selected languages in Prefs/Locale does specify the system default charset, language and catalog search path. If additional language drivers are selected, they do only specify additional catalog search paths.

For the novice user it should be relatively easy to decide which charset variants to use: Simply start with Prefs/Locale, clear the language list, then look which charset variants are offered for your native language and choose one, remember this charset and try to use the same variant when selecting the country and the keymap.

For some languages I was unable to reduce the offered number of variants (e.g. serbian in ISO-8859-2, -5, -16 and X-ATO-E2 because all are used IMHO), for many I offer two variants (with and without Euro), and often there exists only one variant.

You dont need to care about the font variants, diskfont.library does convert them on the fly if needed.

Also the catalog charset does not matter because locale.library converts it if necessary.

Some country definitions exist in charset variants because e.g. the currency is charset dependant. I tried to use e.g. "(Euro sign)" as marker where possible.

The keymap variants are often minor and do only affect deadkeys (e.g. DeadCaron is available in ISO-8859-15 variants but not in ISO-8859-1 variants where this accent does not exist). Sometimes the differences are large, e.g. between Latin Serbian and Cyrillic Serbian.

Disclaimer: Amiga Auckland have prepared the above information for the use of its members based on our experiences and as such is subject to revision at any time. Amiga Auckland cannot guarantee any of the information and cannot be held accountable for any issues that may result from using it.


Copyright 2005 Amiga Auckland Inc. All rights reserved.
Revised: December 15, 2005.