AmigaOS 4.0 - About OS4 - Miscellaneous
|
Understanding Character Sets
With the release of AmigaOS 4.0 Upsate #4 there has been a bit of confusion regarding character sets, in particular where they were previously used incorrectly, or not fully understood. To help clarify things, Detlef Wuerkner who did the development, compiled the following: What is a character set? A character set is just a table that defines which numerical value does mean which character. Example: In many charsets code 65 means the latin uppercase letter A, code 66 means B etc. Popular character sets
US-ASCII
ISO-646
ISO-8859
Before AmigaOS 4.0 the Amiga character set was defined to be ECMA-94 Latin 1, later also used by the ISO commitee as ISO/IEC 8859-1 Latin Alphabet #1.
AmigaOS 4.0 does only support US-ASCII and charsets which are a superset of it, so the numerical values 0 - 127 do always mean the same characters.
For usage as system default charset only charsets that are identical to ISO-8859-1 in the range 0 - 160 are supported. The reason for this limitation
is the fact that important parts of AmigaOS (console.device, filesystems etc) do interpret the C1 control codes in the range from 128 to 159.
Because most operating systems (including AmigaOS) use(d) a byte to store a character, and ISO-8859-1 defined all 256 possible values, it was not
possible to extend it (to add more characters).
The ISO committe then decided to publish additional ISO-8859-X character set standards. All are identical to ISO-8859-1 in the range 0 - 160 and do
vary in the "national" range 161 - 255. This implies that you have to use a different font for a different charset (reading a greek text with a
cyrillic font is impossible) and that you have to mark your mails, web pages etc with a charset tag which allows the software to choose the right
font for display.
ISO-8859-1 covers Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, English, Faroese, Frisian, Galician, German, Greenlandic, Icelandic,
Irish Gaelic (new orthography), Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish. Plus some
non-European languages like Indonesian/Malay, Tagalog (Philippines), Swahili, Afrikaans.
The following charsets cover more than the mentioned languages but for simplicity I did only list the languages that are not covered by a charset
with a lower number:
ISO-8859-2 covers Croatian, Czech, Hungarian, Polish, Slovak, Slovenian, Sorbian that can not be written in ISO-8859-1.
ISO-8859-3 covers Maltese and Esperanto.
ISO-8859-4 covers Estonian, Finnish, Latvian, Lithuanian, Sami.
ISO-8859-5 covers some languages written in Cyrillic: Bulgarian, Byelorussian, (Slavic) Macedonian, Russian, Serbian and Ukrainian.
ISO-8859-6 covers Arabic (it tries to).
ISO-8859-7 covers Greek. Note that it was updated in 2003, some previously undefined values do now define the Euro sign, the Drachma sign and the
greek Ypogegrammeni.
ISO-8859-8 covers Hebrew.
ISO-8859-9 covers Turkish.
ISO-8859-10 covers some Sami variants not covered by ISO-8859-4.
ISO-8859-11 covers Thai. Unfortunately it was still not listed in the IANA registry (see L:charsets/character-sets) when this document was written,
so I've added support for the very similar TIS-620 instead.
ISO-8859-12 does not exist (yet?).
ISO-8859-13 covers Latvian and Lithuanian (also coverd by ISO-8859-4).
ISO-8859-14 covers Irish Gaelic (old orthography), Manx Gaelic and Welsh.
ISO-8859-15 covers French. Yes, ISO-8859-1 does not contain all finnish and french characters, these are added in ISO-8859-15. Plus the Euro sign.
This makes it important for us, so I'll list all covered european languages: Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Estonian,
Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic (new orthography), Italian, Latin, Luxemburgish, Norwegian,
Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish.
ISO-8869-16 covers Romanian. Plus the Euro sign, so here is the list: Albanian, Croatian, English, Finnish, French, German, Hungarian, Irish Gaelic
(new orthography), Italian, Latin, Polish, Romanian, Slovenian.
TIS-620
Amiga-1251
KOI8
windows-125X
Unicode
What is UTF-X?
UTF-X are encodings for Unicode characters. Because there probably exists no computer that uses exactly 21 bits needed to store a Unicode character,
Unicode has to be encoded in an encoding that uses a more common number of bits which is usable by computers.
The simplest encoding is UTF-32 which uses 32 bits per character. It exists in the two variants UTF-32BE and UTF-32LE where BE stands for Big Endian
and LE for Little Endian. Big endian means the most significant byte is stored first, little endian means the least significant byte is stored first.
The Unicode codepoint U0E007D (TAG RIGHT CURLY BRACKET) is stored as byte sequence 0x00, 0x0E, 0x00, 0x7D in UTF-32BE and as byte sequence 0x7D, 0x00,
0x0E, 0x00 in UTF-32LE.
Another encoding is UTF-16 in the two variants UTF-16BE and UTF-16LE. It uses either 16 or 32 bits per character, 16 bits for the characters in the
BMP (Basic MultiLingual Plane) with numeric values 0 to 0xFFFF and 32 bits for other characters. There exists reserved numerical value ranges in the
BMP which ensure that its possible to decide if a 16bit value will be followed by another 16bit value to form a 32bit value or not.
The most popular Unicode encoding is UTF-8. It uses one to four bytes (8, 16, 24 or 32 bits) per character. (Note: Older definitions of UTF-8 which
defined 5-byte and 6-byte sequences are obsolete and not supported by AmigaOS 4.0). It has the following advantages: US-ASCII characters are always
represented as themselves. This means this text is both valid US-ASCII and valid UTF-8. Latin text can be stored in less bytes than in UTF-16 or
UTF-32. The basic string functions used by many applications (strcpy(), strcat(), strcmp() etc) that were designed for 8bit charsets do also work
with UTF-8. AmigaOS 4.0 currently does support UTF-8 in catalog files and in keymap files.
A special UTF variant is UTF-7 which can be used by email applications to send mail over 7-bit-only paths.
Which charset should I use on AmigaOS 4.0?
In AmigaOS 4.0, the first language driver in the list of selected languages in Prefs/Locale does specify the system default charset, language and
catalog search path. If additional language drivers are selected, they do only specify additional catalog search paths.
For the novice user it should be relatively easy to decide which charset variants to use: Simply start with Prefs/Locale, clear the language list,
then look which charset variants are offered for your native language and choose one, remember this charset and try to use the same variant when
selecting the country and the keymap.
For some languages I was unable to reduce the offered number of variants (e.g. serbian in ISO-8859-2, -5, -16 and X-ATO-E2 because all are used IMHO),
for many I offer two variants (with and without Euro), and often there exists only one variant.
You dont need to care about the font variants, diskfont.library does convert them on the fly if needed.
Also the catalog charset does not matter because locale.library converts it if necessary.
Some country definitions exist in charset variants because e.g. the currency is charset dependant. I tried to use e.g. "(Euro sign)" as marker where
possible.
The keymap variants are often minor and do only affect deadkeys (e.g. DeadCaron is available in ISO-8859-15 variants but not in ISO-8859-1 variants
where this accent does not exist). Sometimes the differences are large, e.g. between Latin Serbian and Cyrillic Serbian.
|