427 lines
11 KiB
Plaintext
427 lines
11 KiB
Plaintext
# ******************************************************************************
|
|
# *
|
|
# * Copyright (C) 1995-2014, International Business Machines
|
|
# * Corporation and others. All Rights Reserved.
|
|
# *
|
|
# ******************************************************************************
|
|
|
|
# If this converter alias table looks very confusing, a much easier to
|
|
# understand view can be found at this demo:
|
|
# http://demo.icu-project.org/icu-bin/convexp
|
|
|
|
# IMPORTANT NOTE
|
|
#
|
|
# This file is not read directly by ICU. If you change it, you need to
|
|
# run gencnval, and eventually run pkgdata to update the representation that
|
|
# ICU uses for aliases. The gencnval tool will normally compile this file into
|
|
# cnvalias.icu. The gencnval -v verbose option will help you when you edit
|
|
# this file.
|
|
|
|
# Please be friendly to the rest of us that edit this table by
|
|
# keeping this table free of tabs.
|
|
|
|
# This is an alias file used by the character set converter.
|
|
# A lot of converter information can be found in unicode/ucnv.h, but here
|
|
# is more information about this file.
|
|
#
|
|
# If you are adding a new converter to this list and want to include it in the
|
|
# icu data library, please be sure to add an entry to the appropriate ucm*.mk file
|
|
# (see ucmfiles.mk for more information).
|
|
#
|
|
# Here is the file format using BNF-like syntax:
|
|
#
|
|
# converterTable ::= tags { converterLine* }
|
|
# converterLine ::= converterName [ tags ] { taggedAlias* }'\n'
|
|
# taggedAlias ::= alias [ tags ]
|
|
# tags ::= '{' { tag+ } '}'
|
|
# tag ::= standard['*']
|
|
# converterName ::= [0-9a-zA-Z:_'-']+
|
|
# alias ::= converterName
|
|
#
|
|
# Except for the converter name, aliases are case insensitive.
|
|
# Names are separated by whitespace.
|
|
# Line continuation and comment sytax are similar to the GNU make syntax.
|
|
# Any lines beginning with whitespace (e.g. U+0020 SPACE or U+0009 HORIZONTAL
|
|
# TABULATION) are presumed to be a continuation of the previous line.
|
|
# The # symbol starts a comment and the comment continues till the end of
|
|
# the line.
|
|
#
|
|
# The converter
|
|
#
|
|
# All names can be tagged by including a space-separated list of tags in
|
|
# curly braces, as in ISO_8859-1:1987{IANA*} iso-8859-1 { MIME* } or
|
|
# some-charset{MIME* IANA*}. The order of tags does not matter, and
|
|
# whitespace is allowed between the tagged name and the tags list.
|
|
#
|
|
# The tags can be used to get standard names using ucnv_getStandardName().
|
|
#
|
|
# The complete list of recognized tags used in this file is defined in
|
|
# the affinity list near the beginning of the file.
|
|
#
|
|
# The * after the standard tag denotes that the previous alias is the
|
|
# preferred (default) charset name for that standard. There can only
|
|
# be one of these default charset names per converter.
|
|
|
|
|
|
|
|
# The world is getting more complicated...
|
|
# Supporting XML parsers, HTML, MIME, and similar applications
|
|
# that mark encodings with a charset name can be difficult.
|
|
# Many of these applications and operating systems will update
|
|
# their codepages over time.
|
|
|
|
# It means that a new codepage, one that differs from an
|
|
# old one by changing a code point, e.g., to the Euro sign,
|
|
# must not get an old alias, because it would mean that
|
|
# old files with this alias would be interpreted differently.
|
|
|
|
# If an codepage gets updated by assigning characters to previously
|
|
# unassigned code points, then a new name is not necessary.
|
|
# Also, some codepages map unassigned codepage byte values
|
|
# to the same numbers in Unicode for roundtripping. It may be
|
|
# industry practice to keep the encoding name in such a case, too
|
|
# (example: Windows codepages).
|
|
|
|
# The aliases listed in the list of character sets
|
|
# that is maintained by the IANA (http://www.iana.org/) must
|
|
# not be changed to mean encodings different from what this
|
|
# list shows. Currently, the IANA list is at
|
|
# http://www.iana.org/assignments/character-sets
|
|
# It should also be mentioned that the exact mapping table used for each
|
|
# IANA names usually isn't specified. This means that some other applications
|
|
# and operating systems are left to interpret the exact mappings for the
|
|
# underspecified aliases. For instance, Shift-JIS on a Solaris platform
|
|
# may be different from Shift-JIS on a Windows platform. This is why
|
|
# some of the aliases can be tagged to differentiate different mapping
|
|
# tables with the same alias. If an alias is given to more than one converter,
|
|
# it is considered to be an ambiguous alias, and the affinity list will
|
|
# choose the converter to use when a standard isn't specified with the alias.
|
|
|
|
# Name matching is case-insensitive. Also, dashes '-', underscores '_'
|
|
# and spaces ' ' are ignored in names (thus cs-iso_latin-1, csisolatin1
|
|
# and "cs iso latin 1" are the same).
|
|
# However, the names in the left column are directly file names
|
|
# or names of algorithmic converters, and their case must not
|
|
# be changed - or else code and/or file names must also be changed.
|
|
# For example, the converter ibm-921 is expected to be the file ibm-921.cnv.
|
|
|
|
|
|
|
|
# The immediately following list is the affinity list of supported standard tags.
|
|
# When multiple converters have the same alias under different standards,
|
|
# the standard nearest to the top of this list with that alias will
|
|
# be the first converter that will be opened. The ordering of the aliases
|
|
# after this affinity list does not affect the preferred alias, but it may
|
|
# affect the order of the returned list of aliases for a given converter.
|
|
#
|
|
# The general ordering is from specific and frequently used to more general
|
|
# or rarely used at the bottom.
|
|
{
|
|
UTR22 # Name format specified by http://www.unicode.org/unicode/reports/tr22/
|
|
HTML # WHATWG's encoding spec; https://encoding.spec.whatwg.org
|
|
IANA # Source: http://www.iana.org/assignments/character-sets
|
|
MIME # Source: http://www.iana.org/assignments/character-sets
|
|
}
|
|
|
|
UTF-8 { MIME* HTML* }
|
|
unicode-1-1-utf-8
|
|
utf8
|
|
|
|
utf-16be { MIME* HTML* }
|
|
|
|
utf-16le { MIME* HTML* }
|
|
utf-16
|
|
|
|
ibm866-html
|
|
IBM866 { MIME* HTML* }
|
|
866
|
|
cp866
|
|
csibm866
|
|
|
|
iso-8859-2-html
|
|
ISO-8859-2 { MIME* HTML* }
|
|
csisolatin2
|
|
iso-ir-101
|
|
iso8859-2
|
|
iso88592
|
|
iso_8859-2
|
|
iso_8859-2:1987
|
|
l2
|
|
latin2
|
|
|
|
iso-8859-3-html
|
|
ISO-8859-3 { MIME* HTML* }
|
|
csisolatin3
|
|
iso-ir-109
|
|
iso8859-3
|
|
iso88593
|
|
iso_8859-3
|
|
iso_8859-3:1988
|
|
l3
|
|
latin3
|
|
|
|
iso-8859-4-html
|
|
ISO-8859-4 { MIME* HTML* }
|
|
csisolatin4
|
|
iso-ir-110
|
|
iso8859-4
|
|
iso88594
|
|
iso_8859-4
|
|
iso_8859-4:1988
|
|
l4
|
|
latin4
|
|
|
|
iso-8859-5-html
|
|
ISO-8859-5 { MIME* HTML* }
|
|
csisolatincyrillic
|
|
cyrillic
|
|
iso-ir-144
|
|
iso8859-5
|
|
iso88595
|
|
iso_8859-5
|
|
iso_8859-5:1988
|
|
|
|
iso-8859-6-html
|
|
ISO-8859-6 { MIME* HTML* }
|
|
arabic
|
|
asmo-708
|
|
csiso88596e
|
|
csiso88596i
|
|
csisolatinarabic
|
|
ecma-114
|
|
iso-8859-6-e
|
|
iso-8859-6-i
|
|
iso-ir-127
|
|
iso8859-6
|
|
iso88596
|
|
iso_8859-6
|
|
iso_8859-6:1987
|
|
|
|
iso-8859-7-html
|
|
ISO-8859-7 { MIME* HTML* }
|
|
csisolatingreek
|
|
ecma-118
|
|
elot_928
|
|
greek
|
|
greek8
|
|
iso-ir-126
|
|
iso8859-7
|
|
iso88597
|
|
iso_8859-7
|
|
iso_8859-7:1987
|
|
sun_eu_greek
|
|
|
|
iso-8859-8-html
|
|
ISO-8859-8 { MIME* HTML* }
|
|
csiso88598e { MIME }
|
|
csisolatinhebrew
|
|
hebrew
|
|
ISO-8859-8-E
|
|
ISO-8859-8-I
|
|
iso-ir-138
|
|
iso8859-8
|
|
iso88598
|
|
iso_8859-8
|
|
iso_8859-8:1988
|
|
visual
|
|
# adding this one leads to a failure in encoding-labels.html
|
|
# csiso88598i
|
|
|
|
|
|
# This alias has to be dealt with by TextCodecICU unless
|
|
# multiple encodings can share a single mapping table.
|
|
#ISO-8859-8-I { MIME* HTML* }
|
|
# csiso88598i
|
|
# logical
|
|
|
|
iso-8859-10-html
|
|
ISO-8859-10 { MIME* HTML* }
|
|
csisolatin6
|
|
iso-ir-157
|
|
iso8859-10
|
|
iso885910
|
|
l6
|
|
latin6
|
|
|
|
iso-8859-13-html
|
|
ISO-8859-13 { MIME* HTML* }
|
|
iso8859-13
|
|
iso885913
|
|
|
|
iso-8859-14-html
|
|
ISO-8859-14 { MIME* HTML* }
|
|
iso8859-14
|
|
iso885914
|
|
|
|
iso-8859-15-html
|
|
ISO-8859-15 { MIME* HTML* }
|
|
csisolatin9
|
|
iso8859-15
|
|
iso885915
|
|
iso_8859-15
|
|
l9
|
|
|
|
iso-8859-16-html
|
|
ISO-8859-16 { MIME* HTML* }
|
|
|
|
koi8-r-html
|
|
KOI8-R { MIME* HTML* }
|
|
cskoi8r
|
|
koi
|
|
koi8
|
|
koi8_r
|
|
|
|
koi8-u-html
|
|
KOI8-U { MIME* HTML* }
|
|
koi8-ru
|
|
|
|
macintosh-html
|
|
macintosh { MIME* HTML* }
|
|
csmacintosh
|
|
mac
|
|
x-mac-roman
|
|
|
|
windows-874-html
|
|
windows-874 { MIME* HTML* }
|
|
dos-874
|
|
iso-8859-11
|
|
iso8859-11
|
|
iso885911
|
|
tis-620
|
|
|
|
windows-1250-html
|
|
windows-1250 { MIME* HTML* }
|
|
cp1250
|
|
x-cp1250
|
|
|
|
windows-1251-html
|
|
windows-1251 { MIME* HTML* }
|
|
cp1251
|
|
x-cp1251
|
|
|
|
windows-1252-html
|
|
windows-1252 { MIME* HTML* }
|
|
ansi_x3.4-1968
|
|
ascii
|
|
cp1252
|
|
cp819
|
|
csisolatin1
|
|
ibm819
|
|
iso-8859-1
|
|
iso-ir-100
|
|
iso8859-1
|
|
iso88591
|
|
iso_8859-1
|
|
iso_8859-1:1987
|
|
l1
|
|
latin1
|
|
us-ascii
|
|
x-cp1252
|
|
|
|
windows-1253-html
|
|
windows-1253 { MIME* HTML* }
|
|
cp1253
|
|
x-cp1253
|
|
|
|
windows-1254-html
|
|
windows-1254 { MIME* HTML* }
|
|
cp1254
|
|
csisolatin5
|
|
iso-8859-9
|
|
iso-ir-148
|
|
iso8859-9
|
|
iso88599
|
|
iso_8859-9
|
|
iso_8859-9:1989
|
|
l5
|
|
latin5
|
|
x-cp1254
|
|
|
|
windows-1255-html
|
|
windows-1255 { MIME* HTML* }
|
|
cp1255
|
|
x-cp1255
|
|
|
|
windows-1256-html
|
|
windows-1256 { MIME* HTML* }
|
|
cp1256
|
|
x-cp1256
|
|
|
|
windows-1257-html
|
|
windows-1257 { MIME* HTML* }
|
|
cp1257
|
|
x-cp1257
|
|
|
|
windows-1258-html
|
|
windows-1258 { MIME* HTML* }
|
|
cp1258
|
|
x-cp1258
|
|
|
|
x-mac-cyrillic-html
|
|
x-mac-cyrillic { MIME* HTML* }
|
|
x-mac-ukrainian
|
|
|
|
# Keep GBK and GB18030 separate for now until we decide
|
|
# what to do about them: crbug.com/339862
|
|
# The encoding spec requires that decoding to Unicode should use GB18030
|
|
# while encoding from Unicode should use GBK.
|
|
|
|
windows-936-2000
|
|
GBK { MIME* IANA* }
|
|
chinese { IANA }
|
|
iso-ir-58 { IANA }
|
|
GB2312 { IANA MIME }
|
|
GB_2312-80 { IANA }
|
|
gb_2312
|
|
csGB2312 { IANA }
|
|
csiso58gb231280
|
|
x-gbk
|
|
|
|
# GB 18030 is partly algorithmic, using the MBCS converter
|
|
gb18030 { IANA* } gb18030 { HTML* MIME* }
|
|
|
|
big5-html
|
|
Big5 { MIME* HTML* }
|
|
cn-big5
|
|
csbig5
|
|
x-x-big5
|
|
Big5-HKSCS
|
|
|
|
euc-jp-html
|
|
EUC-JP { MIME* HTML* }
|
|
cseucpkdfmtjapanese
|
|
x-euc-jp
|
|
|
|
ISO_2022,locale=ja,version=0
|
|
ISO-2022-JP { MIME* HTML* }
|
|
csiso2022jp
|
|
|
|
shift_jis-html
|
|
Shift_JIS { MIME* HTML* }
|
|
csshiftjis
|
|
ms_kanji
|
|
ms932
|
|
shift-jis
|
|
sjis
|
|
windows-31j
|
|
x-sjis
|
|
|
|
euc-kr-html
|
|
EUC-KR { MIME* HTML* }
|
|
cseuckr
|
|
csksc56011987
|
|
iso-ir-149
|
|
korean
|
|
ks_c_5601-1987
|
|
ks_c_5601-1989
|
|
ksc5601
|
|
ksc_5601
|
|
windows-949
|
|
|
|
# We need to keep these aliases so that documents labelled with them
|
|
# are converted to a single U+FFFD instead of being rendered as a gibberish.
|
|
ISO-2022-KR { HTML* MIME* } csISO2022KR { IANA }
|
|
ISO-2022-CN { IANA* HTML* } csISO2022CN x-ISO-2022-CN-GB
|
|
ISO-2022-CN-EXT { IANA* HTML* }
|
|
HZ-GB-2312 { HTML* IANA* } HZ
|