HZ (character encoding)

{{Short description|Format for sending GB 2312 text over a 7-bit ASCII channel}}

{{Infobox character encoding

| name = HZ encoding

| image =

| mime = HZ-GB-2312

| alias =

| standard = {{IETF RFC|1843}}

| lang = Simplified Chinese, English, Russian

| encodes = GB 2312

| status =

| prev = zW

| next = Quoted-printable, UTF-7, 8BITMIME

| classification = CJK encoding, ASCII armor, variable-width encoding, stateful encoding

| by = Fung Fung Lee

}}

The HZ character encoding{{cite web|url=http://umunhum.stanford.edu/~lee/chicomp/HZ_spec.html|title=HZ — A Data Format for Exchanging Files of Arbitrarily Mixed Chinese and ASCII Characters|archive-url=https://web.archive.org/web/20051027040810/http://umunhum.stanford.edu/~lee/chicomp/HZ_spec.html|archive-date=2005-10-27}} is an encoding of GB 2312 that was formerly commonly used in email and USENET postings. It was designed in 1989 by Fung Fung Lee ({{zh|李楓峰}}) of Stanford University, and subsequently codified in 1995 into RFC 1843.{{IETF RFC|1843}}

The HZ, short for Hanzi ({{zh|t=漢字|s=汉字|l=Chinese Characters}}), encoding was invented to facilitate the use of Chinese characters through e-mail, which at that time only allowed 7-bit characters. Therefore, in lieu of standard ISO 2022 escape sequences (as in the case of ISO-2022-JP) or 8-bit characters (as in the case of EUC), the HZ code uses only printable, 7-bit characters to represent Chinese characters.

It was also popular in USENET networks, which in the late 1980s and early 1990s, generally did not allow transmission of 8-bit characters or escape characters.

History

HZ superseded the earlier "zW" encoding, which marked entire lines as being GB 2312 text by beginning them with the characters zW.{{cite web|url=https://ccjktype.fonts.adobe.com/wp-content/uploads/2013/09/cjk_inf.txt|last=Lunde|first=Ken|author-link=Ken Lunde|title=CJK.INF Version 1.9|date=1995-12-18}}

Structure and use

In the HZ encoding system, the character sequences "~{" and "~}" act as escape sequences; anything between them is interpreted as Chinese encoded in GB 2312 (the most significant bits are ignored). Outside the escape sequences, characters are assumed to be ASCII.

An example will help illustrate the relationship between GB 2312, EUC-CN, and the HZ code:

border=1 cellpadding=4 style="border-collapse: collapse;"

|+ Various forms of the GB 2312 code (0xD2BB) for the character "一" (one)

--

! Form

CodeWith escape sequencesRemarks
--

| Kuten / Qūwèi / {{lang|zh-hans|区位}} form

5027Zone/ward/row (ku/qū/{{lang|zh-hans|区}}) 50, point (ten/wèi/{{lang|zh-hans|位}}) 27
--

| ISO 2022 form

5216 3B160E16 5216 3B16 0F1650 + 32 = 82 = 5216
--

| EUC-CN form

D216 BB16D216 BB165216 ∨ 8016 = D216
--

| HZ form (standard)

5216 3B167E16 7B16 5216 3B16 7E16 7D16Appears as {{mono|~{R;~}}} without HZ decoder
--

| HZ form (alternate)

D216 BB167E16 7B16 D216 BB16 7E16 7D16EUC form acceptable to at least some decoders

HZ was originally designed to be used purely as a 7-bit code. However, when situations allow, the escape sequences "~{" and "~}" sometimes surround characters represented in EUC-CN; this alternative use allows Chinese to be readable either with the help of HZ decoder software, or with a system that understands EUC-CN.

Additionally, the specification defines that:

  • the sequence "~~" is to be treated as encoding a single ASCII "~" and,
  • the character "~" followed by a newline is to be discarded.

However, not all HZ decoders follow these two rules.

HZ encoders and decoders

The first HZ encoder and decoder were written in 1989 by the code's inventor for the Unix operating system.{{cite web|url=http://ftp.cuhk.hk/pub/chinese/ifcss/software/unix/convert/HZ-2.0.tar.gz|title=HZ package 2.0 — HZ spec, reference encoder and decoder source code}}

The {{mono|hztty}} program, also for the Unix operating system, was also among the first and one of the most popular HZ decoders. It deviates from the specification in that it will display the escape sequences (i.e., "~{" and "~}"), and it does not treat "~~" and "~" followed by a newline specially. This was probably to allow software which assumes one character to occupy one screen position (on a text screen) to function correctly without modification.

Support on Microsoft Windows came later, and a number of third-party "Chinese systems" support HZ. These systems may provide an option to hide the escape sequences.

Disadvantages

Because of its escape sequences, and furthermore because its escape delimiters are printable characters in ASCII, it is fairly easy to construct attack byte sequences that round-trip from HZ to Unicode and back. Use of HZ encoding is thus treated as suspicious by malware protection suites.{{Cite web|url=https://bugzilla.mozilla.org/show_bug.cgi?id=935453|title=935453 - Gather telemetry about HZ and other encodings we might try to remove|access-date=2018-06-18|archive-date=2017-05-19|archive-url=https://web.archive.org/web/20170519153033/https://bugzilla.mozilla.org/show_bug.cgi?id=935453|url-status=live}}{{better source needed|date=September 2020}}

References

{{character encoding}}

Category:Chinese character encodings