开发者

Anyone can explain the "same thing" issue of UTF-8?

开发者 https://www.devze.com 2023-04-12 02:35 出处:网络
Quoted from here: Security may also be impacted by a characteristic of several character encodings, including UTF-8: the \"same thing\" (as far as a user can

Quoted from here:

Security may also be impacted by a characteristic of several character encodings, including UTF-8: the "same thing" (as far as a user can tell) can be represented by several distinct character sequences. For instance, an e with acute accent can be represented by the precomposed U+00E9 E ACUTE character or by the canonicall开发者_开发技巧y equivalent sequence U+0065 U+0301 (E + COMBINING ACUTE). Even though UTF-8 provides a single byte sequence for each character sequence, the existence of multiple character sequences for "the same thing" may have security consequences whenever string matching, indexing,

Is this a hidden feature of UTF-8 that I've never tackled before?


This issue is not actually specific to UTF-8 at all. It happens with all encodings that can represent all (or at least most) Unicode codepoints.

The general idea of Unicode is to not provide so-called pre-composed characters (e.g. U+00E9 E ACUTE), instead they usually like to provide the base character (e.g. U+0065 LATIN SMALL LETTER E) and the combining character (e.g. U+0301 COMBINING ACUTE ACCENT). This has the advantage of not having to provide every possible combination as its own character.

Note: the U+xxxx notation is used to refer to unicode codepoints. It's the encoding-independent way to refer to Unicode characters.

However when Unicode was first designed an important goal was to have round-trip compatibility for existing, widely-used encodings, so some pre-composed characters were included (in fact most of the diacritic characters from the latin and related alphabets are included).

So yes (and tl;dr): in a correctly working Unicode-capable application U+00E9 should render the same way and be treated the same way as U+0065 followed by U+0301.

There's a non-trivial process called normalization that helps work with these differences by reducing a given string to one of four normal forms.

For example passing both strings (U+00E9 and U+0065 U+0301) will result in U+00E9 when using NFC and will result in U+0065 U+0301 when using NFD.


Very short and visualized example: the character "é" can either be represented using the Unicode code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE, é), or the sequence U+0065 (LATIN SMALL LETTER E, e) followed by U+0301 (COMBINING ACUTE ACCENT, ´), which together look like this: é.

In UTF-8, é has the byte sequence xC3 xA9, while é has the byte sequence x65 xCC x81.

Note: Due to technical limitations this post does not contain the actual combination characters.


Actually I don't understand what it means by :

"Even though UTF-8 provides a single byte sequence for each character sequence[...]"

What the quote wants to say is:

"Any given sequence of Unicode code points is mapped to one (and precisely one) sequence of bytes by the UTF-8 encoding." That is, UTF-8 is a bijection between sequences of (abstract) Unicode code points and bytes.

The problem, which the text wants to illustrate, is that there is no bijection between "letters of a text" (as commonly understood) and Unicode code points, because the same text can be represented by different sequences of Unicode code points (as explained in the example).

Actually, this has nothing to do with UTF-8 specifically; it is a fundamental property of Unicode: Many texts have more than one representations as Unicode code points. This is important to keep in mind when comparing texts expressed in Unicode (no matter in what encoding).

One (partial) solution to this is normalization. It defines various Normal forms for Unicode text, which are unique representations of a text.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号