开发者

How does a program read unicode? [closed]

开发者 https://www.devze.com 2023-04-12 12:41 出处:网络
Closed. This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post.
Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? Add details and clarify the problem by editing this post.

Closed 9 years ago.

Improve this question

Unicode code units can be of variable size since characters can be represented by 2 bytes or more bytes (sequence of 2 bytes). So if stored in binary format, how can a program know how to read them back?

Lets say 'a' is represented by 0F0F 13F3 and 'b' is represented by 02AD BC39 09F3 459F

If I write them in file foo.txt:

0F0F 13F3 02AD BC39 09F3 459F

Then how would I know where to stop for 'a' and 'b'?

Guys here I am talking about reading , writing pure unicode i开发者_JS百科.e without converting it into any other format based upon popular charset such as utf-8 .


First, not all Unicode representations are variable length. UTF-32 and USC-2 are fixed length. UTF-8 and UTF-16 are each in their own way variable length.

Second, if you read the specification, you will learn that the sequences are self-describing. The byte values (in UTF-8) that can be first bytes can't be second or third, etc. Ditto for the surrogate pairs that represent non-BMP characters in UTF-16.


A commonly used encoding is UTF-8. The way it's structured is that some predefined bits of the character's bytes tell you whether there are more bytes to come.

See http://en.wikipedia.org/wiki/UTF-8#Design for a nice diagram.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号