Why is 'U+' used to designate a Unicode code point?

Question

Why do Unicode code points appear as U+<codepoint>?

For example, U+2202 represents the character ∂.

Why not U- (dash or hyphen character) or anything else?

Jukka K. Korpela · Accepted Answer · 2012-01-17 07:39:31Z

142

The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.

answered Jan 17, 2012 at 7:39

community wiki

Jukka K. Korpela

Add a comment |

2 revs · Accepted Answer · 2017-11-20 17:18:52Z

The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).

The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".

My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.

I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).

The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:

u'xyz' to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx' to indicate a string with a unicode character denoted by four hex digits
'\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits

Thanks for this explaination @Jim. It is really helpful. I would looking at those linked docs. — Senthil Kumaran, Commented Jan 17, 2012 at 14:57
unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html also supports U+HHHH and U-HHHHHHHH. — Shawn Kovac, Commented Sep 8, 2015 at 18:20

Sean Bright · Accepted Answer · 2009-08-13 18:19:28Z

8

It depends on what version of the Unicode standard you are talking about. From Wikipedia:

Older versions of the standard used similar notations, but with slightly different rules. For example, Unicode 3.0 used "U-" followed by eight digits, and allowed "U+" to be used only with exactly four digits to indicate a code unit, not a code point.

answered Aug 13, 2009 at 18:19

community wiki

Sean Bright

That was the helpful reference. But the reason for that change is not mentioned. Was it just a whim of the committee?
– Senthil Kumaran
Commented Aug 13, 2009 at 18:23
2

I don't see the "U-" convention in either The Unicode Standard 3.0.0 or The Unicode Standard 2.0.0 as archived on the Unicode Consortium's web site. I think Wikipedia is wrong here.
– Jim DeLaHunt
Commented Jan 17, 2012 at 7:08
1

It's in the preface (unicode.org/versions/Unicode3.0.0/Preface.pdf), but only mentioned briefly.
– Sean Bright
Commented Jan 17, 2012 at 11:33

Add a comment |

2 revs, 2 users 67% · Accepted Answer · 2015-12-22 14:25:47Z

4

It is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (0xB9 or B9h). Why 0xB9 and not 0hB9 (or &hB9 or $B9)? Just because that's how the coin flipped :-)

edited Dec 22, 2015 at 14:25

community wiki

2 revs, 2 users 67%
Mihai Nita

1

They didn't even have to flip a coin: x (/ˈɛks/) sounds more like hex than h (/eɪtʃ/) does.
– Frédéric Hamidi
Commented May 28, 2011 at 10:03
2

@FrédéricHamidi but VB uses &hB9, Pascal uses $B9, Intel syntax assembly uses 0B9h
– phuclv
Commented May 10, 2017 at 0:57
Thanks phuclv :-) Yes, the examples were not random :-)
– Mihai Nita
Commented Jul 15, 2019 at 15:28

Add a comment |

Collectives™ on Stack Overflow

Why is 'U+' used to designate a Unicode code point?

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
unicode
codepoint
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged unicodecodepoint or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
unicode
codepoint
or ask your own question.