The \u{HHHH...} can be an arbitrary length, potentially including \u{}.
Restrict to 1 to 8 characters.
Proposal #199: String literals
says "any number of hexadecimal characters" is valid for \u{HHHH}. This is
undesirable, because it means \u{000 ... 000E9} is a valid escape sequence,
for any number of 0 characters. Additionally, it's not clear if \u{} is
meant to be valid.
Proposal #199: String literals says:
As in JavaScript, Rust, and Swift, Unicode code points can be expressed by number using
\u{10FFFF}notation, which accepts any number of hexadecimal characters. Any numeric code point in the ranges 016-D7FF16 or E00016-10FFFF16 can be expressed this way.
When it comes to the number of digits, the languages differ:
10FFFF.Unicode's codespace is 0 to 10FFFF.
The \u{H...} syntax is only valid for 1 to 8 unicode characters.
We could allow \u{} as a version of \u{0}. However, as shorthand, it doesn't
save much and \x00 is both equal length and clearer.
Rather than allowing this syntax, we prefer to disallow it for consistency with other languages.
We could allow any number of digits in the \u escape. However, this has the
consequence of requiring parsing of escapes of completely arbitrary length.
This creates unnecessary complexity in the parser because we need to consider
what happens if the result is greater than 32 bits, significantly larger than
unicode's current 10FFFF limit. One way to do this would be to store the
result in a 32-bit integer and keep parsing until the value goes above 10FFFF,
then error as invalid if that's exceeded. This would allow an arbitrary number
of leading 0's to correctly parse.
It should make it easier to write a simple parser if we instead limit the number of digits to a reasonable amount.
A limit of 6 digits offers a reasonable limit as the minimum needed to represent Unicode's codespace. A limit of 8 digits offers a reasonable limit as a standard 4-byte value, and roughly matches UTF-32.
While it seems a weak advantage, this proposal leans towards 8.