3 лет назад · 15d1e07133
--- a/docs/design/lexical_conventions/string_literals.md
+++ b/docs/design/lexical_conventions/string_literals.md
@@ -204,8 +204,8 @@ While octal escape sequences are expected to remain not permitted (even though
 
															 In the above table, `H` represents an arbitrary hexadecimal character, `0`-`9`
														
 
															 or `A`-`F` (case-sensitive). Unlike in C++, but like in Python, `\x` expects
														
 
															 exactly two hexadecimal digits. As in JavaScript, Rust, and Swift, Unicode code
														
 
															-points can be expressed by number using `\u{10FFFF}` notation, which accepts any
														
 
															-number of hexadecimal characters. Any numeric code point in the ranges
														
 
															+points can be expressed by number using `\u{10FFFF}` notation. This accepts
														
 
															+between 1 and 8 hexadecimal characters. Any numeric code point in the ranges
														
 
															 0<sub>16</sub>-D7FF<sub>16</sub> or E000<sub>16</sub>-10FFFF<sub>16</sub> can be
														
 
															 expressed this way.
														
@@ -338,6 +338,10 @@ string in the type system. In such string literals, we should consider rejecting
 
															     -   [Leading whitespace removal](/proposals/p0199.md#leading-whitespace-removal)
														
 
															     -   [Terminating newline](/proposals/p0199.md#terminating-newline)
														
 
															 -   [Escape sequences](/proposals/p0199.md#escape-sequences-1)
														
 
															+    -   Unicode escape sequences:
														
 
															+        -   [Allow zero digits](/proposals/p2040.md#allow-zero-digits)
														
 
															+        -   [Allow any number of hexadecimal characters](/proposals/p2040.md#allow-any-number-of-hexadecimal-characters)
														
 
															+        -   [Limiting to 6 digits versus 8](/proposals/p2040.md#limiting-to-6-digits-versus-8)
														
 
															 -   [Raw string literals](/proposals/p0199.md#raw-string-literals-1)
														
 
															     -   [Trailing whitespace](/proposals/p0199.md#trailing-whitespace)
														
 
															     -   [Line separators](/proposals/p0199.md#line-separators)
														
@@ -347,3 +351,5 @@ string in the type system. In such string literals, we should consider rejecting
 
															 -   Proposal
														
 
															     [#199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
														
 
															+-   Proposal
														
 
															+    [#2040: Unicode escape code length](https://github.com/carbon-language/carbon-lang/pull/2040)
														
--- a/proposals/p2040.md
+++ b/proposals/p2040.md
@@ -0,0 +1,109 @@
 
															+# Unicode escape code length
														
 
															+
														
 
															+<!--
														
 
															+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
														
 
															+Exceptions. See /LICENSE for license information.
														
 
															+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
														
 
															+-->
														
 
															+
														
 
															+[Pull request](https://github.com/carbon-language/carbon-lang/pull/2040)
														
 
															+
														
 
															+<!-- toc -->
														
 
															+
														
 
															+## Table of contents
														
 
															+
														
 
															+-   [Abstract](#abstract)
														
 
															+-   [Problem](#problem)
														
 
															+-   [Background](#background)
														
 
															+-   [Proposal](#proposal)
														
 
															+-   [Rationale](#rationale)
														
 
															+-   [Alternatives considered](#alternatives-considered)
														
 
															+    -   [Allow zero digits](#allow-zero-digits)
														
 
															+    -   [Allow any number of hexadecimal characters](#allow-any-number-of-hexadecimal-characters)
														
 
															+    -   [Limiting to 6 digits versus 8](#limiting-to-6-digits-versus-8)
														
 
															+
														
 
															+<!-- tocstop -->
														
 
															+
														
 
															+## Abstract
														
 
															+
														
 
															+The `\u{HHHH...}` can be an arbitrary length, potentially including `\u{}`.
														
 
															+Restrict to 1 to 8 characters.
														
 
															+
														
 
															+## Problem
														
 
															+
														
 
															+[Proposal #199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
														
 
															+says "any number of hexadecimal characters" is valid for `\u{HHHH}`. This is
														
 
															+undesirable, because it means `\u{000 ... 000E9}` is a valid escape sequence,
														
 
															+for any number of `0` characters. Additionally, it's not clear if `\u{}` is
														
 
															+meant to be valid.
														
 
															+
														
 
															+## Background
														
 
															+
														
 
															+[Proposal #199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
														
 
															+says:
														
 
															+
														
 
															+> As in JavaScript, Rust, and Swift, Unicode code points can be expressed by
														
 
															+> number using `\u{10FFFF}` notation, which accepts any number of hexadecimal
														
 
															+> characters. Any numeric code point in the ranges
														
 
															+> 0<sub>16</sub>-D7FF<sub>16</sub> or E000<sub>16</sub>-10FFFF<sub>16</sub> can
														
 
															+> be expressed this way.
														
 
															+
														
 
															+When it comes to the number of digits, the languages differ:
														
 
															+
														
 
															+-   In [JavaScript](https://262.ecma-international.org/13.0/#prod-CodePoint),
														
 
															+    between 1 and 6 digits are supported, and it must be less than or equal to
														
 
															+    `10FFFF`.
														
 
															+-   In [Rust](https://doc.rust-lang.org/reference/tokens.html), between 1 and 6
														
 
															+    digits are supported.
														
 
															+-   In
														
 
															+    [Swift](https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html),
														
 
															+    between 1 and 8 digits are supported.
														
 
															+
														
 
															+Unicode's codespace is 0 to [`10FFFF`](https://unicode.org/glossary/#codespace).
														
 
															+
														
 
															+## Proposal
														
 
															+
														
 
															+The `\u{H...}` syntax is only valid for 1 to 8 unicode characters.
														
 
															+
														
 
															+## Rationale
														
 
															+
														
 
															+-   [Code that is easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write)
														
 
															+    -   This restriction does not affect the ability to write valid Unicode.
														
 
															+        Instead, it restricts the ability to write confusing or invalid unicode,
														
 
															+        which should make it easier to detect errors.
														
 
															+-   [Fast and scalable development](/docs/project/goals.md#fast-and-scalable-development)
														
 
															+    -   Simplifies tooling by reducing the number of syntaxes that need to be
														
 
															+        supported, and allowing early failure on obviously invalid inputs.
														
 
															+
														
 
															+## Alternatives considered
														
 
															+
														
 
															+### Allow zero digits
														
 
															+
														
 
															+We could allow `\u{}` as a version of `\u{0}`. However, as shorthand, it doesn't
														
 
															+save much and `\x00` is both equal length and clearer.
														
 
															+
														
 
															+Rather than allowing this syntax, we prefer to disallow it for consistency with
														
 
															+other languages.
														
 
															+
														
 
															+### Allow any number of hexadecimal characters
														
 
															+
														
 
															+We could allow any number of digits in the `\u` escape. However, this has the
														
 
															+consequence of requiring parsing of escapes of completely arbitrary length.
														
 
															+
														
 
															+This creates unnecessary complexity in the parser because we need to consider
														
 
															+what happens if the result is greater than 32 bits, significantly larger than
														
 
															+unicode's current `10FFFF` limit. One way to do this would be to store the
														
 
															+result in a 32-bit integer and keep parsing until the value goes above `10FFFF`,
														
 
															+then error as invalid if that's exceeded. This would allow an arbitrary number
														
 
															+of leading `0`'s to correctly parse.
														
 
															+
														
 
															+It should make it easier to write a simple parser if we instead limit the number
														
 
															+of digits to a reasonable amount.
														
 
															+
														
 
															+### Limiting to 6 digits versus 8
														
 
															+
														
 
															+A limit of 6 digits offers a reasonable limit as the minimum needed to represent
														
 
															+Unicode's codespace. A limit of 8 digits offers a reasonable limit as a standard
														
 
															+4-byte value, and roughly matches UTF-32.
														
 
															+
														
 
															+While it seems a weak advantage, this proposal leans towards 8.