5 лет назад · c197215bca
--- a/proposals/README.md
+++ b/proposals/README.md
@@ -38,6 +38,8 @@ request:
 
				 -   [0107 - Code and name organization](p0107.md)
			
 
				 -   [0120 - Add idiomatic code performance and developer-facing docs to goals](p0120.md)
			
 
				     -   [Decision](p0120_decision.md)
			
 
				+-   [0142 - Unicode source files](p0142.md)
			
 
				+    -   [Decision](p0142_decision.md)
			
 
				 -   [0143 - Numeric literals](p0143.md)
			
 
				     -   [Decision](p0143_decision.md)
			
 
				 -   [0149 - Change documentation style guide](p0149.md)
			
--- a/proposals/p0142.md
+++ b/proposals/p0142.md
@@ -0,0 +1,333 @@
 
				+# Unicode source files
			
 
				+
			
 
				+<!--
			
 
				+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
			
 
				+Exceptions. See /LICENSE for license information.
			
 
				+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
			
 
				+-->
			
 
				+
			
 
				+[Pull request](https://github.com/carbon-language/carbon-lang/pull/142)
			
 
				+
			
 
				+## Table of contents
			
 
				+
			
 
				+<!-- toc -->
			
 
				+
			
 
				+-   [Problem](#problem)
			
 
				+-   [Background](#background)
			
 
				+-   [Proposal](#proposal)
			
 
				+-   [Details](#details)
			
 
				+    -   [Character encoding](#character-encoding)
			
 
				+    -   [Source files](#source-files)
			
 
				+    -   [Normalization](#normalization)
			
 
				+    -   [Characters in identifiers and whitespace](#characters-in-identifiers-and-whitespace)
			
 
				+        -   [Homoglyphs](#homoglyphs)
			
 
				+-   [Alternatives considered](#alternatives-considered)
			
 
				+    -   [Character encoding](#character-encoding-1)
			
 
				+    -   [Byte order marks](#byte-order-marks)
			
 
				+    -   [Normalization forms](#normalization-forms)
			
 
				+
			
 
				+<!-- tocstop -->
			
 
				+
			
 
				+## Problem
			
 
				+
			
 
				+Portable use and maintenance of Carbon source files requires a common
			
 
				+understanding of how they are encoded on disk. Further, the decision as to what
			
 
				+characters are valid in names and what constitutes whitespace are a complex area
			
 
				+in which we do not expect to have local expertise.
			
 
				+
			
 
				+## Background
			
 
				+
			
 
				+[Unicode](https://www.unicode.org/versions/latest/) is a universal character
			
 
				+encoding, maintained by the
			
 
				+[Unicode Consortium](https://home.unicode.org/basic-info/overview/). It is the
			
 
				+canonical encoding used for textual information interchange across all modern
			
 
				+technology.
			
 
				+
			
 
				+The [Unicode Standard Annex 31](https://www.unicode.org/reports/tr31/), "Unicode
			
 
				+Identifier and Pattern Syntax", provides recommendations for the use of Unicode
			
 
				+in the definitions of general-purpose identifiers.
			
 
				+
			
 
				+## Proposal
			
 
				+
			
 
				+Carbon programs are represented as a sequence of Unicode code points. Carbon
			
 
				+source files are encoded in UTF-8.
			
 
				+
			
 
				+Carbon will follow lexical conventions for identifiers and whitespace based on
			
 
				+Unicode Annex 31.
			
 
				+
			
 
				+## Details
			
 
				+
			
 
				+### Character encoding
			
 
				+
			
 
				+Before being divided into tokens, a program starts as a sequence of Unicode code
			
 
				+points -- integer values between 0 and 10FFFF<sub>16</sub> -- whose meaning as
			
 
				+characters or non-characters is defined by the Unicode standard.
			
 
				+
			
 
				+Carbon is based on Unicode 13.0, which is currently the latest version of the
			
 
				+Unicode standard. Newer versions should be considered for adoption as they are
			
 
				+released.
			
 
				+
			
 
				+### Source files
			
 
				+
			
 
				+Program text can come from a variety of sources, such as an interactive
			
 
				+programming environment (a so-called "Read-Evaluate-Print-Loop" or REPL), a
			
 
				+database, a memory buffer of an IDE, or a command-line argument.
			
 
				+
			
 
				+The canonical representation for Carbon programs is in files stored as a
			
 
				+sequence of bytes in a file system on disk, and such files are expected to be
			
 
				+encoded in UTF-8. Such files may begin with an optional UTF-8 BOM, that is, the
			
 
				+byte sequence EF<sub>16</sub>,BB<sub>16</sub>,BF<sub>16</sub>. This prefix, if
			
 
				+present, is ignored.
			
 
				+
			
 
				+Regardless of how program text is concretely stored, the first step in
			
 
				+processing any such text is to convert it to a sequence of Unicode code points
			
 
				+-- although such conversion may be purely notional. The result of this
			
 
				+conversion is a Carbon _source file_. Depending on the needs of the language, we
			
 
				+may require each such source file to have an associated file name, even if the
			
 
				+source file does not originate in anything resembling a file system.
			
 
				+
			
 
				+### Normalization
			
 
				+
			
 
				+Background:
			
 
				+
			
 
				+-   [wikipedia article on Unicode normal forms](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms)
			
 
				+-   [Unicode Standard Annex #15: Unicode Normalization Forms](https://www.unicode.org/reports/tr15/tr15-50.html)
			
 
				+
			
 
				+Carbon source files, including comments and string literals, are required to be
			
 
				+in Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will
			
 
				+convert source files to NFC as necessary to satisfy this constraint.
			
 
				+
			
 
				+The choice to require NFC is really four choices:
			
 
				+
			
 
				+1. Equivalence classes: we use a canonical normalization form rather than a
			
 
				+   compatibility normalization form or no normalization form at all.
			
 
				+
			
 
				+    - If we use no normalization, invisibly-different ways of representing the
			
 
				+      same glyph, such as with pre-combined diacritics versus with diacritics
			
 
				+      expressed as separate combining characters, or with combining characters
			
 
				+      in a different order, would be considered different characters.
			
 
				+    - If we use a canonical normalization form, all ways of encoding diacritics
			
 
				+      are considered to form the same character, but ligatures such as `ﬃ` are
			
 
				+      considered distinct from the character sequence that they decompose into.
			
 
				+    - If we use a compatibility normalization form, ligatures are considered
			
 
				+      equivalent to the character sequence that they decompose into.
			
 
				+
			
 
				+    For a fixed-width font, a canonical normalization form is most likely to
			
 
				+    consider characters to be the same if they look the same. Unicode annexes
			
 
				+    [UAX#15](https://www.unicode.org/reports/tr15/tr15-18.html#Programming%20Language%20Identifiers)
			
 
				+    and
			
 
				+    [UAX#31](https://www.unicode.org/reports/tr31/tr31-33.html#normalization_and_case)
			
 
				+    both recommend the use of Normalization Form C for case-sensitive
			
 
				+    identifiers in programming languages.
			
 
				+
			
 
				+    See also the discussion of [homoglyphs](#homoglyphs) below.
			
 
				+
			
 
				+2. Composition: we use a composed normalization form rather than a decomposed
			
 
				+   normalization form. For example, `ō` is encooded as U+014D (LATIN SMALL
			
 
				+   LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER
			
 
				+   O), U+0304 (COMBINING MACRON) in a decomposed form. The composed form results
			
 
				+   in smaller representations whenever the two differ, but the decomposed form
			
 
				+   is a little easier for algorithmic processing (for example, typo correction
			
 
				+   and homoglyph detection).
			
 
				+
			
 
				+3. We require source files to be in our chosen form, rather than converting to
			
 
				+   that form as necessary.
			
 
				+
			
 
				+4. We require that the entire contents of the file be normalized, rather than
			
 
				+   restricting our attention to only identifiers, or only identifiers and string
			
 
				+   literals.
			
 
				+
			
 
				+### Characters in identifiers and whitespace
			
 
				+
			
 
				+We will largely follow [Unicode Annex 31](https://www.unicode.org/reports/tr31/)
			
 
				+in our selection of identifier and whitespace characters. This Annex does not
			
 
				+provide specific rules on lexical syntax, instead providing a framework that
			
 
				+permits a selection of choices of concrete rules.
			
 
				+
			
 
				+The framework provided by Annex 31 includes suggested sets of characters that
			
 
				+may appear in identifier, including uppercase and lowercase ASCII letters, along
			
 
				+with reasonable extensions to many non-ASCII letters, with some characters
			
 
				+restricted to not appear as the first character. For example, this list includes
			
 
				+U+30EA (KATAKANA LETTER RI), but not U+2603 (SNOWMAN), both of which are
			
 
				+permitted in identifiers in C++20. Similarly, it indicates which characters
			
 
				+should be classified as whitespace, including all the ASCII whitespace
			
 
				+characters plus some non-ASCII whitespace characters. It also supports
			
 
				+language-specific "profiles" to alter these baseline character sets for the
			
 
				+needs of a particular language -- for instance, to permit underscores in
			
 
				+identifiers, or to include non-breaking spaces as whitespace characters.
			
 
				+
			
 
				+This proposal does not specify concrete choices for lexical rules, nor that we
			
 
				+will not deviate from conformance to Annex 31 in any concrete area. We may find
			
 
				+cases where we wish to take a different direction than that of the Annex.
			
 
				+However, we should use Annex 31 as a basis for our decisions, and should expect
			
 
				+strong justification for deviations from it.
			
 
				+
			
 
				+Note that this aligns with the current direction for C++, as described in WG21
			
 
				+paper [P1949R6](http://wg21.link/P1949R6).
			
 
				+
			
 
				+#### Homoglyphs
			
 
				+
			
 
				+The sets of identifier characters suggested by Annex 31's `ID_Start` /
			
 
				+`XID_Start` / `ID_Continue` / `XID_Continue` characters include many pairs of
			
 
				+homoglyphs and near-homoglyphs -- characters that would be interpreted
			
 
				+differently but may render identically or very similarly. This problem would
			
 
				+also be present if we restricted the character set to ASCII -- for example,
			
 
				+`kBa11Offset` and `kBall0ffset` may be very hard to distinguish in some fonts --
			
 
				+but there are many more ways to introduce such problems with the broader
			
 
				+identifier character set suggested by Annex 31.
			
 
				+
			
 
				+One way to handle this problem would be by adding a restriction to name lookup:
			
 
				+if a lookup for a name is performed in a scope and that lookup would have found
			
 
				+nothing, but there is a confusable identifier, as defined by
			
 
				+[UAX#39](http://www.unicode.org/reports/tr39/#Confusable_Detection), in the same
			
 
				+scope, the program is ill-formed. However, this idea is only provided as weak
			
 
				+guidance to future proposals and to demonstrate that UAX#31's approach is
			
 
				+compatible with at least one possible solution for the homoglyph problem. The
			
 
				+concrete rules for handling homoglyphs are considered out of scope for this
			
 
				+proposal.
			
 
				+
			
 
				+## Alternatives considered
			
 
				+
			
 
				+There are a number of different design choices we could make, as divergences
			
 
				+from the above proposal. Those choices, along with the arguments that led to
			
 
				+choosing the proposed design rather than each alternative, are presented below.
			
 
				+
			
 
				+### Character encoding
			
 
				+
			
 
				+We could restrict programs to ASCII.
			
 
				+
			
 
				+Pro:
			
 
				+
			
 
				+-   Reduced implementation complexity.
			
 
				+-   Avoids all problems relating to normalization, homoglyphs, text
			
 
				+    directionality, and so on.
			
 
				+-   We have no intention of using non-ASCII characters in the language syntax or
			
 
				+    in any library name.
			
 
				+-   Provides assurance that all names in libraries can reliably be typed by all
			
 
				+    developers -- we already require that keywords, and thus all ASCII letters,
			
 
				+    can be typed.
			
 
				+
			
 
				+Con:
			
 
				+
			
 
				+-   An overarching goal of the Carbon project is to provide a language that is
			
 
				+    inclusive and welcoming. A language that does not permit names and comments
			
 
				+    in programs to be expressed in the developer's native language will not meet
			
 
				+    that goal for at least some of our developers.
			
 
				+-   Quoted strings will be substantially less readable if non-ASCII printable
			
 
				+    characters are required to be written as escape sequences.
			
 
				+
			
 
				+### Byte order marks
			
 
				+
			
 
				+We could disallow byte order marks.
			
 
				+
			
 
				+Pro:
			
 
				+
			
 
				+-   Marginal implementation simplicity.
			
 
				+
			
 
				+Con:
			
 
				+
			
 
				+-   Several major editors, particularly on the Windows platform, insert UTF-8
			
 
				+    BOMs and use them to identify file encoding.
			
 
				+
			
 
				+### Normalization forms
			
 
				+
			
 
				+We could require a different normalization form.
			
 
				+
			
 
				+Pro:
			
 
				+
			
 
				+-   Some environments might more naturally produce a different normalization
			
 
				+    form.
			
 
				+-   Normalization Form D is more uniform, in that characters are always
			
 
				+    maximally decomposed into combining characters; in NFC, characters may or
			
 
				+    may not be decomposed depending on whether a composed form is available.
			
 
				+    -   NFD may be more suitable for certain uses such as typo correction,
			
 
				+        homoglyph detection, or code completion.
			
 
				+
			
 
				+Con:
			
 
				+
			
 
				+-   The C++ standard and community is moving towards using NFC:
			
 
				+
			
 
				+    -   WG21 is in the process of adopting a NFC requirement for C++
			
 
				+        identifiers.
			
 
				+    -   GCC warns on C++ identifiers that aren't in NFC.
			
 
				+
			
 
				+    As a consequence, we should expect that the tooling and development
			
 
				+    environments that C++ developers are using will provide good support for
			
 
				+    authoring NFC-encoded source files.
			
 
				+
			
 
				+-   The W3C recommends using NFC for all content, so code samples distributed on
			
 
				+    webpages may be canonicalized into NFC by some web authoring tools.
			
 
				+
			
 
				+-   NFC produces smaller encodings than NFD in all cases where they differ.
			
 
				+
			
 
				+We could require no normalization form and compare identifiers by code point
			
 
				+sequence.
			
 
				+
			
 
				+Pro:
			
 
				+
			
 
				+-   This is the rule in use in C++20 and before.
			
 
				+
			
 
				+Con:
			
 
				+
			
 
				+-   This is not the rule planned for the near future of C++.
			
 
				+-   Different representations of the same character may result in different
			
 
				+    identifiers, in a way that is likely to be invisible in most programming
			
 
				+    environments.
			
 
				+
			
 
				+We could require no normalization form, and normalize the source code ourselves:
			
 
				+
			
 
				+Pro:
			
 
				+
			
 
				+-   We would treat source text identically regardless of the normalization form.
			
 
				+-   Developers would not be responsible for ensuring that their editing
			
 
				+    environment produces and preserves the proper normalization form.
			
 
				+
			
 
				+Con:
			
 
				+
			
 
				+-   There is substantially more implementation cost involved in normalizing
			
 
				+    identifiers than in detecting whether they are in normal form. While this
			
 
				+    proposal would require the implementation complexity of converting into NFC
			
 
				+    in the formatting tool, it would not require the conversion cost to be paid
			
 
				+    during compilation.
			
 
				+
			
 
				+    A high-quality implementation may choose to accept this cost anyway, in
			
 
				+    order to better recover from errors. Moreover, it is possible to
			
 
				+    [detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization)
			
 
				+    and do the conversion only when necessary. However, if non-canonical source
			
 
				+    is formally valid, there are more stringent performance constraints on such
			
 
				+    conversion than if it is only done for error recovery.
			
 
				+
			
 
				+-   Tools such as `grep` do not perform normalization themselves, and so would
			
 
				+    be unreliable when applied to a codebase with inconsistent normalization.
			
 
				+-   GCC already diagnoses identifiers that are not in NFC, and WG21 is in the
			
 
				+    process of adopting an
			
 
				+    [NFC requirement for C++ identifiers](http://wg21.link/P1949R6), so
			
 
				+    development environments should be expected to increasingly accommodate
			
 
				+    production of text in NFC.
			
 
				+-   The byte representation of a source file may be unstable if different
			
 
				+    editing environments make different normalization choices, creating problems
			
 
				+    for revision control systems, patch files, and the like.
			
 
				+-   Normalizing the contents of string literals, rather than using their
			
 
				+    contents unaltered, will introduce a risk of user surprise.
			
 
				+
			
 
				+We could require only identifiers, or only identifiers and comments, to be
			
 
				+normalized, rather than the entire input file.
			
 
				+
			
 
				+Pro:
			
 
				+
			
 
				+-   This would provide more freedom in comments to use arbitrary text.
			
 
				+-   String literals could contain intentionally non-normalized text in order to
			
 
				+    represent non-normalized strings.
			
 
				+
			
 
				+Con:
			
 
				+
			
 
				+-   Within string literals, this would result in invisible semantic differences:
			
 
				+    strings that render identically can have different meanings.
			
 
				+-   The semantics of the program could vary if its sources are normalized, which
			
 
				+    an editing environment might do invisibly and automatically.
			
 
				+-   If an editing environment were to automatically normalize text, it would
			
 
				+    introduce spurious diffs into changes.
			
 
				+-   We would need to be careful to ensure that no string or comment delimiter
			
 
				+    ends with a code point sequence that is a prefix of a decomposition of
			
 
				+    another code point, otherwise different normalizations of the same source
			
 
				+    file could tokenize differently.