%!s(int64=4) %!d(string=hai) anos · 6c5cf38879
--- a/docs/design/code_and_name_organization/source_files.md
+++ b/docs/design/code_and_name_organization/source_files.md
@@ -12,11 +12,8 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 
				 
			
 
				 -   [Overview](#overview)
			
 
				 -   [Encoding](#encoding)
			
 
				+-   [Alternatives considered](#alternatives-considered)
			
 
				 -   [References](#references)
			
 
				--   [Alternatives](#alternatives)
			
 
				-    -   [Character encoding](#character-encoding)
			
 
				-    -   [Byte order marks](#byte-order-marks)
			
 
				-    -   [Normalization forms](#normalization-forms)
			
 
				 
			
 
				 <!-- tocstop -->
			
 
				 
			
@@ -46,8 +43,17 @@ a Carbon source file, so the byte representation is required to be normalized in
 
				 Normalization Form C. The Carbon source formatting tool will convert source
			
 
				 files to NFC as necessary.
			
 
				 
			
 
				+## Alternatives considered
			
 
				+
			
 
				+-   [Character encoding](/proposals/p0142.md#character-encoding-1)
			
 
				+-   [Byte order marks](/proposals/p0142.md#byte-order-marks)
			
 
				+-   [Normalization forms](/proposals/p0142.md#normalization-forms)
			
 
				+
			
 
				 ## References
			
 
				 
			
 
				+-   Proposal
			
 
				+    [#142: Unicode source files](https://github.com/carbon-language/carbon-lang/pull/142)
			
 
				+
			
 
				 -   [Unicode](https://www.unicode.org/versions/latest/) is a universal character
			
 
				     encoding, maintained by the
			
 
				     [Unicode Consortium](https://home.unicode.org/basic-info/overview/). It is
			
@@ -60,185 +66,4 @@ files to NFC as necessary.
 
				 
			
 
				 -   [Unicode Standard Annex #15: Unicode Normalization Forms](https://www.unicode.org/reports/tr15/tr15-50.html)
			
 
				 
			
 
				--   [wikipedia article on Unicode normal forms](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms)
			
 
				-
			
 
				-## Alternatives
			
 
				-
			
 
				-The choice to require NFC is really four choices:
			
 
				-
			
 
				-1. Equivalence classes: we use a canonical normalization form rather than a
			
 
				-   compatibility normalization form or no normalization form at all.
			
 
				-
			
 
				-    - If we use no normalization, invisibly-different ways of representing the
			
 
				-      same glyph, such as with pre-combined diacritics versus with diacritics
			
 
				-      expressed as separate combining characters, or with combining characters
			
 
				-      in a different order, would be considered different characters.
			
 
				-    - If we use a canonical normalization form, all ways of encoding diacritics
			
 
				-      are considered to form the same character, but ligatures such as `ﬃ` are
			
 
				-      considered distinct from the character sequence that they decompose into.
			
 
				-    - If we use a compatibility normalization form, ligatures are considered
			
 
				-      equivalent to the character sequence that they decompose into.
			
 
				-
			
 
				-    For a fixed-width font, a canonical normalization form is most likely to
			
 
				-    consider characters to be the same if they look the same. Unicode annexes
			
 
				-    [UAX#15](https://www.unicode.org/reports/tr15/tr15-18.html#Programming%20Language%20Identifiers)
			
 
				-    and
			
 
				-    [UAX#31](https://www.unicode.org/reports/tr31/tr31-33.html#normalization_and_case)
			
 
				-    both recommend the use of Normalization Form C for case-sensitive
			
 
				-    identifiers in programming languages.
			
 
				-
			
 
				-2. Composition: we use a composed normalization form rather than a decomposed
			
 
				-   normalization form. For example, `ō` is encooded as U+014D (LATIN SMALL
			
 
				-   LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER
			
 
				-   O), U+0304 (COMBINING MACRON) in a decomposed form. The composed form results
			
 
				-   in smaller representations whenever the two differ, but the decomposed form
			
 
				-   is a little easier for algorithmic processing (for example, typo correction
			
 
				-   and homoglyph detection).
			
 
				-
			
 
				-3. We require source files to be in our chosen form, rather than converting to
			
 
				-   that form as necessary.
			
 
				-
			
 
				-4. We require that the entire contents of the file be normalized, rather than
			
 
				-   restricting our attention to only identifiers, or only identifiers and string
			
 
				-   literals.
			
 
				-
			
 
				-### Character encoding
			
 
				-
			
 
				-**We could restrict programs to ASCII.**
			
 
				-
			
 
				-Advantages:
			
 
				-
			
 
				--   Reduced implementation complexity.
			
 
				--   Avoids all problems relating to normalization, homoglyphs, text
			
 
				-    directionality, and so on.
			
 
				--   We have no intention of using non-ASCII characters in the language syntax or
			
 
				-    in any library name.
			
 
				--   Provides assurance that all names in libraries can reliably be typed by all
			
 
				-    developers -- we already require that keywords, and thus all ASCII letters,
			
 
				-    can be typed.
			
 
				-
			
 
				-Disadvantages:
			
 
				-
			
 
				--   An overarching goal of the Carbon project is to provide a language that is
			
 
				-    inclusive and welcoming. A language that does not permit names and comments
			
 
				-    in programs to be expressed in the developer's native language will not meet
			
 
				-    that goal for at least some of our developers.
			
 
				--   Quoted strings will be substantially less readable if non-ASCII printable
			
 
				-    characters are required to be written as escape sequences.
			
 
				-
			
 
				-### Byte order marks
			
 
				-
			
 
				-**We could disallow byte order marks.**
			
 
				-
			
 
				-Advantages:
			
 
				-
			
 
				--   Marginal implementation simplicity.
			
 
				-
			
 
				-Disadvantages:
			
 
				-
			
 
				--   Several major editors, particularly on the Windows platform, insert UTF-8
			
 
				-    BOMs and use them to identify file encoding.
			
 
				-
			
 
				-### Normalization forms
			
 
				-
			
 
				-**We could require a different normalization form.**
			
 
				-
			
 
				-Advantages:
			
 
				-
			
 
				--   Some environments might more naturally produce a different normalization
			
 
				-    form.
			
 
				--   Normalization Form D is more uniform, in that characters are always
			
 
				-    maximally decomposed into combining characters; in NFC, characters may or
			
 
				-    may not be decomposed depending on whether a composed form is available.
			
 
				-    -   NFD may be more suitable for certain uses such as typo correction,
			
 
				-        homoglyph detection, or code completion.
			
 
				-
			
 
				-Disadvantages:
			
 
				-
			
 
				--   The C++ standard and community is moving towards using NFC:
			
 
				-
			
 
				-    -   WG21 is in the process of adopting a NFC requirement for C++
			
 
				-        identifiers.
			
 
				-    -   GCC warns on C++ identifiers that aren't in NFC.
			
 
				-
			
 
				-    As a consequence, we should expect that the tooling and development
			
 
				-    environments that C++ developers are using will provide good support for
			
 
				-    authoring NFC-encoded source files.
			
 
				-
			
 
				--   The W3C recommends using NFC for all content, so code samples distributed on
			
 
				-    webpages may be canonicalized into NFC by some web authoring tools.
			
 
				-
			
 
				--   NFC produces smaller encodings than NFD in all cases where they differ.
			
 
				-
			
 
				-**We could require no normalization form and compare identifiers by code point
			
 
				-sequence.**
			
 
				-
			
 
				-Advantages:
			
 
				-
			
 
				--   This is the rule in use in C++20 and before.
			
 
				-
			
 
				-Disadvantages:
			
 
				-
			
 
				--   This is not the rule planned for the near future of C++.
			
 
				--   Different representations of the same character may result in different
			
 
				-    identifiers, in a way that is likely to be invisible in most programming
			
 
				-    environments.
			
 
				-
			
 
				-**We could require no normalization form, and normalize the source code
			
 
				-ourselves.**
			
 
				-
			
 
				-Advantages:
			
 
				-
			
 
				--   We would treat source text identically regardless of the normalization form.
			
 
				--   Developers would not be responsible for ensuring that their editing
			
 
				-    environment produces and preserves the proper normalization form.
			
 
				-
			
 
				-Disadvantages:
			
 
				-
			
 
				--   There is substantially more implementation cost involved in normalizing
			
 
				-    identifiers than in detecting whether they are in normal form. While this
			
 
				-    proposal would require the implementation complexity of converting into NFC
			
 
				-    in the formatting tool, it would not require the conversion cost to be paid
			
 
				-    during compilation.
			
 
				-
			
 
				-    A high-quality implementation may choose to accept this cost anyway, in
			
 
				-    order to better recover from errors. Moreover, it is possible to
			
 
				-    [detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization)
			
 
				-    and do the conversion only when necessary. However, if non-canonical source
			
 
				-    is formally valid, there are more stringent performance constraints on such
			
 
				-    conversion than if it is only done for error recovery.
			
 
				-
			
 
				--   Tools such as `grep` do not perform normalization themselves, and so would
			
 
				-    be unreliable when applied to a codebase with inconsistent normalization.
			
 
				--   GCC already diagnoses identifiers that are not in NFC, and WG21 is in the
			
 
				-    process of adopting an
			
 
				-    [NFC requirement for C++ identifiers](http://wg21.link/P1949R6), so
			
 
				-    development environments should be expected to increasingly accommodate
			
 
				-    production of text in NFC.
			
 
				--   The byte representation of a source file may be unstable if different
			
 
				-    editing environments make different normalization choices, creating problems
			
 
				-    for revision control systems, patch files, and the like.
			
 
				--   Normalizing the contents of string literals, rather than using their
			
 
				-    contents unaltered, will introduce a risk of user surprise.
			
 
				-
			
 
				-**We could require only identifiers, or only identifiers and comments, to be
			
 
				-normalized, rather than the entire input file.**
			
 
				-
			
 
				-Advantages:
			
 
				-
			
 
				--   This would provide more freedom in comments to use arbitrary text.
			
 
				--   String literals could contain intentionally non-normalized text in order to
			
 
				-    represent non-normalized strings.
			
 
				-
			
 
				-Disadvantages:
			
 
				-
			
 
				--   Within string literals, this would result in invisible semantic differences:
			
 
				-    strings that render identically can have different meanings.
			
 
				--   The semantics of the program could vary if its sources are normalized, which
			
 
				-    an editing environment might do invisibly and automatically.
			
 
				--   If an editing environment were to automatically normalize text, it would
			
 
				-    introduce spurious diffs into changes.
			
 
				--   We would need to be careful to ensure that no string or comment delimiter
			
 
				-    ends with a code point sequence that is a prefix of a decomposition of
			
 
				-    another code point, otherwise different normalizations of the same source
			
 
				-    file could tokenize differently.
			
 
				+-   [Wikipedia Unicode equivalence page: Normal forms](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms)