# Unicode source files

<!--
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
Exceptions. See /LICENSE for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-->

[Pull request](https://github.com/carbon-language/carbon-lang/pull/142)

<!-- toc -->

## Table of contents

-   [Problem](#problem)
-   [Background](#background)
-   [Proposal](#proposal)
-   [Details](#details)
    -   [Character encoding](#character-encoding)
    -   [Source files](#source-files)
    -   [Normalization](#normalization)
    -   [Characters in identifiers and whitespace](#characters-in-identifiers-and-whitespace)
        -   [Homoglyphs](#homoglyphs)
-   [Alternatives considered](#alternatives-considered)
    -   [Character encoding](#character-encoding-1)
    -   [Byte order marks](#byte-order-marks)
    -   [Normalization forms](#normalization-forms)
-   [Rationale](#rationale)

<!-- tocstop -->

## Problem

Portable use and maintenance of Carbon source files requires a common
understanding of how they are encoded on disk. Further, the decision as to what
characters are valid in names and what constitutes whitespace are a complex area
in which we do not expect to have local expertise.

## Background

[Unicode](https://www.unicode.org/versions/latest/) is a universal character
encoding, maintained by the
[Unicode Consortium](https://home.unicode.org/basic-info/overview/). It is the
canonical encoding used for textual information interchange across all modern
technology.

The [Unicode Standard Annex 31](https://www.unicode.org/reports/tr31/), "Unicode
Identifier and Pattern Syntax", provides recommendations for the use of Unicode
in the definitions of general-purpose identifiers.

## Proposal

Carbon programs are represented as a sequence of Unicode code points. Carbon
source files are encoded in UTF-8.

Carbon will follow lexical conventions for identifiers and whitespace based on
Unicode Annex 31.

## Details

### Character encoding

Before being divided into tokens, a program starts as a sequence of Unicode code
points -- integer values between 0 and 10FFFF<sub>16</sub> -- whose meaning as
characters or non-characters is defined by the Unicode standard.

Carbon is based on Unicode 13.0, which is currently the latest version of the
Unicode standard. Newer versions should be considered for adoption as they are
released.

### Source files

Program text can come from a variety of sources, such as an interactive
programming environment (a so-called "Read-Evaluate-Print-Loop" or REPL), a
database, a memory buffer of an IDE, or a command-line argument.

The canonical representation for Carbon programs is in files stored as a
sequence of bytes in a file system on disk, and such files are expected to be
encoded in UTF-8. Such files may begin with an optional UTF-8 BOM, that is, the
byte sequence EF<sub>16</sub>,BB<sub>16</sub>,BF<sub>16</sub>. This prefix, if
present, is ignored.

Regardless of how program text is concretely stored, the first step in
processing any such text is to convert it to a sequence of Unicode code points
-- although such conversion may be purely notional. The result of this
conversion is a Carbon _source file_. Depending on the needs of the language, we
may require each such source file to have an associated filename, even if the
source file does not originate in anything resembling a file system.

### Normalization

Background:

-   [wikipedia article on Unicode normal forms](https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms)
-   [Unicode Standard Annex #15: Unicode Normalization Forms](https://www.unicode.org/reports/tr15/tr15-50.html)

Carbon source files, including comments and string literals, are required to be
in Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will
convert source files to NFC as necessary to satisfy this constraint.

The choice to require NFC is really four choices:

1. Equivalence classes: we use a canonical normalization form rather than a
   compatibility normalization form or no normalization form at all.

    - If we use no normalization, invisibly-different ways of representing the
      same glyph, such as with pre-combined diacritics versus with diacritics
      expressed as separate combining characters, or with combining characters
      in a different order, would be considered different characters.
    - If we use a canonical normalization form, all ways of encoding diacritics
      are considered to form the same character, but ligatures such as `ﬃ` are
      considered distinct from the character sequence that they decompose into.
    - If we use a compatibility normalization form, ligatures are considered
      equivalent to the character sequence that they decompose into.

    For a fixed-width font, a canonical normalization form is most likely to
    consider characters to be the same if they look the same. Unicode annexes
    [UAX#15](https://www.unicode.org/reports/tr15/tr15-18.html#Programming%20Language%20Identifiers)
    and
    [UAX#31](https://www.unicode.org/reports/tr31/tr31-33.html#normalization_and_case)
    both recommend the use of Normalization Form C for case-sensitive
    identifiers in programming languages.

    See also the discussion of [homoglyphs](#homoglyphs) below.

2. Composition: we use a composed normalization form rather than a decomposed
   normalization form. For example, `ō` is encooded as U+014D (LATIN SMALL
   LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER
   O), U+0304 (COMBINING MACRON) in a decomposed form. The composed form results
   in smaller representations whenever the two differ, but the decomposed form
   is a little easier for algorithmic processing (for example, typo correction
   and homoglyph detection).

3. We require source files to be in our chosen form, rather than converting to
   that form as necessary.

4. We require that the entire contents of the file be normalized, rather than
   restricting our attention to only identifiers, or only identifiers and string
   literals.

### Characters in identifiers and whitespace

We will largely follow [Unicode Annex 31](https://www.unicode.org/reports/tr31/)
in our selection of identifier and whitespace characters. This Annex does not
provide specific rules on lexical syntax, instead providing a framework that
permits a selection of choices of concrete rules.

The framework provided by Annex 31 includes suggested sets of characters that
may appear in identifier, including uppercase and lowercase ASCII letters, along
with reasonable extensions to many non-ASCII letters, with some characters
restricted to not appear as the first character. For example, this list includes
U+30EA (KATAKANA LETTER RI), but not U+2603 (SNOWMAN), both of which are
permitted in identifiers in C++20. Similarly, it indicates which characters
should be classified as whitespace, including all the ASCII whitespace
characters plus some non-ASCII whitespace characters. It also supports
language-specific "profiles" to alter these baseline character sets for the
needs of a particular language -- for example, to permit underscores in
identifiers, or to include non-breaking spaces as whitespace characters.

This proposal does not specify concrete choices for lexical rules, nor that we
will not deviate from conformance to Annex 31 in any concrete area. We may find
cases where we wish to take a different direction than that of the Annex.
However, we should use Annex 31 as a basis for our decisions, and should expect
strong justification for deviations from it.

Note that this aligns with the current direction for C++, as described in WG21
paper [P1949R6](http://wg21.link/P1949R6).

#### Homoglyphs

The sets of identifier characters suggested by Annex 31's `ID_Start` /
`XID_Start` / `ID_Continue` / `XID_Continue` characters include many pairs of
homoglyphs and near-homoglyphs -- characters that would be interpreted
differently but may render identically or very similarly. This problem would
also be present if we restricted the character set to ASCII -- for example,
`kBa11Offset` and `kBall0ffset` may be very hard to distinguish in some fonts --
but there are many more ways to introduce such problems with the broader
identifier character set suggested by Annex 31.

One way to handle this problem would be by adding a restriction to name lookup:
if a lookup for a name is performed in a scope and that lookup would have found
nothing, but there is a confusable identifier, as defined by
[UAX#39](http://www.unicode.org/reports/tr39/#Confusable_Detection), in the same
scope, the program is ill-formed. However, this idea is only provided as weak
guidance to future proposals and to demonstrate that UAX#31's approach is
compatible with at least one possible solution for the homoglyph problem. The
concrete rules for handling homoglyphs are considered out of scope for this
proposal.

## Alternatives considered

There are a number of different design choices we could make, as divergences
from the above proposal. Those choices, along with the arguments that led to
choosing the proposed design rather than each alternative, are presented below.

### Character encoding

We could restrict programs to ASCII.

Pro:

-   Reduced implementation complexity.
-   Avoids all problems relating to normalization, homoglyphs, text
    directionality, and so on.
-   We have no intention of using non-ASCII characters in the language syntax or
    in any library name.
-   Provides assurance that all names in libraries can reliably be typed by all
    developers -- we already require that keywords, and thus all ASCII letters,
    can be typed.

Con:

-   An overarching goal of the Carbon project is to provide a language that is
    inclusive and welcoming. A language that does not permit names and comments
    in programs to be expressed in the developer's native language will not meet
    that goal for at least some of our developers.
-   Quoted strings will be substantially less readable if non-ASCII printable
    characters are required to be written as escape sequences.

### Byte order marks

We could disallow byte order marks.

Pro:

-   Marginal implementation simplicity.

Con:

-   Several major editors, particularly on the Windows platform, insert UTF-8
    BOMs and use them to identify file encoding.

### Normalization forms

We could require a different normalization form.

Pro:

-   Some environments might more naturally produce a different normalization
    form.
-   Normalization Form D is more uniform, in that characters are always
    maximally decomposed into combining characters; in NFC, characters may or
    may not be decomposed depending on whether a composed form is available.
    -   NFD may be more suitable for certain uses such as typo correction,
        homoglyph detection, or code completion.

Con:

-   The C++ standard and community is moving towards using NFC:

    -   WG21 is in the process of adopting a NFC requirement for C++
        identifiers.
    -   GCC warns on C++ identifiers that aren't in NFC.

    As a consequence, we should expect that the tooling and development
    environments that C++ developers are using will provide good support for
    authoring NFC-encoded source files.

-   The W3C recommends using NFC for all content, so code samples distributed on
    webpages may be canonicalized into NFC by some web authoring tools.

-   NFC produces smaller encodings than NFD in all cases where they differ.

We could require no normalization form and compare identifiers by code point
sequence.

Pro:

-   This is the rule in use in C++20 and before.

Con:

-   This is not the rule planned for the near future of C++.
-   Different representations of the same character may result in different
    identifiers, in a way that is likely to be invisible in most programming
    environments.

We could require no normalization form, and normalize the source code ourselves:

Pro:

-   We would treat source text identically regardless of the normalization form.
-   Developers would not be responsible for ensuring that their editing
    environment produces and preserves the proper normalization form.

Con:

-   There is substantially more implementation cost involved in normalizing
    identifiers than in detecting whether they are in normal form. While this
    proposal would require the implementation complexity of converting into NFC
    in the formatting tool, it would not require the conversion cost to be paid
    during compilation.

    A high-quality implementation may choose to accept this cost anyway, in
    order to better recover from errors. Moreover, it is possible to
    [detect NFC on a fast path](http://unicode.org/reports/tr15/#NFC_QC_Optimization)
    and do the conversion only when necessary. However, if non-canonical source
    is formally valid, there are more stringent performance constraints on such
    conversion than if it is only done for error recovery.

-   Tools such as `grep` do not perform normalization themselves, and so would
    be unreliable when applied to a codebase with inconsistent normalization.
-   GCC already diagnoses identifiers that are not in NFC, and WG21 is in the
    process of adopting an
    [NFC requirement for C++ identifiers](http://wg21.link/P1949R6), so
    development environments should be expected to increasingly accommodate
    production of text in NFC.
-   The byte representation of a source file may be unstable if different
    editing environments make different normalization choices, creating problems
    for revision control systems, patch files, and the like.
-   Normalizing the contents of string literals, rather than using their
    contents unaltered, will introduce a risk of user surprise.

We could require only identifiers, or only identifiers and comments, to be
normalized, rather than the entire input file.

Pro:

-   This would provide more freedom in comments to use arbitrary text.
-   String literals could contain intentionally non-normalized text in order to
    represent non-normalized strings.

Con:

-   Within string literals, this would result in invisible semantic differences:
    strings that render identically can have different meanings.
-   The semantics of the program could vary if its sources are normalized, which
    an editing environment might do invisibly and automatically.
-   If an editing environment were to automatically normalize text, it would
    introduce spurious diffs into changes.
-   We would need to be careful to ensure that no string or comment delimiter
    ends with a code point sequence that is a prefix of a decomposition of
    another code point, otherwise different normalizations of the same source
    file could tokenize differently.

## Rationale

We need to specify the source file encoding in order to move on to higher-level
lexical issues, and the proposal generally gives sound rationales for the
specific choices it makes. UTF-8 and Unicode Annex 31 are non-controversial
choices for a new language in 2020. The proposal provides a good rationale for
requiring NFC.

We support treating homoglyphs and confusability as out of scope. We favor of
accepting of a wide range of scripts to start, in order to be welcoming to as
wide a community as possible, assuming that underhanded code concerns can be
adequately addressed outside of the compiler proper.