A Carbon source file is a sequence of Unicode code points in Unicode Normalization Form C ("NFC"), and represents a portion of the complete text of a program.
Program text can come from a variety of sources, such as an interactive programming environment (a so-called "Read-Evaluate-Print-Loop" or REPL), a database, a memory buffer of an IDE, or a command-line argument.
The canonical representation for Carbon programs is in files stored as a
sequence of bytes in a file system on disk. Such files have a .carbon
extension.
The on-disk representation of a Carbon source file is encoded in UTF-8. Such files may begin with an optional UTF-8 BOM, that is, the byte sequence EF16,BB16,BF16. This prefix, if present, is ignored.
No Unicode normalization is performed when reading an on-disk representation of a Carbon source file, so the byte representation is required to be normalized in Normalization Form C. The Carbon source formatting tool will convert source files to NFC as necessary.
Unicode is a universal character encoding, maintained by the Unicode Consortium. It is the canonical encoding used for textual information interchange across all modern technology.
Carbon is based on Unicode 13.0, which is currently the latest version of the Unicode standard. Newer versions will be considered for adoption as they are released.
The choice to require NFC is really four choices:
Equivalence classes: we use a canonical normalization form rather than a compatibility normalization form or no normalization form at all.
ffi are
considered distinct from the character sequence that they decompose into.For a fixed-width font, a canonical normalization form is most likely to consider characters to be the same if they look the same. Unicode annexes UAX#15 and UAX#31 both recommend the use of Normalization Form C for case-sensitive identifiers in programming languages.
Composition: we use a composed normalization form rather than a decomposed
normalization form. For example, ō is encooded as U+014D (LATIN SMALL
LETTER O WITH MACRON) in a composed form and as U+006F (LATIN SMALL LETTER
O), U+0304 (COMBINING MACRON) in a decomposed form. The composed form results
in smaller representations whenever the two differ, but the decomposed form
is a little easier for algorithmic processing (for example, typo correction
and homoglyph detection).
We require source files to be in our chosen form, rather than converting to that form as necessary.
We require that the entire contents of the file be normalized, rather than restricting our attention to only identifiers, or only identifiers and string literals.
We could restrict programs to ASCII.
Advantages:
Disadvantages:
We could disallow byte order marks.
Advantages:
Disadvantages:
We could require a different normalization form.
Advantages:
Disadvantages:
The C++ standard and community is moving towards using NFC:
As a consequence, we should expect that the tooling and development environments that C++ developers are using will provide good support for authoring NFC-encoded source files.
The W3C recommends using NFC for all content, so code samples distributed on webpages may be canonicalized into NFC by some web authoring tools.
NFC produces smaller encodings than NFD in all cases where they differ.
We could require no normalization form and compare identifiers by code point sequence.
Advantages:
Disadvantages:
We could require no normalization form, and normalize the source code ourselves.
Advantages:
Disadvantages:
There is substantially more implementation cost involved in normalizing identifiers than in detecting whether they are in normal form. While this proposal would require the implementation complexity of converting into NFC in the formatting tool, it would not require the conversion cost to be paid during compilation.
A high-quality implementation may choose to accept this cost anyway, in order to better recover from errors. Moreover, it is possible to detect NFC on a fast path and do the conversion only when necessary. However, if non-canonical source is formally valid, there are more stringent performance constraints on such conversion than if it is only done for error recovery.
Tools such as grep do not perform normalization themselves, and so would
be unreliable when applied to a codebase with inconsistent normalization.
GCC already diagnoses identifiers that are not in NFC, and WG21 is in the process of adopting an NFC requirement for C++ identifiers, so development environments should be expected to increasingly accommodate production of text in NFC.
The byte representation of a source file may be unstable if different editing environments make different normalization choices, creating problems for revision control systems, patch files, and the like.
Normalizing the contents of string literals, rather than using their contents unaltered, will introduce a risk of user surprise.
We could require only identifiers, or only identifiers and comments, to be normalized, rather than the entire input file.
Advantages:
Disadvantages: