|
|
@@ -0,0 +1,405 @@
|
|
|
+# Subscript syntax and semantics
|
|
|
+
|
|
|
+<!--
|
|
|
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
|
|
|
+Exceptions. See /LICENSE for license information.
|
|
|
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
+-->
|
|
|
+
|
|
|
+[Pull request](https://github.com/carbon-language/carbon-lang/pull/2274)
|
|
|
+
|
|
|
+<!-- toc -->
|
|
|
+
|
|
|
+## Table of contents
|
|
|
+
|
|
|
+- [Abstract](#abstract)
|
|
|
+- [Problem](#problem)
|
|
|
+- [Background](#background)
|
|
|
+- [Proposal](#proposal)
|
|
|
+- [Details](#details)
|
|
|
+ - [Examples](#examples)
|
|
|
+- [Rationale based on Carbon's goals](#rationale-based-on-carbons-goals)
|
|
|
+- [Alternatives considered](#alternatives-considered)
|
|
|
+ - [Different subscripting syntaxes](#different-subscripting-syntaxes)
|
|
|
+ - [Multiple indices](#multiple-indices)
|
|
|
+ - [Read-only subscripting](#read-only-subscripting)
|
|
|
+ - [Rvalue-only subscripting](#rvalue-only-subscripting)
|
|
|
+ - [Map-like subscripting](#map-like-subscripting)
|
|
|
+ - [Usage-based rewriting](#usage-based-rewriting)
|
|
|
+
|
|
|
+<!-- tocstop -->
|
|
|
+
|
|
|
+## Abstract
|
|
|
+
|
|
|
+Adds support for subscripting using the conventional square-bracket syntax, with
|
|
|
+support for both array-like and slice-like semantics.
|
|
|
+
|
|
|
+Cloned from [#1356](https://github.com/carbon-language/carbon-lang/pull/1356),
|
|
|
+which was cloned from
|
|
|
+[#1059](https://github.com/carbon-language/carbon-lang/pull/1059).
|
|
|
+
|
|
|
+## Problem
|
|
|
+
|
|
|
+Carbon needs a convenient way of indexing into objects like arrays and slices.
|
|
|
+
|
|
|
+## Background
|
|
|
+
|
|
|
+For purposes of this proposal, a "slice" is an object that refers to some region
|
|
|
+of an array, and provides access to it, such as a `std::string_view` or
|
|
|
+`std::span` in C++. Slices have subtly different indexing API semantics than
|
|
|
+arrays: indexing into an array can produce an l-value only if the array itself
|
|
|
+is an l-value, but indexing into a slice can produce an l-value regardless of
|
|
|
+the slice's value category.
|
|
|
+
|
|
|
+Interfaces are Carbon's
|
|
|
+[only static open extension mechanism](/docs/project/principles/static_open_extension.md),
|
|
|
+so user-defined indexing must be defined in terms of interfaces. In order to
|
|
|
+support
|
|
|
+[definition checking of generic functions](/docs/design/generics/goals.md#generics-will-be-checked-when-defined),
|
|
|
+it must be possible to fully typecheck indexing operations using only the
|
|
|
+information exposed by those interfaces. And in order to provide
|
|
|
+[coherence](/docs/design/generics/goals.md#coherence), a given indexing
|
|
|
+expression must execute the same code regardless of how much type information we
|
|
|
+have about it, so long as we have enough type information to establish that the
|
|
|
+expression is valid.
|
|
|
+
|
|
|
+The primary problem this proposal needs to solve is how to support both
|
|
|
+array-like and slice-like indexing semantics while satisfying those
|
|
|
+requirements.
|
|
|
+
|
|
|
+## Proposal
|
|
|
+
|
|
|
+Carbon will support indexing using the conventional `a[i]` subscript syntax.
|
|
|
+When `a` is an l-value, the result of subscripting will always be an l-value,
|
|
|
+but when `a` is an r-value, the result can be an l-value or an r-value,
|
|
|
+depending on which interface the type implements:
|
|
|
+
|
|
|
+- If subscripting an r-value produces an r-value result, as with an array, the
|
|
|
+ type should implement `IndexWith`.
|
|
|
+- If subscripting an r-value produces an l-value result, as with a slice, the
|
|
|
+ type should implement `IndirectIndexWith`.
|
|
|
+
|
|
|
+`IndirectIndexWith` is a subtype of `IndexWith`, and subscript expressions are
|
|
|
+rewritten to method calls on `IndirectIndexWith` if the type is known to
|
|
|
+implement that interface, or to method calls on `IndexWith` otherwise.
|
|
|
+
|
|
|
+A type can implement at most one of these two interfaces. `IndirectIndexWith`
|
|
|
+provides a final blanket `impl` of `IndexWith`, which ensures coherence.
|
|
|
+
|
|
|
+One implication of this design is that the requirement of coherence must apply
|
|
|
+to value categories as well as types (and for those purposes, "l-value" is a
|
|
|
+subcategory of "r-value"). This implies that for any valid expression containing
|
|
|
+an r-value sub-expression, replacing that sub-expression with an l-value
|
|
|
+expression cannot change the validity or semantics of the outer expression, so
|
|
|
+long as the r-value expression has the same type and evaluates to the same
|
|
|
+value. This, in turn, ensures that once we know a value implements `IndexWith`,
|
|
|
+additionally learning that it implements `IndirectIndexWith` can't invalidate
|
|
|
+code that previously typechecked, or change its semantics.
|
|
|
+
|
|
|
+## Details
|
|
|
+
|
|
|
+A subscript expression has the form "_lhs_ `[` _index_ `]`". As in C++, this
|
|
|
+syntax has the same precedence as `.`, `->`, and function calls, and associates
|
|
|
+left-to-right with all of them.
|
|
|
+
|
|
|
+Its semantics are defined in terms of the following interfaces:
|
|
|
+
|
|
|
+```
|
|
|
+interface IndexWith(SubscriptType:! Type) {
|
|
|
+ let ElementType:! Type;
|
|
|
+ fn At[me: Self](subscript: SubscriptType) -> ElementType;
|
|
|
+ fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType*;
|
|
|
+}
|
|
|
+
|
|
|
+interface IndirectIndexWith(SubscriptType:! Type) {
|
|
|
+ impl as IndexWith(SubscriptType);
|
|
|
+ fn Addr[me: Self](subscript: SubscriptType) -> ElementType*;
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+A subscript expression where _lhs_ has type `T` and _index_ has type `I` is
|
|
|
+rewritten based on the value category of _lhs_ and whether `T` is known to
|
|
|
+implement `IndirectIndexWith(I)`:
|
|
|
+
|
|
|
+- If `T` implements `IndirectIndexWith(I)`, the expression is rewritten to
|
|
|
+ "`*((` _lhs_ `).(IndirectIndexWith(I).Addr)(` _index_ `))`".
|
|
|
+- Otherwise, if _lhs_ is an l-value, the expression is rewritten to "`*((`
|
|
|
+ _lhs_ `).(IndexWith(I).Addr)(` _index_ `))`".
|
|
|
+- Otherwise, the expression is rewritten to "`(` _lhs_ `).(IndexWith(I).At)(`
|
|
|
+ _index_ `)`".
|
|
|
+
|
|
|
+On their own, these rules would oblige `IndirectIndexWith` types to define three
|
|
|
+methods to support a single syntax. Worse, they would permit those types to
|
|
|
+define those methods in inconsistent ways, which would violate coherence. To
|
|
|
+avoid those problems, `IndirectIndexWith` provides a blanket `final impl` for
|
|
|
+`IndexWith`:
|
|
|
+
|
|
|
+```
|
|
|
+final external impl forall
|
|
|
+ [SubscriptType:! Type, T:! IndirectIndexWith(SubscriptType)]
|
|
|
+ T as IndexWith(SubscriptType) {
|
|
|
+ let ElementType:! Type = T.(IndirectIndexWith(SubscriptType)).ElementType;
|
|
|
+ fn At[me: Self](subscript: SubscriptType) -> ElementType {
|
|
|
+ return *(me.(IndirectIndexWith(SubscriptType).Addr)(index));
|
|
|
+ }
|
|
|
+ fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType* {
|
|
|
+ return me->(IndirectIndexWith(SubscriptType).Addr)(index);
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+Thus, a type that implements `IndirectIndexWith` need not, and cannot, provide
|
|
|
+its own definitions of `IndexWith.At` and `IndexWith.Addr`.
|
|
|
+
|
|
|
+### Examples
|
|
|
+
|
|
|
+An array type could implement subscripting like so:
|
|
|
+
|
|
|
+```
|
|
|
+class Array(template T:! Type) {
|
|
|
+ external impl as IndexWith(like i64) {
|
|
|
+ let ElementType:! Type = T;
|
|
|
+ fn At[me: Self](subscript: i64) -> T;
|
|
|
+ fn Addr[addr me: Self*](subscript: i64) -> T*;
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+And a slice type could look like this:
|
|
|
+
|
|
|
+```
|
|
|
+class Slice(T:! Type) {
|
|
|
+ external impl as IndirectIndexWith(like i64) {
|
|
|
+ let ElementType:! Type = T;
|
|
|
+ fn Addr[me: Self](subscript: i64) -> T*;
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+We can even approximate support for indexing on tuple types. We don't have
|
|
|
+enough of a design for metaprogramming or variadics to specify it generically,
|
|
|
+but we can illustrate how it would look for a particular tuple type, such as
|
|
|
+`(i8, f64)`:
|
|
|
+
|
|
|
+```
|
|
|
+external impl (i8, f64) as IndexWith(typeof(0)) {
|
|
|
+ let ElementType:! Type = i8;
|
|
|
+ fn At[me: Self](0) -> i8;
|
|
|
+ fn Addr[addr me: Self*](0) -> i8*;
|
|
|
+}
|
|
|
+
|
|
|
+external impl (i8, f64) as IndexWith(typeof(1)) {
|
|
|
+ let ElementType:! Type = f64;
|
|
|
+ fn At[me: Self](1) -> f64;
|
|
|
+ fn Addr[addr me: Self*](1) -> f64*;
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+However, this approach presupposes that every integer literal has a distinct
|
|
|
+type, which is not yet settled. It also only supports indexing with integer
|
|
|
+literals, not with constant values of ordinary integer types. This is somewhat
|
|
|
+contrary to Carbon's metaprogramming philosophy: Carbon treats types as values
|
|
|
+in order to make metaprogramming more like ordinary programming, but here we're
|
|
|
+doing the reverse, treating values as types. Consequently, the eventual design
|
|
|
+for tuple-style indexing may look very different.
|
|
|
+
|
|
|
+## Rationale based on Carbon's goals
|
|
|
+
|
|
|
+Indexing is a fundamental operation in procedural programming, and is especially
|
|
|
+common in performance-critical code, so supporting it with a concise, familiar
|
|
|
+syntax helps us make
|
|
|
+[code easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write),
|
|
|
+and support
|
|
|
+[performance-critical software](/docs/project/goals.md#performance-critical-software).
|
|
|
+Using the same syntax as C++, even for slice-like types, supports
|
|
|
+[interoperability with and migration from existing C++ code](/docs/project/goals.md#interoperability-with-and-migration-from-existing-c-code).
|
|
|
+
|
|
|
+## Alternatives considered
|
|
|
+
|
|
|
+### Different subscripting syntaxes
|
|
|
+
|
|
|
+The reason that slices have different semantics from arrays is that a slice
|
|
|
+represents an indirection, like a pointer. Indexing behaves the same for slice
|
|
|
+l-values and r-values for the same reason that dereferencing behaves the same
|
|
|
+for pointer l-values and r-values. Thus, our attempts to make a single
|
|
|
+subscripting syntax work for both arrays and slices are arguably analogous to
|
|
|
+trying to make the same method-call syntax work for both pointers and objects.
|
|
|
+Or, to put it another way, the difficulty with slices is that they behave like
|
|
|
+_references_.
|
|
|
+
|
|
|
+From that point of view, it would be more consistent to have different
|
|
|
+subscripting syntaxes for the two cases. The most naive version of that would be
|
|
|
+to say that only array-like types support subscripting, and slice-like types
|
|
|
+should instead support dereferencing, so indexing into a slice would look like
|
|
|
+`(*s)[i]`. However, that would be syntactically onerous, and it's not clear how
|
|
|
+to define the type of `*s` in a way that doesn't beg the question.
|
|
|
+
|
|
|
+Instead, we could define a distinct syntax for slice subscripting, such as
|
|
|
+`s*[i]`, and define separate interfaces for the two subscript syntax. The two
|
|
|
+interfaces would be very similar to `IndexWith` and `IndirectIndexWith`, but
|
|
|
+neither would extend the other, and the rewrite from subscript syntax to method
|
|
|
+calls would not depend on type information (although it would depend on value
|
|
|
+category). However, any such syntax would be quite novel and unfamiliar, and
|
|
|
+there seem to be few good choices for how to spell it. `s*[i]` is almost the
|
|
|
+only syntax that seems both viable and somewhat mnemonic, but it would add
|
|
|
+further complexity to the already-fraught parsing of `*` (for humans as well as
|
|
|
+tools).
|
|
|
+
|
|
|
+### Multiple indices
|
|
|
+
|
|
|
+This proposal does not support multiple comma-separated indices (such as
|
|
|
+`a[i, j]`), which is desirable for types like multidimensional arrays. We do
|
|
|
+support syntax like `a[(i, j)]`, which is a single index whose type is a tuple,
|
|
|
+but the extra parens are syntactically noisy. We could add those parens
|
|
|
+implicitly, but that would effectively move the syntactic noise to the
|
|
|
+implementation, even for the single-index case (for example
|
|
|
+`impl Foo as IndexWith((i64,))`). It should be much cleaner to make the
|
|
|
+interfaces and their methods variadic, once we have a design for variadics.
|
|
|
+
|
|
|
+### Read-only subscripting
|
|
|
+
|
|
|
+This proposal does not provide an obvious path for supporting types like C++'s
|
|
|
+`std::string_view` or `std::span<const T>`, whose subscripting operations expose
|
|
|
+read-only access to their contents. It is tempting to try to extend this
|
|
|
+proposal to support those use cases, both because of their inherent importance
|
|
|
+and because it already has to deal with read-only versus read-write access in
|
|
|
+order to support array r-values.
|
|
|
+
|
|
|
+However, there's a fundamental difference between the true immutability of
|
|
|
+something like an array r-value, and the contextual lack of mutable _access_
|
|
|
+provided by something like `string_view`. While value categories can express the
|
|
|
+former, they are not well-suited to expressing the latter. To address these use
|
|
|
+cases, Carbon will probably need something like C++'s `const` type system, but
|
|
|
+that should be largely orthogonal to this proposal.
|
|
|
+
|
|
|
+### Rvalue-only subscripting
|
|
|
+
|
|
|
+This proposal does not support subscripting operations that can't produce
|
|
|
+l-values. In particular, this means it does not support using subscript syntax
|
|
|
+to form slices, as in Python's `a[i:j]` or Swift's `a[i...j]`. To support this,
|
|
|
+we would need a separate pair of interfaces that return by value:
|
|
|
+
|
|
|
+```
|
|
|
+interface RValueIndexWith(SubscriptType:! Type) {
|
|
|
+ let ElementType:! Type;
|
|
|
+ fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType;
|
|
|
+}
|
|
|
+
|
|
|
+interface RValueIndirectIndexWith(SubscriptType:! Type) {
|
|
|
+ extends RValueIndexWith(SubscriptType);
|
|
|
+ let ElementType:! Type;
|
|
|
+ fn Addr[me: Self](subscript: SubscriptType) -> ElementType;
|
|
|
+}
|
|
|
+
|
|
|
+final external impl [SubscriptType:! Type,
|
|
|
+ T:! RValueIndirectIndexWith(SubscriptType)]
|
|
|
+ T as IndexWith(SubscriptType) {
|
|
|
+ let ElementType:! Type =
|
|
|
+ T.(RValueIndirectIndexWith(SubscriptType)).ElementType;
|
|
|
+ fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType {
|
|
|
+ return me->Addr(subscript);
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+Note that we still need a pair of interfaces, and a blanket final `impl` to
|
|
|
+enforce consistency, because arrays and slices have different semantics in this
|
|
|
+context as well: taking a slice of an r-value array is invalid, because taking a
|
|
|
+slice is equivalent to (and presumably implemented in terms of) taking an
|
|
|
+address. On the other hand, taking a slice of an r-value slice is valid and
|
|
|
+should be supported.
|
|
|
+
|
|
|
+We would likewise need to extend the rewrite rules for subscript syntax to
|
|
|
+detect and use implementations of these interfaces. This should not lead to a
|
|
|
+combinatorial explosion of cases, though; if `T` implements both `IndexWith(I)`
|
|
|
+and `RValueIndexWith(I)`, the program should be ill-formed due to ambiguity.
|
|
|
+
|
|
|
+However, it's not clear if this would provide enough benefit to justify the
|
|
|
+added complexity.
|
|
|
+
|
|
|
+### Map-like subscripting
|
|
|
+
|
|
|
+This proposal does not support subscripting operations that insert new elements
|
|
|
+into a collection, as in C++'s `std::map` and `std::unordered_map`, because it
|
|
|
+requires subscriptable types to support subscripting of r-values, which are
|
|
|
+immutable. We could support this with an additional interface for types that are
|
|
|
+subscriptable only as l-values, and a corresponding extension to the rewrite
|
|
|
+rules.
|
|
|
+
|
|
|
+However, it's debatable whether such insertion behavior is desirable; it has not
|
|
|
+been a clear-cut success in C++. Code like `x = m[i];` looks like it reads from
|
|
|
+`m`, and the fact that it can write to `m` is surprising, and easy for even
|
|
|
+experienced programmers to overlook. Even for wary readers, it doesn't convey
|
|
|
+the author's intent, because it's not clear whether the author assumed `i` is
|
|
|
+present, or is relying on the implicit insertion. Furthermore, the implicit
|
|
|
+insertion means that `x = m[i];` won't compile when `m` is const. On the other
|
|
|
+hand, while it's relatively unsurprising that `m[i] = x;` might insert, that
|
|
|
+insertion is also potentially inefficient, since it must default-construct a new
|
|
|
+value before assigning `x` to it.
|
|
|
+
|
|
|
+We could fix most of these problems by giving special treatment to syntax of the
|
|
|
+form `m[i] = x;`, and defining separate methods for it:
|
|
|
+
|
|
|
+```
|
|
|
+interface IndexWith(SubscriptType:! Type) {
|
|
|
+ let ElementType:! Type;
|
|
|
+ fn At[me: Self](subscript: SubscriptType) -> ElementType;
|
|
|
+ fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType*;
|
|
|
+ fn Assign[addr me: Self*](subscript: SubscriptType, element: ElementType) {
|
|
|
+ (*me->Addr(index)) = element;
|
|
|
+ }
|
|
|
+}
|
|
|
+
|
|
|
+interface IndirectIndexWith(SubscriptType:! Type) {
|
|
|
+ extends IndexWith(SubscriptType);
|
|
|
+ let ElementType:! Type;
|
|
|
+ fn Addr[me: Self](subscript: SubscriptType) -> ElementType*;
|
|
|
+ fn Assign[me: Self](subscript: SubscriptType, element: ElementType) {
|
|
|
+ me->Addr(subscript) = element;
|
|
|
+ }
|
|
|
+}
|
|
|
+
|
|
|
+final external impl [SubscriptType:! Type, T:! IndirectIndexWith(SubscriptType)]
|
|
|
+ T as IndexWith(SubscriptType) {
|
|
|
+ ...
|
|
|
+ fn Assign[addr me: Self*](subscript: SubscriptType, element: ElementType) {
|
|
|
+ me->(IndirectIndexWith.AssignAsRValue)(subscript, element);
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+With this approach, expressions of the form `m[i] = x` would be rewritten to
|
|
|
+call the appropriate `Assign` method, while all other subscript expressions are
|
|
|
+rewritten as in the primary proposal. However, this does add some complexity to
|
|
|
+both the interfaces and the rewrite rules (for example, we presumably want
|
|
|
+`(m[i]) = x` to be treated the same as `m[i] = x`), so I'm leaving this as a
|
|
|
+potential future extension.
|
|
|
+
|
|
|
+#### Usage-based rewriting
|
|
|
+
|
|
|
+As an alternative way of getting `std::map`-like behavior, we could rewrite
|
|
|
+subscript expressions to use `At` instead of `Addr` when the usage context only
|
|
|
+needs an r-value, even if the operand is an l-value, in much the same way that
|
|
|
+Rust rewrites subscript expressions depending on whether the usage expects the
|
|
|
+result to be mutable. This would enable type authors to make `m[i] = x;`
|
|
|
+potentially-inserting while ensuring that `x = m[i];` is not, by making `Addr`
|
|
|
+but not `At` potentially-inserting.
|
|
|
+
|
|
|
+However, this would mean that `m[i].Method()` is potentially inserting if
|
|
|
+`Method` takes `me` by pointer, and would have the same performance drawback as
|
|
|
+in C++. Furthermore, it would violate the principle (which we have discussed but
|
|
|
+not yet adopted) that type information should flow from subexpressions to the
|
|
|
+parent expression, but never the reverse.
|
|
|
+
|
|
|
+A variant of this approach would be to allow the compiler to choose between `At`
|
|
|
+and `Addr` when the operand is an l-value but the usage context only needs an
|
|
|
+r-value. This might enable the compiler to make the generated code more
|
|
|
+efficient and/or safer, and might create pressure for libraries to ensure `At`
|
|
|
+and `Addr` are consistent with each other.
|
|
|
+
|
|
|
+However, if the compiler predictably chooses `At` under those conditions, that
|
|
|
+could create a temptation for library authors to leverage that implementation
|
|
|
+detail to try to implement a `std::map`-like API, by having `Addr` implicitly
|
|
|
+insert while `At` does not. That would have all the same drawbacks as if we
|
|
|
+supported it intentionally, plus the evolutionary drawbacks of having user code
|
|
|
+depend on unsupported implementation details.
|