Просмотр исходного кода

Subscript syntax and semantics (#2274)

Adds support for subscripting using the conventional square-bracket syntax, with support for both array-like and slice-like semantics.

Co-authored-by: Richard Smith <richard@metafoo.co.uk>
Co-authored-by: josh11b <josh11b@users.noreply.github.com>
Geoff Romer 3 лет назад
Родитель
Сommit
4885a2bf8e
1 измененных файлов с 405 добавлено и 0 удалено
  1. 405 0
      proposals/p2274.md

+ 405 - 0
proposals/p2274.md

@@ -0,0 +1,405 @@
+# Subscript syntax and semantics
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+[Pull request](https://github.com/carbon-language/carbon-lang/pull/2274)
+
+<!-- toc -->
+
+## Table of contents
+
+-   [Abstract](#abstract)
+-   [Problem](#problem)
+-   [Background](#background)
+-   [Proposal](#proposal)
+-   [Details](#details)
+    -   [Examples](#examples)
+-   [Rationale based on Carbon's goals](#rationale-based-on-carbons-goals)
+-   [Alternatives considered](#alternatives-considered)
+    -   [Different subscripting syntaxes](#different-subscripting-syntaxes)
+    -   [Multiple indices](#multiple-indices)
+    -   [Read-only subscripting](#read-only-subscripting)
+    -   [Rvalue-only subscripting](#rvalue-only-subscripting)
+    -   [Map-like subscripting](#map-like-subscripting)
+        -   [Usage-based rewriting](#usage-based-rewriting)
+
+<!-- tocstop -->
+
+## Abstract
+
+Adds support for subscripting using the conventional square-bracket syntax, with
+support for both array-like and slice-like semantics.
+
+Cloned from [#1356](https://github.com/carbon-language/carbon-lang/pull/1356),
+which was cloned from
+[#1059](https://github.com/carbon-language/carbon-lang/pull/1059).
+
+## Problem
+
+Carbon needs a convenient way of indexing into objects like arrays and slices.
+
+## Background
+
+For purposes of this proposal, a "slice" is an object that refers to some region
+of an array, and provides access to it, such as a `std::string_view` or
+`std::span` in C++. Slices have subtly different indexing API semantics than
+arrays: indexing into an array can produce an l-value only if the array itself
+is an l-value, but indexing into a slice can produce an l-value regardless of
+the slice's value category.
+
+Interfaces are Carbon's
+[only static open extension mechanism](/docs/project/principles/static_open_extension.md),
+so user-defined indexing must be defined in terms of interfaces. In order to
+support
+[definition checking of generic functions](/docs/design/generics/goals.md#generics-will-be-checked-when-defined),
+it must be possible to fully typecheck indexing operations using only the
+information exposed by those interfaces. And in order to provide
+[coherence](/docs/design/generics/goals.md#coherence), a given indexing
+expression must execute the same code regardless of how much type information we
+have about it, so long as we have enough type information to establish that the
+expression is valid.
+
+The primary problem this proposal needs to solve is how to support both
+array-like and slice-like indexing semantics while satisfying those
+requirements.
+
+## Proposal
+
+Carbon will support indexing using the conventional `a[i]` subscript syntax.
+When `a` is an l-value, the result of subscripting will always be an l-value,
+but when `a` is an r-value, the result can be an l-value or an r-value,
+depending on which interface the type implements:
+
+-   If subscripting an r-value produces an r-value result, as with an array, the
+    type should implement `IndexWith`.
+-   If subscripting an r-value produces an l-value result, as with a slice, the
+    type should implement `IndirectIndexWith`.
+
+`IndirectIndexWith` is a subtype of `IndexWith`, and subscript expressions are
+rewritten to method calls on `IndirectIndexWith` if the type is known to
+implement that interface, or to method calls on `IndexWith` otherwise.
+
+A type can implement at most one of these two interfaces. `IndirectIndexWith`
+provides a final blanket `impl` of `IndexWith`, which ensures coherence.
+
+One implication of this design is that the requirement of coherence must apply
+to value categories as well as types (and for those purposes, "l-value" is a
+subcategory of "r-value"). This implies that for any valid expression containing
+an r-value sub-expression, replacing that sub-expression with an l-value
+expression cannot change the validity or semantics of the outer expression, so
+long as the r-value expression has the same type and evaluates to the same
+value. This, in turn, ensures that once we know a value implements `IndexWith`,
+additionally learning that it implements `IndirectIndexWith` can't invalidate
+code that previously typechecked, or change its semantics.
+
+## Details
+
+A subscript expression has the form "_lhs_ `[` _index_ `]`". As in C++, this
+syntax has the same precedence as `.`, `->`, and function calls, and associates
+left-to-right with all of them.
+
+Its semantics are defined in terms of the following interfaces:
+
+```
+interface IndexWith(SubscriptType:! Type) {
+  let ElementType:! Type;
+  fn At[me: Self](subscript: SubscriptType) -> ElementType;
+  fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType*;
+}
+
+interface IndirectIndexWith(SubscriptType:! Type) {
+  impl as IndexWith(SubscriptType);
+  fn Addr[me: Self](subscript: SubscriptType) -> ElementType*;
+}
+```
+
+A subscript expression where _lhs_ has type `T` and _index_ has type `I` is
+rewritten based on the value category of _lhs_ and whether `T` is known to
+implement `IndirectIndexWith(I)`:
+
+-   If `T` implements `IndirectIndexWith(I)`, the expression is rewritten to
+    "`*((` _lhs_ `).(IndirectIndexWith(I).Addr)(` _index_ `))`".
+-   Otherwise, if _lhs_ is an l-value, the expression is rewritten to "`*((`
+    _lhs_ `).(IndexWith(I).Addr)(` _index_ `))`".
+-   Otherwise, the expression is rewritten to "`(` _lhs_ `).(IndexWith(I).At)(`
+    _index_ `)`".
+
+On their own, these rules would oblige `IndirectIndexWith` types to define three
+methods to support a single syntax. Worse, they would permit those types to
+define those methods in inconsistent ways, which would violate coherence. To
+avoid those problems, `IndirectIndexWith` provides a blanket `final impl` for
+`IndexWith`:
+
+```
+final external impl forall
+    [SubscriptType:! Type, T:! IndirectIndexWith(SubscriptType)]
+    T as IndexWith(SubscriptType) {
+  let ElementType:! Type = T.(IndirectIndexWith(SubscriptType)).ElementType;
+  fn At[me: Self](subscript: SubscriptType) -> ElementType {
+    return *(me.(IndirectIndexWith(SubscriptType).Addr)(index));
+  }
+  fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType* {
+    return me->(IndirectIndexWith(SubscriptType).Addr)(index);
+  }
+}
+```
+
+Thus, a type that implements `IndirectIndexWith` need not, and cannot, provide
+its own definitions of `IndexWith.At` and `IndexWith.Addr`.
+
+### Examples
+
+An array type could implement subscripting like so:
+
+```
+class Array(template T:! Type) {
+  external impl as IndexWith(like i64) {
+    let ElementType:! Type = T;
+    fn At[me: Self](subscript: i64) -> T;
+    fn Addr[addr me: Self*](subscript: i64) -> T*;
+  }
+}
+```
+
+And a slice type could look like this:
+
+```
+class Slice(T:! Type) {
+  external impl as IndirectIndexWith(like i64) {
+    let ElementType:! Type = T;
+    fn Addr[me: Self](subscript: i64) -> T*;
+  }
+}
+```
+
+We can even approximate support for indexing on tuple types. We don't have
+enough of a design for metaprogramming or variadics to specify it generically,
+but we can illustrate how it would look for a particular tuple type, such as
+`(i8, f64)`:
+
+```
+external impl (i8, f64) as IndexWith(typeof(0)) {
+  let ElementType:! Type = i8;
+  fn At[me: Self](0) -> i8;
+  fn Addr[addr me: Self*](0) -> i8*;
+}
+
+external impl (i8, f64) as IndexWith(typeof(1)) {
+  let ElementType:! Type = f64;
+  fn At[me: Self](1) -> f64;
+  fn Addr[addr me: Self*](1) -> f64*;
+}
+```
+
+However, this approach presupposes that every integer literal has a distinct
+type, which is not yet settled. It also only supports indexing with integer
+literals, not with constant values of ordinary integer types. This is somewhat
+contrary to Carbon's metaprogramming philosophy: Carbon treats types as values
+in order to make metaprogramming more like ordinary programming, but here we're
+doing the reverse, treating values as types. Consequently, the eventual design
+for tuple-style indexing may look very different.
+
+## Rationale based on Carbon's goals
+
+Indexing is a fundamental operation in procedural programming, and is especially
+common in performance-critical code, so supporting it with a concise, familiar
+syntax helps us make
+[code easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write),
+and support
+[performance-critical software](/docs/project/goals.md#performance-critical-software).
+Using the same syntax as C++, even for slice-like types, supports
+[interoperability with and migration from existing C++ code](/docs/project/goals.md#interoperability-with-and-migration-from-existing-c-code).
+
+## Alternatives considered
+
+### Different subscripting syntaxes
+
+The reason that slices have different semantics from arrays is that a slice
+represents an indirection, like a pointer. Indexing behaves the same for slice
+l-values and r-values for the same reason that dereferencing behaves the same
+for pointer l-values and r-values. Thus, our attempts to make a single
+subscripting syntax work for both arrays and slices are arguably analogous to
+trying to make the same method-call syntax work for both pointers and objects.
+Or, to put it another way, the difficulty with slices is that they behave like
+_references_.
+
+From that point of view, it would be more consistent to have different
+subscripting syntaxes for the two cases. The most naive version of that would be
+to say that only array-like types support subscripting, and slice-like types
+should instead support dereferencing, so indexing into a slice would look like
+`(*s)[i]`. However, that would be syntactically onerous, and it's not clear how
+to define the type of `*s` in a way that doesn't beg the question.
+
+Instead, we could define a distinct syntax for slice subscripting, such as
+`s*[i]`, and define separate interfaces for the two subscript syntax. The two
+interfaces would be very similar to `IndexWith` and `IndirectIndexWith`, but
+neither would extend the other, and the rewrite from subscript syntax to method
+calls would not depend on type information (although it would depend on value
+category). However, any such syntax would be quite novel and unfamiliar, and
+there seem to be few good choices for how to spell it. `s*[i]` is almost the
+only syntax that seems both viable and somewhat mnemonic, but it would add
+further complexity to the already-fraught parsing of `*` (for humans as well as
+tools).
+
+### Multiple indices
+
+This proposal does not support multiple comma-separated indices (such as
+`a[i, j]`), which is desirable for types like multidimensional arrays. We do
+support syntax like `a[(i, j)]`, which is a single index whose type is a tuple,
+but the extra parens are syntactically noisy. We could add those parens
+implicitly, but that would effectively move the syntactic noise to the
+implementation, even for the single-index case (for example
+`impl Foo as IndexWith((i64,))`). It should be much cleaner to make the
+interfaces and their methods variadic, once we have a design for variadics.
+
+### Read-only subscripting
+
+This proposal does not provide an obvious path for supporting types like C++'s
+`std::string_view` or `std::span<const T>`, whose subscripting operations expose
+read-only access to their contents. It is tempting to try to extend this
+proposal to support those use cases, both because of their inherent importance
+and because it already has to deal with read-only versus read-write access in
+order to support array r-values.
+
+However, there's a fundamental difference between the true immutability of
+something like an array r-value, and the contextual lack of mutable _access_
+provided by something like `string_view`. While value categories can express the
+former, they are not well-suited to expressing the latter. To address these use
+cases, Carbon will probably need something like C++'s `const` type system, but
+that should be largely orthogonal to this proposal.
+
+### Rvalue-only subscripting
+
+This proposal does not support subscripting operations that can't produce
+l-values. In particular, this means it does not support using subscript syntax
+to form slices, as in Python's `a[i:j]` or Swift's `a[i...j]`. To support this,
+we would need a separate pair of interfaces that return by value:
+
+```
+interface RValueIndexWith(SubscriptType:! Type) {
+  let ElementType:! Type;
+  fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType;
+}
+
+interface RValueIndirectIndexWith(SubscriptType:! Type) {
+  extends RValueIndexWith(SubscriptType);
+  let ElementType:! Type;
+  fn Addr[me: Self](subscript: SubscriptType) -> ElementType;
+}
+
+final external impl [SubscriptType:! Type,
+                     T:! RValueIndirectIndexWith(SubscriptType)]
+    T as IndexWith(SubscriptType) {
+  let ElementType:! Type =
+      T.(RValueIndirectIndexWith(SubscriptType)).ElementType;
+  fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType {
+    return me->Addr(subscript);
+  }
+}
+```
+
+Note that we still need a pair of interfaces, and a blanket final `impl` to
+enforce consistency, because arrays and slices have different semantics in this
+context as well: taking a slice of an r-value array is invalid, because taking a
+slice is equivalent to (and presumably implemented in terms of) taking an
+address. On the other hand, taking a slice of an r-value slice is valid and
+should be supported.
+
+We would likewise need to extend the rewrite rules for subscript syntax to
+detect and use implementations of these interfaces. This should not lead to a
+combinatorial explosion of cases, though; if `T` implements both `IndexWith(I)`
+and `RValueIndexWith(I)`, the program should be ill-formed due to ambiguity.
+
+However, it's not clear if this would provide enough benefit to justify the
+added complexity.
+
+### Map-like subscripting
+
+This proposal does not support subscripting operations that insert new elements
+into a collection, as in C++'s `std::map` and `std::unordered_map`, because it
+requires subscriptable types to support subscripting of r-values, which are
+immutable. We could support this with an additional interface for types that are
+subscriptable only as l-values, and a corresponding extension to the rewrite
+rules.
+
+However, it's debatable whether such insertion behavior is desirable; it has not
+been a clear-cut success in C++. Code like `x = m[i];` looks like it reads from
+`m`, and the fact that it can write to `m` is surprising, and easy for even
+experienced programmers to overlook. Even for wary readers, it doesn't convey
+the author's intent, because it's not clear whether the author assumed `i` is
+present, or is relying on the implicit insertion. Furthermore, the implicit
+insertion means that `x = m[i];` won't compile when `m` is const. On the other
+hand, while it's relatively unsurprising that `m[i] = x;` might insert, that
+insertion is also potentially inefficient, since it must default-construct a new
+value before assigning `x` to it.
+
+We could fix most of these problems by giving special treatment to syntax of the
+form `m[i] = x;`, and defining separate methods for it:
+
+```
+interface IndexWith(SubscriptType:! Type) {
+  let ElementType:! Type;
+  fn At[me: Self](subscript: SubscriptType) -> ElementType;
+  fn Addr[addr me: Self*](subscript: SubscriptType) -> ElementType*;
+  fn Assign[addr me: Self*](subscript: SubscriptType, element: ElementType) {
+    (*me->Addr(index)) = element;
+  }
+}
+
+interface IndirectIndexWith(SubscriptType:! Type) {
+  extends IndexWith(SubscriptType);
+  let ElementType:! Type;
+  fn Addr[me: Self](subscript: SubscriptType) -> ElementType*;
+  fn Assign[me: Self](subscript: SubscriptType, element: ElementType) {
+    me->Addr(subscript) = element;
+  }
+}
+
+final external impl [SubscriptType:! Type, T:! IndirectIndexWith(SubscriptType)]
+    T as IndexWith(SubscriptType) {
+  ...
+  fn Assign[addr me: Self*](subscript: SubscriptType, element: ElementType) {
+    me->(IndirectIndexWith.AssignAsRValue)(subscript, element);
+  }
+}
+```
+
+With this approach, expressions of the form `m[i] = x` would be rewritten to
+call the appropriate `Assign` method, while all other subscript expressions are
+rewritten as in the primary proposal. However, this does add some complexity to
+both the interfaces and the rewrite rules (for example, we presumably want
+`(m[i]) = x` to be treated the same as `m[i] = x`), so I'm leaving this as a
+potential future extension.
+
+#### Usage-based rewriting
+
+As an alternative way of getting `std::map`-like behavior, we could rewrite
+subscript expressions to use `At` instead of `Addr` when the usage context only
+needs an r-value, even if the operand is an l-value, in much the same way that
+Rust rewrites subscript expressions depending on whether the usage expects the
+result to be mutable. This would enable type authors to make `m[i] = x;`
+potentially-inserting while ensuring that `x = m[i];` is not, by making `Addr`
+but not `At` potentially-inserting.
+
+However, this would mean that `m[i].Method()` is potentially inserting if
+`Method` takes `me` by pointer, and would have the same performance drawback as
+in C++. Furthermore, it would violate the principle (which we have discussed but
+not yet adopted) that type information should flow from subexpressions to the
+parent expression, but never the reverse.
+
+A variant of this approach would be to allow the compiler to choose between `At`
+and `Addr` when the operand is an l-value but the usage context only needs an
+r-value. This might enable the compiler to make the generated code more
+efficient and/or safer, and might create pressure for libraries to ensure `At`
+and `Addr` are consistent with each other.
+
+However, if the compiler predictably chooses `At` under those conditions, that
+could create a temptation for library authors to leverage that implementation
+detail to try to implement a `std::map`-like API, by having `Addr` implicitly
+insert while `At` does not. That would have all the same drawbacks as if we
+supported it intentionally, plus the evolutionary drawbacks of having user code
+depend on unsupported implementation details.