11.4. Defining New Syntax🔗

Lean's uniform representation of syntax is very general and flexible. This means that extensions to Lean's parser do not require extensions to the representation of parsed syntax.

11.4.1. Syntax Model

Lean's parser produces a concrete syntax tree, of type Lean.Syntax. Lean.Syntax is an inductive type that represents all of Lean's syntax, including commands, terms, tactics, and any custom extensions. All of these are represented by a few basic building blocks:

Atoms

Atoms are the fundamental terminals of the grammar, including literals (such as those for characters and numbers), parentheses, operators, and keywords.

Identifiers

Identifiers represent names, such as x, Nat, or Nat.add. Identifier syntax includes a list of pre-resolved names that the identifier might refer to.

Nodes

Nodes represent the parsing of nonterminals. Nodes contain a syntax kind, which identifies the syntax rule that the node results from, along with an array of child Syntax values.

Missing Syntax

When the parser encounters an error, it returns a partial result, so Lean can provide some feedback about partially-written programs or programs that contain mistakes. Partial results contain one or more instances of missing syntax.

Atoms and identifiers are collectively referred to as tokens.

🔗inductive type
Lean.Syntax : Type

Syntax objects used by the parser, macro expander, delaborator, etc.

Constructors

  • missing : Lean.Syntax

    A missing syntax corresponds to a portion of the syntax tree that is missing because of a parse error. The indexing operator on Syntax also returns missing for indexing out of bounds.

  • node (info : Lean.SourceInfo) (kind : Lean.SyntaxNodeKind) (args : Array Lean.Syntax) : Lean.Syntax

    Node in the syntax tree.

    The info field is used by the delaborator to store the position of the subexpression corresponding to this node. The parser sets the info field to none. The parser sets the info field to none, with position retrieval continuing recursively. Nodes created by quotations use the result from SourceInfo.fromRef so that they are marked as synthetic even when the leading/trailing token is not. The delaborator uses the info field to store the position of the subexpression corresponding to this node.

    (Remark: the node constructor did not have an info field in previous versions. This caused a bug in the interactive widgets, where the popup for a + b was the same as for a. The delaborator used to associate subexpressions with pretty-printed syntax by setting the (string) position of the first atom/identifier to the (expression) position of the subexpression. For example, both a and a + b have the same first identifier, and so their infos got mixed up.)

  • atom (info : Lean.SourceInfo) (val : String) : Lean.Syntax

    An atom corresponds to a keyword or piece of literal unquoted syntax. These correspond to quoted strings inside syntax declarations. For example, in (x + y), "(", "+" and ")" are atom and x and y are ident.

  • ident (info : Lean.SourceInfo) (rawVal : Substring) (val : Lean.Name) (preresolved : List Lean.Syntax.Preresolved) : Lean.Syntax

    An ident corresponds to an identifier as parsed by the ident or rawIdent parsers.

    • rawVal is the literal substring from the input file

    • val is the parsed identifier (with hygiene)

    • preresolved is the list of possible declarations this could refer to

🔗inductive type
Lean.Syntax.Preresolved : Type

Binding information resolved and stored at compile time of a syntax quotation. Note: We do not statically know whether a syntax expects a namespace or term name, so a Syntax.ident may contain both preresolution kinds.

Constructors

11.4.2. Syntax Node Kinds

Syntax node kinds typically identify the parser that produced the node. This is one place where the names given to operators or notations (or their automatically-generated internal names) occur. While only nodes contain a field that identifies their kind, identifiers have the kind identKind by convention, while atoms have their internal string as their kind by convention. Lean's parser wraps each keyword atom KW in a singleton node whose kind is ​`token.KW. The kind of a syntax value can be extracted using Syntax.getKind.

🔗def
Lean.SyntaxNodeKind : Type

A SyntaxNodeKind classifies Syntax.node values. It is an abbreviation for Name, and you can use name literals to construct SyntaxNodeKinds, but they need not refer to declarations in the environment. Conventionally, a SyntaxNodeKind will correspond to the Parser or ParserDesc declaration that parses it.

🔗def
Lean.Syntax.isOfKind (stx : Lean.Syntax)
    (k : Lean.SyntaxNodeKind) : Bool

Is this a syntax with node kind k?

🔗def
Lean.Syntax.getKind (stx : Lean.Syntax)
    : Lean.SyntaxNodeKind

Gets the kind of a Syntax node. For non-node syntax, we use "pseudo kinds": identKind for ident, missing for missing, and the atom's string literal for atoms.

🔗def
Lean.Syntax.setKind (stx : Lean.Syntax)
    (k : Lean.SyntaxNodeKind) : Lean.Syntax

Changes the kind at the root of a Syntax node to k. Does nothing for non-node nodes.

11.4.3. Token and Literal Kinds

A number of named kinds are associated with the basic tokens produced by the parser. Typically, single-token syntax productions consist of a node that contains a single atom; the kind saved in the node allows the value to be recognized. Atoms for literals are not interpreted by the parser: string atoms include their leading and trailing double-quote characters along with any escape sequences contained within, and hexadecimal numerals are saved as a string that begins with "0x". Helpers such as Lean.TSyntax.getString are provided to perform this decoding on demand.

🔗def
Lean.identKind : Lean.SyntaxNodeKind

ident is not actually used as a node kind, but it is returned by getKind in the ident case so that things that handle different node kinds can also handle ident.

🔗def
Lean.strLitKind : Lean.SyntaxNodeKind

str is the node kind of string literals like "foo".

🔗def
Lean.interpolatedStrKind : Lean.SyntaxNodeKind

interpolatedStrKind is the node kind of an interpolated string literal like "value = {x}" in s!"value = {x}".

🔗def
Lean.interpolatedStrLitKind
    : Lean.SyntaxNodeKind

interpolatedStrLitKind is the node kind of interpolated string literal fragments like "value = { and }" in s!"value = {x}".

🔗def
Lean.charLitKind : Lean.SyntaxNodeKind

char is the node kind of character literals like 'A'.

🔗def
Lean.numLitKind : Lean.SyntaxNodeKind

num is the node kind of number literals like 42.

🔗def
Lean.scientificLitKind : Lean.SyntaxNodeKind

scientific is the node kind of floating point literals like 1.23e-3.

🔗def
Lean.nameLitKind : Lean.SyntaxNodeKind

name is the node kind of name literals like `foo.

🔗def
Lean.fieldIdxKind : Lean.SyntaxNodeKind

fieldIdx is the node kind of projection indices like the 2 in x.2.

11.4.4. Internal Kinds

🔗def
Lean.groupKind : Lean.SyntaxNodeKind

The group kind is by the group parser, to avoid confusing with the null kind when used inside optional.

🔗def
Lean.nullKind : Lean.SyntaxNodeKind

The null kind is used for raw list parsers like many.

🔗def
Lean.choiceKind : Lean.SyntaxNodeKind

The choice kind is used when a piece of syntax has multiple parses, and the determination of which to use is deferred until typing information is available.

🔗def
Lean.hygieneInfoKind : Lean.SyntaxNodeKind

hygieneInfo is the node kind of the hygieneInfo parser, which is an "invisible token" which captures the hygiene information at the current point without parsing anything.

They can be used to generate identifiers (with Lean.HygieneInfo.mkIdent) as if they were introduced by the calling context, not the called macro.

11.4.5. Source Positions🔗

Atoms, identifiers, and nodes optionally contain source information that tracks their correspondence with the original file. The parser saves source information for all tokens, but not for nodes; position information for parsed nodes is reconstructed from their first and last tokens. Not all Syntax data results from the parser: it may be the result of macro expansion, in which case it typically contains a mix of generated and parsed syntax, or it may be the result of delaborating an internal term to display it to a user. In these use cases, nodes may themselves contain source information.

Source information comes in two varieties:

Original

Original source information comes from the parser. In addition to the original source location, it also contains leading and trailing whitespace that was skipped by the parser, which allows the original string to be reconstructed. This whitespace is saved as offsets into the string representation of the original source code (that is, as Substring) to avoid having to allocate copies of substrings.

Synthetic

Synthetic source information comes from metaprograms (including macros) or from Lean's internals. Because there is no original string to be reconstructed, it does not save leading and trailing whitespace. Synthetic source positions are used to provide accurate feedback even when terms have been automatically transformed, as well as to track the correspondence between elaborated expressions and their presentation in Lean's output. A synthetic position may be marked canonical, in which case some operations that would ordinarily ignore synthetic positions will treat it as if it were not.

🔗inductive type
Lean.SourceInfo : Type

Source information of tokens.

Constructors

  • original (leading : Substring) (pos : String.Pos) (trailing : Substring) (endPos : String.Pos) : Lean.SourceInfo

    Token from original input with whitespace and position information. leading will be inferred after parsing by Syntax.updateLeading. During parsing, it is not at all clear what the preceding token was, especially with backtracking.

  • Lean.SourceInfo.synthetic (pos endPos : String.Pos) (canonical : Bool := false) : Lean.SourceInfo

    Synthesized syntax (e.g. from a quotation) annotated with a span from the original source. In the delaborator, we "misuse" this constructor to store synthetic positions identifying subterms.

    The canonical flag on synthetic syntax is enabled for syntax that is not literally part of the original input syntax but should be treated "as if" the user really wrote it for the purpose of hovers and error messages. This is usually used on identifiers, to connect the binding site to the user's original syntax even if the name of the identifier changes during expansion, as well as on tokens where we will attach targeted messages.

    The syntax token%$stx in a syntax quotation will annotate the token token with the span from stx and also mark it as canonical.

    As a rough guide, a macro expansion should only use a given piece of input syntax in a single canonical token, although this is sometimes violated when the same identifier is used to declare two binders, as in the macro expansion for dependent if:

    `(if $h : $cond then $t else $e) ~>
    `(dite $cond (fun $h => $t) (fun $h => $t))
    

    In these cases if the user hovers over h they will see information about both binding sites.

  • none : Lean.SourceInfo

    Synthesized token without position information.

11.4.6. Inspecting Syntax

There are three primary ways to inspect Syntax values:

The Repr Instance

The Repr Syntax instance produces a very detailed representation of syntax in terms of the constructors of the Syntax type.

The ToString Instance

The ToString Syntax instance produces a compact view, representing certain syntax kinds with particular conventions that can make it easier to read at a glance. This instance suppresses source position information.

The Pretty Printer

Lean's pretty printer attempts to render the syntax as it would look in a source file, but fails if the nesting structure of the syntax doesn't match the expected shape.

Representing Syntax as Constructors

The Repr instance's representation of syntax can be inspected by quoting it in the context of Lean.Parser.Command.eval : command`#eval e` evaluates the expression `e` by compiling and evaluating it. * The command attempts to use `ToExpr`, `Repr`, or `ToString` instances to print the result. * If `e` is a monadic value of type `m ty`, then the command tries to adapt the monad `m` to one of the monads that `#eval` supports, which include `IO`, `CoreM`, `MetaM`, `TermElabM`, and `CommandElabM`. Users can define `MonadEval` instances to extend the list of supported monads. The `#eval` command gracefully degrades in capability depending on what is imported. Importing the `Lean.Elab.Command` module provides full capabilities. Due to unsoundness, `#eval` refuses to evaluate expressions that depend on `sorry`, even indirectly, since the presence of `sorry` can lead to runtime instability and crashes. This check can be overridden with the `#eval! e` command. Options: * If `eval.pp` is true (default: true) then tries to use `ToExpr` instances to make use of the usual pretty printer. Otherwise, only tries using `Repr` and `ToString` instances. * If `eval.type` is true (default: false) then pretty prints the type of the evaluated value. * If `eval.derive.repr` is true (default: true) then attempts to auto-derive a `Repr` instance when there is no other way to print the result. See also: `#reduce e` for evaluation by term reduction. #eval, which can run actions in the command elaboration monad CommandElabM. To reduce the size of the example output, the helper removeSourceInfo is used to remove source information prior to display.

partial def removeSourceInfo : Syntax Syntax | .atom _ str => .atom .none str | .ident _ str x pre => .ident .none str x pre | .node _ k children => .node .none k (children.map removeSourceInfo) | .missing => .missing Lean.Syntax.node (Lean.SourceInfo.none) `«term_+_» #[Lean.Syntax.node (Lean.SourceInfo.none) `num #[Lean.Syntax.atom (Lean.SourceInfo.none) "2"], Lean.Syntax.atom (Lean.SourceInfo.none) "+", Lean.Syntax.missing]#eval do let stx `(2 + $(.missing)) logInfo (repr (removeSourceInfo stx.raw))
Lean.Syntax.node
  (Lean.SourceInfo.none)
  `«term_+_»
  #[Lean.Syntax.node (Lean.SourceInfo.none) `num #[Lean.Syntax.atom (Lean.SourceInfo.none) "2"],
    Lean.Syntax.atom (Lean.SourceInfo.none) "+", Lean.Syntax.missing]

In the second example, macro scopes inserted by quotation are visible on the call to List.length.

Lean.Syntax.node (Lean.SourceInfo.none) `Lean.Parser.Term.app #[Lean.Syntax.ident (Lean.SourceInfo.none) "List.length".toSubstring (Lean.Name.mkNum `List.length._@.Manual.NotationsMacros.SyntaxDef._hyg 2) [Lean.Syntax.Preresolved.decl `List.length [], Lean.Syntax.Preresolved.namespace `List.length], Lean.Syntax.node (Lean.SourceInfo.none) `null #[Lean.Syntax.node (Lean.SourceInfo.none) `«term[_]» #[Lean.Syntax.atom (Lean.SourceInfo.none) "[", Lean.Syntax.node (Lean.SourceInfo.none) `null #[Lean.Syntax.node (Lean.SourceInfo.none) `str #[Lean.Syntax.atom (Lean.SourceInfo.none) "\"Rose\""], Lean.Syntax.atom (Lean.SourceInfo.none) ",", Lean.Syntax.node (Lean.SourceInfo.none) `str #[Lean.Syntax.atom (Lean.SourceInfo.none) "\"Daffodil\""], Lean.Syntax.atom (Lean.SourceInfo.none) ",", Lean.Syntax.node (Lean.SourceInfo.none) `str #[Lean.Syntax.atom (Lean.SourceInfo.none) "\"Lily\""]], Lean.Syntax.atom (Lean.SourceInfo.none) "]"]]]#eval do let stx `(List.length ["Rose", "Daffodil", "Lily"]) logInfo (repr (removeSourceInfo stx.raw))

The contents of the pre-resolved identifier List.length are visible here:

Lean.Syntax.node
  (Lean.SourceInfo.none)
  `Lean.Parser.Term.app
  #[Lean.Syntax.ident
      (Lean.SourceInfo.none)
      "List.length".toSubstring
      (Lean.Name.mkNum `List.length._@.Manual.NotationsMacros.SyntaxDef._hyg 2)
      [Lean.Syntax.Preresolved.decl `List.length [], Lean.Syntax.Preresolved.namespace `List.length],
    Lean.Syntax.node
      (Lean.SourceInfo.none)
      `null
      #[Lean.Syntax.node
          (Lean.SourceInfo.none)
          `«term[_]»
          #[Lean.Syntax.atom (Lean.SourceInfo.none) "[",
            Lean.Syntax.node
              (Lean.SourceInfo.none)
              `null
              #[Lean.Syntax.node (Lean.SourceInfo.none) `str #[Lean.Syntax.atom (Lean.SourceInfo.none) "\"Rose\""],
                Lean.Syntax.atom (Lean.SourceInfo.none) ",",
                Lean.Syntax.node (Lean.SourceInfo.none) `str #[Lean.Syntax.atom (Lean.SourceInfo.none) "\"Daffodil\""],
                Lean.Syntax.atom (Lean.SourceInfo.none) ",",
                Lean.Syntax.node (Lean.SourceInfo.none) `str #[Lean.Syntax.atom (Lean.SourceInfo.none) "\"Lily\""]],
            Lean.Syntax.atom (Lean.SourceInfo.none) "]"]]]

The ToString instance represents the constructors of Syntax as follows:

  • The ident constructor is represented as the underlying name. Source information and pre-resolved names are not shown.

  • The atom constructor is represented as a string.

  • The missing constructor is represented by <missing>.

  • The representation of the node constructor depends on the kind. If the kind is ​`null, then the node is represented by its child nodes order in square brackets. Otherwise, the node is represented by its kind followed by its child nodes, both surrounded by parentheses.

Syntax as Strings

The string representation of syntax can be inspected by quoting it in the context of Lean.Parser.Command.eval : command`#eval e` evaluates the expression `e` by compiling and evaluating it. * The command attempts to use `ToExpr`, `Repr`, or `ToString` instances to print the result. * If `e` is a monadic value of type `m ty`, then the command tries to adapt the monad `m` to one of the monads that `#eval` supports, which include `IO`, `CoreM`, `MetaM`, `TermElabM`, and `CommandElabM`. Users can define `MonadEval` instances to extend the list of supported monads. The `#eval` command gracefully degrades in capability depending on what is imported. Importing the `Lean.Elab.Command` module provides full capabilities. Due to unsoundness, `#eval` refuses to evaluate expressions that depend on `sorry`, even indirectly, since the presence of `sorry` can lead to runtime instability and crashes. This check can be overridden with the `#eval! e` command. Options: * If `eval.pp` is true (default: true) then tries to use `ToExpr` instances to make use of the usual pretty printer. Otherwise, only tries using `Repr` and `ToString` instances. * If `eval.type` is true (default: false) then pretty prints the type of the evaluated value. * If `eval.derive.repr` is true (default: true) then attempts to auto-derive a `Repr` instance when there is no other way to print the result. See also: `#reduce e` for evaluation by term reduction. #eval, which can run actions in the command elaboration monad CommandElabM.

(«term_+_» (num "2") "+" <missing>)#eval do let stx `(2 + $(.missing)) logInfo (toString stx)
(«term_+_» (num "2") "+" <missing>)

In the second example, macro scopes inserted by quotation are visible on the call to List.length.

(Term.app `List.length._@.Manual.NotationsMacros.SyntaxDef._hyg.2 [(«term[_]» "[" [(str "\"Rose\"") "," (str "\"Daffodil\"") "," (str "\"Lily\"")] "]")])#eval do let stx `(List.length ["Rose", "Daffodil", "Lily"]) logInfo (toString stx)
(Term.app
 `List.length._@.Manual.NotationsMacros.SyntaxDef._hyg.2
 [(«term[_]» "[" [(str "\"Rose\"") "," (str "\"Daffodil\"") "," (str "\"Lily\"")] "]")])

Pretty printing syntax is typically most useful when including it in a message to a user. Normally, Lean automatically invokes the pretty printer when necessary. However, ppTerm can be explicitly invoked if needed.

Pretty-Printed Syntax

The string representation of syntax can be inspected by quoting it in the context of Lean.Parser.Command.eval : command`#eval e` evaluates the expression `e` by compiling and evaluating it. * The command attempts to use `ToExpr`, `Repr`, or `ToString` instances to print the result. * If `e` is a monadic value of type `m ty`, then the command tries to adapt the monad `m` to one of the monads that `#eval` supports, which include `IO`, `CoreM`, `MetaM`, `TermElabM`, and `CommandElabM`. Users can define `MonadEval` instances to extend the list of supported monads. The `#eval` command gracefully degrades in capability depending on what is imported. Importing the `Lean.Elab.Command` module provides full capabilities. Due to unsoundness, `#eval` refuses to evaluate expressions that depend on `sorry`, even indirectly, since the presence of `sorry` can lead to runtime instability and crashes. This check can be overridden with the `#eval! e` command. Options: * If `eval.pp` is true (default: true) then tries to use `ToExpr` instances to make use of the usual pretty printer. Otherwise, only tries using `Repr` and `ToString` instances. * If `eval.type` is true (default: false) then pretty prints the type of the evaluated value. * If `eval.derive.repr` is true (default: true) then attempts to auto-derive a `Repr` instance when there is no other way to print the result. See also: `#reduce e` for evaluation by term reduction. #eval, which can run actions in the command elaboration monad CommandElabM. Because new syntax declarations also equip the pretty printer with instructions for displaying them, the pretty printer requires a configuration object. This context can be constructed with a helper:

def getPPContext : CommandElabM PPContext := do return { env := ( getEnv), opts := ( getOptions), currNamespace := ( getCurrNamespace), openDecls := ( getOpenDecls) } 2 + 5#eval show CommandElabM Unit from do let stx `(2 + 5) let fmt ppTerm ( getPPContext) stx logInfo fmt
2 + 5

In the second example, the macro scopes inserted on List.length by quotation cause it to be displayed with a dagger ().

List.length✝ ["Rose", "Daffodil", "Lily"]#eval do let stx `(List.length ["Rose", "Daffodil", "Lily"]) let fmt ppTerm ( getPPContext) stx logInfo fmt
List.length✝ ["Rose", "Daffodil", "Lily"]

Pretty printing wraps lines and inserts indentation automatically. A coercion typically converts the pretty printer's output to the type expected by logInfo, using a default layout width. The width can be controlled by explicitly calling pretty with a named argument.

List.length✝ ["Rose", "Daffodil", "Lily", "Rose", "Daffodil", "Lily", "Rose", "Daffodil", "Lily"]#eval do let flowers := #["Rose", "Daffodil", "Lily"] let manyFlowers := flowers ++ flowers ++ flowers let stx `(List.length [$(manyFlowers.map (quote (k := `term))),*]) let fmt ppTerm ( getPPContext) stx logInfo (fmt.pretty (width := 40))
List.length✝
  ["Rose", "Daffodil", "Lily", "Rose",
    "Daffodil", "Lily", "Rose",
    "Daffodil", "Lily"]

11.4.7. Typed Syntax

Syntax may additionally be annotated with a type that specifies which syntax category it belongs to. The TSyntax structure contains a type-level list of syntax categories along with a syntax tree. The list of syntax categories typically contains precisely one element, in which case the list structure itself is not shown.

🔗structure
Lean.TSyntax (ks : Lean.SyntaxNodeKinds) : Type

A Syntax value of one of the given syntax kinds. Note that while syntax quotations produce/expect TSyntax values of the correct kinds, this is not otherwise enforced and can easily be circumvented by direct use of the constructor. The namespace TSyntax.Compat can be opened to expose a general coercion from Syntax to any TSyntax ks for porting older code.

Constructor

Lean.TSyntax.mk

Fields

raw : Lean.Syntax

The underlying Syntax value.

Quasiquotations prevent the substitution of typed syntax that does not come from the correct syntactic category. For many of Lean's built-in syntactic categories, there is a set of coercions that appropriately wrap one kind of syntax for another category, such as a coercion from the syntax of string literals to the syntax of terms. Additionally, many helper functions that are only valid on some syntactic categories are defined for the appropriate typed syntax only.

The constructor of TSyntax is public, and nothing prevents users from constructing values that break internal invariants. The use of TSyntax should be seen as a way to reduce common mistakes, rather than rule them out entirely.

In addition to TSyntax, there are types that represent arrays of syntax, with or without separators. These correspond to repeated elements in syntax declarations or antiquotations.

🔗def
Lean.TSyntaxArray (ks : Lean.SyntaxNodeKinds)
    : Type

An array of syntaxes of kind ks.

🔗structure
Lean.Syntax.TSepArray
    (ks : Lean.SyntaxNodeKinds) (sep : String)
    : Type

A typed version of SepArray.

Constructor

Lean.Syntax.TSepArray.mk

Fields

elemsAndSeps : Array Lean.Syntax

The array of elements and separators, ordered like #[el1, sep1, el2, sep2, el3].

11.4.8. Aliases

A number of aliases are provided for commonly-used typed syntax varieties. These aliases allow code to be written at a higher level of abstraction.

🔗def
Lean.Syntax.Term : Type
🔗def
Lean.Syntax.Command : Type
🔗def
Lean.Syntax.Level : Type
🔗def
Lean.Syntax.Tactic : Type
🔗def
Lean.Syntax.Prec : Type
🔗def
Lean.Syntax.Prio : Type
🔗def
Lean.Syntax.Ident : Type
🔗def
Lean.Syntax.StrLit : Type
🔗def
Lean.Syntax.CharLit : Type
🔗def
Lean.Syntax.NameLit : Type
🔗def
Lean.Syntax.NumLit : Type
🔗def
Lean.Syntax.ScientificLit : Type
🔗def
Lean.Syntax.HygieneInfo : Type

11.4.9. Helpers for Typed Syntax🔗

For literals, Lean's parser produces a singleton node that contains an atom. The inner atom contains a string with source information, while the node's kind specifies how the atom is to be interpreted. This may involve decoding string escape sequences or interpreting base-16 numeric literals. The helpers in this section perform the correct interpretation.

🔗def
Lean.TSyntax.getId (s : Lean.Ident) : Lean.Name
🔗def
Lean.TSyntax.getName (s : Lean.NameLit)
    : Lean.Name
🔗def
Lean.TSyntax.getNat (s : Lean.NumLit) : Nat
🔗def
Lean.TSyntax.getScientific
    (s : Lean.ScientificLit) : Nat × Bool × Nat
🔗def
Lean.TSyntax.getString (s : Lean.StrLit)
    : String
🔗def
Lean.TSyntax.getChar (s : Lean.CharLit) : Char
🔗def
Lean.TSyntax.getHygieneInfo
    (s : Lean.HygieneInfo) : Lean.Name

11.4.10. Syntax Categories🔗

Lean's parser contains a table of syntax categories, which correspond to nonterminals in a context-free grammar. Some of the most important categories are terms, commands, universe levels, priorities, precedences, and the categories that represent tokens such as literals. Typically, each syntax kind corresponds to a category. New categories can be declared using Lean.Parser.Command.syntaxCat : commanddeclare_syntax_cat.

syntax

Declares a new syntactic category.

command ::= ...
    | A `docComment` parses a "documentation comment" like `/-- foo -/`. This is not treated like
a regular comment (that is, as whitespace); it is parsed and forms part of the syntax tree structure.

A `docComment` node contains a `/--` atom and then the remainder of the comment, `foo -/` in this
example. Use `TSyntax.getDocString` to extract the body text from a doc string syntax node. docComment?
      declare_syntax_cat ident ((behavior := (catBehaviorBoth | catBehaviorSymbol)))?

The leading identifier behavior is an advanced feature that usually does not need to be modified. It controls the behavior of the parser when it encounters an identifier, and can sometimes cause the identifier to be treated as a non-reserved keyword instead. This is used to avoid turning the name of every tactic into a reserved keyword.

🔗inductive type
Lean.Parser.LeadingIdentBehavior : Type

The type LeadingIdentBehavior specifies how the parsing table lookup function behaves for identifiers. The function prattParser uses two tables leadingTable and trailingTable. They map tokens to parsers.

We use LeadingIdentBehavior.symbol and LeadingIdentBehavior.both and nonReservedSymbol parser to implement the tactic parsers. The idea is to avoid creating a reserved symbol for each builtin tactic (e.g., apply, assumption, etc.). That is, users may still use these symbols as identifiers (e.g., naming a function).

Constructors

  • default : Lean.Parser.LeadingIdentBehavior

    LeadingIdentBehavior.default: if the leading token is an identifier, then prattParser just executes the parsers associated with the auxiliary token "ident".

  • symbol : Lean.Parser.LeadingIdentBehavior

    LeadingIdentBehavior.symbol: if the leading token is an identifier <foo>, and there are parsers P associated with the token <foo>, then it executes P. Otherwise, it executes only the parsers associated with the auxiliary token "ident".

  • both : Lean.Parser.LeadingIdentBehavior

    LeadingIdentBehavior.both: if the leading token an identifier <foo>, the it executes the parsers associated with token <foo> and parsers associated with the auxiliary token "ident".

11.4.11. Syntax Rules

Each syntax category is associated with a set of syntax rules, which correspond to productions in a context-free grammar. Syntax rules can be defined using the Lean.Parser.Command.syntax : commandsyntax command.

syntax
command ::= ...
    | A `docComment` parses a "documentation comment" like `/-- foo -/`. This is not treated like
a regular comment (that is, as whitespace); it is parsed and forms part of the syntax tree structure.

A `docComment` node contains a `/--` atom and then the remainder of the comment, `foo -/` in this
example. Use `TSyntax.getDocString` to extract the body text from a doc string syntax node. docComment?
      attributes?
      `attrKind` matches `("scoped" <|> "local")?`, used before an attribute like `@[local simp]`. attrKind
      syntax(:prec)? ((name := ident))? ((priority := prio))? `p*` is shorthand for `many(p)`. It uses parser `p` 0 or more times, and produces a
`nullNode` containing the array of parsed results. This parser has arity 1.

If `p` has arity more than 1, it is auto-grouped in the items generated by the parser.
stx* : ident

As with operator and notation declarations, the contents of the documentation comments are shown to users while they interact with the new syntax. Attributes may be added to invoke compile-time metaprograms on the resulting definition.

Syntax rules interact with section scopes in the same manner as attributes, operators, and notations. By default, syntax rules are available to the parser in any module that transitively imports the one in which they are established, but they may be declared scoped or local to restrict their availability either to contexts in which the current namespace has been opened or to the current section scope, respectively.

When multiple syntax rules for a category can match the current input, the local longest-match rule is used to select one of them. Like notations and operators, if there is a tie for the longest match then the declared priorities are used to determine which parse result applies. If this still does not resolve the ambiguity, then all the results that tied are saved. The elaborator is expected to attempt all of them, succeeding when exactly one can be elaborated.

The syntax rule's precedence, written immediately after the Lean.Parser.Command.syntax : commandsyntax keyword, restricts the parser to use this new syntax only when the precedence context is at least the provided value. Just as with operators and notations, syntax rules may be manually provided with a name; if they are not, an otherwise-unused name is generated. Whether provided or generated, this name is used as the syntax kind in the resulting node.

The body of a syntax declaration is even more flexible than that of a notation. String literals specify atoms to match. Subterms may be drawn from any syntax category, rather than just terms, and they may be optional or repeated, with or without interleaved comma separators. Identifiers in syntax rules indicate syntax categories, rather than naming subterms as they do in notations.

Finally, the syntax rule specifies which syntax category it extends. It is an error to declare a syntax rule in a nonexistent category.

syntax

The syntactic category stx is the grammar of specifiers that may occur in the body of a Lean.Parser.Command.syntax : commandsyntax command.

String literals are parsed as atoms (including both keywords such as if, #eval, or where):

stx ::=
    str

Leading and trailing spaces in the strings do not affect parsing, but they cause Lean to insert spaces in the corresponding position when displaying the syntax in proof states and error messages. Ordinarily, valid identifiers occurring as atoms in syntax rules become reserved keywords. Preceding a string literal with an ampersand (&) suppresses this behavior:

stx ::= ...
    | &str

Identifiers specify the syntactic category expected in a given position, and may optionally provide a precedence:

stx ::= ...
    | ident(:prec)?

The * modifier is the Kleene star, matching zero or more repetitions of the preceding syntax. It can also be written using many.

stx ::= ...
    | `p*` is shorthand for `many(p)`. It uses parser `p` 0 or more times, and produces a
`nullNode` containing the array of parsed results. This parser has arity 1.

If `p` has arity more than 1, it is auto-grouped in the items generated by the parser.
stx *

The + modifier matches one or more repetitions of the preceding syntax. It can also be written using many1.

stx ::= ...
    | `p+` is shorthand for `many1(p)`. It uses parser `p` 1 or more times, and produces a
`nullNode` containing the array of parsed results. This parser has arity 1.

If `p` has arity more than 1, it is auto-grouped in the items generated by the parser.
stx +

The ? modifier makes a subterm optional, and matches zero or one, but not more, repetitions of the preceding syntax. It can also be written as optional.

stx ::= ...
    | `(p)?` is shorthand for `optional(p)`. It uses parser `p` 0 or 1 times, and produces a
`nullNode` containing the array of parsed results. This parser has arity 1.

`p` is allowed to have arity n > 1 (in which case the node will have either 0 or n children),
but if it has arity 0 then the result will be ambiguous.

Because `?` is an identifier character, `ident?` will not work as intended.
You have to write either `ident ?` or `(ident)?` for it to parse as the `?` combinator
applied to the `ident` parser.
stx ?
stx ::= ...
    | Returns `some x` if `f` succeeds with value `x`, else returns `none`.
optional(stx)

The ,* modifier matches zero or more repetitions of the preceding syntax with interleaved commas. It can also be written using sepBy.

stx ::= ...
    | `p,*` is shorthand for `sepBy(p, ",")`. It parses 0 or more occurrences of
`p` separated by `,`, that is: `empty | p | p,p | p,p,p | ...`.

It produces a `nullNode` containing a `SepArray` with the interleaved parser
results. It has arity 1, and auto-groups its component parser if needed.
stx ,*

The ,+ modifier matches one or more repetitions of the preceding syntax with interleaved commas. It can also be written using sepBy1.

stx ::= ...
    | `p,+` is shorthand for `sepBy(p, ",")`. It parses 1 or more occurrences of
`p` separated by `,`, that is: `p | p,p | p,p,p | ...`.

It produces a `nullNode` containing a `SepArray` with the interleaved parser
results. It has arity 1, and auto-groups its component parser if needed.
stx ,+

The ,*,? modifier matches zero or more repetitions of the preceding syntax with interleaved commas, allowing an optional trailing comma after the final repetition. It can also be written using sepBy with the allowTrailingSep modifier.

stx ::= ...
    | `p,*,?` is shorthand for `sepBy(p, ",", allowTrailingSep)`.
It parses 0 or more occurrences of `p` separated by `,`, possibly including
a trailing `,`, that is: `empty | p | p, | p,p | p,p, | p,p,p | ...`.

It produces a `nullNode` containing a `SepArray` with the interleaved parser
results. It has arity 1, and auto-groups its component parser if needed.
stx ,*,?

The ,+,? modifier matches one or more repetitions of the preceding syntax with interleaved commas, allowing an optional trailing comma after the final repetition. It can also be written using sepBy1 with the allowTrailingSep modifier.

stx ::= ...
    | `p,+,?` is shorthand for `sepBy1(p, ",", allowTrailingSep)`.
It parses 1 or more occurrences of `p` separated by `,`, possibly including
a trailing `,`, that is: `p | p, | p,p | p,p, | p,p,p | ...`.

It produces a `nullNode` containing a `SepArray` with the interleaved parser
results. It has arity 1, and auto-groups its component parser if needed.
stx ,+,?

The <|> operator, which can be written orelse, matches either syntax. However, if the first branch consumes any tokens, then it is committed to, and failures will not be backtracked:

stx ::= ...
    | `p1 <|> p2` is shorthand for `orelse(p1, p2)`, and parses either `p1` or `p2`.
It does not backtrack, meaning that if `p1` consumes at least one token then
`p2` will not be tried. Therefore, the parsers should all differ in their first
token. The `atomic(p)` parser combinator can be used to locally backtrack a parser.
(For full backtracking, consider using extensible syntax classes instead.)

On success, if the inner parser does not generate exactly one node, it will be
automatically wrapped in a `group` node, so the result will always be arity 1.

The `<|>` combinator does not generate a node of its own, and in particular
does not tag the inner parsers to distinguish them, which can present a problem
when reconstructing the parse. A well formed `<|>` parser should use disjoint
node kinds for `p1` and `p2`.
stx <|> stx
stx ::= ...
    | orelse(stx, stx)

The ! operator matches the complement of its argument. If its argument fails, then it succeeds, resetting the parsing state.

stx ::= ...
    | `!p` parses the negation of `p`. That is, it fails if `p` succeeds, and
otherwise parses nothing. It has arity 0.
! stx

Syntax specifiers may be grouped using parentheses.

stx ::= ...
    | (stx)

Repetitions may be defined using many and many1. The latter requires at least one instance of the repeated syntax.

stx ::= ...
    | many(stx)
stx ::= ...
    | many1(stx)

Repetitions with separators may be defined using sepBy and sepBy1, which respectively match zero or more occurrences and one or more occurrences, separated by some other syntax. They come in three varieties:

  • The two-parameter version uses the atom provided in the string literal to parse the separators, and does not allow trailing separators.

  • The three-parameter version uses the third parameter to parse the separators, using the atom for pretty-printing.

  • The four-parameter version optionally allows the separator to occur an extra time at the end of the sequence. The fourth argument must always literally be the keyword allowTrailingSep.

stx ::= ...
    | sepBy(stx, str)
stx ::= ...
    | sepBy(stx, str, stx)
stx ::= ...
    | sepBy(stx, str, stx, allowTrailingSep)
stx ::= ...
    | sepBy1(stx, str)
stx ::= ...
    | sepBy1(stx, str, stx)
stx ::= ...
    | sepBy1(stx, str, stx, allowTrailingSep)
Parsing Matched Parentheses and Brackets

A language that consists of matched parentheses and brackets can be defined using syntax rules. The first step is to declare a new syntax category:

declare_syntax_cat balanced

Next, rules can be added for parentheses and square brackets. To rule out empty strings, the base cases consist of empty pairs.

syntax "(" ")" : balanced syntax "[" "]" : balanced syntax "(" balanced ")" : balanced syntax "[" balanced "]" : balanced syntax balanced balanced : balanced

In order to invoke Lean's parser on these rules, there must also be an embedding from the new syntax category into one that may already be parsed:

syntax (name := termBalanced) "balanced " balanced : term

These terms cannot be elaborated, but reaching an elaboration error indicates that parsing succeeded:

/-- error: elaboration function for 'termBalanced' has not been implemented balanced () -/ #guard_msgs in example := balanced () /-- error: elaboration function for 'termBalanced' has not been implemented balanced [] -/ #guard_msgs in example := balanced [] /-- error: elaboration function for 'termBalanced' has not been implemented balanced [[]()([])] -/ #guard_msgs in example := balanced [[] () ([])]

Similarly, parsing fails when they are mismatched:

example := balanced [() (expected ')' or balanced]]
<example>:1:25: expected ')' or balanced
Parsing Comma-Separated Repetitions

A variant of list literals that requires double square brackets and allows a trailing comma can be added with the following syntax:

syntax "[[" term,*,? "]]" : term

Adding a macro that describes how to translate it into an ordinary list literal allows it to be used in tests.

macro_rules | `(term|[[$e:term,*]]) => `([$e,*]) ["Dandelion", "Thistle"]#eval [["Dandelion", "Thistle",]]
["Dandelion", "Thistle"]

11.4.12. Indentation🔗

Internally, the parser maintains a saved source position. Syntax rules may include instructions that interact with these saved positions, causing parsing to fail when a condition is not met. Indentation-sensitive constructs, such as Lean.Parser.Term.do : termdo, save a source position, parse their constituent parts while taking this saved position into account, and then restore the original position.

In particular, Indentation-sensitvity is specified by combining withPosition or withPositionAfterLinebreak, which save the source position at the start of parsing some other syntax, with colGt, colGe, and colEq, which compare the current column with the column from the most recently-saved position. lineEq can also be used to ensure that two positions are on the same line in the source file.

🔗parser alias
withPosition(p)

Arity is sum of arguments' arities

withPosition(p) runs p while setting the "saved position" to the current position. This has no effect on its own, but various other parsers access this position to achieve some composite effect:

  • colGt, colGe, colEq compare the column of the saved position to the current position, used to implement Python-style indentation sensitive blocks

  • lineEq ensures that the current position is still on the same line as the saved position, used to implement composite tokens

The saved position is only available in the read-only state, which is why this is a scoping parser: after the withPosition(..) block the saved position will be restored to its original value.

This parser has the same arity as p - it just forwards the results of p.

🔗parser alias
withoutPosition(p)

Arity is sum of arguments' arities

withoutPosition(p) runs p without the saved position, meaning that position-checking parsers like colGt will have no effect. This is usually used by bracketing constructs like (...) so that the user can locally override whitespace sensitivity.

This parser has the same arity as p - it just forwards the results of p.

🔗parser alias
withPositionAfterLinebreak
  • Arity: 1
  • Automatically wraps arguments in a null node unless there's exactly one
🔗parser alias
colGt
  • Arity: 0
  • Automatically wraps arguments in a null node unless there's exactly one

The colGt parser requires that the next token starts a strictly greater column than the saved position (see withPosition). This can be used for whitespace sensitive syntax for the arguments to a tactic, to ensure that the following tactic is not interpreted as an argument.

example (x : False) : False := by
  revert x
  exact id

Here, the revert tactic is followed by a list of colGt ident, because otherwise it would interpret exact as an identifier and try to revert a variable named exact.

This parser has arity 0 - it does not capture anything.

🔗parser alias
colGe
  • Arity: 0
  • Automatically wraps arguments in a null node unless there's exactly one

The colGe parser requires that the next token starts from at least the column of the saved position (see withPosition), but allows it to be more indented. This can be used for whitespace sensitive syntax to ensure that a block does not go outside a certain indentation scope. For example it is used in the lean grammar for else if, to ensure that the else is not less indented than the if it matches with.

This parser has arity 0 - it does not capture anything.

🔗parser alias
colEq
  • Arity: 0
  • Automatically wraps arguments in a null node unless there's exactly one

The colEq parser ensures that the next token starts at exactly the column of the saved position (see withPosition). This can be used to do whitespace sensitive syntax like a by block or do block, where all the lines have to line up.

This parser has arity 0 - it does not capture anything.

🔗parser alias
lineEq
  • Arity: 0
  • Automatically wraps arguments in a null node unless there's exactly one

The lineEq parser requires that the current token is on the same line as the saved position (see withPosition). This can be used to ensure that composite tokens are not "broken up" across different lines. For example, else if is parsed using lineEq to ensure that the two tokens are on the same line.

This parser has arity 0 - it does not capture anything.

Aligned Columns

This syntax for saving notes takes a bulleted list of items, each of which must be aligned at the same column.

syntax "note " ppLine withPosition((colEq "• " str ppLine)+) : term

There is no elaborator or macro associated with this syntax, but the following example is accepted by the parser:

#check elaboration function for '«termNote__•__»' has not been implemented note • "One" • "Two" note "One" "Two"
elaboration function for '«termNote__•__»' has not been implemented
  note
    • "One"
    • "Two"
    

The syntax does not require that the list is indented with respect to the opening token, which would require an extra withPosition and a colGt.

#check elaboration function for '«termNote__•__»' has not been implemented note • "One" • "Two" note "One" "Two"
elaboration function for '«termNote__•__»' has not been implemented
  note
    • "One"
    • "Two"
    

The following examples are not syntactically valid because the columns of the bullet points do not match.

#check  note    • "One"   expected end of input• "Two"
<example>:4:3: expected end of input
#check  note   • "One"     expected end of input• "Two"
<example>:4:5: expected end of input