Example: Writing a Language Package

Language support in CodeMirror takes the form of specific packages (with names like @codemirror/lang-python) that implement the support features for working with that language. Those may be...

In this example, we'll go through implementing a language package for a very minimal Lisp-like language. A similar project, with build tool configuration and such set up for you, is available as an example Git repository at codemirror/lang-example. It may be useful to start from that when building your own package.

Parsing

The first thing we'll need is a parser, which is used for highlighting but also provides the structure for things like syntax-aware selection, auto-indentation, and code folding. There are several ways to implement a parser for CodeMirror.

You can generally not use existing parsers, written for a different purpose, to parse editor content. The way the editor parses code needs to be incremental, so that it can quickly update its parse when the document changes, without re-parsing the entire text. It also needs to be error-tolerant, so that highlighting doesn't break when you have a syntax error somewhere in your file. And finally, it is practical when it produces a syntax tree in a format that the highlighter can consume. Very few existing parsers can easily be integrated in such a context.

If your language defines a formal context-free grammar, you may be able to base a Lezer grammar on that with relative ease—depending on how much dodgy tricks the language uses. Almost all languages do some things that don't fit the context-free formalism, but Lezer has some mechanisms to deal with that.

The Lezer guide provides a more complete explanation of how to write a grammar. But roughly, the way it works is that you declare a number of tokens, which describe the way the document is split into meaningful pieces (identifiers, strings, comments, braces, and so on), and then provide rules that describe bigger constructs in term of those tokens and other rules.

The notation borrows from extended Backus-Naur notation and regular expression syntax, using | to indicate a choice between several forms, * and + for repetition, and ? for optional elements.

The grammar should be put in its own file, typically with a .grammar extension, and ran through lezer-generator to create a JavaScript file.

This first rule means that a document should be parsed as any number of expressions, and the top node of the syntax tree should be called Program.

@top Program { expression* }

The next rule is a bit more involved. It declares that an expression can either be an identifier, a string, a boolean literal, or an application, which is any number of expressions wrapped in parentheses. (The branch for Application uses an inline rule to combine the definition of the rule with its only use.)

expression {
  Identifier |
  String |
  Boolean |
  Application { "(" expression* ")" }
}

Rule names that start with a capital letter will end up in the syntax tree produced by the parser. Other rules, such as expression, which are only there to structure the grammar, will be left out (to keep the tree small and clean).

Next, we define our tokens.

@tokens {
  Identifier { $[a-zA-Z_0-9]+ }

  String { '"' (!["\\] | "\\" _)* '"' }

  Boolean { "#t" | "#f" }

  LineComment { ";" ![\n]* }

  space { $[ \t\n\r]+ }

  "(" ")"
}

These use a syntax similar to the rule definitions, but can only express a regular language, which roughly mean they can't be recursive. Quoted literals match exactly the text in the quotes, sets of characters can be specified with $[] syntax, and ![] is used to match all characters except the ones between the brackets.

By default, tokens implicitly created by using literal strings in the (non-token) grammar won't be part of the syntax tree. By mentioning such tokens (like "(" and ")") explicitly in the @tokens block, we indicate that they should be included.

Skippable tokens, like space and comments, are declared in the same way as other tokens, and declared as skippable with a declaration like this.

@skip { space | LineComment }

And finally, the parser generator can be asked to automatically infer matching delimiters with a @detectDelim directive. This will cause it to add metadata to those node types, which the editor can use for things like bracket matching and automatic indentation.

@detectDelim

If that grammar lives in example.grammar, you can run lezer-generator example.grammar to create a JavaScript module holding the parse tables. Or, as the example repository does, include the Rollup plugin provided by that tool in your build process, so that you can directly import the parser from the grammar file.

CodeMirror integration

Lezer is a generic parser tool, and our grammar so far doesn't know anything about highlighting or other editor-related functionality.

A Lezer parser comes with a number of node types, each of which can have props with extra metadata added to them. We'll create an extended copy of the parser to include node-specific information for highlighting, indentation, and folding.

import {parser} from "./parser.js"
import {foldNodeProp, foldInside, indentNodeProp} from "@codemirror/language"
import {styleTags, tags as t} from "@codemirror/highlight"

let parserWithMetadata = parser.configure({
  props: [
    styleTags({
      Identifier: t.variableName,
      Boolean: t.bool,
      String: t.string,
      LineComment: t.lineComment,
      "( )": t.paren
    }),
    indentNodeProp.add({
      Application: context => context.column(context.node.from) + context.unit
    }),
    foldNodeProp.add({
      Application: foldInside
    })
  ]
})

styleTags is a helper that attaches highlighting information. We give it an object mapping node names (or space-separated lists of node names) to highlighting tags. These tags describe the syntactic role of the elements, and are used by higlight styles to style the text.

The information added by @detectDelim would already allow the automatic indentation to do a reasonable job, but because Lisps tend to indent continued lists one unit beyond the start of the list, and the default behavior is similar to how you'd indent parenthesized things in C or JavaScript, we'll have to override it.

The indentNodeProp prop associates functions that compute an indentation with node types. The function is passed a context object holding the relevant values and some indentation-related helper methods. In this case, the function computes the column position at the start of the application node and adds one indent unit to that. The language package exports a number of helpers to easily implement common indentation styles.

Finally, foldNodeProp associates folding information with node types. We allow application nodes to be folded by hiding everything but their delimiters.

That gives us a parser with enough editor-specific information encoded in its output to use it for editing. Next we wrap that in a Language instance, which wraps a parser and adds a language-specific facet that can be used by external code to register language-specific metadata.

import {LRLanguage} from "@codemirror/language"

export const exampleLanguage = LRLanguage.define({
  parser: parserWithMetadata,
  languageData: {
    commentTokens: {line: ";"}
  }
})

That code provides one piece of metadata (line comment syntax) right away, and allows us to do something like this to add additional information, such as the an autocompletion source for this language.

import {completeFromList} from "@codemirror/autocomplete"

export const exampleCompletion = exampleLanguage.data.of({
  autocomplete: completeFromList([
    {label: "defun", type: "keyword"},
    {label: "defvar", type: "keyword"},
    {label: "let", type: "keyword"},
    {label: "cons", type: "function"},
    {label: "car", type: "function"},
    {label: "cdr", type: "function"}
  ])
})

Finally, it is convention for language packages to export a main function (named after the language, so it's called css in @codemirror/lang-css for example) that takes a configuration object (if the language has anything to configure) and returns a LanguageSupport object, which bundles a Language instance with any additional supporting extensions that one might want to enable for the language.

import {LanguageSupport} from "@codemirror/language"

export function example() {
  return new LanguageSupport(exampleLanguage, [exampleCompletion])
}

The result looks like this: