Example: Mixed-Language Parsing

A lot of file formats contain other formats inside them—things like JavaScript inside HTML <script> tags, HTML inside template literals in that JavaScript, or template languages that wrap processing instructions around some other language.

The way Lezer, and thus CodeMirror, handle this is by treating the composite language as a combination of an outer language (which parses the entire document) and one or more inner languages (which parse only some regions, determined by the structure of the outer parse tree).

Hierarchical Nesting

For example, in HTML with CSS and JavaScript, HTML provides the outer structure, and the content of <style> and <script> tags in that structure are given to the CSS and JavaScript parsers. In a template language, the outer parser would parse the directive syntax, since that determines the structure of the document, and then the space between the directives would be given to the target language (often HTML).

The feature that handles this kind of parsing is parseMixed, which can be attached to the outer parser to manage the inner parsing.

Let's pretend the @codemirror/lang-html package doesn't already provide mixed-language parsing, and implement parsing of <script> tags ourselves:

import {parser as htmlParser} from "@lezer/html"
import {parser as jsParser} from "@lezer/javascript"
import {parseMixed} from "@lezer/common"
import {LRLanguage} from "@codemirror/language"

const mixedHTMLParser = htmlParser.configure({
  wrap: parseMixed(node => {
    return node.name == "ScriptText" ? {parser: jsParser} : null
  })
})

const mixedHTML = LRLanguage.define({parser: mixedHTMLParser})

The function given to parseMixed will be called on the outer tree's nodes, and determines whether their content should be parsed with a nested parser. The HTML parser conveniently emits a syntax tree node ScriptText for the content of <script> tags, and here we're telling the mixed parser to parse the content of such nodes using the JavaScript parser, and “mount” the resulting tree in place of the script text node.

As you can see, the highlighting works for both languages. (Lots of functionality is missing though, since this code completely bypasses the things, like folding information and autocompletion, defined in @codemirror/lang-html and @codemirror/lang-javascript.)

Overlay Nesting

In the situation above, the nested region neatly matches the structure of the outer language—the script text is a single node, covering the content of an HTML tag. In some other mixed-language systems, the nesting isn't quite so straightforward. For example, in the Twig templating language, the nesting of the templating directives might not match the nesting of the the HTML in the template, as in this somewhat odd template:

<div>
  {{ content }}
{% if extra_content %}
  <hr></div><div class=extra>{{ extra_content }}
{% endif %}
  <hr>
</div>

We could use the template syntax as outer language, parsing that document like this:

Template(
  Text("<div>"),
  Insert("content"),
  Conditional(
    ConditionalOpen("extra_content"),
    Text("<hr></div><div class=extra>"),
    Insert("extra_content"),
    ConditionalClose),
  Text("<hr></div>"))

But to parse the HTML text, there's no single node in this tree to target—it is spread out over multiple nodes, with different parent nodes.

The way Lezer models nesting like this is with an “overlay” mounted tree. Instead of replacing a given node, an overlay overlays parts of it with the content from a different tree.

Assuming we have a grammar for our outer parser, we could define a mixed parser like this.

import {parser as twigParser} from "./twig-parser.js"
import {htmlLanguage} from "@codemirror/lang-html"
import {foldNodeProp, foldInside, indentNodeProp} from "@codemirror/language"

const mixedTwigParser = twigParser.configure({
  props: [
    // Add basic folding/indent metadata
    foldNodeProp.add({Conditional: foldInside}),
    indentNodeProp.add({Conditional: cx => {
      let closed = /^\s*\{% endif/.test(cx.textAfter)
      return cx.lineIndent(cx.node.from) + (closed ? 0 : cx.unit)
    }})
  ],
  wrap: parseMixed(node => {
    return node.type.isTop ? {
      parser: htmlLanguage.parser,
      overlay: node => node.type.name == "Text"
    } : null
  })
})

const twigLanguage = LRLanguage.define({parser: mixedTwigParser})

Here we set up mixed parsing for the template's top node, but also include an overlay property in the object returned from the callback. This can be an array of ranges to parse, or a further function that is called for each descendent node. The latter is preferable when the target node might be large, because it allows more efficient incremental reparsing.

Note that, since we used the HTML parser from @codemirror/lang-html, parsing inside <script> and <style> tags just works in this editor. Mixed parsers can be nested.

Support Extensions

CodeMirror's languageDataAt feature can be used to look up values associated with the language active at a given point in the document. If you define things like autocompletion using this mechanism, it will automatically look up the proper completion source in a mixed-language document.

const twigAutocompletion = twigLanguage.data.of({
  autocomplete: context => /* Twig completion logic here */ null 
})

It is recommended for mixed-language extensions to include any support extensions for their nested languages—extensions have to be loaded into an editor to take effect. For example, the editor above doesn't load the HTML support extensions, and thus doesn't have HTML autocompletion. Here's a twig function that exposes both the language and the support extensions.

import {html} from "@codemirror/lang-html"

export function twig() {
  return [
    twigLanguage,
    twigAutocompletion,
    html().support
  ]
}

Loading that into an editor gives you HTML (and JavaScript, and CSS) completion support.