Lingo: A Language Workbench Powered by Datalog

2022-5-16

This is a transcript of my talk at Jamie Brandon's HYTRADBOI Conference. Thanks for having me, Jamie!

Lingo is a generalization of my earlier work, described here: Codebase as Database: Turning the IDE inside-out with Datalog.

Try it out here: lingo-workbench.dev

Intro

Hi, my name is Pete and I've always been fascinated by IDEs. Their features, like jump to definition, rename refactoring, and scope-aware autocompletion are useful to me every day as a developer.

But, I never felt it was easy to see how they worked on the inside or modify them to add new visualizations or support for new languages. I wondered: what if there was an IDE that was turned inside out, so we could inspect and visualize its internal data structures and algorithms in any way we wanted to?

Today, I'm going to show you Lingo, an environment I've been building, which is a prototype of this idea. It allows you to define rich IDE support for new languages by declaring a grammar and some datalog rules. Lingo was built around the idea of representing a codebase as a database. Specifically, it represents syntax trees as rows in relational tables, and then evaluates User-specified declarative rules to derive information about things like scope and types. These rules, written in datalog, are then queried by the UI to provide IDE features like semantic highlighting, jump to definition, rename refactoring and autocompletion.

The Need for Declarative Reasoning

So, what does this look like in practice? Well, let's say we're defining a simple functional language with syntax that looks like this:

The first thing we'll want to do is provide syntax highlighting for the language. The simplest way to do this, which many editors do is a regular expression based tokenizer. That is, define a regular expression for each type of token in our language. For instance, variables, keywords, integers, and strings.

This is a good way to get started and it's supported by many editors, but it won't get us all the way to the rich IDE features we want. To provide these features we'll need to parse our code into a syntax tree, and to do that, we'll need to define a grammar for our language.

Zooming out, we can see that Lingo provides a box for us to type into grammar. I'll paste one in.

Zooming out a bit more. We can see that the environment is now displaying a syntax tree. It took our example and our grammar and parsed it into this tree. We can see that each node in the tree has the rule which was matched, with source code range for which it was matched, a star if the cursor is currently over that node, and a unique ID for that node.

Now, when we go to the left and look at these tables. Notice that there's one table for every rule in our grammar. Let's open up a couple of key ones. And see that there's one function call. Two variables. And one let expression.

So the environment is inserting a row into the each one of these tables for e ach node in the syntax tree. It has a unique ID and has a parent ID, thereby encoding the tree.

Now that we have a full syntax tree, The environment can provide highlighting if we just tell it which color to highlight each of these nodes.

I'll paste in a mapping here, which maps it to predefined token types. And now we can see that we have basic highlighting.

But as I moved my cursor over, nothing happens. I can't jump to definition. I can't jump to a placeholder and there's no highlighting over certain variables.

So the question is, how to get these features? How can we explain to the environment that this X is defined here and used here? This is usually where there's a steep increment in complexity. In Visual Studio Code, we need to write a language server. In the IntelliJ family of IDs, we need to write some Java code.

This is frustrating and because it would be nice to say this in a declarative manner. All we really want to say is if something is defined on the left side of the let expression it's in scope on the right side of the let expression.

So, is there a programming paradigm which allows us to state this directly?

Typing Judgements in Datalog

The closest thing I've seen is a notation common in academic programming language papers called typing judgements. Now I'm no expert on the programming languages or logic, but the basic principle here is that for each rule if the expressions on the top of this line are true, then the expression on the bottom is true. And variables bound up here apply down here. And this fact here can then be fed into other facts or other rules in the system.

Now, this is great, but how can we make it into executable code?

It turns out that there's a programming paradigm which closely matches typing judgments called logic programming. In logic programming, you have a database that you can insert facts into and you can define rules to draw inferences from those facts. There are two main logic programming languages, datalog, and prologue. They're similar conceptually, but at different guarantees around termination.

Lingo sticks a little closer to datalog.

As we can see in this comparison with SQL, programming in datalog is a lot like creating views on SQL database, but datalog syntax is much more concise. Most importantly, its use of variables makes complex joins much more ergonomic to express.

Example: Simple Functional Language

So what does this look like in Lingo? Well, we already have all the facts we need inserted by the environment into the database from the AST. But now we needed to find some datalog rules to teach the environment what's in scope at each node in the tree, and where are those scope items were defined. Let's switch to a pre-built version of language with some rules defined.

Suddenly, with the addition of these datalog rules, we can notice that the editor has come to life.

We have this highlighting where pink is the definition and gray is a usage. We can jump between definition and usage. We can rename variables. If we type in a placeholder, we get suggestions for things that are in scope. And if we introduce a type error, we get a red underline. If we move the cursor, we get an error message. All with about 130 lines of datalog!

Datalog Walkthrough

So how do these rules actually work? Well everything is defined in terms of scope, so let's start there.

Two most basic rules are scope.Defn and scope.Var. We'll focus on scope.Defn.

So it's saying that there are three things that can define a variable, which goes into the scope. The first one is a built-in, like plus2. The second is a let expression, which is what we have here. And the third is a Lambda perimeter. Let's click on this rule.

So, what it's saying is that if there's a let expression that has a body, a binding, and a variable, and if the type of that binding is type T, then we know that there's going to be a variable of name N with type T defined at the span of the variable within the body.

Generalizing to More Languages

So that's the basics of adding IDE support for a simple, functional language. But does this generalize to other languages? It turns out that yes, a lot of other languages have notions of definitions, variables and scoping. So we can use the same framework for those languages too.

Language Interface

Your language can have whatever AST it wants, but for editor support to work, you just have to define, at base, these three relations:

scope.Defn{id: I, name: N, span: S}
scope.Var{id: I, name: N, span: S}
scope.Parent{parent: P, child: C}

scope.Defn tells Lingo where the AST variables are being defined and scope.Var tells the environment where they're being used. scope.Parent tells the environment which scopes are nested within other scopes, inheriting their definitions.

Defining these two relations unlocks more features:

scope.Placeholder{id: I, span: S}: tells the environment which scope to use for autocompletion.
tc.Problem{desc: D, span: S}: tells the environment that there's a type error at a given span.

SQL

Let's select another example language.

Here's a simple subset of SQL as an example of a language which is substantially different than our functional language, but still fits into the framework.

It's different because scoping works differently. For instance. The table name, which I say after the FROM clause, affects which columns are in scope here in the SELECT clause:

Also, instead of let expressions, everything is defined with CREATE TABLE expressions.

SQL Schema Viz

Scrolling down a bit, we can see that there's a live Graphviz visualization of the schema of the database:

Now, how is this working? Well, it's actually defined in just a few lines of datalog at the bottom of this file as a couple of queries, one for nodes and one for edges:

Now, this is an example of what I call turning the IDE inside out. Normally these data structures would be hidden inside of the IDE. But here, this is just data that we can work with and query as if we were in a spreadsheet.

Self-Hosting: Grammars and Datalog

Now you might notice that the grammar editor here and the datalog editor on the right have a lot of the same affordances as the example on the left. For instance there's semantic highlighting and we can do renames.

That's because these editors are powered by Lingo as well. In fact, we can go look at the definition of the grammar language itself. This is the grammar of grammars:

And, we can look at the definition of Lingo's datalog dialect:

Now, don't worry — you can't break Lingo by editing these because there's a hard-coded copy as well.

Conclusion & Future Work

So that's my demo for today, but where could this project go in the future? Well, there are a lot of interesting directions:

The core datalog engine could benefit from implementing some well-known techniques for improving the speed and adding a negation operator, which would allow us to specify more complicated type errors (note: this has now been implemented).
I'd love to spend more time on the UI for database exploration and rule authoring by example, instead of just typing in a datalog query.
Adding support for more languages would really push the environment to its limits, like languages with generics or perhaps language with a Rust-like borrow checker. In fact, you could say that this whole project is the world's most roundabout way of learning Rust.
In addition to type checking, we could perhaps program operational semantics with datalog, thereby allowing Lingo to not just type check code, but actually execute it, and maybe even giving us a time-traveling debugger for free.

That's it, thanks! This is live at https://lingo-workbench.dev if you want to try it out. Also, you can go to https://github.com/vilterp/datalog-ts, or talk to me on Twitter. I'd love to hear from you.

JetBrains MPS is the first language workbench I encountered, focused on structured editing.
Spoofax is a very cool language workbench based on plaintext editing, which also uses a datalog-like language for scoping and typing rules.
IncA is an incrementalized datalog-powered analysis framework, integrated into Jetbrains MPS.
Soufflé is a datalog engine optimized for program analysis use cases.

Intro
Generalizing to More Languages
Conclusion & Future Work
- Related Work

Lingo: A Language Workbench Powered by Datalog

Intro #

The Need for Declarative Reasoning #

Typing Judgements in Datalog #

Example: Simple Functional Language #

Datalog Walkthrough #

Generalizing to More Languages #

Language Interface #

SQL #

SQL Schema Viz #

Self-Hosting: Grammars and Datalog #

Conclusion & Future Work #

Related Work #