How To Create Pragmatic, Lightweight Languages

Article Index
How To Create Pragmatic, Lightweight Languages
Part I: The Basics
Part II Compiling

Page 1 of 3

Author: Federico Tomassetti
Publisher: Leanpub
Pages: 370
Audience: Beginning but aspiring language engineers
Rating: 4
Reviewer: Nikos Vaggalis

A book that makes language design and compilers construction accessible.

At last, a guide that makes creating a language with its associated baggage of lexers, parsers and compilers, accessible to mere mortals, rather to a group of a few hardcore eclectics as it stood until now.

The first thing that catches the eye, is the subtitle:

The unix philosophy applied to language design, for GPLs and DSLs"

What is meant by "unix philosophy" ?. It's taking simple, high quality components and combining them together in smart ways to obtain a complex result; the exact approach the book adopts.

I'm getting ahead here, but a first sample of this philosophy becomes apparent at the beginnings of Chapter 5 where the Parser treats and calls the Lexer like unix's pipes as in lexer|parser. Until the end of the book, this pipeline is going to become larger, like a chain, due to the amount of components that end up interacting together.

The book opens by putting things into perspective in Chapter 1: Motivation: why do you want to build language tools?

There are two different scenarios in which you may want to do that:

1. you want to create a new language: maybe a general purpose language (GPL), maybe a domain specific language (DSL). In any case you may want to build some support for this language of yours. Maybe you want to generate C and compile the generated code, maybe you want to interpret it. Maybe you want to build a compiler or a simulator for your language. Or you want to do all of this stuff and more.

2. you want to create an additional tool for an existing language. Do you want to perform static analysis on your Java code? Or build a translator from Python to JavaScript? Maybe a web editor for some less known language?

Nowadays, we are surrounded by those: languages targeting a common runtime like .NET or JVM; language A to language B transpilers, the likes of Perlito in translating Perl5/6 to JavaScript; or dedicated DSL's within larger frameworks such as the Perl Dancer Web Framework's one which makes implementing a web application trivial.

It's also important to note that the author is not using any GUIs or IDEs that generate code behind the scenes, but rather hand-codes everything in Kotlin, using Gradle to build and run it.

And why Kotlin? Because it is very concise, reduces the boilerplate, is well supported and reasonably clear. No worries though as the ideas discussed throughout the book should be applicable to any language.

The first few pages close with a brief rundown on the General Purpose Languages (GPL) vs the Domain Specific ones (DSL), or rather when to use which.

Chapter 2: The general plan continues with a more detailed look on what we're going to build:
   lexers
   parsers
   compilers
   code generators
   static analysis tools
   editors
   simulators

that is, build, not just use...

At this point the author points out that the underlying theory of building a grammar for your language, poses the very first barrier which puts people off progressing any further, thus he keeps theory down to a minimum, instead adopting a "practical to the bone" approach. As he puts it "there is no better way to learn design principles than by building things".

Quickly then, Chapter 3: The example languages we are going to build, provides a high level overview of the toy languages we're going to work on - MiniCalc, a language to show how to work with expressions. At its core it will support:

integer and decimal literals
variable definition and assignment
the basic mathematical operations (addition, subtraction,
multiplication, division)
the usage of parenthesis

and has the following particularities:

newlines will be meaningful
supports string interpolation like “hi #{name}!”

Example:

input Int width
input Int height
var area = width * height
print("A rectangle #{width}x#{height} has an area #{area}")

MiniCalcFun, is a pumped up variant of MiniCalc that adds support for functions, and finally StaMac, a language that represents state machines.

(click on book cover for details on Lean Pub site)

Part I: The Basics

At this point the intro ends and we reach the main material, beginnining with 4: Writing a lexer. The lexer is the piece of code that takes a textual document and breaks it down into elements called tokens, which essentially are the portions of text with a specific role. Tokens like that could be numeric literals, string literals, comments and keywords.

A lexer's purpose can be clearly observed in the context of the syntax highlighting built into IDE's and text editors;do you want to show the keywords in green? you first need to recognize which parts are the keywords.

To build our lexer we are going to use ANTLR, a very mature tool for writing lexers and parsers. Indeed we will use ANTLR to generate both our lexer and our parser, as typically a lexer and a parser need to work together, therefore it makes sense that just one tool generates both of them.

In addition to that, ANTLR 4 makes it easy to write simple grammars as it solves the left recursive definition for you, so you do not have to write many intermediate node types for specifying precedence rules for your expressions ... More on this when we look into the parser,

However I could not find that explanation; as already mentioned, it is very light on theory.

The complete lexer grammar for MiniCalc is then provided, part of which I relay below, subsequently dissected and explained line by line.

1 lexer grammar MiniCalcLexer;
2
3 channels { WHITESPACE }
4
5 // Whitespace
6 NEWLINE            : '\r\n' | 'r' | '\n' ;
7 WS                 : [\t ]+ -> channel(WHITESPACE) ;
8
9 // Keywords
10 INPUT              : 'input' ;
11 VAR                : 'var' ;
12 PRINT              : 'print';
13 AS                 : 'as';
14 INT                : 'Int';
15 DECIMAL            : 'Decimal';
16 STRING             : 'String';
17
18 // Literals
19 INTLIT             : '0'|[1-9][0-9]* ;
20 DECLIT             : '0'|[1-9][0-9]* '.' [0-9]+ ;
21
22 // Operators
23 PLUS               : '+' ;
24 MINUS              : '-' ;
25 ASTERISK           : '*' ;
26 DIVISION           : '/' ;
27 ASSIGN             : '=' ;
28 LPAREN             : '(' ;
29 RPAREN             : ')' ;
30
31 // Identifiers
32 ID                 : [_]*[a-z][A-Za-z0-9_]* ;
33
34 STRING_OPEN        : '"' -> pushMode(MODE_IN_STRING);
35
36 UNMATCHED          : . ;
37
38 mode MODE_IN_STRING;
39
40 ESCAPE_STRING_DELIMITER : '\\"' ;
41 ESCAPE_SLASH            : '\\\\' ;
42 ESCAPE_NEWLINE          : '\\n' ;
43 ESCAPE_SHARP            : '\\#' ;
44 STRING_CLOSE            : '"' -> popMode ;
45 INTERPOLATION_OPEN      : '#{' -> pushMode(MODE_IN_INTERPOLATION) ;
46 STRING_CONTENT          : ~["\n\r\t\\#]+ ;

and so on.

Prev - Next >>

Last Updated ( Tuesday, 08 August 2017 )