How To Create Pragmatic, Lightweight Languages
Article Index
How To Create Pragmatic, Lightweight Languages
Part I: The Basics
Part II Compiling


In languages with strong regular expression capabilities such as Perl, this grammar could be defined, although tediously, in the host language itself, something often described as the "poor man's lexer"::

sub lexer {
 my $sql = shift;
 return sub {
  return ['KEYWORD', $1] if $sql =~ /\G ($keyword) /gcx;
  return ['COMMA',   ''] if $sql =~ /\G ($comma)   /gcx;
  return ['OP',      $1] if $sql =~ /\G ($op)      /gcx;
  return ['PAREN',    1] if $sql =~ /\G $lparen    /gcx;
  return ['PAREN',   -1] if $sql =~ /\G $rparen    /gcx;
  return ['TEXT',    $1] if $sql =~ /\G ($text)    /gcx;
  redo LEXER             if $sql =~ /\G \s+        /gcx;

my $lexer = lexer($sql);

while (defined (my $token = $lexer->())) {
    # do something with the token
but this is not what we are after here.

We also set an associated rule which our identifiers should follow:
ID : [_]*[a-z][A-Za-z0-9_]* ;

this means that the following are considered valid identifiers in our language:


while the following are not:


There's a few more other things before we consider our grammar complete, like making the lexer context sensitive so that it can recognize that "\n" is significant only inside a String or enabling interpolation inside a String    :

STRING_OPEN : '"' -> pushMode(MODE_IN_STRING);


 4 ESCAPE_SLASH            : '\\\\' ;
 5 ESCAPE_NEWLINE          : '\\n' ;
 6 ESCAPE_SHARP            : '\\#' ;
 7 STRING_CLOSE            : '"' -> popMode ;
 9 STRING_CONTENT          : ~["\n\r\t\\#]+ ;
11 STR_UNMATCHED           : . -> type(UNMATCHED)


Now that our lexer grammar has been defined we need to invoke ANTLR through Kotlin code :


and invoke the lexer to print the list of tokens included in
file "" :

input Int width
input Int height
var area = width * height
print("A rectangle #{width}x#{height} has an area #{area}"

We build and run the solution by running the "gradlew" wrapper from the CLI, something that produces the following output:

  1 L1(0-4) INPUT 'input'
 2 L1(5-5) WS ' '
 3 L1(6-8) INT 'Int'
 4 L1(9-9) WS ' '
 5 L1(10-14) ID 'width'
 6 L1(15-15) NEWLINE '\n'
 7 L2(16-20) INPUT 'input'
 8 L2(21-21) WS ' '
 9 L2(22-24) INT 'Int'
10 L2(25-25) WS ' '
11 L2(26-31) ID 'height'
12 L2(32-32) NEWLINE '\n'
13 L3(33-35) VAR 'var'
14 L3(36-36) WS ' '
15 L3(37-40) ID 'area'
16 L3(41-41) WS ' '
17 L3(42-42) ASSIGN '='
18 L3(43-43) WS ' '
19 L3(44-48) ID 'width'
20 L3(49-49) WS ' '
21 L3(50-50) ASTERISK '*'
22 L3(51-51) WS ' '
23 L3(52-57) ID 'height'
24 L3(58-58) NEWLINE '\n'
25 L4(59-63) PRINT 'print'
26 L4(64-64) LPAREN '('
27 L4(65-65) STRING_OPEN '"'
28 L4(66-77) STRING_CONTENT 'A rectangle '
29 L4(78-79) INTERPOLATION_OPEN '#{'
30 L4(80-84) ID 'width'
32 L4(86-86) STRING_CONTENT 'x'
33 L4(87-88) INTERPOLATION_OPEN '#{'
34 L4(89-94) ID 'height'
36 L4(96-108) STRING_CONTENT ' has an area '
37 L4(109-110) INTERPOLATION_OPEN '#{'
38 L4(111-114) ID 'area'
39 L4(115-115) INTERPOLATION_CLOSE '}'
40 L4(116-116) STRING_CLOSE '"'
41 L4(117-117) RPAREN ')'
42 L4(118-118) NEWLINE '\n'
43 L5(119-118) EOF '<EOF>'

which proves that our lexer functioned as expected and broke the text down to its tokens.

In similar fashion, the chapter continuous with describing the lexer grammar for StaMac too.

5: Writing a parser builds on the work done by the lexer, with the Parser now having to arrange the tokens into an organized structure called a parse-tree. To get that  tree we'll use ANTLR as a parser generator in accordance to out parser grammar.

ANTLR version 4 in contrast to other recursive-descent parser generators, like Perl's famous Parse::RecDescent, is that it eliminates left recursion.For example if Parse::RecDescent would be fed with a left-recursive production, it would stall in an infinite loop.

So the parser grammar in effect consumes the lexer grammar but also introduces the new  notions, of statements, expressions, operations between expressions and precedance .

Here is our parser grammar of our first example language, MiniCalc:



Line  4 options { tokenVocab=MiniCalcLexer; } is proof of the unix philosophy in connecting self-contained and well defined components.We simply create a lexer for our code and pass that lexer to our parser.

The approach adopted in going through the parser grammar remains the same as in the previous lexer grammar example - extracting blocks off the listing and going through them.

For example  :

Labels are more useful when we have more than one terminal of the same kind in the same rule.This is for example the case in the binaryOperation alternative of the expression rule. You can see that we have two expressions: one with label left and one with label right. Later we will be able to ask for the left or the right expression of a binaryOperation avoiding every confusion.

Then it's just a matter of invoking the parser with similar Kotlin code in order to get our parse tree.

At this point we do have our first usable parser and, while not going behind the underlying theory of EBNF grammars, examining the listings of our toy grammars and their associated light explanations of their rules, allows us to deduct how simple lexer and parser grammars can be defined.

So that's quite a few things already; defining a lexer grammar; building a lexer; defining a parser grammar; building a parser; combining the parser with the lexer and output a parse tree!  And we're still on Chapter 5.


Last Updated ( Tuesday, 08 August 2017 )