Parsing with Perl 6 Regexes and Grammars

Written by Nikos Vaggalis

Author: Moritz Lenz
Publisher: Apress
Pages: 201
ISBN: 978-1484232279
Print: 1484232275
Kindle: B0785L6WR6
Audience: Programmers, Developers, Scientists
Rating: 5
Reviewer: Nikos Vaggalis

This book's title, together with its subtitle "A Recursive Descent into Parsing" reveals that it focuses on a very specific area of Perl 6 - its regular expressions capabilities.

This is Moritz Lenz's second book release. One could argue that you have to tackle the first one, "Perl 6 Fundamentals" before progressing to this one. This is not the case, however, as the book is written in such a way that it can be consumed independently without requiring any kind of knowledge in Perl 6.Saying that, it does however dedicate a chapter to getting started with Per l 6, just enough to ready you to work with its regular expressions.That aside, the author does indeed go into detail explaining the Perl 6 features used in the examples.

A question that I also posed to Moritz was whether the knowledge obtained from studying this book is transferable to Perl5 or to other languages, or, is it rather unique to Perl6?

Moritz's reply was that

the general skills/knowledge is transferable, but p6 regexes and grammars offer much awesome stuff that other implementations simply don't have.

So while the material is only applicable to Perl 6, since its implementation and the tricks you can pull off go beyond PCRE, previous experience with regular expressions in general will certainly make the material easier to follow.

About the author himself, Moritz is a well known figure of Perl and especially Perl 6, being a contributor to the Rakudo Perl 6 compiler as well as the initiator of the official Perl 6 documentation project. I Programmer's first contact with him was back in 2012(!) when I interviewed him about "Perl 6 and Parrot", at a time when Parrot was Perl 6's primary VM and well ahead of any plans for Rakudo.

The interview does not stop there though, but also investigates the differences, cultural or otherwise between Perl 5 and Perl 6 as well as the then yet-to-be-announced regular expression capabilities in Perl 6. It's worth recalling relevant snippets of this interview:

NV: How do you create a parser for a language on Parrot? I understand that it is not done in the traditional lex/yacc/BNF grammar way but with Perl 6 regexes?

ML: Yes. Perl 6 regexes are somewhat of a mixture of traditional regexes (for lexing), and a BNF-like structure for parsing.

One of the reasons that people think regexes are stopgap solutions is that most languages don't have a proper way to reuse regexes. In Perl 6 regexes, a regex is a method on the grammar, and you can call other regexes just as you can call other methods in normal code and you can use all the usual code reuse mechanisms, like calling other grammars ( basically delegation), inheritance, and role composition.

And more than that, it provides you with a way to trigger actions that are tied to grammar rules. So for each grammar rule, a method in a corresponding "actions" class is triggered, and that makes it easy to create a syntax tree right away

NV: Are those actions generated behind the scenes or is the developer required to write and wire them as well?

ML: The developer needs to write them, but the grammar + actions mechanism provides a convenient way to trigger them, and to assemble the result in a tree.

NV: What about Perl 6 mutable grammar? How does it work and how is that an advantage?

ML: For example you can allow the language you are parsing to add a new operator, and immediately use them. It also means that it is easy to create a dialect of grammar. For example, if you have a grammar for JSON, it is trivial to override the whitespace rule, and make it accept comments too (which aren't part of standard JSON).

Speaking of which, here is a JSON grammar in Perl 6. I think it is a nice example, because it's self-contained, useful and short. So to extend that, you'd simply write

token ws { \s* [ '//' \N* \n ]* }

and then everywhere that whitespace is allowed, // comments until the end of the line are allowed too, and you can do that in a subclass if you want.

You can easily infer that Mortiz is the definitive authority to talk to about the subject, and this book is his attempt to impart this knowledge accumulated throughout the years .

Now, as far as the book itself goes, it is broken down in 13 chapters:

Chapter 1: What are Regexes and Grammars?
Chapter 2: Getting Started with Perl 6
Chapter 3: The Building Blocks of Regexes
Chapter 4: Regexes and Perl 6 Code
Chapter 5: Extracting Data from Regex Matches
Chapter 6: Regex Mechanics
Chapter 7: Regex Techniques
Chapter 8: Reusing and Composing Regexes
Chapter 9: Parsing with Grammars
Chapter 10: Extracting Data from Matches
Chapter 11: Generating Good Parse Error Messages
Chapter 12: Unicode and Natural Language
Chapter 13: Case Studies

In more detail, Chapter1 contrasts regexes to grammars and briefly outlines popular use cases such as searching for data, ensuring that input is in the described format or extracting specific components, tasks that can be more or less grouped into three broad categories, that of Searching, Validating and Parsing.

As well as cleanly defining the boundaries, this grouping also defines the book's target audience;people with some programming experience but novices in regular expressions, who are actively looking for a tool to make their work easier, but at the same time not wanting to deal with the cognitive load that the next best thing, Perl 5's regexes, carries. This is detailed a bit further down in section 1.3 "What’s So Special about Perl 6 Regexes?"

Unfortunately, in making regexes so useful, Perl had assigned special meaning to almost every ASCII character (except those that match literally). And, as newer and more powerful regex features were created, this led to using obscure character sequences for the new features while continuing to maintain backward compatibility with existing regex syntax.A good example of such a character sequence is (?<=pattern) for look-behind assertions.Perl6 regexes clean up this historical syntactic baggage.

Amongst programmers and developers, other candidates I could imagine would be scientists in fields like NLP or Data Science, which Python currently dominates due to its data science extensions that bring coding closer to the scientific community. Perl 6 aims to do the same with its regular expressions capabilities.

Chapter 2 offers a quick introduction to installing Rakudo and obtaining the code examples as well as a rundown and on the basics of Perl6, per the language, such as variables, strings, functions, classes, methods and control structures.

Chapter 3 "The Building Blocks of Regexes" is pretty much straightforward regex basics, the likes of Literals, Meta characters, Anchors and more.

Chapter 4 moves along the same lines, going through the idea of "smart matching", abolished in Perl 5 but reinvented the right way in Perl 6, treating regular expressions as objects with the equivalent operator of Perl 5's qr, and looking at the regex modifiers split into those that affect the compile time behaviour and those that affect the runtime behaviour of a compiled regex, the likes of :exhaustive, :continue and :pos.

The Comb and Split methods which get called on a string object follow

my @numbers = "1308 5th Avenue".comb(/\d+/);
say @numbers; # Output: [1308 5]

my ($city, $area, $popul) = 'Berlin;891.8;3671000'.split(/';'/);
say $area; # Output: 891.8

until we "Cross the Code and Regex Boundary" where "regexes and regular Perl 6 code compile to the same byte code, so that you can mix Perl 6 code and regexes", for example incorporating Perl 6 code into the body of a regex simply by embedding it in curly braces:

my$count = 0;
my$str = "between 23 and 42 numbers";
if$str ~~ / [ \d+ { $count++ } \D* ]+ / {
say $count; # Output: 2
}

There's also more advanced code embedding functionality such as executing arbitrary code with <?{ ... }> in order to actually influence the matching process.This is demonstrated with a great example of matching 3 digit integers and checking whether they are valid sub-parts of an IP address (>0 and <255) on the fly and while the matching is still going on.

The complete ways to embed code into regexes are summed up in the following table:

{ CODE } runs perl 6 code; no effect on regex match
<?{ CODE }> Code needs to return a true value for the match to succeed
<!{ CODE }> Code needs to return a false value for the match to succeed
<{ CODE }> result of code is interpreted as a regex
<$STRING> Interprets $STRINGas regex source code

If you are interested in the ways that Perl5 goes about it, you might want to check the following two I Programmer articles,

Advanced Perl Regular Expressions - Extended Constructs and

Advanced Perl Regular Expressions - The Pattern Code Expression.

Chapter 5 on extracting data from regex matches covers the usual regex primitives:

The Match Object
Positional Captures
Nesting of Captures
Named Captures
Backreferences

but looked at from a Perl 6 perspective.Evem someone with little regex experience should be able to follow along without problems.

Then again, knowing how the tool works under the hood is not only useful for understanding why your attempts to match fail, but also enables you to optimize your expressions by re-writing them in such a way so that they run much faster, something especially desired when computational intensive cases the likes of backtracking kick off. That and more in Chapter 6 on "Regex Mechanics".

Chapter 7 "Common Regex Techniques" is a treasure trove of considerations when looking to parse text according to a format or specification, say INI or JSON.If you think that parsing an INI file is something trivial, you might very well have left out the following cases off your planning:

Can the list of sections be empty (that is, is an empty file a valid INI file)?
Can a section be empty (not contain any key/value pairs)?
Are comments allowed after a key/value pair? For instance, if you write port=443; only one not blocked by firewall, is the comment part of the value or not?
Where is whitespace allowed? Are [ database ]or port = 443 valid?
What’s allowed in a section name? For example whitespace, or an opening bracket? What about line breaks?
Same for keys. Is a dash (-) allowed in a key?

The chapter does not only explore such potential gotchas and exotic cases but also shows how to recover from them.

Chapter 8 on reusability and composability looks at grammars, not in the traditional sense but as namespaces grouping invokable methods. As such you gain OOP's principles such as you can inherit from a more a generic grammar to create a more specific variant or even employ composition rather than inheritance by composing Roles for assembling a grammar from smaller independent parts, hence making your regexes accessible for reuse.

Chapter 9 "Parsing with Grammars" could very well be re-titled "Parsing Essentials" as it offers a complete overview of the subject by going through everything there is to know. It begins exploring the subject through analysing a humble mathematical expression of adding and multiplying numbers, which as the example delves deeper in incorporating operator precedence and recursion, begins to appear more complex than initially thought. Again direct instructions on how to tackle potential obstacles are given, like how to eliminate left recursion:

A common technique to avoid left recursion is to have a structure where you can order regexes from generic (here sum) to specific (number).You only have to be careful and check for consumed characters when a regex deviates from that order (e.g., group calling sum).moving up to parsing larger structures.

At its close, the chapter moves away from this simple mathematical expression to tackle more interesting data formats such as the ones of programming languages which require the parser to store state, albeit temporarily, such as C.

Chapter10 "Extracting Data from Parse Trees" is about manipulating the parse tree or better said, the Abstract Syntax Tree produced by the parsing operation, by attaching special methods called Action Objects to its nodes :

An action object is one that you pass to a .parse or .subparse call on a grammar. Henceforth, whenever a named regex matches successfully, the regex engine calls a method for you; it searches for a method with the same name as the regex, and if one exists, calls it with the match object as the argument.If such a method does not exist, nothing happens, and no error is raised.

With Action Objects you can pretty much do anything to a parse tree, even transform it to other kind of trees.

Chapter12 "Unicode and Natural Language" discusses issues that you have to be aware of when processing multilanguage input. A rundown of what constitutes Unicode as a concept follows with material on "Writing Systems", "Bytes, Code Points, Graphemes and Glyphs" and "Unicode Properties".

Finally Chapter 13, examines some real world case studies in parsing S-Expressions, more elaborate Mathematical Expressions and an example toy language, a small subset of Python that the author named Pythonesque, in order to learn how to parse an indentation-based format YAML look alike.

Conclusion

All in all I liked the clean cut hands on approach to regular expressions and parsing taken by the book, capabilities that Perl 6 builds in and makes them much more accessible than other more traditional approaches, such as ANTLR.

Saying that, Perl 6 is a magnificent language whose depth is only starting to emerge and the more you look the more you find something new. Great books like this and "Think Perl 6" (check our Mega Review) play a pivotal role in educating the public to the language's purposes, benefits, capabilities and place in the pantheon of programming languages.

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Last Updated ( Wednesday, 20 December 2017 )