Grammar and Torture
Written by Mike James   
Thursday, 26 January 2017
Article Index
Grammar and Torture
Green Dreams

Computational grammar is a subject that is sometimes viewed as a form of torture by computer science students, but understanding something about it really does help ....

In The Heart Of A Compiler we saw how assembly languages evolved into high level languages and the problems caused by the apparently simple “arithmetic expression”.

What we haven’t looked at so far is the impact this practical problem had on theoretical computer science.

There is an argument that the arithmetic expression is the whole reason that grammar and parsing methods were invented but once you have the theory of how to do it you might as well use it for the entire structure of a language.



Computational Grammar

To deal with the whole problem of describing and compiling artificial languages we had to invent a new subject – computational grammar.

This is a subject that has over the years been used as a form of torture for thousands of computer science students but it deserves better!

Backus Naur Form - BNF

In the early days of high level languages the only really difficult part was the translation of arithmetic expressions to machine code and much of the theory of grammar was invented to deal with just this problem. 

At the core of this theory, and much used in the definition of programming languages, is BNF, or Backus Normal Form. Because Donald Knuth pointed out that it really isn't a normal form in the sense of providing a unique normalized grammar it is often known as Backus-Naur Form after John Backus the inventor of Fortran and Peter Naur one of the people involved in creating Algol 60.

Not only isn't it a normal form, there isn't even a single standard notation for BNF and everyone feels free to make up their own variation on the basic theme. However, it is always easy enough to understand.

For example, using "arrow notation" you might write:

<additive expression> -> <variable> + <variable>

You can read this as saying that an additive expression is formed by taking a variable, a plus sign and another variable.

This rule only fits expressions like A+B, and not A+B+C, but you get the general idea.

Quantities in angle brackets like

<additive expression>

are called “non-terminal” symbols because they are defined by a BNF rule and don’t appear in the final language.

The plus sign on the other hand is a “terminal” symbol because it isn't further defined by a BNF rule.

You might notice that there is a slight cheat going on in that the non-terminal <variable> was replaced by a terminal in the examples but without a rule allowing this to happen.

Well in proper BNF you have to have a rule that defines every non-terminal in terms of terminal symbols but in the real world this becomes so tedious that we often don’t bother and rely on common sense.

In this case the missing rule is something like:

<variable> -> A|B|C|D etc .. |Z

where the vertical bar is read as “OR”. Notice that this defines a <variable> as just a single character - you need a slightly more complicated rule for a full multicharacter variable name.

If you want to define a variable that was allowed to have more than a one-letter name you might use:

1 <variable> -> <variable> <letter> | <letter>
2  <letter> -> A|B|C|D etc .. |Z

This is the BNF equivalent of a repeating operation. It says that a variable is either a letter or a variable followed by a letter. 

For example, to use this definition you start with:


and use rule one to replace it by:


then use rule two to replace <letter> by A to give:


Next we use the first rule to replace <variable> to get:


and so on building up the variable name one character at a time.

Boring isn’t it?

But it is an ideal way to tell a computer what the rules of a language are.

Q)  Just to check that you are following – why does rule one include the alternative “|<letter>”?

A) The reason is that this is the version of the rule we use to stop a variable name growing forever because once it is selected the next step it to pick a letter and then we have no non-terminals left in the result.

Notice also that the BNF definition of a variable name cannot restrict its length. Rules like “fewer than 8 characters” have to be imposed as notes that are supplied outside of the BNF grammar.

Extended BNF

To make repeats easier you will often find that an extended BNF notation is used where curly brackets are used to indicate that the item can occur as many times as you like – including zero. In this notation:

variable -> <letter> {<letter>}

means “a variable is a letter followed by any number of letters”. Another extension is the use of square brackets to indicate an optional item. For example:

<integer constant> -> [+|-] <digit> {<digit>}

means that you can have a plus or minus sign  or no sign at all in front of a sequence of at least one digit.

The notation isn't universal by any means but something like it is often used to simplify BNF rules.

BNF in Pictures - Syntax Diagrams

You will also encounter BNF in picture form. Syntax diagrams or "railroad diagrams" are often used as an easy way to understand and even use BNF rules.

The idea is simple - each diagram shows how a non-terminal can be constructed from other elements. Alternatives are shown as different branches of the diagram and repeats are loops. You construct a legal string by following a path through the diagram and you can determine if a string is legal by finding a path though the diagram that it matches.

For example the following syntax diagram is how Pascal defined a variable name: 


It corresponds to the rules

<letter> -> <letter><letterordigit>


If you want to see more syntax diagrams simply see if your favourite language has a list of them.

Here is a classic example in the 1973 document defining Pascal - scroll to the end of the pdf for the syntax diagrams. 


Why bother?

So what have we got so far?

A grammar is a set of rules which defines the types of things that you can legitimately write in a given language.

As such this is very useful because it allows designers of computer languages to define the language unambiguously.

If you look at the definition of almost any modern language – C, C++, C#, Java, Python, Ruby and so on - you will find that it is defined using some form of BNF. However you will also find languages that don't bother. For example, the language PHP just sort of grew and was defined by its implementation. Only later did anyone go back and work out what the BNF needed to define it actually was. See - PHP Gets A Formal Specification 

For example, if you lookup the grammar of C++ you will discover than an expression is defined as:

    expression , assignment-expression

Which in our BNF notation would be

<expression>-> <assignment-expression>|

As already mentioned, sometimes the BNF has to be supplemented by the occasional note restricting what is legal – for example no BNF grammar can restrict the length of a terminal string to a given number of characters. 

This use of BNF in defining a language is so obviously useful that most students of computer science and practicing programmers immediately take it to heart as a good method.

Things really only seem to begin to go wrong when the torture of parsing methods is introduced. Most people just get lost in the number and range of complexity of the methods introduced and  can’t see why they should bother with any of it.

Let’s find out the true story while avoiding the torture.

Last Updated ( Thursday, 26 January 2017 )