|
Page 1 of 2
Computational Grammar
In other looks into Babbage’s Bag we have been concerned with the development of computer languages. We have seen how assembly languages evolved into high level languages and the problems caused by the apparently simple “arithmetic expression”. What we haven’t looked at is the impact this practical problem had on theoretical computer science. To deal with the whole problem of describing and compiling artificial languages they had to invent a new subject – computational grammar. This is a subject that has over the years been used as a form of torture for thousands of computer science students but it deserves better! So this Babbage’s Bag is full of something a little more theoretical than usual. You don’t have to know all of this to write a compiler but it might help you understand things a little better.
BNF
Babbage's Bag has previously introduced a notation to describe what constitutes a legal arithmetic expression but without a great deal of explanation. In the early days of high level languages the only really difficult part was the translation of arithmetic expressions to machine code and much of the theory of grammar was invented to deal with just this problem. At the core of this theory, and much used in the definition of programming languages, is BNF, or Backus Normal Form. The only problem is that there isn’t one standard notation for BNF but it is always easy enough to understand. For example, using arrow notation you might write:
<additive expression> -> <variable> + <variable>
You can read this as saying that an additive expression is formed by taking a variable, a plus sign and another variable. This rule only fits expressions like A+B, and not even A+B+C, but you get the general idea. Quantities in angle brackets like <additive expression> are called “non-terminal” symbols because they are defined by a BNF rule and don’t appear in the final language. The plus sign on the other hand is a “terminal” symbol because it isn’t further defined by a BNF rule. You might notice that there is a slight cheat going on in that the non-terminal <variable> was replaced by a terminal in the examples but without a rule allowing this to happen. Well in proper BNF you have to have a rule that defines every non-terminal in terms of terminal symbols but in the real world this becomes so tedious that we often don’t bother and rely on common sense. In this case the missing rule is something like:
<variable> -> A|B|C|D etc .. |Z
where the vertical bar is read as “OR”. If you wanted to define a variable that was allowed to have more than a one-letter name you might use:
1 <variable> -> <variable> <letter>| <letter> 2 <letter> -> A|B|C|D etc .. |Z
This is the BNF equivalent of a repeating operation. For example, to use this definition you start with:
<variable>
and use rule one to replace it by:
<variable><letter>
then use rule two to replace <letter> by A to give:
<variable>A
Next we use the first rule to replace <variable> to get:
<variable><letter>A
and so on building up the variable name one character at a time.
Boring isn’t it, but it is an ideal way to tell a computer what the rules of a language are.
Q) Just to check that you are following – why does rule one include the alternative “|<letter>”?
A) The reason is that this is the version of the rule we use to stop a variable name growing forever because once it is selected the next step it to pick a letter and then we have no non-terminals left in the result.
Notice also that the BNF definition of a variable name cannot restrict its length. Rules like “fewer than 8 characters” have to be imposed as notes that are supplied outside of the BNF grammar.
Extended BNF
To make repeats easier you will often find that an extended BNF notation is used where curly brackets are used to indicate that the item can occur as many times as you like – including zero. In this notation:
variable -> <letter> {<letter>}
means “a variable is a letter followed by any number of letters”. Another extension is the use of square brackets to indicate an optional item. For example:
<integer constant> -> [+|-] <digit> {<digit>}
means that you can have a plus or minus sign or no sign at all in front of a sequence of at least one digit.
Why bother?
So what have we got so far? A grammar is a set of rules which defines the types of things that you can legitimately write in a given language. As such this is very useful because it allows designers of computer languages to define the language unambiguously. If you look at the definition of almost any modern language – C, C++, Pascal, Java and so on - you will find that it is defined using some form of BNF. For example, if you lookup the grammar of Visual C++ you will discover than an expression is defined as:
expression: assignment-expression expression , assignment-expression
Which in our BNF notation would be
<expression>-> <assignment-expression>| <expression>,<assignment-expression>
As already mentioned, sometimes the BNF has to be supplemented by the occasional note restricting what is legal – for example no BNF grammar can restrict the length of a terminal string to a given number of characters. This use of BNF in defining a language is so obviously useful that most students of computer science and practicing programmers immediately take it to heart as a good method. Things really only seem to begin to go wrong when the torture of parsing methods is introduced. Most people just get lost in the number and range of complexity of the methods introduced and quite frankly can’t see why they should bother with any of it. Let’s find out the true story while avoiding the torture.
<ASIN: 0321544285>
<ASIN: 0763741493>
|