Compilers, Interpreters, VMs and JIT

Written by Mike James

Thursday, 21 December 2017

Article Index
Compilers, Interpreters, VMs and JIT
Two approaches

Page 1 of 2

The distinction between a compiler and an interpreter is one that can cause controversy. One programmer's compiler is another's interpreter and the whole subject gets very murky when you throw in the idea of the Virtual Machine and Just In Time compilation. So what is it all about?

If you want to start an argument between programmers or software engineers then just say “language X is implemented by an interpreter”.

If that doesn’t cause a heated exchange, then change “interpreter” to “compiler” in the same sentence. For while there are languages that are generally accepted as being implemented usually as one or the other, the situation is rarely clear cut.

Back in the early days of computing the distinction was much more important and the debate about compilers versus interpreters would rage in the pages of any magazine or journal and verbal wars were common. Today, with more powerful machines, the issue isn’t quite as important but it can still raise a good argument!

It can also be the deeper cause of a language war. For example, C++ v C# isn't just about which language is better it is also about the way that the languages are generally implemented. C++ is usually compiled to machine code where C# is usually compiled to an intermediate code and then interpreted or JITed.

But this is getting ahead of our selves.

As well as being controversial it is also still at the leading edge of development although now we tend to talk more about “virtual machines” and “Just In Time Compilers” than straightforward interpreters and compilers and this makes everything dissolve into shades of grey.

In other articles we have looked at the progression from machine code to assembler and on to high-level languages. An inherent assumption in this discussion is that the language would always be translated, or compiled, to machine code before it was expected to do anything much - but this isn’t the only possible approach.

The run time

The idea of an interpreter as opposed to a compiler evolved slowly and it isn’t very easy to say exactly where the idea came from.

In the early days assemblers and compilers would translate every last instruction of a high-level language program into machine code and the resulting machine code was then taken and run on a "real" machine. When you think about it what other way could you get a program written in say, Fortran, to run on hardware that knew nothing of Fortran and only understood machine code?

In other words the process of compiling from a language to machine code seemed unavoidable and the only way to do things. This was a major problem however as no one had any idea how to do it and it wasn't at all obvious that the resulting automatically generated code would be fast enough to be useful. Eventually the first compilers were created and they proved valuable.

Then some clever programmer had an idea. Why not make use of a library of machine code subroutines to make life easier?

For example, if the high level language program contained the line

A=B*C

then a standard compiler approach would take the “B times C” part of the and translate it to the machine code equivalent. Which, if you were lucky would be something like:

MUL B,C

However, many early machines didn’t have a multiply command and so the multiplication had to be built up using add and shift instructions. A true compiler would and should take the high level multiplication B*C and convert it into a complete sequence of machine code instructions that performed the multiplication.

You can see that implemented this way a simple operation of multiplication would generate a lot of machine code each time you used it.

The clever idea was to create a subroutine that would multiply two numbers together and then a multiplication operation would be compiled to a call to the new subroutine.

That is B*C compiles to:

CALL Multiply

Of course the values in B and C would have to be loaded into the locations in which the subroutine expected to find them before the call, but in principle this is still a great simplification.

The advantage of this approach is that the compiled program is smaller because the multiplication code isn’t repeated every time it is used.

However there are some disadvantages to the method. The first is that there is usually an overhead in calling a subroutine that makes this approach slower than simply compiling in the instructions needed to multiply two numbers together. The second is that now any program that runs on the machine compiled in this way needs access to a chunk of standard code – usually called a “run time library”.

Compilers that use run time libraries are common and they can hide the fact that they are doing so by including a copy of the library with each and every program. This, of course wastes space but it does produce programs that can run without the obvious help of a run time library.

Smarter compilers only compile in the bits of the run time library that a program actually uses.This is what a linker does. It takes the separate sections of a program including any routines needed from the run time library and stitches it all together to create a complete program.

You can even package much of the run time library into separate modules which are used at run time and not compiled into the application - this is what Windows DLLs and Linux SO files are all about. The are shared libraries that are precompiled and available for use by any program.

As time went on run time libraries tended to get bigger and bigger and more sophisticated. The consequence is that more and more of the compiler's time is spent not in compiling to machine code but simply putting in calls to routines in the run time.

The Interpreter

The idea of using a run time library to implement multiplication and similar operations doesn’t seem like a deep philosophical principle – but it is!

What we have done is to notice that the high level language makes use of a facility that the machine doesn’t actually have.

The run time subroutine can be thought of as providing the missing facility by simulating it in software. If at a later date the machine suddenly gets a piece of hardware that does multiplication the compiler can be changed to make use of it by issuing a single multiply instruction instead of a call to the multiply subroutine. Alternatively the run time library can be changed to make use of the new hardware – this is very flexible.

Now think a little harder about his idea and you quickly see that it can be extended to greater things.

The multiplication routine is there to make up for the lack of a multiply command in the real hardware.

Why not simulate all of the hardware you need and make the run time library a software machine that runs the high-level language in question.

After all A*B is an instruction in say Fortran or Basic and the run time can simply do the multiplication as soon as it encounters the instruction. The same is true for all of the instructions written in the high-level language. Each instruction results in a subroutine being called to do what ever it is the instruction is about.

Now we don't have an obvious translation step from the high level language to machine code. In fact no machine code is generated. All that happens is that the "run time" package now reads the high level language and obeys its instructions by calling appropriate subroutines.

That is, the run time library has become a computer, implemented in software, which has the high-level language as its machine code.

This is an amazing idea and another name for the simulation is an interpreter.

In the early days languages such as Basic were implemented using interpreters and no compile step was required. The Basic Interpreter read each of the Basic instruction and did what it was told to do, usually by calling predefined routines.

One of the earliest to have an impact was Palo Alto Tiny Basic. A miracle of economy by today's standards. A complete integer only Basic interpreter that used just 2K of an 8 bit processors memory. It worked by using a big selection statement to call the appropriate subroutine to handle whatever keyword was next LET, GOTO etc. The actual Basic was never converted to machine code it simply activated the subroutines in the runtime library now more correctly called an interpreter.

The Virtual Machine

However notice that there is another way to look at the interpreter code. It is a software implementation of a machine that runs the high-level language as its "machine code". This is what we generally refer to as a Virtual Machine or VM.

In the case of Tiny Basic the interpreter can be viewed as a simulation of a machine that has Tiny Basic as its machine code or assembly language.

This idea can be generalized and you can design VMs that have a lower level language than Tiny Basic as their machine code. Such a VM can be used as a target for a range of languages with the help of a compiler that translates the high level language into the VMs machine code. And at this point you might think that world has gone mad - a compiler that compiles to a made up machine code and then run on a VM in the style of an interpreter! Surely this is crazy?!

<ASIN:0470177071>

<ASIN:1292024348>

<ASIN:0321486811>

<ASIN:0471976970>

Prev - Next >>

Last Updated ( Wednesday, 31 March 2021 )