Perl Threading
Written by Nikos Vaggalis   
Wednesday, 06 April 2011
Article Index
Perl Threading
Components of UE&R
From single-threaded to multi-threaded
The message loop

Components of UE&R

UE&R is written in Perl and is comprised of the following parts:

  • Unrar.pm - CPAN Archive::Unrar module, the engine that processes RAR files
  • gui.pl,gui_ui.pm,gui.ui - the GUI front end that uses the Tk toolkit
  • Unrar_Extract_and_Recover.pl - the main program

The Archive::Unrar module

This is the back bone which provides the core functionality. It interacts with the RAR-lab unrar.dll http://www.rarlab.com/ (written in C++) which encapsulates all the functionality for manipulating (not compressing, only extracting) .rar compressed archives. The module also reads the .rar header format to make important decisions.

There is much more going on inside the module but I will briefly introduce the most crucial parts:

"list_files_in_archive" is the procedure that lists the details embedded into the archive (files bundled into the .rar archive, archive's comments)

"extract_headers()" is the procedure that reads the headers

"process_file" is the public procedure that is called by the client. The procedure receives all needed information from the client (file, output directory, password, an optional callback and a parameter ERAR_MAP_DIR_YES that instructs whether to create a directory named after the file name) in the form of a hash containing named parameters:

(Unrar_Extract_and_Recover.pl)
($result,$directory) = process_file(
 file=>$file, password=>$password,
output_dir_path=>$outputdir,
selection=>$selection,
callback=>$callback );
(Unrar.pm)
sub process_file {
my %params=@_; my ($file,$password,$output_dir_path,
$selection,$callback) =
@params{qw (file password output_
dir_path selection callback)};

It then calls the private extract_headers() function which runs various tests on the file by manipulating its file headers. It checks if it is valid, if it is multipart, if it is encrypted, extracts the embedded comments into an internal buffer handled by Perl for displaying to the user later on, and runs some more complex tests by using bitwise operations such as:

(Unrar.pm)
if (!($flagsEX & 256) && !($flagsEX & 128)
&& ($flagsEX & 1)) {
#not blockencrypted and not first volume
# and part of multi archive
#multipart and not the first volume...
# no need to process...skipping....
$continue=ERAR_MULTI_BRK; }

After extract_headers() returns, execution is resumed inside the process_file() subroutine. It uses the information collected by extract_headers() for actually extracting the file.

Multi-part .rar archives are chained together. Their headers contain the file names of the rest of the parts. When you extract just one part of the chain, the rest of the parts are extracted as well since they are linked. We use that to our advantage for caching purposes.

After a file is successfully extracted by process_file(), the data comprising of the filename together with its absolute path and password (if it was password protected) is forwarded to list_files_in_archive() which re-opens the file. (It is a straightforward operation even if the file is password protected since the password has been already retrieved and does not have to be looked up again), extracts the file names of the rest of the parts and stores them in hash:

(Unrar.pm)
my @files = unpack( 'Z260Z260',
$RARHeaderData_struct );
$files[0] =~ s/\0//g;
$donotprocess{ $files[0] } = 1;

Later on when a file that has not been yet processed is encountered while traversing the directory, it is looked up inside the hash; if there is a match then there is no need to process it and a ERAR_CHAIN_FOUND=>
'found in chain.already processed', message is returned to the caller.

The issue here is that since it is a batch application it is primarily designed for speed, which means that all files are thrown inside a bucket (the common directory) and selecting individual files cannot be done as the process has to be fully automated. Therefore caching plays a very important role performance wise, being particularly beneficial when encountering multipart archives.

For example we have three files which are part of a multi-part archive:

File1.part1.rar
File1.part2.rar
File2.part3.rar

A batch application would have to go through each one of them, while in a normal application, the user would select each of the the files they wanted to extract by clicking on it. The batch application however cannot do that; it will go through all the files regardless. The only information the batch application has at its disposal, comes from the files' headers. Encrypted files however do not reveal much, as their headers are encrypted and the only information revealed is whether they are encrypted or not (and that is why they are handled differently by the module).

So after we successfully process the first part, we also get the rest of the filenames comprising the chain and cache them inside the hash. Therefore the process won't need to be repeated when the rest of the files are encountered later on by the reading of the directory.

Furthermore, only successfully extracted files are entered into this hash, so we instantly know which files failed and which did not.

The added advantage of this approach is that since the hash is global, the caller/client has access to it. For example, UE&R uses it for deleting the successfully extracted files:

(Unrar_Extract_and_Recover.pl)
my $bool=Recycle(map {$_} keys %donotprocess)

Choosing a language

As it was my own project, I was able to choose between two languages; C# or Perl. The winner was Perl, hands down.

Although C# would have allowed me to use rich GUI features and crisp graphics, easy threading and the .NET libraries, there was a fundamental issue - interoperating with the unmanaged C++ dll from C# was not easy in pre C# 4.0. You had to go through P/Invoke, map the corresponding C++ structures to C# ones, cross the managed environment, use unsafe pointers, while the GC would cause more harm than good (see fixed unsafe pointer type). One main reason that the dynamic type was introduced in C# 4.0 is easier interop with unmanaged libraries.

With Perl I had much easier interoperation with the C++ dll because of Perl's dynamic typing. Perl structures can be more easily mapped to the structures of the dll because the Perl 'pack' function maps and serializes the structure and then you can move it back and forth between the dll and the Perl client application. And of course I had the power of CPAN at my disposal.

Despite being a dynamic language which is heavily criticized for its performance since it late binds everything, in this case Perl's performance would, I believed, be better than using a static language like C#. As I am already working in an unmanaged environment when I access the C++ dll from within Perl, I don't have to marshal anything and cross boundaries from a managed environment to an unmanaged one.

Another decision was as to what programming approach to use, OOP or procedural? (I use the term 'procedural', not 'functional', since nowadays saying 'functional' is interpreted as relating to functional programming as in F#).

OOP in this case would involve more complexity and would be likely to hinder the approach, rather than benefit it. It is an application in which not everything can be wrapped as an object and as I needed a more direct low level approach I chose the procedural one.



Last Updated ( Friday, 19 August 2011 )
 
 

   
RSS feed of all content
I Programmer - full contents
Copyright © 2014 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.