Unicode issues in Perl
Written by Nikos Vaggalis   
Monday, 14 February 2011
Article Index
Unicode issues in Perl
Readdir in action
Byte semantics
UTF8
Greek, Latin and Cyrillic

Readdir in action

We have a directory with three files, and we use the readdir function to get the filenames into our script:

 

folders

Three test files with "interesting" names

#Example1.pl

use Devel::Peek;
opendir (my $MYFILE, ".")
|| die "Can't open directory $!";

while ( my $file = readdir($MYFILE) ) {
next if ( $file eq "." || $file eq ".." );
next if ($file =~ /\.pl$|\.db$/i );
Dump($file);
}
closedir $MYFILE;

By using the module Devel::Peek we can take a look at the internals of what our Perl program is fed after conversion takes effect:

C:\unicode>perl example1.pl
SV = PV(0x36bd4) at 0x182a764
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x18242a4 "f.rar"\0
CUR = 5
LEN = 8
SV = PV(0x36bd4) at 0x182a764
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x18241e4 "C33E~1.RAR"\0
CUR = 10
LEN = 12
SV = PV(0x36bd4) at 0x182a764
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x18242a4 "\304.rar"\0
CUR = 5
LEN = 8
C:\unicode>

“f.rar” and “Δ.rar” got there successfully but “à.rar” was not (it returned “C33E~1.RAR”).

The ANSI representation of “Δ.rar” is ordinal value \304. Our Perl program gets the correct value but it does not know that it should be treated as Greek as it gets unencoded bytes. We will return later to explore this issue in detail.

Let’s add a few directories and observe again:

 

addeddirectories

Some more test directories

C:\unicode>perl example1.pl
SV = PV(0x36bd4) at 0x182a764
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x1857884 "f"\0
CUR = 1
LEN = 4
SV = PV(0x36bd4) at 0x182a764
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x18241e4 "f.rar"\0
CUR = 5
LEN = 8
SV = PV(0x36bd4) at 0x182a764
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x18242a4 "0E00~1"\0
CUR = 6
LEN = 8
SV = PV(0x36bd4) at 0x182a764
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x18241e4 "C33E~1.RAR"\0
CUR = 10
LEN = 12
SV = PV(0x36bd4) at 0x182a764
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x1857884 "\304"\0
CUR = 1
LEN = 4
SV = PV(0x36bd4) at 0x182a764
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x18241e4 "\304.rar"\0
CUR = 5
LEN = 8
C:\unicode>

Directory "f" and file "f.rar" have been converted correctly, so has directory "Δ" and file "Δ.rar", but file "à.rar" produced "C33E~1.RAR" and directory "à" produced the sequence "0E00~1" both of which are erroneous. In fact the only thing that you can derive from the garbage is that 0xC30xA0 is the UTF8 internal byte representation of character à in hex, while 0E00 is the equivalent UTF16 representation.

Console Input and Output

Console Input and Output is controlled by using the command "chcp". For example chcp 1253 changes the input and output code page of the console to code page 1253/Greek. This means that the output of our Perl program will be treated as Greek by our Console.

Let's reuse our example but this time instead of dumping the internals we will print the filenames to STDOUT:

 

ex2

#Example2.pl
use Devel::Peek;
opendir (my $MYFILE, ".") || die "Can't
open directory $!";
while ( my $file = readdir($MYFILE) ) {
next if ( $file eq "." || $file eq ".." );
next if ($file =~ /\.pl$|\.db$/i );
print $file,"\n";
}
closedir $MYFILE;

First we set console output to 1251 and then to 1253:

C:\unicode>chcp 1251
Active code page: 1251
C:\unicode>perl example2.pl
f.rar
C33E~1.RAR
Д.rar
C:\unicode>chcp 1253
Active code page: 1253
C:\unicode>perl example2.pl
f.rar
C33E~1.RAR
Δ.rar

Our Perl program spits out bytes which are intercepted by the console and, depending on the codepage set, they are translated into the equivalent ANSI characters. Thus in the first case the file with Greek letter "Δ" which has the ordinal value \304 is translated to Cyrillic character "Д" because the code page is set to 1251/Cyrillic and \304 corresponds to that Cyrillic character.

Microsoft Windows Code Page 1253
char dec oct hex description
[Δ] 196 304 C4 GREEK CAPITAL LETTER DELTA
Microsoft Windows Code Page 1251
char dec oct hex description
[Д] 196 304 C4 CYRILLIC CAPITAL LETTER DE

Source : http://www.columbia.edu/kermit/cp1251.html

By setting the code page with chcp to the correct page 1253 we interpret the bytes correctly.

However we can change the console output programmatically by using:

Win32::Console::OutputCP( 1253 );

which supersedes any chcp settings.

In the following example we set both the Input code page as well the Output code page of the console to 1251 but we supersede the Output code page setting from within our Perl program, hence we still get the correct output:

#Example3.pl
use Win32::Console;
Win32::Console::OutputCP( 1253 );

opendir (my $MYFILE, ".") || die "Can't
open directory $!";

while ( my $file = readdir($MYFILE) ) {
next if ( $file eq "." || $file eq ".." );
next if ($file =~ /\.pl$|\.db$/i );
print $file,"\n";
}
closedir $MYFILE;

This produces:

C:\unicode>chcp 1251
Active code page: 1251
C:\unicode>perl example3.pl
f.rar
C33E~1.RAR
Δ.rar
C:\unicode>

<ASIN:0321480910>
<ASIN:0596520107>



Last Updated ( Monday, 04 April 2011 )