Unicode issues in Perl
Written by Nikos Vaggalis   
Monday, 14 February 2011
Article Index
Unicode issues in Perl
Readdir in action
Byte semantics
UTF8
Greek, Latin and Cyrillic

The UTF8 flag

Let's return to the earlier example where we fed our program with the file called 'Δ.rar':

#Example1.pl
use Devel::Peek;
opendir (my $MYFILE, ".")
|| die "Can't open directory $!";

while ( my $file = readdir($MYFILE) ) {
next if ( $file eq "." || $file eq ".." );
next if ($file =~ /\.pl$|\.db$/i );
Dump($file);
}
closedir $MYFILE;

This produces:

C:\unicode>perl example1.pl
SV = PV(0x36bd4) at 0x182a764
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x18242a4 "\304.rar"\0
CUR = 5
LEN = 8

C:\unicode

Here conversion was successful as demonstrated by the fact that we get the correct value in ANSI cp, but our Perl program still does not know that we have a string encoded in Greek/cp1253 and treats the byte sequence as a Latin1 string.

This poses issues such as not being able to use the Unicode regex facilities, for example:

$file=~s/\N{GREEK CAPITAL LETTER DELTA}/!/;
(replace "delta" with "!")

because the string is not Unicode encoded. Hence we must upgrade the bytes into characters:

#Example6.pl
use Devel::Peek;
use Encode;
use charnames ":full";
opendir (my $MYFILE, ".") ||
 die "Can't open directory $!";
 while ( my $file = readdir($MYFILE) ) {
 next if ( $file eq "." || $file eq ".." );
 next if ($file =~ /\.pl$|\.db$/i );
  Dump($file),"\n";
  $unicode_file=decode(cp1253,$file);
 
  Dump($unicode_file),"\n";<
  $unicode_file=~s/\N{GREEK CAPITAL
LETTER DELTA}/!/;
 
Dump($unicode_file),"\n";
 }
closedir $MYFILE;

which produces

C:\unicode>perl example 6.pl
SV = PV(0x243ba4) at 0x18207b4
REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
PV = 0x1824cc4 "\304.rar"\0
CUR = 5
LEN = 8
SV = PV(0x18ded34) at 0x18e479c
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x1866164 "\316\224.rar"\0
[UTF8 "\x{394}.rar"]
CUR = 6
LEN = 8
SV = PVIV(0x1820874) at 0x18e479c
REFCNT = 1
FLAGS = (POK,OOK,pPOK,UTF8)
IV = 1 (OFFSET)
PV = 0x1866165 ( "\316" . ) "!.rar"\0
[UTF8 "!.rar"]
CUR = 5
LEN = 7
C:\unicode>

After upgrading the string to Unicode the UTF8 flag is on which denotes that we are working with Unicode; and now regex is also working!

If you do not specify an input code page as:

$unicode_file=decode(cp1253,$file),

whereby we specifically instruct Perl to encode bytes into characters by using the cp1253 code page, Perl treats the string as Latin1.

To prove this point we will use the following example :

#Example7.pl
use Devel::Peek;
use Encode;
use charnames ":full";
use encoding::warnings;
opendir (my $MYFILE, ".") ||
die "Can't open directory $!";
  while ( my $file = readdir($MYFILE) ) {
next if ( $file eq "." || $file eq ".." );
next if ($file =~ /\.pl$|\.db$/i );
       Dump($file),"\n";
       $file=~s/\N{GREEK CAPITAL LETTER 
DELTA}/!/;
print "Greek? ", $file,"\n";
       $file=~s/\N{LATIN CAPITAL LETTER 
A WITH DIAERESIS}/!!!/;
print "latin1? ",$file,"\n";
}
closedir $MYFILE;

Which produces

E:\unicode> chcp 1253
Active codepage : 1253
E:\unicode>example7.pl
Bytes implicitly upgraded into wide 
characters as iso-8859-1 at example7.pl line 10
Bytes implicitly upgraded into wide 
characters as iso-8859-1 at at example7.pl line 10
SV = PV(0x243ba4) at 0x18207b4
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x1824cc4 "\304.rar"\0
CUR = 5
LEN = 8
greek? Δ.rar
latin1? !!!.rar
C:\unicode>

GREEK CAPITAL LETTER DELTA and LATIN CAPITAL LETTER A WITH DIAERESIS share the same ordinal value \304.

For the Unicode enabled regex to work, the string was treated as bytes with Latin1 encoding and was implicitly upgraded into Latin1's equivalent UTF8 and that is why our regex search for Greek char LETTER DELTA failed while it was successful for LATIN CAPITAL LETTER A

 

As perlunicode states :


"By default, there is a fundamental asymmetry in Perl's Unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1), but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 code points in Unicode happens to agree with Latin-1."


If you need to explicitly upgrade the bytes to UTF8 , you can use utf8::upgrade() which upgrades a string in native format (Latin1) to Unicode:

E:\unicode>example7a.pl
SV = PV(0x243ba4) at 0x18207b4
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x1824cc4 "\304.rar"\0
#Latin1 bytes
CUR = 5
LEN = 8
SV = PV(0x243ba4) at 0x18207b4
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x183097c "\303\204.rar"\0
[UTF8 "\x{c4}.rar"]

 #mapped to Unicode
CUR = 6
LEN = 7

<ASIN:0596000278>



Last Updated ( Monday, 04 April 2011 )