|Perl Unicode Forensics|
|Written by Nikos Vaggalis|
|Monday, 28 October 2019|
Page 1 of 2
Are character encodings and environment incompatibilities messing with your data? Why it happens and what to do about it.
A SOAP message is delivered to an Apache server which runs a SOAP::Lite powered Perl CGI script that acts as the SOAP server. The script interacts with Ingres, reading and inserting data. Both Perl and Ingres are fined tuned to speak iso8859/7 Greek.
The issue was that the same CGI script produced different results when run under different servers. In the first case the Greek characters sent by the client and consumed by the server are getting into the database as they should do, while in the second case the very same data under the same workflow ends up as "garbage".That is, for example, Greek character capital A, or alpha, ends up as sequence "Γ\201".
Server A is the erroneous one. Its specs are:
Server B predates Server A by some years and bears the original installation, produces the expected results. Its specs are:
The sample code
The CGI script source, test.cgi, is heavily simplified and trimmed down for the purpose of demonstration, but still reveals the underlying concept:
And before you scream SQL INJECTION !! yes I know, legacy code...but anyway the server it is not open to the public and is only accessible "locally" by VPN.
Now to the client, test.pl:
Note that argument "A" is in Greek while the rest are in English. Both servers are set to locale el_GR.ISO8859-7 such as LANG=el_GR.ISO8859-7 etc
The assumption was that because the script was copied and pasted from Server B to Server A, two *almost* identical environments, as is, everything would work just fine out of the box.But unfortunately that wasn't the case.On the contrary it set the beginnings of a debugging journey deep into the troves of character encodings and Perl's handling of them.
Testing the environments
So I set to test new Server A.
Firing test.pl to send the SOAP packet to test.cgi resulted in writing the following into the database (the database's charset is fixed on iso-8859-7) :
Greek capital was ending up as "Γ\201"....that is Greek capital Gamma plus \201
On the contrary on Server B:
Greek capital Alpha was indeed ending up as Greek "A".
Why is that? What is at fault ? Perl, SOAP::Lite, the OS?
Time for some forensics
Let's do a:
It seems that single byte "A" ended up as two byte sequence C3 and 81;the rest is padded white space.
This set the UTF8 alarm off and prompted looking at the UTF8 table for byte sequence c381:
this is the representation of latin1 A with Acute.
Let's take a look at the Latin1/iso-8859-1 table:
But how did this get mixed up with Greek capital letter A ?
Let's look at the Greek/iso-8859-7 table:
And we have a match! Both characters share decimal value 193 or C1 hex.This explains how Perl received the underlying correct byte but thought that it was in iso-8859-1 , the default, and not in iso-8859-7.
In other words, SOAP::Lite receives byte with value 193 but Server A interprets it as
This is not the end of the story, however. It gets even more interesting since subsequently this latin1 single-byte character is being somehow upgraded to UTF8, which would explain the double byte sequence of C381. C381 cannot be inserted into the database as one character because the database reads single-byte iso and not multi-byte UTF, thus it gets decomposed in two bytes, C3 and 81.
This is evident when running the sql query from within the sql terminal client:
"select hex(greek) from testtable where id=3519999;"
C381 is being displayed as that Γ\201 sequence.
That happens because when the terminal tries to read, interpret and display the value to the monitor, it reads hex C3, which in the iso7 table is Greek capital letter Gamma or "Γ"
while hex 81 does not exist, thus is non-printable, therefore its octal value \201 is displayed instead.
|Last Updated ( Monday, 28 October 2019 )|