Getting Started With jQuery - Advanced Ajax Characters & Encoding

Written by Ian Elliot

Tuesday, 20 June 2017

Article Index
Getting Started With jQuery - Advanced Ajax Characters & Encoding
JavaScript problems
Ajax and Encoding from the Server
Conclusion

Page 1 of 4

One of the biggest problems you encounter in using Ajax is the dreaded character encoding. No matter what data format you select, the data is actually transmitted as text. But it isn't as simple as this sounds. This is the final chapter in the newly published Just jQuery: Events, Async and Ajax.

Just jQuery
Events, Async & AJAX

Is now available as a print book: Amazon

Also Available:

buy from Amazon

smallcoverjQuery

Advanced Attributes

Before we start to look at character encoding within Ajax we have to understand the problem it tries to solve. At its most basic, data on the Internet consists of groups of 8-bits known at an "octet" but usually just called a "byte". Obviously to represent character data we need a mapping between numeric values and characters. One of the first standards for this was, and is, ASCII. This defines 127 alphanumeric characters: A-Z; a-z; 0-9; command characters such as carriage return and backspace; and assorted special characters. Of course, using 8 bits you can represent 256 characters, but this isn't enough to represent all of the characters used by even a small selection of the written languages of the world.

ISO-8859

The first solution to this problem was to simply reuse the same 256 numeric codes and associate them with different sets of characters. The most commonly used on the internet is ISO 8859-n where n is between 1 and 16. Each value of n maps a different set of characters onto the 0 to 255 values that a byte can represent. For example, ISO 8859-1 is Latin-1 Western European and if selected provides characters for most Western European languages. ISO-8859-2 is Latin-2 Central European and provides characters for Bosnian, Polish, Croatian and so on.

Notice that we now have a situation where a single character code can correspond to different characters depending on which ISO-8859 character set is selected. This is the source of problems if a server sends data using one ISO-8859 character set and the browser displays it using another. The data hasn't changed but what is displayed on each system is different. To stop this from happening, servers send a header stating the character set in use. For example:

Content-Type: text/html; charset=ISO-8859-1

sets the character set to Latin 1. The problem with this is that the server can't adjust its headers for an individual page. Setting the HTTP header for an entire site is reasonable, but you still might want to send a page in another character set. To allow this you can use the meta tag:

<meta http-equiv=“Content-Type” content=“text/html; charset=ISO-8859-1”>

This has to be the first tag in the <head> section because the page cannot be rendered until the browser knows the charset in use.

Notice that adding an HTTP header or an HTML meta tag only tells the browser what encoding is in use, it doesn't actually enforce the encoding or convert anything from one encoding to another. What matters is what encoding the file is stored using. For example, to use ISO-8859-2 when you save a file when using an editor such as Notepad++, select encoding Ansi and character set Eastern European. The encoding used for the file determines how all of the characters it contains are represented and this includes string literals used in JavaScript or PHP programs.

Unicode

Most of what we have just looked at is legacy because the proper way to do character representation today is to use Unicode. However, you will still encounter websites using ISO character sets and you need to understand how they work. By comparison Unicode is more logical and complete. Unicode is just a list of characters indexed by a 32-bit value called the character's code point. There are enough characters in Unicode to represent every language in use and some that aren't.

Unicode defines the characters, but it doesn't say how the code point should be represented. The simplest is to use a 32-bit index for every character. This is UTF-32 and it is simple, but very inefficient. It is roughly four times bigger than ASCII. In practice we use more efficient encodings.

UTF-8

There are a number of encodings of Unicode, but the most important for the web is UTF-8. There are 1,112,064 characters in UTF-8 and clearly these cannot all be represented by a value in a single byte as the 256 characters of ASCII could. Instead UTF-8 is a variable length code that uses up to four bytes to represent a character. How many bytes are used to code a character is indicated by the most significant bits of the first byte.

Byte 1

0xxxxxxx one byte
110xxxxx two bytes
1110xxxx three bytes
11110xxx four bytes

All subsequent bytes have their most significant two bits set to 10. This means that you can always tell a follow on byte from a first byte. The bits in the table shown as x carry the information about which character is represented. To get the character code you simply extract the bits and concatenate them to get a 7, 11, 16 or 21-bit character code. Notice that, unlike the ISO scheme, there is only one character assigned to a character code. This means that if the server sends UTF-8 and the browser interprets the data as UTF-8 then there is no ambiguity.

The first 128 characters of UTF-8 are the same as ASCII, so if you use a value less than 128 stored in a single byte then you have backward compatible ASCII text. That is, Unicode characters U+0000 to U+007F can be represented in a single byte. Going beyond this needs two, three and four bytes. Almost all the Latin alphabets plus Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syrian, Thaana and N'Ko can be represented with just two bytes.

As long as the server sends out a header:

Content-Type: text/html; charset=UTF-8

or you include:

<meta charset="UTF-8">

as the first tag in the web page, the browser will interpret the stream of bytes as UTF-8. Nothing can go wrong as long as the server is actually sending UTF-8 encoded text.

One complication is that if the server sends out a header that sets the character coding this takes precedence over the meta tag. This suggests that if you want to serve pages in different encodings you should set the server not to send out character encoding headers.

If you want to include a UTF-8 character that is outside the usual range, i.e. one you cannot type using the default keyboard, then you can enter it using:

&#decimal;

&#xhex;

where decimal and hex are the character codes in decimal and hex. For example:

∑

will display a mathematical summation sign, i.e. a sigma.

∑

If you don't see a Greek sigma above then your browser isn't using UTF-8. Many of the common symbols have HTML entity names. For example, you can enter the summation symbol using ∑

You also have to be careful about text that is processed by the server. For example, text stored in a database needs to be in the same representation as the server is going to use. Similarly, you have to pay attention to text processed by server-side languages like PHP.

The most important single idea is:

The browser always works with UTF-8 encoded data and, if it can, will convert any other encoding as the web page is read in.

To do this it has to know what the encoding is and it has to support converting it.

Prev - Next >>

Last Updated ( Thursday, 05 May 2022 )