JavaScript Data Structures - The String Object
Written by Ian Elliot   
Thursday, 17 January 2019
Article Index
JavaScript Data Structures - The String Object
Characters
String manipulation

datastruct

Characters

The JavaScript String is a fundamental type in the sense that there is no character/char data type. If you want to work with a single character then you use a string of length 1. To access a single character in a string it is best to use the charAt(i) method which will return the character at the ith position. That is, "ABC".charAt(1) returns "B" as the first character in a string is character zero.

You can also treat a string as if it was an array of characters - for example "ABC"[1] returns "B". However Strings as arrays was only introduced in ECMAScript 5 and it is read only i.e. you cannot assign to a string element.

Of course characters are represented by character codes and the day of the simple ASCII code has long gone. The charCodeAt method returns the Unicode of the character a given position. For example:

var s1 = "ABC";
alert(s1.charCodeAt(1));

displays 66 - which is of course the ASCII code for B.

For the ASCII characters the two codes are the same so if you want to you can simply ignore Unicode and continue is if you were still using ASCII.

JavaScript strings are UTF-16 coded – this has some unexpected results for any programmer under the impression that they can assume that one character is one byte.

UTF-16 is a variable length way of coding Unicode but as the basic unit is 16 bits we only need to allow for the possibility of an additional two-byte word.  Unfortunately JavaScript doesn't do multiword encoding. That is JavaScript only supports 16 bit Unicode 0 to hex FFFF and not the full range of 0 to hex 10FFFF. This means you need to restrict yourself to the part of Unicode called the BMP - Basic Multilingual Plane. If you need to use Unicode characters that need two 16 bit words then you have to work without any help from JavaScript - see later.

The inverse function to charCodeAt is the object method String.fromCharCode which returns a string composed of characters specified by the given Unicode parameters. For example:

String.fromCharCode(65,66,67)

returns the string "ABC", You can go beyond the usual ASCII characters however.

For example to create a string consisting of the single Unicode character 09B2 (hex) which is the Bengali LA character you would use:

var s1 = String.fromCharCode(0x09B2);

(recall that 0x is the prefix for hexadecimal numbers).

Unicode characters are only supported in JavaScript after version 1.3 so some care is needed with older browsers.

You can also enter Unicode characters within a string using escape codes. If you include \uhexcode in a string then the Unicode character corresponding to the code is inserted. For example to insert the Bengali LA character you would use:

alert("\u09B2");

You can use \DDD with DDD in octal or \xDD with DD in hex for any Latin-1 encoded character

There are also escape characters to allow you to enter a range of non-printing control codes:

\' single quote
\" double quote
\\ backslash
\n new line
\r carriage return
\t tab
\b backspace
\f

form feed

 

There are also Unicode equivalents of these old ASCII escape codes:

\u0009 Tab
\u000B Vertical Tab
\u000C Form Feed
\u0020 Space
\u000A Line Feed
\u000D Carriage Return
\u0008 Backspace
\u0009 Horizontal Tab
\u0022 Double Quote
\u0027 Single Quote
\u005C Backslash

 

The big problem with Unicode when used outside of the standard ASCII characters is that the browser, or other display device, has to support Unicode and there has to be an appropriate Unicode font available. Most modern browsers do support Unicode, but there is still the problem of Unicode input.

There is also the problem of what to do if you want to go outside of the BMP?

For example, the "grinning cat face with smiling eyes" needs two 16 bit words in UTF16 - \uD83D\uDE38. However if you try:

s= "\uD83D\uDE38"; 
console.log(lengths);

reports the length of the string as two. even though only one character is coded. 

At the moment most of the JavaScript functions only work when you use characters from the BMP and there is a one-to-one correspondence between 16-bit values and characters. JavaScript may display surrogate pairs correctly, but in general it doesn't process them correctly.  For example, consider the string that represents two cat emoji:

s= "\uD83D\uDE38\uD83D\uDE38";
alert(s.charAt(1));

The charAt doesn't give you the final cat emoji, but the character corresponding to the first uDE38, which is an illegal Unicode character, i.e. it returns the 16-bit code corresponding to the second 16-bit word rather than the second character. 

<ASIN:1871962579>

<ASIN:1871962560>



Last Updated ( Thursday, 17 January 2019 )