Unicode Version 14 Announced
Written by Nikos Vaggalis   
Friday, 24 September 2021

The venerable Unicode standard gets an update. We report the news and go behind the scenes with a brief look at the standard's philosophy and practical use.

Most people stop thinking about Unicode at the introduction of new Emoji characters. However, the main purpose of the Unicode standard isn't just sharing expressive characters to be used on mobile apps just for fun; it also facilitates communication in every humanly readable language as well as supporting science and research with its scientific symbols and ancient language scripts.


In the Unicode consortium's own words:

The Unicode Standard is the foundation for all modern software and communications around the world, including all modern operating systems, browsers, laptops, and smart phones-plus the Internet and Web (URLs, HTML, XML, CSS, JSON, etc. ).

With that said Unicode v14 has added 838 characters, including five new scripts and 37 new emoji characters.

The scripts are: 

  • Toto, used to write the Toto language in northeast India

  • Cypro-Minoan, an undeciphered historical script primarily used on the island of Cyprus and surrounding areas during the Late Bronze Age (ca. 1550-1050 BCE).

 

  • Vithkuqi, an historic script used to write Albanian, and undergoing a modern revival

  • Old Uyghur, an historic script used in Central Asia and elsewhere to write Turkic, Chinese, Mongolian, Tibetan, and Arabic languages
  • Tangsa, a modern script used to write the Tangsa language, which is spoken in India and Myanmar 

This goes to show that Unicode is not just useful for communication in the modern world, but is also the Gatekeeper that safeguards the memory of niche or extinct cultures.

Elaborating more, technically a Unicode Script (according to Wikipedia) is:

A collection of letters and other written signs used to represent textual information in one or more writing systems. Some scripts support one and only one writing system and language, for example, Armenian.

Other scripts support many different writing systems; for example, the Latin script supports English, French, German, Italian, Vietnamese, Latin itself, and several other languages.

In regular expressions, you'll find them usually notated with \p{..} , like \p{Latin} etc.

As far as the fun aspect goes, v14 also added the following 37 emoji characters: 

  • Melting Face
  • Face with Open Eyes and Hand Over Mouth
  • Face with Peeking Eye
  • Saluting Face
  • Dotted Line Face
  • Face with Diagonal Mouth
  • Face Holding Back Tears
  • Rightwards Hand
  • Leftwards Hand
  • Palm Down Hand
  • Palm Up Hand
  • Hand with Index Finger and Thumb Crossed
  • Index Pointing at the Viewer
  • Heart Hands
  • Biting Lip
  • Person with Crown
    Pregnant Man
  • Pregnant Person
  • Troll
  • Coral
  • Lotus
  • Empty Nest
  • Nest with Eggs
  • Beans
  • Pouring Liquid
  • Jar
  • Playground Slide
  • Wheel
  • Ring Buoy
  • Hamsa
  • Mirror Ball
  • Low Battery
  • Crutch
  • X-Ray
  • Bubbles
  • Identification Card
  • Heavy Equals Sign 

At I Programmer we have extensive coverage of the Emoji world. Check Emoji SubCommittee ReOpens Submissions Process and World Emoji Day Chooses Syringe To Sum Up 2021 for the latest.

 

Some other minor additions found their way in, including: 

  • Many Latin additions for extended IPA

  • Arabic script additions used to write languages across Africa and in Iran, Pakistan, Malaysia, Indonesia, Java, and Bosnia, and to write honorifics, and additions for Quranic use

  • Character additions to support the languages of North America and of the Philippines, India, and Mongolia 

All fine, but in order to get your hands on the new characters, you'll have to wait until your favorite apps and fonts get upgraded to support the new standard. The same delay applies to programming language support. Perl is always the fastest to adopt the newest Unicode standards. For instance Unicode 10 support came with Perl version 5.28 back in 2018, while Perl 5.32.0 came with Unicode 13. The latest version of Perl is 5.34.0, released in May 2021, and as such it has not incorporated the latest standard but I guess that the next one will.

And what can you do with Scripts programming-wise? Use them in manipulating text such as in regular expressions. This is described in Advanced Perl Regular Expressions - Extended Constructs where I have a file:

 myimageऄwithधDevanagariमcharsफ'.png 

in which Hindi DEVANAGARI characters  are intermixed with Latin. The file needs to be distributed to multiple platforms and operating systems that might not be Unicode compatible.  Thus its file name needs to be portable and compatible with the file systems of the various operating systems. 

What is the best way to achieve this?  By renaming the file to contain characters only from the universally recognizable ASCII character set, which means we have to strip it out of all the non-ASCII characters. But to do that, we have to first introduce Blocks in addition to Scripts. According to perlunicode:

Unicode also defines blocks of characters. The difference between scripts and blocks is that the concept of scripts is closer to natural languages, while the concept of blocks is more of an artificial grouping based on groups of Unicode characters with consecutive ordinal values. For example, the "Basic Latin" block is all the characters whose ordinals are between 0 and 127, inclusive; in other words, the ASCII characters. The "Latin" script contains some letters from this as well as several other blocks, like "Latin-1 Supplement", "Latin Extended-A", etc., but it does not contain all the characters from those blocks. 

Armed with this knowledge we can proceed in solving the portability issue. There is the [[:ascii:]] POSIX class and/or the Unicode \p{InBasicLatin} block that do match all ASCII characters, thus by negation [^[:ascii:]] or P{InBasic_Latin} we get to all non-ASCII ones. As everything in Perl, TMTOWTDI (there's more than one way to do it). and this example can be the basis for forming more elaborate use cases later on.

But what do we actually mean by ASCII?

We mean characters with ordinal values below 128 (in other words US English only), thus we need to remove those beyond 127 which leads us to a 'remove all characters whose ordinal value is > 127' condition for use in constructing the regex.

For the solution check the rest of the article, but the point is that the Unicode standard organizes concepts into concrete blocks so that you can work with them intuitively.

All the information about Scripts, Blocks and the rest can be found in the crisp documentation of the standard up on Unicode.org. And you can find all the new Emoji additions at Emoji recently added.

 

More Information

Announcing The Unicode® Standard, Version 14.0

Related Articles

 Advanced Perl Regular Expressions - Extended Constructs

Advanced Perl Regular Expressions - The Pattern Code Expression

Query Unicode From The Command Line

Taming Regular Expressions

Automatically Generating Regular Expressions with Genetic Programming

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Trouble At The Heart Of AI?
29/09/2021

A recent paper by dozens of Stanford researchers calls AI models similar to the massive GPT-3 "foundation models", sparking a polarized response from other AI researchers and onlookers alike.



$1 Million Prize For AI That Benefits Society
22/10/2021

Cynthia Rudin, a professor of computer science at Duke University, is the winner of the 2022 Squirrel AI Award for Artificial Intelligence for the Benefit of Humanity for her work applying machine lea [ ... ]


More News

square

 



 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Friday, 24 September 2021 )