Character Encoding

Introduction

Encoding refers to the details of how the characters in the source file for a web page are coded for transmission over the web. Encoding often involves a character being represented by two or more other characters.

While the subject is rather wide-ranging the basics are not difficult to grasp but if carried out incorrectly errors can occur and are often noticeable by uninterpretable glyphs (shapes) appearing on the screen. If you see the character '�' it is almost always an indication of an encoding error though the glyph appearing depends on the browser in use and fonts available. This character is the Unicode Character 'Replacement character' (U+FFFD). On other occasions encoding errors can result in other characters or pairs of characters.

Use UTF-8

HTML files may be encoded in a large number of different ways. Almost all are equally valid and usable but, as we will see later, W3C now recommends UTF-8 in preference to others and that makes life somewhat simpler. If you adopt this recommendation you can skip the next section on 'ASCII and Latin 1 characters' and the one on 'ISO-8859'. You can also skip the first bit under the heading 'Character encoding', and pick up and carry on from the emphasised paragraph.

Encoding text

ASCII and Latin 1 characters

Table 1
Printable ASCII and Latin-1 characers with corresponding hex codes
(msd in first row, lsd in first column)
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
0x
A
S
C
I
I



L
A
T
I
N

1




1x
2x sp ! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~
8x
9x
Ax nbsp ! shy
Bx
Cx
Dx
Ex
Fx
The characters sp (space), nbsp, (No-break space),
shy (soft hyphen) are printable but (normally) invisible.

Early computers used a 7 bit byte. This was adequate for addressing the ASCII (American Standard Code for Information Interchange) character set which provides a set of 95 printable characters dating from the teleprinter era. A modern eight bit byte however allows a doubling of this number (while reserving a number of codes for control purposes) and gives rise to the Latin-1 set illustrated in table 1. The row and column headings indicate the more and less significant parts of the code (in hexadecimal) corresponding to each character. For instance, the code for character 'A' is 41.

Latin-1 corresponds to the ISO-8859-1 set which is sufficient for web pages in English and many other western European languages. Include the appropriate code in a file and the corresponding character will appear.

ISO-8859

The needs of many languages, European and other, can be satisfied by similar sets of characters, all share the ASCII characters and substitute some in other positions. This give rise to 15 standards in the ISO-8859 series. You can find which language, and corresponding characters, is supported by each encoding in the article ISO/IEC 8859 at Wikipedia [Ref 9].

To implement this it is clear that more than 256 characters are needed although only 256 locations (less control positions) are available to address them. The characters required to satisfy all in the series are drawn from a much larger set.

Unicode – The Universal Character Set

The Unicode Consortium [Ref 17] have standardised a universal character set (UCS), i.e. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.

Unicode (as the UCS is commonly referred to) can access over a million characters of which by 2010 about 110,000 were already been defined. These include characters for all the world's main languages along with a selection of symbols for various purposes. An excellent introduction to Unicode is provided by Richard Ishida's article Character encodings: Essential concepts [Ref 22] and he has a fuller tutorial at An Introduction to Writing Systems & Unicode/ [Ref 23]

HTML specifies a Document Character Set which is a list of the character repertoire available along with the corresponding code points (sometimes referred to as code positions). For HTML the Document Character Set is identical to the UCS which means that, in principle, any character in the UCS may be used in any HTML document. In practice support for the complete character range is uneconomic and systems provide support for subsets only.

Table 2
Printable ASCII and Greek characters
Using ISO-8859-7 encoding
(msd in first row, lsd in first column)
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
0x
A
S
C
I
I

1x
2x sp ! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~
8x
9x
Ax nbsp ͺ shy
Bx ΅ Ά Έ Ή Ί Ό Ύ Ώ
Cx ΐ Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο
Dx Π Ρ Σ Τ Υ Φ Χ Ψ Ω Ϊ Ϋ ά έ ή ί
Ex ΰ α β γ δ ε ζ η θ ι κ λ μ ν ξ ο
Fx π ρ ς σ τ υ φ χ ψ ω ϊ ϋ ό ύ ώ

Character Encoding

Character Encoding, at its simplest, refers to the process whereby the codes for the characters are mapped to the code points for the Unicode characters appropriate to the language in use. In the case of ISO-8859-1 the character codes are mapped to identical Unicode code points. (The first 256 Unicode characters being the same as the Latin-1 set.) As another example, ISO-8859-7 encodes Greek characters displacing many from the Latin-1 set to make room. (Compare table 2 to table 1.) In this case the code EA instead of being mapped to Unicode code point EA (giving e circumflex ) is mapped to code point 03BA which returns a small kappa κ. In fact ISO-8859-7 does not include the character.

Note All ISO-8859 encodings retain the ASCII characters at the original positions.

This document uses ISO-8859-1 encoding but, in spite of this, has no difficulty in representing the full repertoire of the Greek characters covered by ISO-8859-7 as can be seen in the table. How this is achieved is explained in the next section.

Authors should note that every page uses one character encoding, and one only, irrespective of the number or range of languages encountered on a page.

In HTML pages character encoding is specified using the 'charset' parameter in the head area for each page. Several options are possible but the form
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">

is recommended for HTML4 and
<meta charset="UTF-8"> for HTML5. See Declaring character encodings in HTML [Ref 24].

Note 'charset', in spite of its name, does not specify a character set. The character set for HTML documents is always the UCS. 'charset' specifies the encoding.

Character escapes (Character references)

A character escape is a way of representing a character without actually using the character itself.

Table 3
Important entity references
Character Entity Numeric character reference
&euro; &#8364; &#x20AC;
< &lt; &#60; &#x3c;
> &gt; &#62; &#x3e;
&times; &#215; &#xd7;
&divide; &#247; &#xf7;
& &amp; &#38; &#x26;
" &quot; &#34; &#x22;
no-break space &nbsp; &#160; &#xa0;

ISO-8859 uses a single byte per character to represent all the characters commonly expected in a language but clearly there may be a need to represent uncommon characters. The way in which this done uses a technique called Character escapes. HTML provides two mechanisms – Character entity references (entities) and numeric character references (ncrs). Using these methods any character in the UCS may be reached by using a sequence of ASCII characters to point to the required character. Entities take the form &euro; and numeric references the form &#8364; or &#x20AC; all representing the euro symbol. The 8464 and x20AC represent the Unicode code point for the symbol in decimal and hexadecimal notation.

These methods free the author to employ Unicode characters, irrespective of the encoding in use, at the expense of increasing file size. Where such use is limited this is inconsequential.

The list of entities is included at section 24 of the HTML specification [Ref 16]. About 250 are defined, numeric character references must be used for characters outside this range. Characters do not have to be out of range of the encoding for entity references to be provided as is clear from Table 3 which lists some of the most frequently used including some in the ASCII set.

Note Entities are case sensitive thus &Eacute; represents upper case E with an acute accent () while &eacute; represents the corresponding lower case letter (). &EacutE; does not represent anything (&EacutE;). (The error just gets printed out.)

Note Irrespective of the ISO-8859 encoding employed the entity or numeric reference to be input remains the same. So, although in ISO-8859-7 the euro symbol is represented as byte A4, entering the code &#xA4; will generate a symbol not a euro symbol. The code to be input is the entity or numeric character reference for the character required.

Note The need to use character escapes is reduced if you use UTF-8 encoding but it is not completely eliminated. You are also permitted to use escapes even though they may be unnecessary.

UTF encodings

ISO-8859 is fine when using one language at a time but becomes clumsy and slow when languages are mixed. UTF coding releases us from this restriction and provides a mechanism for addressing the full range of Unicode characters quickly. UTF offers alternative formats UTF-8, UTF-16 or UTF-32 which are based on units of 8, 16 or 32 bits respectively. UTF-32 is not usually used for coding web pages.

UTF-8 uses 1 to 4 bytes to represent a character. It uses 1 byte to represent characters in the ASCII set, two bytes for the next 1920 characters (including the Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters) and three bytes for the rest of 65,000 characters in the Basic Multilingual plane (BMP). Supplementary characters use 4 bytes.

UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.

UTF-32 uses 4 bytes for all characters.

Authors are encouraged to use UTF-8

While HTML4 allows authors to choose from a large number of possible encodings the HTML 5 draft currently says "Authors are encouraged to use UTF-8." Since there are a number of advantages and trivial disadvantages in adopting this practice now is probably a good time to start.

Symbols

Traditionally computers have relied on special fonts like 'Symbol' or 'Wingdings' to produce symbols. This is not necessary on web pages. Since such fonts do not support Unicode any attempt to use them will yield unreliable results which may vary from browser to browser.

Fortunately Unicode supports a large range of symbols which fulfills many needs.

Inputting special symbols

Table 4
Some useful symbols
Symbol Key code Unicode
En dash alt+0150 U+2013
Em dash alt+0151 U+2014
ellipsis alt+0133 U+2026
Left single quote alt+0145 U+2018
Right single quote alt+0146 U+2019
Left double quote alt+0147 U+201C
Right double quote alt+0148 U+201D
Euro alt+0128 U+20AC
Pound alt+0163 U+00A3
Generic currency sign alt+0164 U+00A3
Degree sign alt+0176 U+00B0
Multiply sign alt+0215 U+00D7;
Divide sign
alt+0247 U+00F7;

There are several ways of inserting symbols into a page.

  1. Dedicated editors often offer special menu of keystroke options.
  2. Using Windows Character Map. Using the standard font in use find the symbol required, select and copy it, then paste it as required. Recent versions of Character map allow you to group characters by Unicode subrange which makes it easier to find a particular symbol
  3. If you use a particular symbol frequently it may be easier to insert it by keystroke. Several characters permit this and, where it can be done, the keystroke is shown at the bottom right corner of Character Map. The euro symbol, €, for instance, may be inserted using ALT+0128
  4. AllChars [Ref21] is a useful utility that allows any program to insert any Windows-1282 (see below) character using a few easily discovered keystrokes
  5. When editing source code any character may be inserted using the Numeric character reference. In the bottom left corner of Character Map this is given in hex format. Thus for euro you insert &#x20AC;
    Note The code is shown in Character map as U+20AC This is the conventional way of representing Unicode characters i.e. by 'U+' followed by the hexadecimal code point
  6. In source view, entity references may be inserted similarly. Many are easy to memorise e.g. &euro; &gt; &lt;

The keen-eyed may observe that the key codes are neither the character codes nor the Unicode code points for the character required. In fact they are the (decimal) character codes derived from Windows-1252 encoding.Windows-1252 [Ref 13] encoding is a possible alternative to ISO-8859-1 suitable for western languages. It increases the number of available character codes to 218 characters by re-allocating some of the codes in the range 80 to 9F which are normally unused.

Alan Wood's website [Ref 2] is a useful resource listing entities (where defined) and Numeric character references for a large number of characters from the Monotype Typography Symbol font (as on Windows) including Greek, Mathematical and Punctuation [Ref 6] and also the Microsoft Wingdings font [Ref 7]. (For Windings in several cases there is no Unicode equivalent.)

Unicode support

Fonts

Although Unicode offers tremendous potential the usual caveats apply when choosing fonts. When building a font stack it is important to check that all fonts in the list include the characters required. No font covers the full range of Unicode, or even a small single digit percentage of it. To check the supported Unicode ranges of a font Microsoft supply an extension [Ref 12] for Windows Explorer. With it installed, right-click any TrueType (TTF) font file in Windows Explorer and select the Properties tab. Particular characters can be searched for using Character map.

Checking for support is more than usually difficult if unusual characters are required. Compatible fonts must be installed on any visitor's computer and, where in a style sheet the font-family is specified as a prioritised list of font family names (as it should be), ideally all fonts in the list should be checked.

Note A font stack lists fonts the first of which will be used if available. A browser should check that the character required is supported by the font selected and, if not, pass on to the next in the stack. Not all browsers do this. Older versions of IE (≤7) will use the first font found even if it does not include the character required probably printing a small square.

If a page uses unusual characters check the rendering on as many different browsers as possible but even this may not be adequate. If the author has a well configured system with fonts with good unicode support, such as "Arial Unicode MS" or even Code2000, the browser after searching the listed fonts will pass on to generic type and may find the character needed. A visitor with fewer fonts installed may fare less well. It is important to ensure that the font stack prior to the generic font can find the character needed.

Alan Wood offers several pages which are extremely useful in this respect. A list of characters in the WGL4 set and which are likely to be widely available may be found at Using special characters from Windows Glyph List 4 (WGL4) in HTML [Ref 3]. A list of which fonts carry specific ranges of Unicode characters and more interestingly shows distribution of the fonts so that authors may check likely availability to visitors may be found at Unicode fonts for Windows computers [Ref 4]. Those wishing to use a rarer character may check which fonts include them at Unicode character ranges and the Unicode fonts that support them [Ref 5].

Examples

Example

Upwards double arrow Character U+21D1 not included in any font from the list specified (Tahoma, Arial, Helvetica, sans-serif).

The arrow appears as a square when using MSIE ≤ 7

Same demonstration but set up spanning the arrow using a class with a font stack starting with font-family: 'Lucida Sans Unicode' .

While preparing this page, for instance, Table 1 displayed correctly in Firefox but in MSIE the arrows originally appeared as squares. The issue is reproduced in the box on the right. The arrows use comparatively rare characters that do not appear in the Trebuchet font used but, on the writer's machine at least, the Gecko engine was able to retrieve them, possibly from Lucida Sans Unicode.

The result is that visitors using older MSIE see boxes instead of arrows but those using a modern browser may see the arrows if Lucida Sans Unicode or some other font with the characters is installed on their machine.

A work-around this issue is possible, as also shown in the box. The list specifying the font is modified so that the first in the list becomes 'Lucida Sans Unicode'. If this is available it will be used, otherwise the choice passes down the list. According to Wikipedia this font has been supplied with all Windows OS since Windows 98 which ensures high availability.

This is a moderately, but not very, robust solution. Had the availability of the arrows been critical to understanding the table it would have been necessary to change the design.

While the arrows may be considered rare and unusual characters even characters covered by some ISO-8859 options may not be reliable. In viewing Table 2, depending on the browser in use and fonts installed there are two characters, Drachma sign (Code A5) and Greek ypogegrammeni (Code AA), which may not display correctly. In cases like these checking the WGL4 list may provide a warning because neither of the characters is listed.

Special characters

HTML uses certain characters for specific purposes and demands that when they appear in text strings they are encoded so that the special nature is preserved.

< (Less than)
This is used to demark the opening of an html tag. When used for other purposes, e.g. in text, it must be encoded as an entity or numeric character reference. (&lt; or &#60; or &#x3C;)
> (Greater than)
This is used to demark the closing of an html tag. Similarly it must be encoded. (&gt; or &#62; or &#x3E;)
& (Ampersand)
This is used to demark the opening of an encoded character so it itself must be encoded. (&amp; or &#38; or &#x26;)
The encoding for the last bracketed example would therefore be
&amp;amp; or &amp;#38; or &amp;#x26; when encoding as entities or
&#38;amp; or &#38;#38; or &#38;#x26; when encoding as decimal ncrs or
&#x26;amp; or &#x26;#38; or &#x26;#x26; when encoding as hexadecimal ncrs

In addition there are a few characters which when used are normally encoded this is because they would otherwise be indistinguishable in the source code.

 (non-breaking space)
This is used between two words if you do not wish them to be split across a line break. Apart from the space taken it is otherwise invisible. It should be encoded as
&nbsp; or &#160; or &#xA0;
(soft hyphen)
The soft hyphen may be be inserted between syllables of long words to show where a line may be broken. If the line is not broken the hyphen is invisible but when broken it appears as a normal hyphen. It is encoded as
&shy; or &#173; or &#xAD;

Encoding and CSS

Use ASCII only and avoid all encoding issues

I almost called this 'CSS encoding' but this might give the impression that CSS needs to be encoded. This is almost always untrue but occasionally, and for some authors, that might equate to 'CSS always needs to be encoded'. Let's see why and when.

CSS files include CSS properties and values. All the properties and most of the values are written using ASCII characters. Very occasionally values might include non ASCII characters – possibly in the name of a font. CSS files also include selectors. These include HTML tag names, which use only ASCII, and also class and id names which are entirely at the discretion of the author so can include almost any character.

If you need to include non-ASCII characters in a CSS file you must encode. There are two possible approaches.

  • Using character escapes
  • Fully encode the CSS

Using character escapes for CSS

If you have only occasional need to use a non-ASCII character in a CSS file the simplest, safest and most painless, approach is to escape the character. While HTML provides entities and ncrs for escapes CSS recognises neither of these but provides a simple alternative. Take the example

p.dition_1 { color: green; }

Since the letter , though Latin 1 and included in the ISO-8859-1 set, is not ASCII it should be encoded. You could use

p.\E9 dition_1 {color: green;} 

The escape is formed starting with a backslash followed by the Unicode code point for the required character in hexadecimal. The end of the escape may be indicated either by encountering a character that cannot occur in the hex code or by a space or on arriving at the 7th character. If a space is encountered it is ignored.

For the first 256 (Latin 1) characters the code points are the same as the ISO-8859-1 codes. See table 1. For others they may be obtained from the Windows character map.

Thus the following are all possible

p.\E9tape_1 {color: blue;}
p.\E9 tape_1 {color: blue;}

p.\00E9tape_1 {color: blue;}
p.\0000E9tape_1 {color: blue;}

It is important to note that these are the codes to be used in the CSS file and not the HTML. So the content of the HTML fragment below will appear in blue.

<p class="tape_1">Stage 1</p>

Fully encode the CSS

Using CSS escapes is alright as long as it doesn't occur too often because it then becomes difficult and maintain the code. An alternative approach is to use the same encoding for the CSS stylesheet as for the HTML page. This allows you to use any character covered by the encoding chosen in the stylesheet. Any character outside the encoding must still be escaped as described already.

When using external stylesheets the browser assumes that the stylesheet has the same encoding as the HTML page. This leads potential problems with the need to keep the encodings of the page and stylesheet synchronised. To avoid this the encoding of the stylesheet should be declared on the first line of the stylesheet.To do this use the @charset rule, for instance:

@charset "UTF-8";

This must be placed on the first line of the stylesheet with nothing at all before it.

While any encoding may be used the choice of UTF-8 allows any character to be used in the stylesheet and eliminates the possible need for escapes.

If the HTML file is encoded as UTF-8 and the stylesheet uses ascii characters only there is no need for the @charset declaration in the stylesheet since ascii characters occupy the code positions expected.

Learn more at Declaring character encodings in CSS [Ref 25]

Using character escapes in markup and CSS [Ref 26]

URI encoding

I use the term URI rather than the more common URL though the difference in meaning is quite subtle.

Percentage encoding

Table 5
Characters permissible in an URI
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
0x
A
S
C
I
I



L
A
T
I
N

1




1x
2x sp ! " # $ % & ' ( ) * + , - . /
3x 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4x @ A B C D E F G H I J K L M N O
5x P Q R S T U V W X Y Z [ \ ] ^ _
6x ` a b c d e f g h i j k l m n o
7x p q r s t u v w x y z { | } ~
8x
9x
Ax nbsp ! shy
Bx
Cx
Dx
Ex
Fx

Special considerations apply to characters in a URI. URIs can occur as the value of some attributes of elements the most common being the 'href' and 'src'.

Any Latin 1 character may occur in a URI but only those shown against a green background in Table 5 may be used freely. This set includes alphabetics, hyphen and underscore. A number of other characters which may have specific meanings are reserved. This includes the majority of the remaining ASCII characters. Such characters may be used to separate one part of the structure from an other e.g the colon separates the protocol from the domain. These characters, from the specification RFC 3986 [Ref 14], are shown against an orange background. Whenever such a character is used other than for the specific reserved purpose it must be encoded to avoid confusion. Use of the remaining characters depends on specifics of the URI or part of the URI involved.

When encoding is required in a URI a new method referred to as 'percentage encoding' is used. Put simply, percent encoded characters consist of a percentage sign followed by two characters representing the hexadecimal position of the character in the Latin 1 set. Thus %20 represents a space.

Full details of URI encoding are covered in RFC 3986 [Ref 14].Wikipedia provides a simpler explanation of Percent encoding [Ref 11].

Authors often note that the names of saved files appear with spaces replaced by %20. As explained this is quite safe and indeed some operating systems prohibit unencoded spaces in file names. It is always preferable to avoid spaces when naming files. Use the underscore as an alternative.

It is actually possible to use percentage encoding for any character in the Latin 1 set.

Eric Meyer has provided a URL Decoder Encoder in his toolbox [ref 20] which allows you to see the results of encoding.

Multilingual web addresses

What has been said above is actually a slight simplification. It has recently become possible to use non Latin1 characters in a URI, for which the correct term becomes IRI (Internationalized Resource Identifier). This makes possible the use of IRIs like http: //ヒキワリ.ナットウ.ニホン

This subject is out of scope here but is discussed in the W3C article An Introduction to Multilingual Web Addresses [Ref 27].