The primary advantage of printable ASCII is that it is a fixed alphabet of limited length. Most keyboards have dedicated keys for each of the printable ASCII characters. They may also support other characters, but these will vary, depending on country and manufacturer.
Important properties of an alphabet:
- Fixed
- A relatively small and manageable set of characters
- Ordered
- Writable (with a pen and paper, or with a keyboard)
- Each glyph represents a single character.
The ASCII text encoding contains more characters, but these are non-printing control characters, which do not have individual visible glyphs.
Note: The horizontal tab (HT), the line feed (LF), and the space (SPA) characters are whitespace characters. They are non-printing control characters but they are used to affect the visible layout of the printable text. Most keyboards have dedicated keys for them (usually Tab, Enter, and Space Bar, respectively). I have therefore included them in the "printable ASCII" alphabet.
There are several problems with Unicode:
- It is not fixed. Every so often new characters are added.
- It is not a relatively small and manageable set of characters. Some reading indicates that it has 1,111,998 possible characters.
- It is not directly writable with a keyboard.
- Many glyphs look the same or are very similar, but represent different characters.
The primary problem with Unicode is that editing Unicode text is difficult, as soon as you stop using a dedicated subset of it. It is feasible to write French text using a dedicated French keyboard on a French-language computer system. It is a significant challenge to edit excerpts in e.g. French or Greek within an article written in English, using an English-language keyboard and computer system. In practice, this is usually done by a) writing special ASCII sequences using an ASCII keyboard and/or b) using a mouse and a dedicated software interface with buttons or lists.
My preference is simply to accept that keyboard-editable Unicode will consist of special ASCII sequences, and then design and use ASCII sequences that are relatively readable and editable. These sequences can later, in a further layer of software, be rendered as specialised glyphs, for those who wish to write French text directly using a French keyboard and computer system. If there are problems (and there are always problems), the text degrades gracefully into ASCII sequences.
If you think about it, this is how ASCII works anyway. The actual alphabet in a computer is [0, 1]. When you type 'A', the computer actually stores this text: 01000001
01000001 is a sequence written using the letters 0 and 1 that encodes the ASCII character 'A'.
So, using the same approach, I can declare that the ASCII sequence {e'}, written using the letters {e} and {'}, encodes the Unicode letter {é}.
Important properties for an encoding:
- All characters in the encoding must have the same length (the same number of characters in the underlying alphabet). All ASCII characters are written using eight 0s or 1s. [0] If the length is not fixed, code that searches the text will be harder to write and slower to run.
- Ordered
Note: A new version of an encoding may change the characters and their order.
In the Unicode standard, the first Unicode block is identical to ASCII.
The next Unicode block is the Latin-1 Supplement. It includes the more common characters with diacritics that are used in various European languages. Some reading indicates that ASCII and Latin-1 together can handle the following languages: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish. Note: Some rare characters (e.g. the capital Y with an umlaut in French) are not handled.
I have developed an encoding for the Latin-1 Supplement (see the table below). Every character is written as a two-ASCII-character sequence. I have tried to choose ASCII character sequences that resemble the Latin-1 character or are related to it in some way. This will hopefully make the encoding relatively readable and editable in its raw ASCII form.
Today, Unicode is usually (although not always) successfully rendered by various programs e.g. web browsers. So, my preference is to write and store European text in my encoding, but to then render it into Unicode when it is requested by a web browser (note: the rendering can be cached).
Within ASCII text, I enclose my encoded Latin-1 data within opening and closing tags, like so:
<latin-1>e'</latin-1>
This is a markup language approach, so I have called my encoding Edgecase Markup Language (EML).
Many words will include both ASCII characters and Latin-1 characters. I don't want to have to write the tags around every single letter. So, I have included all the printable ASCII characters within my Latin-1 encoding, mostly by adding an underscore to make each one two-characters-long but still readable (and distinguishable from Latin-1 characters).
Example:
<latin-1>h_e'l_l_o_</latin-1> will be rendered as: héllo
Table 1: Edgecase Markup Language (EML): Latin-1 | |||
0 | |||
Note: Each character is represented by two ASCII characters. | |||
0 | |||
Index | Character | ASCII sequence | Name |
0 | |||
0 | HT | - followed by HT | Horizontal Tab |
1 | LF | + followed by LF | Line Feed |
2 | SPA | Two SPA characters | Space |
3 | ! | !_ | Exclamation Mark |
4 | " | "_ | Quotation Mark |
5 | # | #_ | Number Sign |
6 | $ | $_ | Dollar Sign |
7 | % | %_ | Percent Sign |
8 | & | &_ | Ampersand |
9 | ' | '_ | Apostrophe |
10 | ( | (_ | Left Parenthesis |
11 | ) | )_ | Right Parenthesis |
12 | * | *_ | Asterisk |
13 | + | +_ | Plus Sign |
14 | , | ,_ | Comma |
15 | - | -_ | Hyphen-Minus (also known as Hyphen or Minus Sign) |
16 | . | ._ | Full Stop (also known as Period or Dot) |
17 | / | /_ | Solidus (also known as Slash) |
18 | 0 | 0_ | Digit Zero |
19 | 1 | 1_ | Digit One |
20 | 2 | 2_ | Digit Two |
21 | 3 | 3_ | Digit Three |
22 | 4 | 4_ | Digit Four |
23 | 5 | 5_ | Digit Five |
24 | 6 | 6_ | Digit Six |
25 | 7 | 7_ | Digit Seven |
26 | 8 | 8_ | Digit Eight |
27 | 9 | 9_ | Digit Nine |
28 | : | :_ | Colon |
29 | ; | ;_ | Semicolon |
30 | < | <_ | Less-Than Sign |
31 | = | =_ | Equals Sign |
32 | > | >_ | Greater-Than Sign |
33 | ? | ?_ | Question Mark |
34 | @ | @_ | Commercial At (also known as At Sign) |
35 | A | A_ | Latin Capital Letter A |
36 | B | B_ | Latin Capital Letter B |
37 | C | C_ | Latin Capital Letter C |
38 | D | D_ | Latin Capital Letter D |
39 | E | E_ | Latin Capital Letter E |
40 | F | F_ | Latin Capital Letter F |
41 | G | G_ | Latin Capital Letter G |
42 | H | H_ | Latin Capital Letter H |
43 | I | I_ | Latin Capital Letter I |
44 | J | J_ | Latin Capital Letter J |
45 | K | K_ | Latin Capital Letter K |
46 | L | L_ | Latin Capital Letter L |
47 | M | M_ | Latin Capital Letter M |
48 | N | N_ | Latin Capital Letter N |
49 | O | O_ | Latin Capital Letter O |
50 | P | P_ | Latin Capital Letter P |
51 | Q | Q_ | Latin Capital Letter Q |
52 | R | R_ | Latin Capital Letter R |
53 | S | S_ | Latin Capital Letter S |
54 | T | T_ | Latin Capital Letter T |
55 | U | U_ | Latin Capital Letter U |
56 | V | V_ | Latin Capital Letter V |
57 | W | W_ | Latin Capital Letter W |
58 | X | X_ | Latin Capital Letter X |
59 | Y | Y_ | Latin Capital Letter Y |
60 | Z | Z_ | Latin Capital Letter Z |
61 | [ | [_ | Left Square Bracket |
62 | \ | \_ | Reverse Solidus (also known as Backslash) |
63 | ] | ]_ | Right Square Bracket |
64 | ^ | ^_ | Circumflex Accent |
65 | _ | __ | Low Line (also known as Underscore) |
66 | ` | `_ | Grave Accent (also known as Backtick) |
67 | a | a_ | Latin Small Letter A |
68 | b | b_ | Latin Small Letter B |
69 | c | c_ | Latin Small Letter C |
70 | d | d_ | Latin Small Letter D |
71 | e | e_ | Latin Small Letter E |
72 | f | f_ | Latin Small Letter F |
73 | g | g_ | Latin Small Letter G |
74 | h | h_ | Latin Small Letter H |
75 | i | i_ | Latin Small Letter I |
76 | j | j_ | Latin Small Letter J |
77 | k | k_ | Latin Small Letter K |
78 | l | l_ | Latin Small Letter L |
79 | m | m_ | Latin Small Letter M |
80 | n | n_ | Latin Small Letter N |
81 | o | o_ | Latin Small Letter O |
82 | p | p_ | Latin Small Letter P |
83 | q | q_ | Latin Small Letter Q |
84 | r | r_ | Latin Small Letter R |
85 | s | s_ | Latin Small Letter S |
86 | t | t_ | Latin Small Letter T |
87 | u | u_ | Latin Small Letter U |
88 | v | v_ | Latin Small Letter V |
89 | w | w_ | Latin Small Letter W |
90 | x | x_ | Latin Small Letter X |
91 | y | y_ | Latin Small Letter Y |
92 | z | z_ | Latin Small Letter Z |
93 | { | {_ | Left Curly Bracket (also known as Left Brace) |
94 | | | |_ | Vertical Line (also known as Vertical Bar) |
95 | } | }_ | Right Curly Bracket (also known as Right Brace) |
96 | ~ | ~_ | Tilde |
97 | nb | No-Break Space | |
98 | ¡ | !i | Inverted Exclamation Mark |
99 | ¢ | c/ | Cent Sign |
100 | £ | L- | Pound Sign |
101 | ¤ | ox | Currency Sign |
102 | ¥ | Y= | Yen Sign |
103 | ¦ | || | Broken Bar |
104 | § | se | Section Sign |
105 | ¨ | _: | Diaeresis |
106 | © | co | Copyright Sign |
107 | ª | ^a | Feminine Ordinal Indicator |
108 | « | << | Left-Pointing Double Angle Quotation Mark |
109 | ¬ | -| | Not Sign |
110 | | so | Soft Hyphen |
111 | ® | re | Registered Sign |
112 | ¯ | ^- | Macron |
113 | ° | ^O | Degree Sign |
114 | ± | +- | Plus-Minus Sign |
115 | ² | ^2 | Superscript Two |
116 | ³ | ^3 | Superscript Three |
117 | ´ | _' | Acute Accent |
118 | µ | mi | Micro Sign |
119 | ¶ | pi | Pilcrow Sign |
120 | · | m. | Middle Dot |
121 | ¸ | _, | Cedilla |
122 | ¹ | ^1 | Superscript One |
123 | º | ^o | Masculine Ordinal Indicator |
124 | » | >> | Right-Pointing Double Angle Quotation Mark |
125 | ¼ | 14 | Vulgar Fraction One Quarter |
126 | ½ | 12 | Vulgar Fraction One Half |
127 | ¾ | 34 | Vulgar Fraction Three Quarters |
128 | ¿ | ?i | Inverted Question Mark |
129 | À | A` | Latin Capital Letter A With Grave |
130 | Á | A' | Latin Capital Letter A With Acute |
131 | Â | A^ | Latin Capital Letter A With Circumflex |
132 | Ã | A~ | Latin Capital Letter A With Tilde |
133 | Ä | A: | Latin Capital Letter A With Diaeresis |
134 | Å | Ao | Latin Capital Letter A With Ring Above |
135 | Æ | AE | Latin Capital Letter Ae |
136 | Ç | C, | Latin Capital Letter C With Cedilla |
139 | È | E` | Latin Capital Letter E With Grave |
140 | É | E' | Latin Capital Letter E With Acute |
141 | Ê | E^ | Latin Capital Letter E With Circumflex |
142 | Ë | E: | Latin Capital Letter E With Diaeresis |
143 | Ì | I` | Latin Capital Letter I With Grave |
144 | Í | I' | Latin Capital Letter I With Acute |
145 | Î | I^ | Latin Capital Letter I With Circumflex |
146 | Ï | I: | Latin Capital Letter I With Diaeresis |
147 | Ð | -D | Latin Capital Letter Eth |
148 | Ñ | N~ | Latin Capital Letter N With Tilde |
149 | Ò | O` | Latin Capital Letter O With Grave |
150 | Ó | O' | Latin Capital Letter O With Acute |
151 | Ô | O^ | Latin Capital Letter O With Circumflex |
152 | Õ | O~ | Latin Capital Letter O With Tilde |
153 | Ö | O: | Latin Capital Letter O With Diaeresis |
154 | × | xx | Multiplication Sign |
155 | Ø | O/ | Latin Capital Letter O With Stroke |
156 | Ù | U` | Latin Capital Letter U With Grave |
157 | Ú | U' | Latin Capital Letter U With Acute |
158 | Û | U^ | Latin Capital Letter U With Circumflex |
159 | Ü | U: | Latin Capital Letter U With Diaeresis |
160 | Ý | Y' | Latin Capital Letter Y With Acute |
161 | Þ | Ip | Latin Capital Letter Thorn |
162 | ß | B. | Latin Small Letter Sharp S |
163 | à | a` | Latin Small Letter A With Grave |
164 | á | a' | Latin Small Letter A With Acute |
165 | â | a^ | Latin Small Letter A With Circumflex |
166 | ã | a~ | Latin Small Letter A With Tilde |
167 | ä | a: | Latin Small Letter A With Diaeresis |
168 | å | ao | Latin Small Letter A With Ring Above |
169 | æ | ae | Latin Small Letter Ae |
170 | ç | c, | Latin Small Letter C With Cedilla |
171 | è | e` | Latin Small Letter E With Grave |
172 | é | e' | Latin Small Letter E With Acute |
173 | ê | e^ | Latin Small Letter E With Circumflex |
174 | ë | e: | Latin Small Letter E With Diaeresis |
175 | ì | i` | Latin Small Letter I With Grave |
176 | í | i' | Latin Small Letter I With Acute |
177 | î | i^ | Latin Small Letter I With Circumflex |
178 | ï | i: | Latin Small Letter I With Diaeresis |
179 | ð | -d | Latin Small Letter Eth |
180 | ñ | n~ | Latin Small Letter N With Tilde |
181 | ò | o` | Latin Small Letter O With Grave |
182 | ó | o' | Latin Small Letter O With Acute |
183 | ô | o^ | Latin Small Letter O With Circumflex |
184 | õ | o~ | Latin Small Letter O With Tilde |
185 | ö | o: | Latin Small Letter O With Diaeresis |
186 | ÷ | -: | Division Sign |
187 | ø | o/ | Latin Small Letter O With Stroke |
188 | ù | u` | Latin Small Letter U With Grave |
189 | ú | u' | Latin Small Letter U With Acute |
190 | û | u^ | Latin Small Letter U With Circumflex |
191 | ü | u: | Latin Small Letter U With Diaeresis |
192 | ý | y' | Latin Small Letter Y With Acute |
193 | þ | ip | Latin Small Letter Thorn |
194 | ÿ | y: | Latin Small Letter Y With Diaeresis |
Notes:
- Characters 1-97 are printable ASCII. This includes the three whitespace characters HT (horizontal tab or "tab"), LF (line feed or "newline"), and SPA (space). For all the non-whitespace printable ASCII characters, I have simply appended an underscore. This makes them two ASCII characters long via a single, easily-remembered rule. They are also reasonably distinguishable from the Latin-1 characters.
- Characters 97-194 are printable Latin-1. Characters 0080-009f from the Latin-1 Supplement are non-printing control characters and are not included. Characters 00a0 (no-break space) and 00ad (soft hyphen) are included.
- Character HT is not two ASCII HT characters, because this would distort the layout of the text when it is viewed in its raw ASCII form. Similarly, Character LF is not two ASCII LF characters. I chose the - sign to be in front of the HT character in order to indicate continuity (visually, a Minus Sign is a smooth line from left to right) and the + sign to be in front of the LF character in order to indicate an addition (i.e. the new line). The - and + signs are also sufficiently distinct from letters that they should be clearly distinguishable from the text when the raw ASCII is viewed.
- Character SPA is two ASCII SPA characters, as this does not distort the layout of the text, but instead preserves it, when the text is viewed in its raw ASCII form.
- In the Latin-1 encoding, I have used the following approaches to choose two-ASCII-character sequences:
-- Where possible, use a shape-based analogy or combination. E.g. "c/" for a Cent Sign. If the two ASCII characters are overlaid, this produces a rough approximation of the Cent Sign. Similarly, when the two characters in "L-" are overlaid, this produces a rough approximation of the Pound Sign.
-- Sometimes, the shape analogy or combination is a little more abstract. Two vertical bars "||" overlaid does not produce a broken bar symbol, but the two separate bar pieces next to each other are, in a sense, a "broken bar".
-- Ideally, the second letter should be a modifier of the first.
-- ^ in the first position indicates "above, up, superscript". ^ in the second position indicates a circumflex that should be placed on top of the letter in the first position.
-- "," in the second position indicates a Cedilla modifier. If an underscore (i.e. a blank) is in the first position, this is the Cedilla by itself.
-- "i" in the second position indicates "inverted". E.g. in "?i", this indicates that the question mark should be turned upside down.
-- {`} in the second position indicates a grave accent modifier for the letter in the first position. Similarly, {'} is an acute accent modifier, {~} is a tilde modifier, and ":" is a diaeresis modifier.
-- When I couldn't see a good shape-based choice, I used a pair of letters based on the English name for the character. E.g. "co" for "Copyright Sign", "se" for "Section Sign".
-- "m." is a combination shape and word meaning. "m" stands for "middle" (much like ^ in first position stands for "above") and "." is a dot. The "m" indicates that the "." should be moved to a middle height.
-- The Currency Sign looks like a circle with small protusions at the corners of the glyph rectangle. The sequence "ox" ("circle modified by a cross") was the best shape combination I could come up with.
[start of notes]
Sources of inspiration for my work on EML:
- My desire to be able to write and edit French and Greek text using my English-based keyboard and computer system.
- My desire to be able to digitally sign and timestamp all my work. The work must therefore be legible (else how do you know what you are signing?) and editable (because you'll want to improve it as much as possible before timestamping it and making it forever unalterable). Additionally, readability should be emphasised over convenience. Data should, as much as possible, be both human-readable (for verification) and machine-readable (for search). Unicode in its most common form (UTF-8) is often illegible for humans (when displayed in many text processing programs) and slow for machines to search.
- Stanislav Datskovskiy's article "No Formats, no Format Wars.". Link:
www.loper-os.org/?p=309
-- This showed me that I could simply design a new data format and embed items in this format into ASCII data. The format would have to have opening and closing tags that would allow a machine or human to read the data, arrive at the tag, and determine whether or not the format was known. If known, use that format to interpret data until the closing tag is encountered. If unknown, stop and try to locate the relevant format definition.
- Mircea Popescu's comments on alphabets vs hieroglyphics in the #trilema chat channel log. Link:
btcbase.org/log-search?q=alphabet+from%3Amircea
-- These made me understand the importance of a fixed set of symbols (an alphabet) in which all text can be written. New symbols can be invented and used, but only as shorthand. The names and definitions of new symbols must be written or writable in the original alphabet. It must be possible, if necessary, to describe anything using only the original alphabet. Any relaxation of this rule will eventually produce a hieroglyphic system that is increasingly expensive to use.
Another source:
www.unicode.org/charts/PDF/U0080.pdf
Title: C1 Controls and Latin-1 Supplement
Author: Unknown.
This source supplied me with the Latin-1 Supplement glyphs, their order, and their names.
I consulted my own copy in my archives. I downloaded a copy from the link and used
diff
to compare the two files.
diff
reported that the files differ but did not give any detail (they are both binary files). I'm reasonably sure that the glyphs, their order, and their names are still the same, but I don't know for certain. [end of notes]
[start of footnotes]
[0]
Note: Originally, ASCII was designed as a 7-bit code. Every character was written as a sequence of seven 0s and 1s. Today, an extra 0 is added onto the front of each character in order to make them all 8 bits long.
[return to main text]
[end of footnotes]