edgecase
Author: StJohn Piano
Published: 2019-03-26
Datafeed Article 92
This article has been digitally signed by Edgecase Datafeed.
This article has been digitally signed by its author.
3604 words - 571 lines - 15 pages





The primary advantage of printable ASCII is that it is a fixed alphabet of limited length. Most keyboards have dedicated keys for each of the printable ASCII characters. They may also support other characters, but these will vary, depending on country and manufacturer.

Important properties of an alphabet:
- Fixed
- A relatively small and manageable set of characters
- Ordered
- Writable (with a pen and paper, or with a keyboard)
- Each glyph represents a single character.

The ASCII text encoding contains more characters, but these are non-printing control characters, which do not have individual visible glyphs.

Note: The horizontal tab (HT), the line feed (LF), and the space (SPA) characters are whitespace characters. They are non-printing control characters but they are used to affect the visible layout of the printable text. Most keyboards have dedicated keys for them (usually Tab, Enter, and Space Bar, respectively). I have therefore included them in the "printable ASCII" alphabet.

There are several problems with Unicode:
- It is not fixed. Every so often new characters are added.
- It is not a relatively small and manageable set of characters. Some reading indicates that it has 1,111,998 possible characters.
- It is not directly writable with a keyboard.
- Many glyphs look the same or are very similar, but represent different characters.

The primary problem with Unicode is that editing Unicode text is difficult, as soon as you stop using a dedicated subset of it. It is feasible to write French text using a dedicated French keyboard on a French-language computer system. It is a significant challenge to edit excerpts in e.g. French or Greek within an article written in English, using an English-language keyboard and computer system. In practice, this is usually done by a) writing special ASCII sequences using an ASCII keyboard and/or b) using a mouse and a dedicated software interface with buttons or lists.

My preference is simply to accept that keyboard-editable Unicode will consist of special ASCII sequences, and then design and use ASCII sequences that are relatively readable and editable. These sequences can later, in a further layer of software, be rendered as specialised glyphs, for those who wish to write French text directly using a French keyboard and computer system. If there are problems (and there are always problems), the text degrades gracefully into ASCII sequences.

If you think about it, this is how ASCII works anyway. The actual alphabet in a computer is [0, 1]. When you type 'A', the computer actually stores this text: 01000001

01000001 is a sequence written using the letters 0 and 1 that encodes the ASCII character 'A'.

So, using the same approach, I can declare that the ASCII sequence {e'}, written using the letters {e} and {'}, encodes the Unicode letter {é}.

Important properties for an encoding:
- All characters in the encoding must have the same length (the same number of characters in the underlying alphabet). All ASCII characters are written using eight 0s or 1s. [0] If the length is not fixed, code that searches the text will be harder to write and slower to run.
- Ordered

Note: A new version of an encoding may change the characters and their order.

In the Unicode standard, the first Unicode block is identical to ASCII.

The next Unicode block is the Latin-1 Supplement. It includes the more common characters with diacritics that are used in various European languages. Some reading indicates that ASCII and Latin-1 together can handle the following languages: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish. Note: Some rare characters (e.g. the capital Y with an umlaut in French) are not handled.

I have developed an encoding for the Latin-1 Supplement (see the table below). Every character is written as a two-ASCII-character sequence. I have tried to choose ASCII character sequences that resemble the Latin-1 character or are related to it in some way. This will hopefully make the encoding relatively readable and editable in its raw ASCII form.

Today, Unicode is usually (although not always) successfully rendered by various programs e.g. web browsers. So, my preference is to write and store European text in my encoding, but to then render it into Unicode when it is requested by a web browser (note: the rendering can be cached).

Within ASCII text, I enclose my encoded Latin-1 data within opening and closing tags, like so:
<latin-1>e'</latin-1>

This is a markup language approach, so I have called my encoding Edgecase Markup Language (EML).

Many words will include both ASCII characters and Latin-1 characters. I don't want to have to write the tags around every single letter. So, I have included all the printable ASCII characters within my Latin-1 encoding, mostly by adding an underscore to make each one two-characters-long but still readable (and distinguishable from Latin-1 characters).

Example:
<latin-1>h_e'l_l_o_</latin-1> will be rendered as: héllo





Table 1: Edgecase Markup Language (EML): Latin-1
Note: Each character is represented by two ASCII characters.
IndexCharacterASCII sequenceName
0HT- followed by HTHorizontal Tab
1LF+ followed by LFLine Feed
2SPATwo SPA charactersSpace
3!!_Exclamation Mark
4""_Quotation Mark
5##_Number Sign
6$$_Dollar Sign
7%%_Percent Sign
8&&_Ampersand
9''_Apostrophe
10((_Left Parenthesis
11))_Right Parenthesis
12**_Asterisk
13++_Plus Sign
14,,_Comma
15--_Hyphen-Minus (also known as Hyphen or Minus Sign)
16.._Full Stop (also known as Period or Dot)
17//_Solidus (also known as Slash)
1800_Digit Zero
1911_Digit One
2022_Digit Two
2133_Digit Three
2244_Digit Four
2355_Digit Five
2466_Digit Six
2577_Digit Seven
2688_Digit Eight
2799_Digit Nine
28::_Colon
29;;_Semicolon
30<<_Less-Than Sign
31==_Equals Sign
32>>_Greater-Than Sign
33??_Question Mark
34@@_Commercial At (also known as At Sign)
35AA_Latin Capital Letter A
36BB_Latin Capital Letter B
37CC_Latin Capital Letter C
38DD_Latin Capital Letter D
39EE_Latin Capital Letter E
40FF_Latin Capital Letter F
41GG_Latin Capital Letter G
42HH_Latin Capital Letter H
43II_Latin Capital Letter I
44JJ_Latin Capital Letter J
45KK_Latin Capital Letter K
46LL_Latin Capital Letter L
47MM_Latin Capital Letter M
48NN_Latin Capital Letter N
49OO_Latin Capital Letter O
50PP_Latin Capital Letter P
51QQ_Latin Capital Letter Q
52RR_Latin Capital Letter R
53SS_Latin Capital Letter S
54TT_Latin Capital Letter T
55UU_Latin Capital Letter U
56VV_Latin Capital Letter V
57WW_Latin Capital Letter W
58XX_Latin Capital Letter X
59YY_Latin Capital Letter Y
60ZZ_Latin Capital Letter Z
61[[_Left Square Bracket
62\\_Reverse Solidus (also known as Backslash)
63]]_Right Square Bracket
64^^_Circumflex Accent
65___Low Line (also known as Underscore)
66``_Grave Accent (also known as Backtick)
67aa_Latin Small Letter A
68bb_Latin Small Letter B
69cc_Latin Small Letter C
70dd_Latin Small Letter D
71ee_Latin Small Letter E
72ff_Latin Small Letter F
73gg_Latin Small Letter G
74hh_Latin Small Letter H
75ii_Latin Small Letter I
76jj_Latin Small Letter J
77kk_Latin Small Letter K
78ll_Latin Small Letter L
79mm_Latin Small Letter M
80nn_Latin Small Letter N
81oo_Latin Small Letter O
82pp_Latin Small Letter P
83qq_Latin Small Letter Q
84rr_Latin Small Letter R
85ss_Latin Small Letter S
86tt_Latin Small Letter T
87uu_Latin Small Letter U
88vv_Latin Small Letter V
89ww_Latin Small Letter W
90xx_Latin Small Letter X
91yy_Latin Small Letter Y
92zz_Latin Small Letter Z
93{{_Left Curly Bracket (also known as Left Brace)
94||_Vertical Line (also known as Vertical Bar)
95}}_Right Curly Bracket (also known as Right Brace)
96~~_Tilde
97 nbNo-Break Space
98¡!iInverted Exclamation Mark
99¢c/Cent Sign
100£L-Pound Sign
101¤oxCurrency Sign
102¥Y=Yen Sign
103¦||Broken Bar
104§seSection Sign
105¨_:Diaeresis
106©coCopyright Sign
107ª^aFeminine Ordinal Indicator
108«<<Left-Pointing Double Angle Quotation Mark
109¬-|Not Sign
110­soSoft Hyphen
111®reRegistered Sign
112¯^-Macron
113°^ODegree Sign
114±+-Plus-Minus Sign
115²^2Superscript Two
116³^3Superscript Three
117´_'Acute Accent
118µmiMicro Sign
119piPilcrow Sign
120·m.Middle Dot
121¸_,Cedilla
122¹^1Superscript One
123º^oMasculine Ordinal Indicator
124»>>Right-Pointing Double Angle Quotation Mark
125¼14Vulgar Fraction One Quarter
126½12Vulgar Fraction One Half
127¾34Vulgar Fraction Three Quarters
128¿?iInverted Question Mark
129ÀA`Latin Capital Letter A With Grave
130ÁA'Latin Capital Letter A With Acute
131ÂA^Latin Capital Letter A With Circumflex
132ÃA~Latin Capital Letter A With Tilde
133ÄA:Latin Capital Letter A With Diaeresis
134ÅAoLatin Capital Letter A With Ring Above
135ÆAELatin Capital Letter Ae
136ÇC,Latin Capital Letter C With Cedilla
139ÈE`Latin Capital Letter E With Grave
140ÉE'Latin Capital Letter E With Acute
141ÊE^Latin Capital Letter E With Circumflex
142ËE:Latin Capital Letter E With Diaeresis
143ÌI`Latin Capital Letter I With Grave
144ÍI'Latin Capital Letter I With Acute
145ÎI^Latin Capital Letter I With Circumflex
146ÏI:Latin Capital Letter I With Diaeresis
147Ð-DLatin Capital Letter Eth
148ÑN~Latin Capital Letter N With Tilde
149ÒO`Latin Capital Letter O With Grave
150ÓO'Latin Capital Letter O With Acute
151ÔO^Latin Capital Letter O With Circumflex
152ÕO~Latin Capital Letter O With Tilde
153ÖO:Latin Capital Letter O With Diaeresis
154×xxMultiplication Sign
155ØO/Latin Capital Letter O With Stroke
156ÙU`Latin Capital Letter U With Grave
157ÚU'Latin Capital Letter U With Acute
158ÛU^Latin Capital Letter U With Circumflex
159ÜU:Latin Capital Letter U With Diaeresis
160ÝY'Latin Capital Letter Y With Acute
161ÞIpLatin Capital Letter Thorn
162ßB.Latin Small Letter Sharp S
163àa`Latin Small Letter A With Grave
164áa'Latin Small Letter A With Acute
165âa^Latin Small Letter A With Circumflex
166ãa~Latin Small Letter A With Tilde
167äa:Latin Small Letter A With Diaeresis
168åaoLatin Small Letter A With Ring Above
169æaeLatin Small Letter Ae
170çc,Latin Small Letter C With Cedilla
171èe`Latin Small Letter E With Grave
172ée'Latin Small Letter E With Acute
173êe^Latin Small Letter E With Circumflex
174ëe:Latin Small Letter E With Diaeresis
175ìi`Latin Small Letter I With Grave
176íi'Latin Small Letter I With Acute
177îi^Latin Small Letter I With Circumflex
178ïi:Latin Small Letter I With Diaeresis
179ð-dLatin Small Letter Eth
180ñn~Latin Small Letter N With Tilde
181òo`Latin Small Letter O With Grave
182óo'Latin Small Letter O With Acute
183ôo^Latin Small Letter O With Circumflex
184õo~Latin Small Letter O With Tilde
185öo:Latin Small Letter O With Diaeresis
186÷-:Division Sign
187øo/Latin Small Letter O With Stroke
188ùu`Latin Small Letter U With Grave
189úu'Latin Small Letter U With Acute
190ûu^Latin Small Letter U With Circumflex
191üu:Latin Small Letter U With Diaeresis
192ýy'Latin Small Letter Y With Acute
193þipLatin Small Letter Thorn
194ÿy:Latin Small Letter Y With Diaeresis





Notes:
- Characters 1-97 are printable ASCII. This includes the three whitespace characters HT (horizontal tab or "tab"), LF (line feed or "newline"), and SPA (space). For all the non-whitespace printable ASCII characters, I have simply appended an underscore. This makes them two ASCII characters long via a single, easily-remembered rule. They are also reasonably distinguishable from the Latin-1 characters.
- Characters 97-194 are printable Latin-1. Characters 0080-009f from the Latin-1 Supplement are non-printing control characters and are not included. Characters 00a0 (no-break space) and 00ad (soft hyphen) are included.
- Character HT is not two ASCII HT characters, because this would distort the layout of the text when it is viewed in its raw ASCII form. Similarly, Character LF is not two ASCII LF characters. I chose the - sign to be in front of the HT character in order to indicate continuity (visually, a Minus Sign is a smooth line from left to right) and the + sign to be in front of the LF character in order to indicate an addition (i.e. the new line). The - and + signs are also sufficiently distinct from letters that they should be clearly distinguishable from the text when the raw ASCII is viewed.
- Character SPA is two ASCII SPA characters, as this does not distort the layout of the text, but instead preserves it, when the text is viewed in its raw ASCII form.
- In the Latin-1 encoding, I have used the following approaches to choose two-ASCII-character sequences:
-- Where possible, use a shape-based analogy or combination. E.g. "c/" for a Cent Sign. If the two ASCII characters are overlaid, this produces a rough approximation of the Cent Sign. Similarly, when the two characters in "L-" are overlaid, this produces a rough approximation of the Pound Sign.
-- Sometimes, the shape analogy or combination is a little more abstract. Two vertical bars "||" overlaid does not produce a broken bar symbol, but the two separate bar pieces next to each other are, in a sense, a "broken bar".
-- Ideally, the second letter should be a modifier of the first.
-- ^ in the first position indicates "above, up, superscript". ^ in the second position indicates a circumflex that should be placed on top of the letter in the first position.
-- "," in the second position indicates a Cedilla modifier. If an underscore (i.e. a blank) is in the first position, this is the Cedilla by itself.
-- "i" in the second position indicates "inverted". E.g. in "?i", this indicates that the question mark should be turned upside down.
-- {`} in the second position indicates a grave accent modifier for the letter in the first position. Similarly, {'} is an acute accent modifier, {~} is a tilde modifier, and ":" is a diaeresis modifier.
-- When I couldn't see a good shape-based choice, I used a pair of letters based on the English name for the character. E.g. "co" for "Copyright Sign", "se" for "Section Sign".
-- "m." is a combination shape and word meaning. "m" stands for "middle" (much like ^ in first position stands for "above") and "." is a dot. The "m" indicates that the "." should be moved to a middle height.
-- The Currency Sign looks like a circle with small protusions at the corners of the glyph rectangle. The sequence "ox" ("circle modified by a cross") was the best shape combination I could come up with.














[start of notes]



Sources of inspiration for my work on EML:
- My desire to be able to write and edit French and Greek text using my English-based keyboard and computer system.
- My desire to be able to digitally sign and timestamp all my work. The work must therefore be legible (else how do you know what you are signing?) and editable (because you'll want to improve it as much as possible before timestamping it and making it forever unalterable). Additionally, readability should be emphasised over convenience. Data should, as much as possible, be both human-readable (for verification) and machine-readable (for search). Unicode in its most common form (UTF-8) is often illegible for humans (when displayed in many text processing programs) and slow for machines to search.
- Stanislav Datskovskiy's article "No Formats, no Format Wars.". Link:
www.loper-os.org/?p=309
-- This showed me that I could simply design a new data format and embed items in this format into ASCII data. The format would have to have opening and closing tags that would allow a machine or human to read the data, arrive at the tag, and determine whether or not the format was known. If known, use that format to interpret data until the closing tag is encountered. If unknown, stop and try to locate the relevant format definition.
- Mircea Popescu's comments on alphabets vs hieroglyphics in the #trilema chat channel log. Link:
btcbase.org/log-search?q=alphabet+from%3Amircea
-- These made me understand the importance of a fixed set of symbols (an alphabet) in which all text can be written. New symbols can be invented and used, but only as shorthand. The names and definitions of new symbols must be written or writable in the original alphabet. It must be possible, if necessary, to describe anything using only the original alphabet. Any relaxation of this rule will eventually produce a hieroglyphic system that is increasingly expensive to use.


Another source:
www.unicode.org/charts/PDF/U0080.pdf
Title: C1 Controls and Latin-1 Supplement
Author: Unknown.
This source supplied me with the Latin-1 Supplement glyphs, their order, and their names.
I consulted my own copy in my archives. I downloaded a copy from the link and used
diff
to compare the two files.
diff
reported that the files differ but did not give any detail (they are both binary files). I'm reasonably sure that the glyphs, their order, and their names are still the same, but I don't know for certain.


[end of notes]











[start of footnotes]


[0]
Note: Originally, ASCII was designed as a 7-bit code. Every character was written as a sequence of seven 0s and 1s. Today, an extra 0 is added onto the front of each character in order to make them all 8 bits long.

[return to main text]

[end of footnotes]