Edgecase | Edgecase Markup Language (EML): Latin-1

The primary advantage of printable ASCII is that it is a fixed alphabet of limited length. Most keyboards have dedicated keys for each of the printable ASCII characters. They may also support other characters, but these will vary, depending on country and manufacturer.

Important properties of an alphabet:
- Fixed
- A relatively small and manageable set of characters
- Ordered
- Writable (with a pen and paper, or with a keyboard)
- Each glyph represents a single character.

The ASCII text encoding contains more characters, but these are non-printing control characters, which do not have individual visible glyphs.

Note: The horizontal tab (HT), the line feed (LF), and the space (SPA) characters are whitespace characters. They are non-printing control characters but they are used to affect the visible layout of the printable text. Most keyboards have dedicated keys for them (usually Tab, Enter, and Space Bar, respectively). I have therefore included them in the "printable ASCII" alphabet.

There are several problems with Unicode:
- It is not fixed. Every so often new characters are added.
- It is not a relatively small and manageable set of characters. Some reading indicates that it has 1,111,998 possible characters.
- It is not directly writable with a keyboard.
- Many glyphs look the same or are very similar, but represent different characters.

The primary problem with Unicode is that editing Unicode text is difficult, as soon as you stop using a dedicated subset of it. It is feasible to write French text using a dedicated French keyboard on a French-language computer system. It is a significant challenge to edit excerpts in e.g. French or Greek within an article written in English, using an English-language keyboard and computer system. In practice, this is usually done by a) writing special ASCII sequences using an ASCII keyboard and/or b) using a mouse and a dedicated software interface with buttons or lists.

My preference is simply to accept that keyboard-editable Unicode will consist of special ASCII sequences, and then design and use ASCII sequences that are relatively readable and editable. These sequences can later, in a further layer of software, be rendered as specialised glyphs, for those who wish to write French text directly using a French keyboard and computer system. If there are problems (and there are always problems), the text degrades gracefully into ASCII sequences.

If you think about it, this is how ASCII works anyway. The actual alphabet in a computer is [0, 1]. When you type 'A', the computer actually stores this text: 01000001

01000001 is a sequence written using the letters 0 and 1 that encodes the ASCII character 'A'.

So, using the same approach, I can declare that the ASCII sequence {e'}, written using the letters {e} and {'}, encodes the Unicode letter {é}.

Important properties for an encoding:
- All characters in the encoding must have the same length (the same number of characters in the underlying alphabet). All ASCII characters are written using eight 0s or 1s. [0] If the length is not fixed, code that searches the text will be harder to write and slower to run.
- Ordered

Note: A new version of an encoding may change the characters and their order.

In the Unicode standard, the first Unicode block is identical to ASCII.

The next Unicode block is the Latin-1 Supplement. It includes the more common characters with diacritics that are used in various European languages. Some reading indicates that ASCII and Latin-1 together can handle the following languages: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish. Note: Some rare characters (e.g. the capital Y with an umlaut in French) are not handled.

I have developed an encoding for the Latin-1 Supplement (see the table below). Every character is written as a two-ASCII-character sequence. I have tried to choose ASCII character sequences that resemble the Latin-1 character or are related to it in some way. This will hopefully make the encoding relatively readable and editable in its raw ASCII form.

Today, Unicode is usually (although not always) successfully rendered by various programs e.g. web browsers. So, my preference is to write and store European text in my encoding, but to then render it into Unicode when it is requested by a web browser (note: the rendering can be cached).

Within ASCII text, I enclose my encoded Latin-1 data within opening and closing tags, like so:
<latin-1>e'</latin-1>

This is a markup language approach, so I have called my encoding Edgecase Markup Language (EML).

Many words will include both ASCII characters and Latin-1 characters. I don't want to have to write the tags around every single letter. So, I have included all the printable ASCII characters within my Latin-1 encoding, mostly by adding an underscore to make each one two-characters-long but still readable (and distinguishable from Latin-1 characters).

Example:
<latin-1>h_e'l_l_o_</latin-1> will be rendered as: héllo

Table 1: Edgecase Markup Language (EML): Latin-1
0
Note: Each character is represented by two ASCII characters.
0
Index	Character	ASCII sequence	Name
0
0	HT	- followed by HT	Horizontal Tab
1	LF	+ followed by LF	Line Feed
2	SPA	Two SPA characters	Space
3	!	!_	Exclamation Mark
4	"	"_	Quotation Mark
5	#	#_	Number Sign
6	$	$_	Dollar Sign
7	%	%_	Percent Sign
8	&	&_	Ampersand
9	'	'_	Apostrophe
10	(	(_	Left Parenthesis
11	)	)_	Right Parenthesis
12	*	*_	Asterisk
13	+	+_	Plus Sign
14	,	,_	Comma
15	-	-_	Hyphen-Minus (also known as Hyphen or Minus Sign)
16	.	._	Full Stop (also known as Period or Dot)
17	/	/_	Solidus (also known as Slash)
18	0	0_	Digit Zero
19	1	1_	Digit One
20	2	2_	Digit Two
21	3	3_	Digit Three
22	4	4_	Digit Four
23	5	5_	Digit Five
24	6	6_	Digit Six
25	7	7_	Digit Seven
26	8	8_	Digit Eight
27	9	9_	Digit Nine
28	:	:_	Colon
29	;	;_	Semicolon
30	<	<_	Less-Than Sign
31	=	=_	Equals Sign
32	>	>_	Greater-Than Sign
33	?	?_	Question Mark
34	@	@_	Commercial At (also known as At Sign)
35	A	A_	Latin Capital Letter A
36	B	B_	Latin Capital Letter B
37	C	C_	Latin Capital Letter C
38	D	D_	Latin Capital Letter D
39	E	E_	Latin Capital Letter E
40	F	F_	Latin Capital Letter F
41	G	G_	Latin Capital Letter G
42	H	H_	Latin Capital Letter H
43	I	I_	Latin Capital Letter I
44	J	J_	Latin Capital Letter J
45	K	K_	Latin Capital Letter K
46	L	L_	Latin Capital Letter L
47	M	M_	Latin Capital Letter M
48	N	N_	Latin Capital Letter N
49	O	O_	Latin Capital Letter O
50	P	P_	Latin Capital Letter P
51	Q	Q_	Latin Capital Letter Q
52	R	R_	Latin Capital Letter R
53	S	S_	Latin Capital Letter S
54	T	T_	Latin Capital Letter T
55	U	U_	Latin Capital Letter U
56	V	V_	Latin Capital Letter V
57	W	W_	Latin Capital Letter W
58	X	X_	Latin Capital Letter X
59	Y	Y_	Latin Capital Letter Y
60	Z	Z_	Latin Capital Letter Z
61	[	[_	Left Square Bracket
62	\	\_	Reverse Solidus (also known as Backslash)
63	]	]_	Right Square Bracket
64	^	^_	Circumflex Accent
65	_	__	Low Line (also known as Underscore)
66	`	`_	Grave Accent (also known as Backtick)
67	a	a_	Latin Small Letter A
68	b	b_	Latin Small Letter B
69	c	c_	Latin Small Letter C
70	d	d_	Latin Small Letter D
71	e	e_	Latin Small Letter E
72	f	f_	Latin Small Letter F
73	g	g_	Latin Small Letter G
74	h	h_	Latin Small Letter H
75	i	i_	Latin Small Letter I
76	j	j_	Latin Small Letter J
77	k	k_	Latin Small Letter K
78	l	l_	Latin Small Letter L
79	m	m_	Latin Small Letter M
80	n	n_	Latin Small Letter N
81	o	o_	Latin Small Letter O
82	p	p_	Latin Small Letter P
83	q	q_	Latin Small Letter Q
84	r	r_	Latin Small Letter R
85	s	s_	Latin Small Letter S
86	t	t_	Latin Small Letter T
87	u	u_	Latin Small Letter U
88	v	v_	Latin Small Letter V
89	w	w_	Latin Small Letter W
90	x	x_	Latin Small Letter X
91	y	y_	Latin Small Letter Y
92	z	z_	Latin Small Letter Z
93	{	{_	Left Curly Bracket (also known as Left Brace)
94	\|	\|_	Vertical Line (also known as Vertical Bar)
95	}	}_	Right Curly Bracket (also known as Right Brace)
96	~	~_	Tilde
97		nb	No-Break Space
98	¡	!i	Inverted Exclamation Mark
99	¢	c/	Cent Sign
100	£	L-	Pound Sign
101	¤	ox	Currency Sign
102	¥	Y=	Yen Sign
103	¦	\|\|	Broken Bar
104	§	se	Section Sign
105	¨	_:	Diaeresis
106	©	co	Copyright Sign
107	ª	^a	Feminine Ordinal Indicator
108	«	<<	Left-Pointing Double Angle Quotation Mark
109	¬	-\|	Not Sign
110		so	Soft Hyphen
111	®	re	Registered Sign
112	¯	^-	Macron
113	°	^O	Degree Sign
114	±	+-	Plus-Minus Sign
115	²	^2	Superscript Two
116	³	^3	Superscript Three
117	´	_'	Acute Accent
118	µ	mi	Micro Sign
119	¶	pi	Pilcrow Sign
120	·	m.	Middle Dot
121	¸	_,	Cedilla
122	¹	^1	Superscript One
123	º	^o	Masculine Ordinal Indicator
124	»	>>	Right-Pointing Double Angle Quotation Mark
125	¼	14	Vulgar Fraction One Quarter
126	½	12	Vulgar Fraction One Half
127	¾	34	Vulgar Fraction Three Quarters
128	¿	?i	Inverted Question Mark
129	À	A`	Latin Capital Letter A With Grave
130	Á	A'	Latin Capital Letter A With Acute
131	Â	A^	Latin Capital Letter A With Circumflex
132	Ã	A~	Latin Capital Letter A With Tilde
133	Ä	A:	Latin Capital Letter A With Diaeresis
134	Å	Ao	Latin Capital Letter A With Ring Above
135	Æ	AE	Latin Capital Letter Ae
136	Ç	C,	Latin Capital Letter C With Cedilla
139	È	E`	Latin Capital Letter E With Grave
140	É	E'	Latin Capital Letter E With Acute
141	Ê	E^	Latin Capital Letter E With Circumflex
142	Ë	E:	Latin Capital Letter E With Diaeresis
143	Ì	I`	Latin Capital Letter I With Grave
144	Í	I'	Latin Capital Letter I With Acute
145	Î	I^	Latin Capital Letter I With Circumflex
146	Ï	I:	Latin Capital Letter I With Diaeresis
147	Ð	-D	Latin Capital Letter Eth
148	Ñ	N~	Latin Capital Letter N With Tilde
149	Ò	O`	Latin Capital Letter O With Grave
150	Ó	O'	Latin Capital Letter O With Acute
151	Ô	O^	Latin Capital Letter O With Circumflex
152	Õ	O~	Latin Capital Letter O With Tilde
153	Ö	O:	Latin Capital Letter O With Diaeresis
154	×	xx	Multiplication Sign
155	Ø	O/	Latin Capital Letter O With Stroke
156	Ù	U`	Latin Capital Letter U With Grave
157	Ú	U'	Latin Capital Letter U With Acute
158	Û	U^	Latin Capital Letter U With Circumflex
159	Ü	U:	Latin Capital Letter U With Diaeresis
160	Ý	Y'	Latin Capital Letter Y With Acute
161	Þ	Ip	Latin Capital Letter Thorn
162	ß	B.	Latin Small Letter Sharp S
163	à	a`	Latin Small Letter A With Grave
164	á	a'	Latin Small Letter A With Acute
165	â	a^	Latin Small Letter A With Circumflex
166	ã	a~	Latin Small Letter A With Tilde
167	ä	a:	Latin Small Letter A With Diaeresis
168	å	ao	Latin Small Letter A With Ring Above
169	æ	ae	Latin Small Letter Ae
170	ç	c,	Latin Small Letter C With Cedilla
171	è	e`	Latin Small Letter E With Grave
172	é	e'	Latin Small Letter E With Acute
173	ê	e^	Latin Small Letter E With Circumflex
174	ë	e:	Latin Small Letter E With Diaeresis
175	ì	i`	Latin Small Letter I With Grave
176	í	i'	Latin Small Letter I With Acute
177	î	i^	Latin Small Letter I With Circumflex
178	ï	i:	Latin Small Letter I With Diaeresis
179	ð	-d	Latin Small Letter Eth
180	ñ	n~	Latin Small Letter N With Tilde
181	ò	o`	Latin Small Letter O With Grave
182	ó	o'	Latin Small Letter O With Acute
183	ô	o^	Latin Small Letter O With Circumflex
184	õ	o~	Latin Small Letter O With Tilde
185	ö	o:	Latin Small Letter O With Diaeresis
186	÷	-:	Division Sign
187	ø	o/	Latin Small Letter O With Stroke
188	ù	u`	Latin Small Letter U With Grave
189	ú	u'	Latin Small Letter U With Acute
190	û	u^	Latin Small Letter U With Circumflex
191	ü	u:	Latin Small Letter U With Diaeresis
192	ý	y'	Latin Small Letter Y With Acute
193	þ	ip	Latin Small Letter Thorn
194	ÿ	y:	Latin Small Letter Y With Diaeresis

Notes:
- Characters 1-97 are printable ASCII. This includes the three whitespace characters HT (horizontal tab or "tab"), LF (line feed or "newline"), and SPA (space). For all the non-whitespace printable ASCII characters, I have simply appended an underscore. This makes them two ASCII characters long via a single, easily-remembered rule. They are also reasonably distinguishable from the Latin-1 characters.
- Characters 97-194 are printable Latin-1. Characters 0080-009f from the Latin-1 Supplement are non-printing control characters and are not included. Characters 00a0 (no-break space) and 00ad (soft hyphen) are included.
- Character HT is not two ASCII HT characters, because this would distort the layout of the text when it is viewed in its raw ASCII form. Similarly, Character LF is not two ASCII LF characters. I chose the - sign to be in front of the HT character in order to indicate continuity (visually, a Minus Sign is a smooth line from left to right) and the + sign to be in front of the LF character in order to indicate an addition (i.e. the new line). The - and + signs are also sufficiently distinct from letters that they should be clearly distinguishable from the text when the raw ASCII is viewed.
- Character SPA is two ASCII SPA characters, as this does not distort the layout of the text, but instead preserves it, when the text is viewed in its raw ASCII form.
- In the Latin-1 encoding, I have used the following approaches to choose two-ASCII-character sequences:
-- Where possible, use a shape-based analogy or combination. E.g. "c/" for a Cent Sign. If the two ASCII characters are overlaid, this produces a rough approximation of the Cent Sign. Similarly, when the two characters in "L-" are overlaid, this produces a rough approximation of the Pound Sign.
-- Sometimes, the shape analogy or combination is a little more abstract. Two vertical bars "||" overlaid does not produce a broken bar symbol, but the two separate bar pieces next to each other are, in a sense, a "broken bar".
-- Ideally, the second letter should be a modifier of the first.
-- ^ in the first position indicates "above, up, superscript". ^ in the second position indicates a circumflex that should be placed on top of the letter in the first position.
-- "," in the second position indicates a Cedilla modifier. If an underscore (i.e. a blank) is in the first position, this is the Cedilla by itself.
-- "i" in the second position indicates "inverted". E.g. in "?i", this indicates that the question mark should be turned upside down.
-- {`} in the second position indicates a grave accent modifier for the letter in the first position. Similarly, {'} is an acute accent modifier, {~} is a tilde modifier, and ":" is a diaeresis modifier.
-- When I couldn't see a good shape-based choice, I used a pair of letters based on the English name for the character. E.g. "co" for "Copyright Sign", "se" for "Section Sign".
-- "m." is a combination shape and word meaning. "m" stands for "middle" (much like ^ in first position stands for "above") and "." is a dot. The "m" indicates that the "." should be moved to a middle height.
-- The Currency Sign looks like a circle with small protusions at the corners of the glyph rectangle. The sequence "ox" ("circle modified by a cross") was the best shape combination I could come up with.

[start of notes]

Sources of inspiration for my work on EML:
- My desire to be able to write and edit French and Greek text using my English-based keyboard and computer system.
- My desire to be able to digitally sign and timestamp all my work. The work must therefore be legible (else how do you know what you are signing?) and editable (because you'll want to improve it as much as possible before timestamping it and making it forever unalterable). Additionally, readability should be emphasised over convenience. Data should, as much as possible, be both human-readable (for verification) and machine-readable (for search). Unicode in its most common form (UTF-8) is often illegible for humans (when displayed in many text processing programs) and slow for machines to search.
- Stanislav Datskovskiy's article "No Formats, no Format Wars.". Link:
www.loper-os.org/?p=309
-- This showed me that I could simply design a new data format and embed items in this format into ASCII data. The format would have to have opening and closing tags that would allow a machine or human to read the data, arrive at the tag, and determine whether or not the format was known. If known, use that format to interpret data until the closing tag is encountered. If unknown, stop and try to locate the relevant format definition.
- Mircea Popescu's comments on alphabets vs hieroglyphics in the #trilema chat channel log. Link:
btcbase.org/log-search?q=alphabet+from%3Amircea
-- These made me understand the importance of a fixed set of symbols (an alphabet) in which all text can be written. New symbols can be invented and used, but only as shorthand. The names and definitions of new symbols must be written or writable in the original alphabet. It must be possible, if necessary, to describe anything using only the original alphabet. Any relaxation of this rule will eventually produce a hieroglyphic system that is increasingly expensive to use.

Another source:
www.unicode.org/charts/PDF/U0080.pdf
Title: C1 Controls and Latin-1 Supplement
Author: Unknown.
This source supplied me with the Latin-1 Supplement glyphs, their order, and their names.
I consulted my own copy in my archives. I downloaded a copy from the link and used

diff

to compare the two files.

diff

reported that the files differ but did not give any detail (they are both binary files). I'm reasonably sure that the glyphs, their order, and their names are still the same, but I don't know for certain.

[end of notes]

[start of footnotes]

[0]

Note: Originally, ASCII was designed as a 7-bit code. Every character was written as a sequence of seven 0s and 1s. Today, an extra 0 is added onto the front of each character in order to make them all 8 bits long.

[return to main text]

[end of footnotes]