edgecase
Author: StJohn Piano
Published: 2019-03-24
Datafeed Article 90
This article has been digitally signed by Edgecase Datafeed.
This article has been digitally signed by its author.
2372 words - 535 lines - 14 pages





The ASCII text encoding is an alphabet. It contains 128 characters.

34 characters are non-printing control characters. 94 characters are printable (they have individual visible glyphs).

Notes:
- Many of the control characters are no longer used.
- The horizontal tab (HT), the line feed (LF), and the space (SPA) characters are whitespace characters. They are non-printing control characters but they are used to affect the visible layout of the printable text.
- The line feed character is often known as the "newline" character. The vertical tab has fallen out of use, so the horizontal tab character is often known simply as the "tab" character.
- The other whitespace characters are vertical tab (VT), carriage return (CR), and form feed (FF). They are rarely used. Exception: Windows uses CR+LF as its line termination sequence.


The first 33 characters (characters 0-32) and the last character (character 127) are the non-printing control characters. Characters 33-126 are the printable ASCII characters.






Table 1: ASCII text encoding
Note: The binary and hex values are 8-bit.
DecimalCharacterBinaryHex
0NUL0000 000000
1SOH0000 000101
2STX0000 001002
3ETX0000 001103
4EOT0000 010004
5ENQ0000 010105
6ACK0000 011006
7BEL0000 011107
8BS0000 100008
9HT0000 100109
10LF0000 10100a
11VT0000 10110b
12FF0000 11000c
13CR0000 11010d
14SO0000 11100e
15SI0000 11110f
16DLE0001 000010
17DC10001 000111
18DC20001 001012
19DC30001 001113
20DC40001 010014
21NAK0001 010115
22SYN0001 011016
23ETB0001 011117
24CAN0001 100018
25EM0001 100119
26SUB0001 10101a
27ESC0001 10111b
28FS0001 11001c
29GS0001 11011d
30RS0001 11101e
31US0001 11111f
32SPA0010 000020
33!0010 000121
34"0010 001022
35#0010 001123
36$0010 010024
37%0010 010125
38&0010 011026
39'0010 011127
40(0010 100028
41)0010 100129
42*0010 10102a
43+0010 10112b
44,0010 11002c
45-0010 11012d
46.0010 11102e
47/0010 11112f
4800011 000030
4910011 000131
5020011 001032
5130011 001133
5240011 010034
5350011 010135
5460011 011036
5570011 011137
5680011 100038
5790011 100139
58:0011 10103a
59;0011 10113b
60<0011 11003c
61=0011 11013d
62>0011 11103e
63?0011 11113f
64@0100 000040
65A0100 000141
66B0100 001042
67C0100 001143
68D0100 010044
69E0100 010145
70F0100 011046
71G0100 011147
72H0100 100048
73I0100 100149
74J0100 10104a
75K0100 10114b
76L0100 11004c
77M0100 11014d
78N0100 11104e
79O0100 11114f
80P0101 000050
81Q0101 000151
82R0101 001052
83S0101 001153
84T0101 010054
85U0101 010155
86V0101 011056
87W0101 011157
88X0101 100058
89Y0101 100159
90Z0101 10105a
91[0101 10115b
92\0101 11005c
93]0101 11015d
94^0101 11105e
95_0101 11115f
96`0110 000060
97a0110 000161
98b0110 001062
99c0110 001163
100d0110 010064
101e0110 010165
102f0110 011066
103g0110 011167
104h0110 100068
105i0110 100169
106j0110 10106a
107k0110 10116b
108l0110 11006c
109m0110 11016d
110n0110 11106e
111o0110 11116f
112p0111 000070
113q0111 000171
114r0111 001072
115s0111 001173
116t0111 010074
117u0111 010175
118v0111 011076
119w0111 011177
120x0111 100078
121y0111 100179
122z0111 10107a
123{0111 10117b
124|0111 11007c
125}0111 11017d
126~0111 11107e
127DEL0111 11117f









Table 2: Names of the non-printing control characters in ASCII
DecimalCharacterName
0NULNull character
1SOHStart of Heading
2STXStart of Text
3ETXEnd of Text
4EOTEnd of Transmission
5ENQEnquiry
6ACKAcknowledgement
7BELAudible Bell or Alarm
8BSBackspace
9HTHorizontal Tab
10LFLine Feed
11VTVertical Tab
12FFForm Feed
13CRCarriage Return
14SOShift Out
15SIShift In
16DLEData Link Escape
17DC1Device Control 1: Resume Transmission. Also known as XON.
18DC2Device Control 2
19DC3Device Control 3: Suspend Transmission. Also known as XOF.
20DC4Device Control 4
21NAKNegative Acknowledgement
22SYNSynchronous Idle
23ETBEnd of Transmission Block
24CANCancel
25EMEnd of Medium
26SUBSubstitute
27ESCEscape
28FSFile Separator
29GSGroup Separator
30RSRecord Separator
31USUnit Separator
32SPASpace
127DELDelete









Table 3: Names of the printable characters in ASCII
DecimalCharacterName
33!Exclamation Mark
34"Quotation Mark
35#Number Sign
36$Dollar Sign
37%Percent Sign
38&Ampersand
39'Apostrophe
40(Left Parenthesis
41)Right Parenthesis
42*Asterisk
43+Plus Sign
44,Comma
45-Hyphen-Minus (also known as Hyphen or Minus)
46.Full Stop (also known as Period or Dot)
47/Solidus (also known as Slash)
480Digit Zero
491Digit One
502Digit Two
513Digit Three
524Digit Four
535Digit Five
546Digit Six
557Digit Seven
568Digit Eight
579Digit Nine
58:Colon
59;Semicolon
60<Less-Than Sign
61=Equals Sign
62>Greater-Than Sign
63?Question Mark
64@Commercial At (also known as At Sign)
65ALatin Capital Letter A
66BLatin Capital Letter B
67CLatin Capital Letter C
68DLatin Capital Letter D
69ELatin Capital Letter E
70FLatin Capital Letter F
71GLatin Capital Letter G
72HLatin Capital Letter H
73ILatin Capital Letter I
74JLatin Capital Letter J
75KLatin Capital Letter K
76LLatin Capital Letter L
77MLatin Capital Letter M
78NLatin Capital Letter N
79OLatin Capital Letter O
80PLatin Capital Letter P
81QLatin Capital Letter Q
82RLatin Capital Letter R
83SLatin Capital Letter S
84TLatin Capital Letter T
85ULatin Capital Letter U
86VLatin Capital Letter V
87WLatin Capital Letter W
88XLatin Capital Letter X
89YLatin Capital Letter Y
90ZLatin Capital Letter Z
91[Left Square Bracket
92\Reverse Solidus (also known as Backslash)
93]Right Square Bracket
94^Circumflex Accent
95_Low Line (also known as Underscore)
96`Grave Accent (also known as Backtick)
97aLatin Small Letter A
98bLatin Small Letter B
99cLatin Small Letter C
100dLatin Small Letter D
101eLatin Small Letter E
102fLatin Small Letter F
103gLatin Small Letter G
104hLatin Small Letter H
105iLatin Small Letter I
106jLatin Small Letter J
107kLatin Small Letter K
108lLatin Small Letter L
109mLatin Small Letter M
110nLatin Small Letter N
111oLatin Small Letter O
112pLatin Small Letter P
113qLatin Small Letter Q
114rLatin Small Letter R
115sLatin Small Letter S
116tLatin Small Letter T
117uLatin Small Letter U
118vLatin Small Letter V
119wLatin Small Letter W
120xLatin Small Letter X
121yLatin Small Letter Y
122zLatin Small Letter Z
123{Left Curly Bracket (also known as Left Brace)
124|Vertical Line (also known as Vertical Bar)
125}Right Curly Bracket (also known as Right Brace)
126~Tilde













[start of notes]



ASCII stands for "American Standard Code for Information Interchange".


Here are my notes on ASCII, taken from various sources.






Source 1:
rabbit.eng.miami.edu/info/ascii.html
Author: First name appears to be Stephen. Did not find a surname. Seems to be a professor (?) of computer science (?) at the University of Miami.

Excerpts:

ASCII is a seven bit code, it only defines codes from 0 to 127. Codes outside this range are not part of ASCII, and vary in meaning considerably. The version in use today is more completely called ASCII-1967 (it was adopted in 1967), and there are two slightly different earlier versions documented below.

ASCII uses only seven bits. Although it was communicated in eight-bit bytes, normal communication channels were unreliable. The 8th bit was used for error checking (parity). Typically the 8th bit was set to ensure that there was always an odd number of 1's in each byte transmitted (e.g. '$' is binary 0100100 which has an even number on 1's so is transmitted as 10100100; 'F' is binary 1000110 with an odd number of 1's so is transmitted as 01000110), but even parity systems were also used. The receiving equipment would simply check the parity of each byte; any single-bit inversion would be detected, and large errors were very likely to be noticed.

Seven bit code was not considered strange; it is only comparatively recently that computers with eight-bit byte based memory became an accidental standard. The Dec-system-10 had 36-bit memory, then ICL-1900 had 24-bit memory, and the CDC-6600 had 60-bit memory, to name but three.

[...]

Older ASCII versions

ASCII-1963 was the same as the current (1967) version except:
- What is now usually rendered as a hat ^ was rendered as an arrow pointing up,
- What is now rendered as an underline was rendered as an arrow pointing left,
- The last 32 codes were not assigned: lower-case letters did not exist,
- The invisible control codes (1 to 31) had different official abbreviations.

ASCII-1965 was the same as the current (1967) version except:
- What is now the backwards-divide sign \ was the wiggle sign ~,
- What is now the vertical line | was the logical-not sign ,
- What is now the wiggle sign ~ was the vertical line |,
- The at-sign @ and the backwards-single-quote ` swapped places.


The symbols (e.g. NUL, SOH) and names for the non-printing control characters come from source 1.

From source 1, I also learned that in C / Unix:
- The NUL character is used to signify "end of string".
- The EOT character is used to signify "end of file".
- The LF character is used to signify "end of line". (I already knew this, but included it here for the completeness of the pattern).

Additionally:
- The SUB character is used in the Unix C Shell to suspend the current process.






Source 2:
www.computinghistory.org.uk/det/5942/First-edition-of-the-ASCII-standard-was-published
Author: None listed.

Excerpt:

The American Standard Code for Information Interchange (acronym: ASCII) is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text. Most modern character-encoding schemes, which support many more characters than did the original, are based on ASCII.

US-ASCII is the IANA preferred charset name for ASCII.

Historically, ASCII developed from telegraphic codes. Its first commercial use was as a seven-bit teleprinter code promoted by Bell data services. Work on ASCII formally began October 6, 1960, with the first meeting of the American Standards Association's (ASA) X3.2 subcommittee. The first edition of the standard was published during 1963, a major revision during 1967, and the most recent update during 1986. Compared to earlier telegraph codes, the proposed Bell code and ASCII were both ordered for more convenient sorting (i.e., alphabetization) of lists, and added features for devices other than teleprinters.







Source 3:
nemesis.lonestar.org/reference/telecom/codes/ascii.html
Author: Frank Durda IV

Excerpts:

The United States of America Standard Code for Information Interchange (USACII, later renamed American Standard Code for Information Interchange, or simply "ASCII") describes a communications system where 7-bit words represent printable symbols and control codes. The 1963 USACII standard went through numerous revisions between 1963 and 1968, when it was formally adopted in 1968 by the American National Standards Institute (ANSI).

[...]

ASCII Code Divisions and Categories

The ASCII code is divided into three main divisions and five categories as shown in this table:

[table altered to be the following list]

Division: Control
- ASCII Range (Decimal): 0 to 31, 127

Division: Basic Printable
- ASCII Range (Decimal): 32 to 95
-- Subcategory: Symbols and Punctuation (32-47, 58-64, 91-95)
-- Subcategory: Numbers (48-57)
-- Subcategory: Uppercase Letters (65-90)

Division: Extended Printable
- ASCII Range (Decimal): 96 to 126
-- Subcategory: Lowercase Letters (97-122)
-- Subcategory: Extended Symbols and Punctuation (96, 123-126)

The extended printable character set was deliberately arranged so that if a symbol was received in this range and could not be displayed due to limitations of the printing or display device, the symbol in the basic printable range exactly 32 (0x20) positions earlier could be substituted and would provide reasonable results. In such situations, "{" and "}" would be displayed or printed as "[" and "]", while lowercase letters would be displayed or printed in uppercase.

[...]

This design of ASCII was intentionally organized to allow simpler display devices to be produced that only had to print 62 of the 94 ASCII printable codes and could substitute something "close" when asked to display an ASCII character that the device was incapable of producing, such as using the uppercase letter when the lowercase letter could not be printed.

[...]

Early Uses of ASCII and alternate coding systems

One of the earliest 7-bit ASCII devices was an improved line of electro-mechanical printers made by the Teletype corporation. With an operational speed of up to 10 characters per second, these devices were used worldwide for message transmission by Western Union, various news wire services and the military. Later, these devices found new uses as input/output devices connected to computer systems that also communicated using the ASCII character set.

The most widely-manufactured Teletype model was number 33, which was sold under a variety of model names such as the KSR-33 and ASR-33. These devices could only print the basic printable character portion of the ASCII character set (64 characters). This limited these devices to uppercase letters, numbers and most punctuation characters as shown in the table above. Some early video terminals and computers (such as the Digital Equipment Corporation VT50 and the Radio Shack TRS-80 Model I) supported only the basic printable set of characters, despite being designed and manufactured years after the ASCII extended character set was adopted. Some manufacturers did offer upgrades that allowed for the display of all ASCII printable characters.

Prior to the introduction of the ASCII-based teletype printers, the Teletype corporation produced teleprinters that used Baudot or "5-Level" character codes, operating at speeds between 40 and 75 baud. These were widely used for over thirty years, but were largely removed from service by the mid 1960s.

IBMs earlier mainframe computers (notably the IBM 360 and 370 families) did not use ASCII. Instead, they used an alternate character coding system called EBCDIC which was devised by IBM as a way to ensure that any peripherals to be connected to IBM computers were also made by IBM. IBM eventually lost this battle and by the late 1970s, it was common to see IBM systems that used EBCDIC internally, but had external communication processors that translated transmissions between IBMs EBCDIC and what other equipment makers were using, which was ASCII.


From source 3, I also learned that:
- "Paper Advance" is another term for "Line Feed".
- LF = Paper Advance one line or move cursor down one line. If VDT is at the bottom of screen already, scroll screen one line or wrap to top, depending on settings.
- UNIX system display routines treat LF as though it received CR and LF in most situations. However, TCP communication software on UNIX systems running in the default "cooked" mode must use the proper CR/LF sequence to end a given line of ASCII text that is transmitted or received.
- Vertical Tabulation = Paper Advance by number of lines dictated by the control tape or similar mechanism.
- Form Feed = Paper Advance to next page, screen clear and/or position to top or bottom line on some VDTs.
- Carriage Return = Move print head or cursor to column 1.






Source 4:
www.cs.mcgill.ca/~rwest/wikispeedia/wpcd/wp/a/ASCII.htm
Author: The content found at the hyperlink www.cs.mcgill.ca/~rwest indicates that the site author is Robert West. However, this content originally comes from Wikipedia, and has been curated / hand-selected (by whom?).

Excerpts:

Like other character representation computer codes, ASCII specifies a correspondence between digital bit patterns and the symbols / glyphs of a written language, thus allowing digital devices to communicate with each other and to process, store, and communicate character-oriented information.

ASCII is, strictly, a seven-bit code, meaning that it uses the bit patterns representable with seven binary digits (a range of 0 to 127 decimal) to represent character information. At the time ASCII was introduced, many computers dealt with eight-bit groups ( bytes or, more specifically, octets) as the smallest unit of information; the eighth bit was commonly used as a parity bit for error checking on communication lines or other device-specific functions. Machines which did not use parity typically set the eighth bit to zero, though some systems such as Prime machines running PRIMOS set the eighth bit of ASCII characters to one.

ASCII only defines a relationship between specific characters and bit sequences; aside from reserving a few control codes for line-oriented formatting, it does not define any mechanism for describing the structure or appearance of text within a document. Such concepts are within the realm of other systems such as the markup languages.

[...]

ASCII developed from telegraphic codes and first entered commercial use as a seven-bit teleprinter code promoted by Bell data services in 1963. The Bell System had previously planned to use a six-bit code, derived from Fieldata, that added punctuation and lower-case letters to the earlier five-bit Baudot teleprinter code, but was persuaded instead to join the ASA subcommittee that had started to develop ASCII. Baudot helped in the automation of sending and receiving telegraphic messages, and took many features from Morse code; however, unlike Morse code, Baudot used constant-length codes. Compared to earlier telegraph codes, the proposed Bell code and ASCII both underwent re-ordering for more convenient sorting (especially alphabetization) of lists, and added features for devices other than teleprinters. Bob Bemer introduced features such as the 'escape sequence'. His British colleague Hugh McGregor Ross helped to popularize this work, as Bemer said, "so much so that the code that was to become ASCII was first called the Bemer-Ross Code in Europe".

[...]

Many more of the control codes have taken on meanings quite different from their original ones. The "escape" character (code 27), for example, was originally intended to allow sending other control characters as literals instead of invoking their meaning. This is the same meaning of "escape" encountered in URL encodings, C language strings, and other systems where certain characters have a reserved meaning. Over time this meaning has been coopted and has eventually drifted. In modern use, an ESC sent to the terminal usually indicates the start of a command sequence, usually in the form of an ANSI escape code. An ESC sent from the terminal is most often used as an "out of band" character used to terminate an operation, as in the TECO and vi text editors.

The inherent ambiguity of many control characters, combined with their historical usage, has also created problems when transferring "plain text" files between systems. The clearest example of this is the newline problem on various operating systems. On printing terminals there is no question that you terminate a line of text with both "Carriage Return" and "Linefeed". The first returns the printing carriage to the beginning of the line and the second advances to the next line without moving the carriage. However, requiring two characters to mark the end of a line introduced unnecessary complexity and questions as to how to interpret each character when encountered alone. To simplify matters, plain text files on Unix systems use line feeds alone to separate lines. Similarly, older Macintosh systems, among others, use only carriage returns in plain text files. Various DEC operating systems used both characters to mark the end of a line, perhaps for compatibility with teletypes, and this de facto standard was copied in the CP/M operating system and then in MS-DOS and eventually Microsoft Windows. The DEC operating systems, along with CP/M, tracked file length only in units of disk blocks and used Control-Z (SUB) to mark the end of the actual text in the file (also done for CP/M compatibility in some cases in MS-DOS, though MS-DOS has always recorded exact file-lengths). Control-C (ETX, End of TeXt) might have made more sense, but was already in wide use as a program abort signal. UNIX's use of Control-D (EOT, End of Transmission) appears on its face similar, but is used only from the terminal and never stored in a file.

While the codes mentioned above have retained some semblance of their original meanings, many of the codes originally intended for stream delimiters or for link control on a terminal have lost all meaning except their relation to a letter. Control-A is almost never used to mean "start of header" except on an ANSI magnetic tape. When connecting a terminal to a system, or asking the system to recognize that a logged-out terminal wants to log in, modern systems are much more likely to want a carriage return or an ESCape than Control-E (ENQuire, meaning "is there anybody out there?").

[...]

Structural features:
- The digits 0-9 are represented with their values in binary prefixed with 0011 (this means that converting BCD to ASCII is simply a matter of taking each BCD nibble separately and prefixing 0011 to it).
- Lowercase and uppercase letters only differ in bit pattern by a single bit simplifying case conversion to a range test (to avoid converting characters that are not letters) and a single bitwise operation. Fast case conversion is important because it is often used in case-ignoring search algorithms.

[...]

The blend word ASCIIbetical has evolved to describe the collation of data in ASCII-code order rather than "standard" alphabetical order.

The abbreviation ASCIIZ or ASCIZ refers to a null-terminated ASCII string.

[...]

This reference article is mainly selected from the English Wikipedia with only minor checks and changes (see www.wikipedia.org for details of authors and sources) and is available under the GNU Free Documentation License. See also our Disclaimer.







Source 5:
www.aivosto.com/articles/charsets-7bit.html
Author: Unknown

Excerpts:

When computers were young in the early 1960s, it was decided that text should be represented with 7 bits for each character. Seven bits would be enough to represent 128 different characters, including letters, numbers, symbols and required control codes. 6 bits were too few. 8 bits were considered too much. The standard became 7.

ASCII (American Standard Code for Information Interchange) was the first 7-bit character set to be standardized. During the years, several revisions of ASCII were published. ASCII based character sets became immensely widespread. Most character sets in current use are based on ASCII in a way or another.

[...]

Revisions of ASCII

ASCII has undergone several revisions to become the character set we know today. The history of ASCII is not always fully understood. As an example, IANA lists ASCII as the same thing as ANSI_X3.4-1968 and ANSI_X3.4-1986. This is not entirely accurate. The 1968 revision was ambigous. The ambiguities were fixed later, making the 1986 revision different from the 1968 revision.

[...]

ASCII-1963 (ASA standard X3.4-1963) was the initial release of ASCII. It was in many ways different from the ASCII in current use. ASCII-1963 didn't yet gain wide acceptance. One of the reasons is that IBM chose to use EBCDIC, an IBM proprietary character set, in its successful SYSTEM/360 series of computers released in 1964.

ASCII-1965 was an unpublished major revision. It looked a lot like the current ASCII, even though there were differences with certain characters. ASCII-1965 was accepted as a standard, but it went unpublished and unused.

ASCII-1967 (USAS X3.4-1967) was a major revision of the previous versions of ASCII. This was the version that eventually evolved to the ASCII we know today.

ASCII-1967 was not exactly what we currently think of as ASCII. The differences are as follows. ASCII-1967 offered some options for certain characters, and one character was totally ambigous. The Number Sign (#) could be replaced by the symbol £. Two characters could be stylized. The Exclamation Point (!) could be stylized as a logical OR (|) and the Circumflex (^) could be stylized as a logical NOT (¬). Character 7C, even though called a Vertical Line, looked like a broken vertical bar (¦). It looked that way to avoid confusion with a solid vertical bar (|) used as a logical OR. In other words, since character 21 could sometimes look like (|), 7C had to look like (¦).

Character 7E was ambiguous. This character had three functions. It was 1) Overline when used as punctuation, 2) Tilde when used as a diacritic, and 3) General Accent, yet another diacritic which could be used for other accents not specifically provided. The character appeared in two shapes, upper tilde ([deleted]) and midline tilde (~), interchangeably. No explanation was provided as to which shape to use and when. The character did not look like an overline (¯), even when it was called Overline. As if they couldn't decide what this character really was for. The midline shape (~) may have been unintentional. The midline position conflicts with the intended use either as a diacritic or as an overline. Ambiguity regarding the shape seems to have originated in ASCII-1965, where it may have been a typographical error or restriction.

ASCII-1968 (USAS X3.4-1968) was a minor revision. It didn't change any of the graphic characters. The only change was to the "newline" function. LF could now be used alone as a newline. The previous versions required the use of CR LF (or LF CR). The 1968 standard also gave the code its name ASCII or USASCII.

ASCII-1977 (ANSI X3.4-1977) fixed some of the ambiguities of ASCII-1967 and ASCII-1968. The Number Sign (#) could no longer be replaced by the Pound (£). Character 7C was now a Vertical Line (|) that no longer looked like a broken vertical bar. One could no longer stylize the Exclamation Point (!) as a (|) or the Circumflex (^) as a logical NOT (¬). Overline was no longer present; it was simply a Tilde ([deleted], not ~). That character could no longer be used as a General Accent either. ASCII-1977 also changed the definitions of several control characters. The changes did not necessarily change the intended use of these characters. An essential change was with VT and FF: it was now possible to allow an "optional implicit CR" after VT and FF the same way it was already possible with LF. More changes can be found in Control characters in ASCII and Unicode [ www.aivosto.com/articles/control-characters.html ].

ASCII-1986 (ANSI X3.4-1986) did not change the character set nor the control characters.


- Question: "Overline was no longer present; it was simply a Tilde ([deleted], not ~)." - but today, in ASCII, the tilde is the midline tilde. Why?
-- Alternatively: Is this statement a mistake? Should it be "it was simply a Tilde (~, not [deleted])."?
- Note: I have replaced the upper tilde Unicode character with "[deleted]".






Source 6:
www.unicode.org/charts/PDF/U0000.pdf
Author: Unknown.

This source supplied the names of the printable characters.








Changes to the original text:
- I have not always preserved the format of any excerpts from webpages on other sites (e.g. not preserving the original bold/italic styles, changing the list structures, not preserving hyperlinks).
- I have not always preserved the spellings in excerpts from webpages on other sites (e.g. I may change "execpt" to "except").




[end of notes]