Edgecase | Building a parser for EML

Goal

Write a specification for Edgecase Markup Language (EML). Develop an algorithm for parsing it. Implement this algorithm in Python.

Contents

- Goal
- Contents
- Downloadable Assets
- Project Log

Downloadable Assets

Asset: This module supplies a class ("Element") that can parse a file object or a string into an EML element tree. This class can be imported and used elsewhere. It accepts an optional logger object.
eml_parser.py

Asset: A module containing a series of unit tests for eml_parser.py.
test_eml_parser.py

Project Log

System details:
- Name: Shovel
- Specifications: HP 6005 Pro SFF. 3 GHz x86_64 processor (AMD II x4 B95 Quad Core), 4 GB RAM, 1 TB hard drive. Running CentOS 7.6.1810 (Core).
- More information: New computer: Shovel
- Installed items: GCC 4.8.5, Make 3.82, Vim 7.4, Python 2.7.5, Python 3.3.2, Gnome 3.28.2, gedit 3.28.1, GPG 1.4.10.

EML samples:

<content>hello world</content>

<content>
hello <bold>world</bold>
</content>

<content>
<italic_lines>
Things have their shape in time, not space alone.
~ Dr Manhattan
</italic_lines>
</content>

Principles of EML:
- EML contains only Printable ASCII, which is the bedrock alphabet of our era. Any text editor that can handle printable ASCII can handle EML.
- EML is designed to be directly writable and readable, with some effort and practice, by a competent person who is not a computer programmer.
- EML is designed to be both human-readable and machine-readable. The EML format permits the writer to insert whitespace where convenient to make the document more human-readable.
- EML favours convenience of reading more than convenience of writing. Any good document will be read many more times than it is written.
- EML is readable in its raw form but can be rendered into a presentation that is more pleasant to read [0]
- The most flexible way to describe knowledge in a computer system is as a tree. [1] Borrowing from XML, EML is designed as a tree of elements. Elements begin with a start tag and end with an end tag. All tags contain the name of the element. [2] Anything in between the start tag and end tag is the data contained in that element.
- EML is designed to be digitally signed. Once written, text is a specific and unambiguous sequence of bytes. This means that whitespace is as important as the visible text.
- EML uses several characters as in-band signals. [3]. If you wish to include the data of another file without escaping signal characters, then this file should be stored alongside the EML document as a separate file, and included in the EML document as a link.
- It can be difficult to develop a algorithm for processing trees, but this is the price paid for the flexibility of being able to write in the tree format. Reality is difficult to describe, the best way to describe its aspects is not really knowable in advance, and the most effective way to deal with the future problems of description is to design the base data format to be as flexible as possible.

So, let's try to specify EML:
- Elements have names.
- Elements begin with a start tag, which is the element name enclosed in angle brackets. Example: "<content>"
- Elements end with an end tag, which is the element name prefixed with a forward slash and then enclosed in angle brackets. Example: "</content>"
- The element names in the start tag and the end tag must match exactly.
- An EML document is entirely contained within a single element: the root element. An EML document starts with the root element's start tag and ends with its end tag. Notably, an EML document does not end with a newline byte.
- Permitted characters in element names:

abcdefghijklmnopqrstuvwxyz_-.0123456789

-- Lowercase-only element names so as to avoid any reading difficulty due to case. Example of what this avoids: Element 1 is named "foo" and element 2 is named "Foo", and they mean different things.
-- The underscore (_) character is for multi-word names e.g. "<italic_lines>.
-- The hyphen (-), period (.), and digit (0123456789) characters are for element versioning e.g. "<latin-1>", "<greek-2.1>".
- Elements contain data. Data consists of 0 or more bytes. [4] In this data, there may be other elements, which are child elements of the first element (the parent element). So: Element data consists of 0 or more data sequences (which can contain escaped signal characters), interleaved with 0 or more child elements.
- Signal characters:

<>\

- The backslash (\) character is used to escape the angle brackets, and itself, within data.
- The forward slash (/) character is not escaped within data. This is because, within an end tag, it will only occur after the left angle bracket has already changed the context from "data" to "name".
- Permitted characters in data:
-- Printable ASCII:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

-- The three important whitespace characters:
HT (horizontal tab \t), LF (line feed a.k.a. newline \n), SPA (space

).

Further notes:
- EML must be processed byte by byte. No lengths are known ahead of time, so no jumping ahead is possible. This is the cost of the flexibility of the tree data structure. [5]
- Signed EML (which won't change) can be processed in order to produce a cache layer containing e.g. lengths, locations of elements and data sequences, in order to permit fast search.

Let's think about developing an algorithm for parsing EML from a byte sequence into a tree structure. Example of use: In Python, parse an article file written in EML into a tree of Python objects, and search down through the tree in order to extract the title and the author name from the relevant elements.

Well, how many contexts are there?

1) "<" - this begins a tag (start tag or end tag).
2) ">" - this ends a tag.
3) "\" - this means "escape a signal character or a backslash". if this is the second consecutive \, it means itself, the literal backslash character.
4) "/" - in a tag, at the beginning, this distinguishes an end tag from a start tag. in data, it means itself, the literal forward slash character.
5) there is the "name" context, and the "data" context. escaping is only used in the "data" context.
6) If a "<" or a ">" is preceded by a backslash, it means itself, the literal angle bracket character.

The problem with using escape sequences that contain the escaped character e.g. "\<" for "<" is that the "<" now means two different things, depending on where it occurs within the text. It's less ambiguous to use an escape sequence such as "\left", but this makes the raw EML less human-readable.

Example:

- Elements begin with a start tag, which is the element name enclosed in angle brackets. Example: "\leftcontent\left"

- Elements begin with a start tag, which is the element name enclosed in angle brackets. Example: "\<content\>"

I think there are two general contexts: "tag" and "data". Each has its own processing rules.

In an algorithm, this will mean two separate subroutines, one for each context. The algorithm begins at the start of the byte sequence, and jumps between one or the other subroutine, as indicated by the bytes encountered.

[studying, development, and testing occurs here]

Finished. I've written a module called eml_parser.py. It supplies a class ("Element") that can parse a file object or a string into an EML element tree. This class can be imported and used elsewhere.

In EML, bytes can have different meanings depending on their context. In order to interpret a byte, the parsing algorithm preserves some state ("context") from the previous byte. The hard part of this problem was thinking about the chains of context that occur across series of bytes, and choosing context values that cause these chains to be unambiguous. The algorithm runs in O(n) time. In a few places, it will rewind a byte or two, but in general it proceeds through the data one byte at a time, without re-analysing any section.

The class accepts an optional logger object, to which it will report information. The module contains two functions that demonstrate the creation of logger objects (one to the console, and the other to a log file).

I've also written a test file called test_eml_parser.py, which contains a series of unit tests.

The rest of this article contains the code.

eml_parser.py

	#!/usr/bin/python


	import os
	import logging
	import StringIO


	name_characters = "0123456789abcdefghijklmnopqrstuvwxyz_-."
	# printable ASCII:
	data_characters = "!#$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{\|}~"
	data_characters += "\""
	escaped_characters = "<>\\"
	whitespace_characters = " \t\n"
	data_characters += escaped_characters + whitespace_characters

	# future: numerical values for different contexts
	# e.g. EMPTY = 10


	def main():

	help_message = """
	Edgecase Markup Language (EML) Parser Library:

	This library supplies a class that can parse a file object or a string into an EML element tree.

	Use:

	1)
	from eml_parser import Element
	with open(file_path, 'r') as f:
	root_element = Element.from_file_object(f)

	2)
	from eml_parser import Element
	s = "<content>hello <bold>world</bold></content>"
	root_element = Element.from_string(s)

	Both class methods accept an optional Python logger object, using the keyword argument "logger".

	Example:
	import logging
	from eml_parser import Element, create_console_logger
	logger = create_console_logger(logging.INFO)
	with open(file_path, 'r') as f:
	root_element = Element.from_file_object(f, logger)

	To display the EML element tree:
	print(root_element.tree())
	"""

	print help_message


	def run_test():

	# how to run this function from the command line:
	# $ python -c "import eml_parser; eml_parser.run_test()"

	#logger = create_console_logger(logging.DEBUG)
	logger = create_file_logger(logging.INFO)

	data = "<content>hello world this is dog</content>"
	#data = "<foo>x<bar>y</bar>b</foo>"
	#data = "<e1>foo<e2></e2>xxx<e3>bar</e3>hello</e1>"
	#data = "<e1>foo</e1>asdf"

	try:
	root_element = Element.from_string(data, logger=logger)
	logger.info("Built Element from string.")
	logger.info("\n" + root_element.tree() + "\n")
	except Exception as e:
	logger.exception("Error occurred when building an Element from file object")


	def create_console_logger(level):
	logger_name = __name__ + ':console'
	# create logger
	logger = logging.getLogger(logger_name)
	logger.setLevel(level)
	# create console handler
	ch = logging.StreamHandler()
	ch.setLevel(level)
	# create formatter
	formatter = logging.Formatter('[%(name)s] [%(levelname)s] [%(message)s]')
	# add formatter to ch
	ch.setFormatter(formatter)
	# add ch to logger
	logger.addHandler(ch)
	return logger


	def create_file_logger(level, filepath=None):
	if not filepath:
	filepath = __name__ + ".log"
	logger_name = filepath
	logger = logging.getLogger(logger_name)
	logger.setLevel(level)
	handler = logging.FileHandler(filepath, mode='w', delay=True)
	handler.setLevel(level)
	formatter = logging.Formatter('[%(asctime)s] [%(name)s] [%(levelname)s] [%(message)s]')
	handler.setFormatter(formatter)
	logger.addHandler(handler)
	return logger


	def create_null_logger():
	logger = logging.getLogger(__name__)
	handler = logging.NullHandler()
	logger.addHandler(handler)
	return logger




	class Element:

	def __init__(self):
	self.name = ""
	self.end_name = ""
	self.complete = False
	self.children = []
	self.parent = None
	self.logger = None

	def __str__(self):
	return "Element: [{}]".format(self.name)

	def tree(self):
	return "\n".join(self.tree_lines())

	def tree_lines(self):
	if self.parent == None:
	tree_lines = ["Tree for " + str(self)]
	else:
	tree_lines = [" " + str(self)]
	for child in self.children:
	child_lines = child.tree_lines()
	child_lines = [("-" + x) for x in child_lines]
	tree_lines.extend(child_lines)
	return tree_lines

	@classmethod
	def from_string(cls, string, parent=None, logger=None):
	f = StringIO.StringIO(string)
	element = cls.from_file_object(f, parent=parent, logger=logger)
	return element

	@classmethod
	def from_file_object(cls, f, parent=None, logger=None):
	if not logger:
	logger = create_null_logger()
	e = cls()
	e.parent = parent
	e.logger = logger
	e.process_file_object(f)
	if not e.parent:
	# this is the root Element.
	# did we reach the end of the file object?
	remaining_data = f.read()
	if len(remaining_data) > 0:
	message = "Finished building root element, but found this data at end of file: {}".format(repr(remaining_data))
	raise Exception(message)
	return e

	def process_file_object(self, f):
	c = "empty" # c = context
	handled = False # have we handled this combination of context and byte?
	message = "Element: context [{}], byte [{}]."
	while True:
	b = f.read(1) # b = byte
	if not b:
	if not self.complete:
	message = message.format(c, b)
	message += " No more data left, but Element is not complete."
	raise Exception(message)
	break
	self.logger.debug(message.format(c, b))
	if b == "<":
	if c == "empty":
	c = "start_tag_open"
	handled = True
	elif c == "start_tag_close":
	c = "tag_open"
	handled = True
	elif c == "data":
	c = "tag_open"
	handled = True
	elif b == ">":
	if c == "start_tag_name":
	c = "start_tag_close"
	handled = True
	elif c == "end_tag_name":
	c = "end_tag_close"
	# we're at the end. move 1 level back up in the decision hierarchy.
	if self.name != self.end_name:
	message = message.format(c, b)
	message += " Finished building Element, but end tag name ({}) is not the same as start tag name ({}).".format(self.end_name, self.name)
	raise Exception(message)
	# mark element as complete
	self.complete = True
	self.logger.debug("End Element. Move up 1 level.")
	break
	elif b == "/":
	if c == "tag_open":
	c = "end_tag_open"
	handled = True
	elif b in name_characters:
	if c == "start_tag_open":
	c = "start_tag_name"
	self.name += b
	handled = True
	elif c == "start_tag_name":
	self.name += b
	handled = True
	elif c == "start_tag_close":
	c = "data"
	self.logger.debug("switch to Datum.")
	# rewind file by 1 byte
	f.seek(-1, os.SEEK_CUR)
	# pass file object to Datum class
	datum = Datum.from_file_object(f, logger=self.logger)
	self.children.append(datum)
	handled = True
	elif c == "end_tag_open":
	c = "end_tag_name"
	self.end_name += b
	handled = True
	elif c == "end_tag_name":
	self.end_name += b
	handled = True
	elif c == "tag_open":
	# child Element
	c = "data"
	self.logger.debug("switch to Element.")
	# rewind file by 2 bytes
	f.seek(-2, os.SEEK_CUR)
	# pass file object to Element class
	child = Element.from_file_object(f, parent=self, logger=self.logger)
	self.children.append(child)
	handled = True
	elif c == "data":
	self.logger.debug("switch to Datum.")
	# rewind file by 1 byte
	f.seek(-1, os.SEEK_CUR)
	# pass file object to Datum class
	datum = Datum.from_file_object(f, logger=self.logger)
	self.children.append(datum)
	handled = True
	elif b in data_characters:
	if c == "start_tag_close":
	c = "data"
	self.logger.debug("switch to Datum.")
	# rewind file by 1 byte
	f.seek(-1, os.SEEK_CUR)
	# pass file object to Datum class
	datum = Datum.from_file_object(f, logger=self.logger)
	self.children.append(datum)
	handled = True

	if not handled:
	raise Exception(message.format(c, b))
	handled = False




	class Datum:

	def __init__(self):
	self.bytes = ""
	self.parent = None
	self.logger = None

	@classmethod
	def from_file_object(cls, f, parent=None, logger=None):
	d = cls()
	d.parent = parent
	d.logger = logger
	d.process_file_object(f)
	return d

	def __str__(self):
	return "Datum: [{} bytes]".format(len(self.bytes))

	def tree_lines(self):
	tree_line = " " + str(self)
	n = 5 # display this number of bytes from either end of the Datum.
	m = len(self.bytes)
	if m <= n*2:
	tree_line += ": [{}]".format(self.bytes)
	else:
	tree_line += ": [{} ... {}]".format(self.bytes[:n], self.bytes[-n:])
	return [tree_line]

	def process_file_object(self, f):
	c = "data" # c = context
	handled = False # have we handled this combination of context and byte?
	message = "Datum: context [{}], byte [{}]."
	while True:
	b = f.read(1) # b = byte
	if not b:
	break
	self.logger.debug(message.format(c, b))
	if b == "<":
	if c == "data":
	# this is a new Element
	self.logger.debug("End Datum. Move up 1 level.")
	# rewind file by 1 byte
	f.seek(-1, os.SEEK_CUR)
	break
	elif b == ">":
	if c == "data":
	# error
	raise Exception(message.format(c, b))
	elif b == "\\":
	if c == "data":
	c = "escape"
	handled = True
	elif c == "escape":
	self.bytes += b
	c = "data"
	handled = True
	elif b in data_characters:
	if c == "data":
	self.bytes += b
	handled = True

	if not handled:
	raise Exception(message.format(c, b))
	handled = False




	if __name__ == "__main__":
	main()

test_eml_parser.py

	#!/usr/bin/python


	from eml_parser import Element, Datum
	import unittest


	class TestElement(unittest.TestCase):

	def test_upper(self):
	self.assertEqual('foo'.upper(), 'FOO')

	def test_empty_element(self):
	s = "<content></content>"
	e = Element.from_string(s)
	self.assertEqual(e.name, "content")
	self.assertEqual(len(e.children), 0)

	def test_single_start_tag(self):
	s = "<content>"
	with self.assertRaises(Exception):
	Element.from_string(s)

	def test_some_garble(self):
	s = "uahslyfasdlhgfas"
	with self.assertRaises(Exception):
	Element.from_string(s)

	def test_single_end_tag(self):
	s = "</foo>"
	with self.assertRaises(Exception):
	Element.from_string(s)

	def test_element_containing_data(self):
	s = "<content>hello world this is dog</content>"
	e = Element.from_string(s)
	self.assertEqual(e.name, "content")
	self.assertEqual(len(e.children), 1)
	child = e.children[0]
	self.assertIsInstance(child, Datum)

	def test_element_containing_element(self):
	s = "<foo>x<bar>y</bar>b</foo>"
	e = Element.from_string(s)
	self.assertEqual(e.name, "foo")
	self.assertEqual(len(e.children), 3)
	child1 = e.children[0]
	self.assertIsInstance(child1, Datum)
	child2 = e.children[1]
	self.assertIsInstance(child2, Element)
	self.assertEqual(len(child2.children), 1)

	def test_element_followed_by_garble(self):
	s = "<e1>foo</e1>asdf"
	with self.assertRaises(Exception):
	Element.from_string(s)


	if __name__ == '__main__':
	unittest.main()

[spiano@localhost work]$ python --version

Python 2.7.5

[spiano@localhost work]$ python test_eml_parser.py

........
----------------------------------------------------------------------
Ran 8 tests in 0.005s

OK

[spiano@localhost work]$ python eml_parser.py

Edgecase Markup Language (EML) Parser Library:

This library supplies a class that can parse a file object or a string into an EML element tree.

Use:

1)
from eml_parser import Element
with open(file_path, 'r') as f:
root_element = Element.from_file_object(f)

2)
from eml_parser import Element
s = "<content>hello <bold>world</bold></content>"
root_element = Element.from_string(s)

Both class methods accept an optional Python logger object, using the keyword argument "logger".

Example:
import logging
from eml_parser import Element, create_console_logger
logger = create_console_logger(logging.INFO)
with open(file_path, 'r') as f:
root_element = Element.from_file_object(f, logger)

To display the EML element tree:
print(root_element.tree())

[start of notes]

Changes from the original text:
- I have not always preserved the format of any computer output (e.g. from running bash commands). Examples: Setting input lines in bold text, adding/removing newlines in order to make a sequence of commands easier to read, using hyphens for lists and sublists instead of indentation, breaking wide tables into consecutive sections.

[end of notes]

[start of footnotes]

[0]

e.g. an HTML page, as rendered by today's browsers.

[return to main text]

[1]

Reality is admittedly more tangled and involves multiple interacting trees, webs, and shrubs, but a tree is probably the best format that humans can process.

[return to main text]

[2]

This permits analysis of the tree to identify exactly which end tag is missing (if one is missing). This is not possible if a the end tag is always the same e.g. </end> or } or ) or something similar.

[return to main text]

[3]

When using a given character set to write an encoding, you will eventually want to refer to one of the signal characters in the normal text, and this will mean that you will have to use an escape sequence. Some escape sequences include the original signal character, some do not. I have included them in my escape sequences for EML, because I judge the result to be more human-readable in its raw form.

[return to main text]

[4]

There can be empty elements e.g. "<divider></divider>". I chose this format instead of element names that close themselves, as in HTML's "<br/>", because this way the encoding format is less complex.

[return to main text]

[5]

Although you could design a table format, where the initial section would have to be read byte-by-byte, but would specify the table row lengths and the number of columns, and then you'd know how to move quickly around the table without reading and processing all the data within it. With time, this might become a new database system (like MySQL, PostgreSQL, etc), where the entire database was not just machine-readable, but human-readable as well.

[return to main text]

[end of footnotes]