I was looking for another side project to keep myself busy when I’m not doing the day job, and specifically I am enjoying Rust but still don’t feel like I fully understand the borrow checker any why sometimes I code myself into a corner I can’t borrow my way out of. So, find something that forces you to deal with it, a structure with inherent references, something useful but complex enough to be fun. I was working on something else, which included an XML writer and which one day was going to need an XML reader also. So I wondered, what about an implementation of the DOM, I know it, have used it so many times from a bunch of languages, it is a tree structure but has certain links that would make references a pain, so why not?

Well, I was about to find out some really good answers to why not in short order.

Read The F-ing Standards (RTFS)

I have read a lot of industry standards in my time, I’ve even written and contributed to some, I have written software that implements standards too. But nothing quite prepared me for the complexity of the task ahead of me.

Obviously one starts with a quick search for “DOM API” and I found Document Object Model (DOM) Level 1 Specification which had the look of a good starting place. Except that, what you really want is Document Object Model (DOM) Level 2 Core Specification. It turns out that the writers of these things realized that they were dealing with (at least) two communities, those using the DOM as a helpful formalism over a horribly ill-defined and frequently ill-written format namely HTML. But, then there are those needing an API that is a faithful representation of a language more formally defined and in most cases written in a valid and reasonable manner namely XML. So, loosely speaking Level 1 (not version 1) is the set of interfaces required for HTML and Level 2 is a superset of these interfaces that covers XML. OK, So Level 2 Core it is.

But it doesn’t end there. To understand what is a valid name, what constitutes valid content, or implement Attribute-Value Normalization, you really have to dive into Extensible Markup Language (XML) 1.1 (Second Edition) (which supersedes Extensible Markup Language (XML) 1.0 (Fifth Edition) but which you really should keep handy to understand the differences). To understand the error conditions around NAMESPACE_ERR you really need to go read Namespaces in XML 1.1 (Second Edition) as well as The “xml” Namespace. To understand how to implement the DOM Document’s get_element_by_id method you’ll need xml:id Version 1.0. There are also additional specifications dealing with encoding names, language identifiers, and URIs used for namespaces.

Now if you decide to go really nuts and have your processor be schema aware, then you can start on the XML Schema family of specifications; Part 0: Primer Second Edition, Part 1: Structures Second Edition, Part 2: Datatypes Second Edition, W3C XML Schema Definition Language (XSD): Component Designators, and Processing XML 1.1 documents with XML Schema 1.0 processors.

My head hurts already.

A few ways that won’t work

“I have not failed. I’ve just found 10,000 ways that won’t work.” – Thomas Edison.

Getting to the right implementation model/approach took a few tries. My first was to develop traits that mapped 1:1 with the specification and a set of structs that could model nodes. This seemed reasonable but I ended up with myself wrapped up in lifetimes, trying to copy references around and basically making my own life miserable. The second approach was to have a struct for each interface, that also ended up in a mess trying to keep track of lifetimes as I had all kinds of &'a Node fields in these structs.

After a while I decided that honestly doing this from first principals wasn’t productive so what had other people done? So, digging around I found the amxml crate and inside it’s dom module a neat trick:

pub struct NodePtr {
    rc_node: RcNode,
}

type RcNode = Rc<Node>;

fn wrap_rc_clone(rc_node: &RcNode) -> NodePtr {
    return NodePtr{ rc_node: Rc::clone(rc_node) };
}

// ...

struct Node {
    node_type: NodeType,
    order: Cell<i64>,
    name: String,
    value: String,
    parent: Option<RefCell<Weak<Node>>>,
    children: RefCell<Vec<RcNode>>,
    attributes: RefCell<Vec<RcNode>>,
}

The NodePtr values can easily be created, and as their only member is an Rc<Node> (Rc) there can be a number of these structs sharing the same Node value. The vector of children in each Node allow us to retain ownership with Rc, but this doesn’t work well for parent (or for owning_document which isn’t present in amxml). However, a Weak<Node> (Weak) created by Rc::downgrade works well to capture these additional references. Additionally, RefCell is added to allow mutation of elements in subtle ways within the implementation.

The corresponding types in xml_dom are split between a generic rc_cell module:

pub struct RcRefCell<T: Sized> {
    inner: Rc<RefCell<T>>,
}

pub struct WeakRefCell<T: Sized> {
    inner: Weak<RefCell<T>>,
}

and the node_impl module (note that the WeakRefNode is not exposed to clients):

pub type RefNode = RcRefCell<NodeImpl>;

pub(crate) type WeakRefNode = WeakRefCell<NodeImpl>;

Taking an approach where there is a single structure for all node types, NodeImpl, the question is then how to store all the node-type specific details. Here is the final form of NodeImpl.

pub(crate) enum Extension {
    None,
    Document {
        i_xml_declaration: Option<XmlDecl>,
        i_document_element: Option<RefNode>,
        i_document_type: Option<RefNode>,
        i_id_map: HashMap<String, WeakRefNode>,
        i_options: ProcessingOptions,
    },
    DocumentType {
        i_entities: HashMap<Name, RefNode>,
        i_notations: HashMap<Name, RefNode>,
        i_public_id: Option<String>,
        i_system_id: Option<String>,
        i_internal_subset: Option<String>,
    },
    Element {
        i_attributes: HashMap<Name, RefNode>,
        i_namespaces: HashMap<Option<String>, String>,
    },
    Entity {
        i_public_id: Option<String>,
        i_system_id: Option<String>,
        i_notation_name: Option<String>,
    },
    Notation {
        i_public_id: Option<String>,
        i_system_id: Option<String>,
    },
}

pub struct NodeImpl {
    pub(crate) i_node_type: NodeType,
    pub(crate) i_name: Name,
    pub(crate) i_value: Option<String>,
    pub(crate) i_parent_node: Option<WeakRefNode>,
    pub(crate) i_owner_document: Option<WeakRefNode>,
    pub(crate) i_child_nodes: Vec<RefNode>,
    pub(crate) i_extension: Extension,
}

It’s Alive!!

I have by now checked in a couple of versions and created the crate xml_dom, and while not all things work as they should, some very basic things do.

use xml_dom::*;
use xml_dom::convert::*;

let implementation = get_implementation();
let mut document_node = implementation
    .create_document("http://www.w3.org/1999/xhtml", "html", None)
    .unwrap();
println!("document 1: {:#?}", document_node);

let document = as_document_mut(&mut document_node).unwrap();
let mut root_node = document.document_element().unwrap();

let root = as_element_mut(&mut root_node).unwrap();
root.set_attribute("lang", "en");
let _head = root.append_child(document.create_element("head").unwrap());
let _body = root.append_child(document.create_element("body").unwrap());

let xml = document_node.to_string();
println!("document 2: {}", xml);

Now, it’s not necessarily pretty, but in general DOM programming rarely is. There’s a lot of casting between node types and so on, but it works. Also note the need for a non-specification get_implementation function to deal with the following in §1.1.2. Memory Management:

The DOM Level 2 API does not define a standard way to create DOMImplementation objects; DOM implementations must provide some proprietary way of bootstrapping these DOM interfaces, and then all other objects can be built from there.

That got big, fast

After checking in code to add support for XML declarations (why isn’t that a part of the DOM?) and was considering writing this post, I actually wondered, how big had this gotten? Well, it turns out that scc seems to think I wrote ~7 months of work in the last two and a half weeks.

───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        26      8029      737      2695     4597        532
License                      1        21        4         0       17          0
Markdown                     1        84       21         0       63          0
TOML                         1        14        1         0       13          0
gitignore                    1         4        1         0        3          0
───────────────────────────────────────────────────────────────────────────────
Total                       30      8152      764      2695     4693        532
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $136,967
Estimated Schedule Effort 7.207851 months
Estimated People Required 2.250957
───────────────────────────────────────────────────────────────────────────────

What it doesn’t know is that those 2,695 lines of comments are mostly copied from the various W3C specifications either into the documentation produced by cargo doc or internally as reminders about various rules and validation requirements. But still, I didn’t know I typed that fast.

Also, where can I apply for the $137K?

And yet, so much more

The DOM doesn’t actually cover the XML declaration (Definition: XML 1.1 documents MUST begin with an XML declaration which specifies the version of XML being used.) which seems awkward if it’s a required part of XML 1.1.

Now, how do we parse text into the DOM? I’d rather not go further down this rabbit hole and build my own parser, there are some existing crates out there, but not all of them parse all the components required to construct a high-fidelity DOM.

Is the implementation of Display for RefNode enough to write XML out again?