A few weeks ago I set about creating a new markup language. I wanted to learn more about language parsing, grammars, and the various difficulties involved.

I also had a very specific idea of what I wanted to create: a dead simple alternative to HTML. I’d recently picked up SASS and tried to draw on its succinctness to inspire me. CSS itself is quite succinct in how it declares elements, IDs, classes and attributes. And SASS, drawing on its own inspiration, HAML, adds the elegance of tabbed nesting.

I’d done something similar a while ago, allowing you to get DOM structures from basic CSS selectors:

ul li:5 span[innerHTML="item"]

Using satisfy() this becomes:

<ul>
    <li><span>item</span></li>
    <li><span>item</span></li>
    <li><span>item</span></li>
    <li><span>item</span></li>
    <li><span>item</span></li>
</ul>

But I didn’t want to stop there; I wanted to create a way to define entire HTML documents with minimal syntax. i.e. allowing you write stuff like:

html
  head
    title 'something'
  body
    h1 a[href=/] 'something'

Creating the parser

I began by looking into PEGjs, a really impressive parser generator for JavaScript. It allows you to specify the rules of your grammar like so:

Single
  = Attribute
  / Element
  / Text
  / Directive
 
//...
 
Attribute
  = name:AttributeName _ ":" _ value:Value (_ ";")? {
    // This bit is just regular JavaScript...
    return ['Attribute', [name, value]];
  }

Above specifies the grammar rule, Single, which defines various valid “Single” definitions, such as Attribute, which is also specified above. The Attribute rule references AttributeName:

AttributeName
  = name:[A-Za-z0-9-_]+ { return name.join(''); }
  / String

An AttributeName can be a string of characters matching the pattern [A-Za-z0-9-_]+ or a String (wrapped in quotes), which is also specified in the grammar.

It’s seemingly dead-simple, although there are gotchas like left-hand-side recursion and poisonously inefficient backtracking. At one point it was taking my parser 700ms to parse this:

a {
  b {
    c {}
  }
}

I found that I was writing rules in such a way that meant there was a lot of backtracking happening. I.e. when the parser tried a rule and failed on it, it would go back to the initial character trying the next alternate rule. In a nutshell, don’t do this:

SomeRule
  = [a-zA-Z]+ '::' [0-9]+ ';'
  / [a-zA-Z]+ '::' [0-9]+

Instead, just make the semi-colon optional:

SomeRule
  = [a-zA-Z]+ '::' [0-9]+ ';'?

This may seem trivial but it’s not always easier to spot for higher level rules. Small optimisations like this matter.

I was able to get that ridiculous 700ms down to 5ms! And there are still improvements to be made.

Creating the generator

The generator would have to be able to take output from the parser and generate HTML from it. From a string like a b c the parser outputs a structure like this:

[
   "Element",
   [
      [
         [
            "Tag",
            "a"
         ]
      ],
      [
         [
            "Element",
            [
               [
                  [
                     "Tag",
                     "b"
                  ]
               ],
               [
                  [
                     "Element",
                     [
                        [
                           [
                              "Tag",
                              "c"
                           ]
                        ]
                     ]
                  ]
               ]
            ]
         ]
      ]
   ]
]

The HTML generation was quite simple to do. Essentially, I treated every Element as an entity that can have children. An Element’s children could be other Elements, Attributes, Text or even custom directives. So, this:

label {
  label: foo;
  input#foo
}

Would parse to:

[
   "Element",
   [
      [
         [
            "Tag",
            "label"
         ]
      ],
      [
         [
            "IncGroup",
            [
               [
                  "Attribute",
                  [
                     "label",
                     "foo"
                  ]
               ],
               [
                  "Element",
                  [
                     [
                        [
                           "Tag",
                           "input"
                        ],
                        [
                           "Id",
                           "foo"
                        ]
                     ]
                  ]
               ]
            ]
         ]
      ]
   ]
]

Essentially, the hiararchy that you originally write is reflected in the tree outputted by the parser. The generator can then just recurse through this structure creating HTML strings as it goes along.

For example, this is the default generator for HTML attributes:

//...
    _default: {
      type: 'ATTR',
      make: function(attrName, value) {
        if (value == null) {
          return attrName;
        }
        return attrName + '="' + escapeHTML(value) + '"';
      }
    },
  //...

This would make `for:foo;` output the HTML, `for=”foo”`.

Fun feature: Exclusives

The fake power you feel when creating a language frequently manifests in strange features and syntax. That’s what happened here. Although I do genuinely feel that this particular one is useful.

I’m talking about “Exclusive Groups”. When writing your CSS-style selectors, it allows you to specify alternates within braces and then these will then be expanded so that the resulting HTML conforms to all the potential combinations. An example:

x (a/b) // expands to: "x a, x b"

That would give you:

<x>
  <a></a>
</x>
<x>
  <b></b>
</x>

A more complex example:

(a/b) (x/y)

That would give you:

<a><x></x></a>
<a><y></y></a>
<b><x></x></b>
<b><y></y></b>

The original selector (a/b)(x/y) expanded to a x, a y, b x, b y.

A little nifty, a little pointless.. perhaps. Although it can be useful:

ul li ('A'/'List'/'Of'/'Stuff')

(becomes)

<ul>
  <li>A</li>
  <li>List</li>
  <li>Of</li>
  <li>Stuff</li>
</ul>

Indentation

I wanted there to be the option to use traditional CSS curlies to demarcate nestings. I.e.

div {
  ul {
    li {
      //...
    }
  }
}

But I also wanted auto-nesting via indentation, like in SASS:

div
  ul
    li
      //...

Stuff became tricky, quickly. The problem with auto-nesting is that the expected behaviour can become ambiguous:

section
    h1
        em
      span
    div
        p

Furthermore, you have to contend with spaces and tabs. Which one counts as a single level of indentation?

The solution I eventually rested on was simply letting the user mess stuff up themselves, if they wanted. The parser will count levels of indentation by how many whitespace characters you have. I’d like to add an error that’s thrown if the user’s silly enough to mix tabs and spaces. For now, though, they’ll have to suffer. There is an inherent ambiguity in this kind of magic. What should the parser do with this? –

body
  div
    p {
    span
  em
    }

Right now, we assume, because the user has opted to use curlies on the p element, that the auto-nesting should be turned off until the curly closes. Another option would be to reset the indentation counter to zero and try to resolve children regularily. But the above code is still ambiguous. Should an error be thrown? Maybe “SyntaxError: What on earth are you doing?

Is it done? What is it?

Yeh, it’s done, more or less.

Technically, it’s an HTML preprocessor. It’s not a templating engine. It doesn’t do that. Reasons are as follows:

  1. Feature bloat
  2. People still write plain ol’ HTML
  3. Pure DOM templates are on the rise. See AngularJS or Knockout.

Also: client-side templating is a minefield of different approaches. I’ll stay out if I can.

SIML can cater to the DOM template style quite gracefully. This is using SIML’s Angular generator:

ul#todo-list > li
  @repeat( todo in todos | filter:statusFilter )
  @class({
    completed: todo.completed,
    editing: todo == editedTodo
  })

That produces:

<ul id="todo-list">
  <li
    ng-repeat="todo in todos | filter:statusFilter"
    ng-class="{ completed: todo.completed, editing: todo == editedTodo }"
  ></li>
</ul>

The @foo things you see above are directives. You can create your own in a new generator, if you so wish. The Angular generator, by default, will create ng- HTML attributes from undefined psueod-classes and directives. So I could do:

div:cloak
  @show(items.length)

And that would generate:

<div ng-cloak ng-show="items.length"></div>

Ideas and paths

It’s early days and I’m not even sure if SIML provides enough value as-is, but I do think it could serve devs quite well for the following use-cases:

  • Creating boilerplate HTML code quickly
  • Creating cleaner AngularJS/Knockout markup (Example)
  • Creating bespoke directives/pseudo-classes/attributes to serve your needs

The last point is quite powerful, I think. Imagine having a bunch of pre-defined directives that would allow you to do stuff like:

#sidebar
  input
    @datepicker({
      start: [2013,01,01]
    })

Closing remarks

As a learning exercise it was very valuable. I hope, as a happy accident, I’ve created something potentially useful to others.