Building SIML: A new markup language

A few weeks ago I set about creating a new markup language. I wanted to learn more about language parsing, grammars, and the various difficulties involved.

I also had a very specific idea of what I wanted to create: a dead simple alternative to HTML. I’d recently picked up SASS and tried to draw on its succinctness to inspire me. CSS itself is quite succinct in how it declares elements, IDs, classes and attributes. And SASS, drawing on its own inspiration, HAML, adds the elegance of tabbed nesting.

I’d done something similar a while ago, allowing you to get DOM structures from basic CSS selectors:

ul li:5 span[innerHTML="item"]

Using satisfy() this becomes:

<ul>
    <li><span>item</span></li>
    <li><span>item</span></li>
    <li><span>item</span></li>
    <li><span>item</span></li>
    <li><span>item</span></li>
</ul>

But I didn’t want to stop there; I wanted to create a way to define entire HTML documents with minimal syntax. i.e. allowing you write stuff like:

html
  head
    title 'something'
  body
    h1 a[href=/] 'something'

Creating the parser

I began by looking into PEGjs, a really impressive parser generator for JavaScript. It allows you to specify the rules of your grammar like so:

Single
  = Attribute
  / Element
  / Text
  / Directive
 
//...
 
Attribute
  = name:AttributeName _ ":" _ value:Value (_ ";")? {
    // This bit is just regular JavaScript...
    return ['Attribute', [name, value]];
  }

Above specifies the grammar rule, Single, which defines various valid “Single” definitions, such as Attribute, which is also specified above. The Attribute rule references AttributeName:

AttributeName
  = name:[A-Za-z0-9-_]+ { return name.join(''); }
  / String

An AttributeName can be a string of characters matching the pattern [A-Za-z0-9-_]+ or a String (wrapped in quotes), which is also specified in the grammar.

It’s seemingly dead-simple, although there are gotchas like left-hand-side recursion and poisonously inefficient backtracking. At one point it was taking my parser 700ms to parse this:

a {
  b {
    c {}
  }
}

I found that I was writing rules in such a way that meant there was a lot of backtracking happening. I.e. when the parser tried a rule and failed on it, it would go back to the initial character trying the next alternate rule. In a nutshell, don’t do this:

SomeRule
  = [a-zA-Z]+ '::' [0-9]+ ';'
  / [a-zA-Z]+ '::' [0-9]+

Instead, just make the semi-colon optional:

SomeRule
  = [a-zA-Z]+ '::' [0-9]+ ';'?

This may seem trivial but it’s not always easier to spot for higher level rules. Small optimisations like this matter.

I was able to get that ridiculous 700ms down to 5ms! And there are still improvements to be made.

Creating the generator

The generator would have to be able to take output from the parser and generate HTML from it. From a string like a b c the parser outputs a structure like this:

The HTML generation was quite simple to do. Essentially, I treated every Element as an entity that can have children. An Element’s children could be other Elements, Attributes, Text or even custom directives. So, this:

label {
  label: foo;
  input#foo
}

Would parse to:

[
   "Element",
   [
      [
         [
            "Tag",
            "label"
         ]
      ],
      [
         [
            "IncGroup",
            [
               [
                  "Attribute",
                  [
                     "label",
                     "foo"
                  ]
               ],
               [
                  "Element",
                  [
                     [
                        [
                           "Tag",
                           "input"
                        ],
                        [
                           "Id",
                           "foo"
                        ]
                     ]
                  ]
               ]
            ]
         ]
      ]
   ]
]

Essentially, the hiararchy that you originally write is reflected in the tree outputted by the parser. The generator can then just recurse through this structure creating HTML strings as it goes along.

For example, this is the default generator for HTML attributes:

//...
    _default: {
      type: 'ATTR',
      make: function(attrName, value) {
        if (value == null) {
          return attrName;
        }
        return attrName + '="' + escapeHTML(value) + '"';
      }
    },
  //...

This would make `for:foo;` output the HTML, `for=”foo”`.

Fun feature: Exclusives

The fake power you feel when creating a language frequently manifests in strange features and syntax. That’s what happened here. Although I do genuinely feel that this particular one is useful.

I’m talking about “Exclusive Groups”. When writing your CSS-style selectors, it allows you to specify alternates within braces and then these will then be expanded so that the resulting HTML conforms to all the potential combinations. An example:

x (a/b) // expands to: "x a, x b"

That would give you:

<x>
  <a></a>
</x>
<x>
  <b></b>
</x>

A more complex example:

(a/b) (x/y)

That would give you:

<a><x></x></a>
<a><y></y></a>
<b><x></x></b>
<b><y></y></b>

The original selector (a/b)(x/y) expanded to a x, a y, b x, b y.

A little nifty, a little pointless.. perhaps. Although it can be useful:

ul li ('A'/'List'/'Of'/'Stuff')

(becomes)

<ul>
  <li>A</li>
  <li>List</li>
  <li>Of</li>
  <li>Stuff</li>
</ul>

Indentation

I wanted there to be the option to use traditional CSS curlies to demarcate nestings. I.e.

div {
  ul {
    li {
      //...
    }
  }
}

But I also wanted auto-nesting via indentation, like in SASS:

div
  ul
    li
      //...

Stuff became tricky, quickly. The problem with auto-nesting is that the expected behaviour can become ambiguous:

section
    h1
        em
      span
    div
        p

Furthermore, you have to contend with spaces and tabs. Which one counts as a single level of indentation?

The solution I eventually rested on was simply letting the user mess stuff up themselves, if they wanted. The parser will count levels of indentation by how many whitespace characters you have. I’d like to add an error that’s thrown if the user’s silly enough to mix tabs and spaces. For now, though, they’ll have to suffer. There is an inherent ambiguity in this kind of magic. What should the parser do with this? —

body
  div
    p {
    span
  em
    }

Right now, we assume, because the user has opted to use curlies on the p element, that the auto-nesting should be turned off until the curly closes. Another option would be to reset the indentation counter to zero and try to resolve children regularily. But the above code is still ambiguous. Should an error be thrown? Maybe “SyntaxError: What on earth are you doing?“

Is it done? What is it?

Yeh, it’s done, more or less.

It’s called SIML.
You can try it here!

Technically, it’s an HTML preprocessor. It’s not a templating engine. It doesn’t do that. Reasons are as follows:

Feature bloat
People still write plain ol’ HTML
Pure DOM templates are on the rise. See AngularJS or Knockout.

Also: client-side templating is a minefield of different approaches. I’ll stay out if I can.

SIML can cater to the DOM template style quite gracefully. This is using SIML’s Angular generator:

ul#todo-list > li
  @repeat( todo in todos | filter:statusFilter )
  @class({
    completed: todo.completed,
    editing: todo == editedTodo
  })

That produces:

<ul id="todo-list">
  <li
    ng-repeat="todo in todos | filter:statusFilter"
    ng-class="{ completed: todo.completed, editing: todo == editedTodo }"
  ></li>
</ul>

The @foo things you see above are directives. You can create your own in a new generator, if you so wish. The Angular generator, by default, will create ng- HTML attributes from undefined psueod-classes and directives. So I could do:

div:cloak
  @show(items.length)

And that would generate:

<div ng-cloak ng-show="items.length"></div>

Ideas and paths

It’s early days and I’m not even sure if SIML provides enough value as-is, but I do think it could serve devs quite well for the following use-cases:

Creating boilerplate HTML code quickly
Creating cleaner AngularJS/Knockout markup (Example)
Creating bespoke directives/pseudo-classes/attributes to serve your needs

The last point is quite powerful, I think. Imagine having a bunch of pre-defined directives that would allow you to do stuff like:

#sidebar
  input
    @datepicker({
      start: [2013,01,01]
    })

Closing remarks

As a learning exercise it was very valuable. I hope, as a happy accident, I’ve created something potentially useful to others.

Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!

Lars March 17th, 2013 at 11:45 pm

Looks promising! Currently I’m using jade (http://jade-lang.com/) for that purpose.. Do you know it?

James March 17th, 2013 at 11:50 pm

@Lars, Yeh I found Jade a few days after starting work on this — and I do like it. It’s concise and expressive. I did find some of the notation to be quite alien to me though (although thankfully less alien than HAML). What I originally wanted for SIML was something that anyone familiar with CSS (and maybe SASS) could pick up quite quickly.

Ricardo Rodrigues March 18th, 2013 at 12:13 am

It’s a quite entertaining/educational experience, even though there are some good options out there, like the already mentioned jade, and also zen coding, which has been out there for a while and now it’s even available within Visual Studio 2012 with its latest official update. The latter allows stuff like this:

div#myId>select>option[value=someValue]*5>lorem

generate:

Lorem ipsum dolor sit amet, consectetur adipiscing elit fusce vel sapien elit in malesuada semper mi, id sollicitudin urna fermentum ut fusce varius nisl ac ipsum gravida vel pretium tellus.
Tincidunt integer eu augue augue nunc elit dolor, luctus placerat scelerisque euismod, iaculis eu lacus nunc mi elit, vehicula ut laoreet ac, aliquam sit amet justo nunc tempor, metus vel.
Placerat suscipit, orci nisl iaculis eros, a tincidunt nisi odio eget lorem nulla condimentum tempor mattis ut vitae feugiat augue cras ut metus a risus iaculis scelerisque eu ac ante.
Fusce non varius purus aenean nec magna felis fusce vestibulum velit mollis odio sollicitudin lacinia aliquam posuere, sapien elementum lobortis tincidunt, turpis dui ornare nisl, sollicitudin interdum turpis nunc eget.
Sem nulla eu ultricies orci praesent id augue nec lorem pretium congue sit amet ac nunc fusce iaculis lorem eu diam hendrerit at mattis purus dignissim vivamus mauris tellus, fringilla.

And this is just the tip of the iceberg

James March 18th, 2013 at 9:01 am

@Ricardo, Yup, Zen coding’s pretty cool. Isn’t it now called Emmet though? – http://docs.emmet.io/

Ricardo Rodrigues March 18th, 2013 at 12:06 pm

The HTML I put in my comment was obliterated and I couldn’t edit my comment, two good features you could add here 🙂
No, in this case it’s really called zen coding, at least in Visual Studio that’s what they’re calling it, and since on that one I don’t see VS in the supported list, I don’t think it’s the same, even though it looks very similar!

Peter van der Zee March 18th, 2013 at 2:11 pm

Welcome to the wonderful world of parsing. Hope you have a pleasant stay 😉 Next stop: writing the parser yourself.

😀

James March 18th, 2013 at 2:19 pm

@peter — I hope I’m brave enough! I think that’s definitely the next challenge.

Tom Kenny April 15th, 2013 at 1:54 pm

This is brilliant and I can’t wait for it to wholesale replace the manual writing of HTML.