For the recently developed debug.js (view) I had to come up with a way to remove all comments from any piece of JavaScript code.

I originally thought that this would be a piece of cake; a simple regex takes care of everything!

code.replace(//*.+?*/|//.*(?=[nr])/g, '');

This regular expression would have worked in 90% of situations but, unfortunately I had to build something that would work in every single situation.

It’s worth mentioning exactly when the above regular expression would fail:

  • When comment notation exists in a string, e.g.
  • var str = " /* not a real comment */ ";
  • When comment notation exists in a literal regular expression, e.g.
  • var regex = //*.*/;
  • When conditional compilation (supported in IE > 4) exists in the code, e.g.
  • /*@cc_on @*/
    /*@if (@_jscript_version == 4)
    alert("JavaScript version 4");
    @else @*/
    alert("Blah blah blah");
    /*@end @*/

While the likelihood of any of the above happening is low it’s certainly worth catering to all potential situations; just encase one of them arises!

So, after a bit of googling and messing arround, it turns out that the only way of doing this properly is to loop through the code, character by character, checking for certain delimiters and then enabling/disabling modes as the loop progresses:

/* 
    This function is loosely based on the one found here:
    http://www.weanswer.it/blog/optimize-css-javascript-remove-comments-php/
*/
function removeComments(str) {
    str = ('__' + str + '__').split('');
    var mode = {
        singleQuote: false,
        doubleQuote: false,
        regex: false,
        blockComment: false,
        lineComment: false,
        condComp: false 
    };
    for (var i = 0, l = str.length; i < l; i++) {
 
        if (mode.regex) {
            if (str[i] === '/' && str[i-1] !== '\') {
                mode.regex = false;
            }
            continue;
        }
 
        if (mode.singleQuote) {
            if (str[i] === "'" && str[i-1] !== '\') {
                mode.singleQuote = false;
            }
            continue;
        }
 
        if (mode.doubleQuote) {
            if (str[i] === '"' && str[i-1] !== '\') {
                mode.doubleQuote = false;
            }
            continue;
        }
 
        if (mode.blockComment) {
            if (str[i] === '*' && str[i+1] === '/') {
                str[i+1] = '';
                mode.blockComment = false;
            }
            str[i] = '';
            continue;
        }
 
        if (mode.lineComment) {
            if (str[i+1] === 'n' || str[i+1] === 'r') {
                mode.lineComment = false;
            }
            str[i] = '';
            continue;
        }
 
        if (mode.condComp) {
            if (str[i-2] === '@' && str[i-1] === '*' && str[i] === '/') {
                mode.condComp = false;
            }
            continue;
        }
 
        mode.doubleQuote = str[i] === '"';
        mode.singleQuote = str[i] === "'";
 
        if (str[i] === '/') {
 
            if (str[i+1] === '*' && str[i+2] === '@') {
                mode.condComp = true;
                continue;
            }
            if (str[i+1] === '*') {
                str[i] = '';
                mode.blockComment = true;
                continue;
            }
            if (str[i+1] === '/') {
                str[i] = '';
                mode.lineComment = true;
                continue;
            }
            mode.regex = true;
 
        }
 
    }
    return str.join('').slice(2, -2);
}

The best way to wrap your head round the above code is to literally take it step by step. There are six modes; only one mode will be set to true at any time during iteration; this activated mode respresents what construct is currently being looped through (a string, a regular expression, a comment etc.). The modes include:

  • mode.singleQuote: Single-quote delimited string ('string').
  • mode.doubleQuote: Double-quote delimited string ("string).
  • mode.regex: Literal regular expression (/regex/.
  • mode.blockComment: Block comment (/*...*/).
  • mode.lineComment: Line comment (//...).
  • mode.condComp: Conditional compilation (/*@...@*/).

Here’s an example trail through the loop:

Using string ->   "a"" /*Boo!*/
 
01. Double quote; *mode.doubleQuote* activated.
02. Letter 'a'; loop continues.
03. Character ''; loop continues.
04. Double quote; ignored because the previous character is an escaper.
05. Double quote; last character is not ''; so *mode.doubleQuote* de-activated
06. Space; loop continues.
07. Character '/'; Next character is asterisk; *mode.blockComment* activated
    - character replaced with an empty string
08. Letter 'B'; loop continues.
    - character replaced with an empty string
09. Letter 'o'; loop continues.
    - character replaced with an empty string
10. Letter 'o'; loop continues.
    - character replaced with an empty string
11. Character '!'; loop continues.
    - character replaced with an empty string
12. Character '*' followed by '/'; *mode.blockComment* de-activated
    - both characters replaced with an empty string
 
Result ->   "a""

There’s quite a lot of forward/back-tracking involved, that’s why a couple of arbitrary characters are added to either end of the string before the loop; to make sure something is there when str[i-2] is queried.

Note: the code I used in the removeComments function could be shortened; in fact, the entire function could probably be squeezed into 20 lines but that would only slow it down. Terseness does not always equal speed, especially so in this situation; a somewhat repetitive stream of IF statements really is the only way to produce acceptable performance.

I’d love to be proven wrong in this situation so if anyone can come up with an easier way of doing this I’d love to hear it! Especially if you think you can solve this with regular expressions alone!

Thanks for reading! Please share your thoughts with me on Twitter. Have a great day!