Posts Tagged ‘Regular Expressions’

JavaScript comment removal – revisited

Posted in 'Code Snippets, JavaScript' by James on September 11th, 2009

A while ago I posted a method I had been using at the time to remove comments from JavaScript code. It was pretty decent – instead of using a regular expression it steps through each character and removes comments where it finds them.

At the time I thought stepping through a string character-by-character was the only reliable way to solve the “comments problem” but after giving it another attempt I found that it was possible with a only a few regular expressions and a fairly moderate dose of JavaScript’s replace() function.

Here it is:

function removeComments(str) {
 
    var uid = '_' + +new Date(),
        primatives = [],
        primIndex = 0;
 
    return (
        str
        /* Remove strings */
        .replace(/(['"])(\\\1|.)+?\1/g, function(match){
            primatives[primIndex] = match;
            return (uid + '') + primIndex++;
        })
 
        /* Remove Regexes */
        .replace(/([^\/])(\/(?!\*|\/)(\\\/|.)+?\/[gim]{0,3})/g, function(match, $1, $2){
            primatives[primIndex] = $2;
            return $1 + (uid + '') + primIndex++;
        })
 
        /*
        - Remove single-line comments that contain would-be multi-line delimiters
            E.g. // Comment /* < --
        - Remove multi-line comments that contain would be single-line delimiters
            E.g. /* // <-- 
       */
        .replace(/\/\/.*?\/?\*.+?(?=\n|\r|$)|\/\*[\s\S]*?\/\/[\s\S]*?\*\//g, '')
 
        /*
        Remove single and multi-line comments,
        no consideration of inner-contents
       */
        .replace(/\/\/.+?(?=\n|\r|$)|\/\*[\s\S]+?\*\//g, '')
 
        /*
        Remove multi-line comments that have a replaced ending (string/regex)
        Greedy, so no inner strings/regexes will stop it.
       */
        .replace(RegExp('\\/\\*[\\s\\S]+' + uid + '\\d+', 'g'), '')
 
        /* Bring back strings & regexes */
        .replace(RegExp(uid + '(\\d+)', 'g'), function(match, n){
            return primatives[n];
        })
    );
 
}

Theoretically this should work perfectly in almost all situations. Don’t bother even trying it with E4X as that definitely won’t work! E.g.

var someE4X = <box>// this is NOT a comment</box>;

It's impossible to cater to E4X with regular expressions because XML is a recursive structure. I'm not bothered though as E4X isn't exactly a widely used extension. It also doesn't play well with conditional compilation but frankly, conditional compilation shouldn't exist anyway.

Anyway, back to the solution. It takes a pretty conventional approach of removing all strings and regular expressions first and then moving on to the comments. Unfortunately comments are not as simple as \/\*.+?\*\/ - there are nested comments within strings, nested comments within literal-regular-expressions and nested comments within other comments.

String.prototype.extract

Posted in 'Code Snippets, JavaScript' by James on July 14th, 2009

I recently released some useful additions to String.prototype over on Github (linky). One of the more useful methods in there is extract. I’m sure something similar to this already exists but I’ve yet to find it.

It’s useful when you need to extract a particular group of every match from a global regular expression. Normally you’d have to use the RegExp.exec method along with a while loop to extract that info, and, that is exactly what’s going on behind the scenes. Have a look:

The code:

String.prototype.extract = function( regex, n ) {
 
    n = n === undefined ? 0 : n;
 
    if ( !regex.global ) {
        return this.match(regex)[n] || '';
    }
 
    var match,
        extracted = [];
 
    while ( (match = regex.exec(this)) ) {
        extracted[extracted.length] = match[n] || '';
    }
 
    return extracted;
 
};

Example:

('hi @rob and @adam, oh and @bob').extract(/@(\w+)/g, 1);
    // => ['rob', 'adam', 'bob']

(With the above example, you could achieve it with a simple match() if JavaScript supported look-behinds…)

Regular Expressions in JavaScript, part 2

Posted in 'JavaScript' by James on March 23rd, 2009
Regular Expressions in JavaScript, part 2

A while ago, when I was just getting used to this insanely complicated stuff, I posted a brief introduction to the world of regular expressions. I’m glad to say that, since then, I have learnt a bunch more about them and how you can make use of them within JavaScript. So, here goes:

In JavaScript, there are four string operations that will accept a regular expression as an argument:

  • String.match(), – this method only accepts a regexp as the first argument. It’s usually used to extract specific parts of a string or to test whether a string matches a regular expression.
  • String.replace(), – this method accepts either a string or a regular expression as its first argument, and accepts either another string or a function as its second argument. It’s usually used to find and replace certain parts of a string.
  • String.split(), – this method accepts either a string or a regular expression as its first arguments, the second argument is used (rarely) to signify a limit for the split operation. It’s used to split a string into an array based on the regular expression and/or string passed as the first parameter.
  • String.search(), – this method accepts a regular expression as its first and only argument. It’s used to find the index of a regex match within a string.

The RegExp object has its own methods:

  • RegExp.exec(), – this method is exactly the same as the String.match() method, the only difference being that you pass the string as the argument and the method is run as a member of the regular expression that you’re using to search the string.
  • RegExp.test(), – this method is similar to the above exec, but instead of returning the match found it will return either true or false dependent on whether or not its found a match.

Correction: Luke pointed out in the comments that String.match and RegExp.exec are slightly different in that the latter will return capture groups plus the first match if a global flag is used, while the former (match) method won’t return any capture groups; only the full matches.

Because I know no better way to begin, let’s start with a basic example:

Validating user input

One of the most common uses for regular expressions on the client-side is validating user input. Let’s say we need to validate a product ID… We’ve had to leave it up to the user to type it in because there are over 5000 products. All product ID’s start with either the letter ‘M’ or ‘D’ followed by 4 or 5 digits and then an additional trailing letter to signify upgrades and variations. Validating such an input would be perfectly possible without using a single regular expression, as shown here:

var usersProductID = 'M5060i';
 
function isLetter(character) {
    return ('abcdefghijklmnopqrstuvwxyz').indexOf(character.toLowerCase()) > -1;
}
 
function isValidKey(character) {
    return ('md').indexOf(character.toLowerCase()) > -1;
}
 
var isValidProductID = (
        isValidKey(usersProductID.substr(0,1))
        && (!isNaN(usersProductID.substr(1,4)) || !isNaN(usersProductID.substr(1,5)))
        && isLetter(usersProductID.substr(usersProductID.length-1))
    );
 
alert (isValidProductID); // Boolean, true or false...

Now, with a regular expression:

var usersProductID = 'M5060i';
 
var isValidProductID = /^[md][0-9]{4,5}[a-z]$/i.test(usersProductID);

Hopefully the above example has demonstrated the necessity and importance of regular expressions in JavaScript (if you weren’t already convinced). Here’s a commented version of our regular expression:

^        - Matches the start of a string
[md]     - Character class that matches 'm' or 'd'
[0-9]    - Character class that matches any digit 
{4,5}    - Repeat last character ([0-9]) 4 OR 5 times
[a-z]    - Character class that matches any letter
$        - Matches the end of a string

In JavaScript there are two ways of defining a regular expression, using its constructor, or literally:

// Constructor:
var myRegexp = new RegExp('^[md][0-9]{4,5}[a-z]$', 'i');
// Literal:
var myRegexp = /^[md][0-9]{4,5}[a-z]$/i;

The only situation in which you’d want to use the constructor would be when you need to add varying data to the regular expression. If it’s constant and does not change then stick with the RegExp literal (/regex goes here/)

The ‘i’ that you see is a flag. Flags are either passed as the second argument to the constructor or, if you’re using the literal syntax, they’re specified beyond the right-hand delimiter (forward slash) of the expression. The ‘i’ flag in particular means ‘ignore case’, so an ‘a’ in the regular expression will match both ‘a’ and ‘A’ in the string that’s being tested. The available flags include:

  • i, – “ignore case” – the case (uppercase/lowercase) of all letters within the string will be ignored during testing.
  • g, – “global search” – the search is carried out across the entire string, regardless of whether a match has already been found.
  • m, – “multiline search” – the regular expression will match over multiple lines.

String extraction

I couldn’t come up with a good name; “string extraction” seems suitable, although it sounds a bit dodgy if not in the context of programming. Anyway, back to the point: regular expressions are not only useful in validation; you can extract very precise pieces of information from string data. Let’s say, for example, we have to extract all numbers from a massive string and produce an array from them:

Regular Expressions in JavaScript

Posted in 'General, JavaScript' by James on January 4th, 2009
Regular Expressions in JavaScript

About a month ago I decided to begin on the long and tiresome journey of learning regular expressions. I even bought the book! So, in this post I’m going to share some of the awesome things I’ve learnt so far on my "journey".

The first thing to note is that I’m no way near the end of my journey and I’m still very much a novice in this area, so if an expert happens to stumble across this post I would very much appreciate some light-hearted critique! If you’re a novice like me I hope you can gain something from my ramblings.

Unrelated: I’d like to attribute the top speech bubble in the image to the right to XKCD. It’s from this comic!

Defining Regular Expressions in JavaScript

Most modern programming languages have support for regex (regular expressions) but I’ll be focusing on JavaScript’s implementation because that’s what I’m best at! It doesn’t really matter though because the typical regex notation varies little between implementations (As far as I know).

Like everything in JavaScript you can either create a regex pattern by using literal notation or by calling the constructor function of the ‘RegExp‘ object:

// Using a regex literal:
var myRegexPattern = /Regular expression goes here.../;
// Calling upon the object:
var myRegexPattern = new RegExp('Regular expression goes here...');

More info on the two methods can be found here: developer.mozilla.org/en/Core_JavaScript_1.5_Guide/Regular_Expressions

The first method (using a regex literal) is faster than the second but it compiles your pattern when the script is evaluated as opposed to the second method (Calling the object constructor) which compiles your pattern at runtime. So, if you wanted to include some variable data (perhaps from user input) in the pattern then using the object constructor is your only option. e.g: