2012-02-17

Transforming HTML with Node.js and jQuery

The npm module jsdom enables you to use jQuery to examine and transform HTML on Node.js. This post explains how.

The basics

As a tool for processing HTML, Node.js offers an important foundation: It can download or upload data and it can read or write to disks [1]. What it lacks is the ability to parse and transform HTML. Luckily, the jQuery framework is ideally suited for this task. The jsdom module implements the HTML DOM on top of Node.js, which is everything that jQuery needs to run on that platform. To install it, use the node package manager:
    npm install jsdom
jsdom is very easy to use:
    var htmlSource = fs.readFileSync("dummy.html", "utf8");
    call_jsdom(htmlSource, function (window) {
        var $ = window.$;

        var title = $("title").text();
        $("h1").text(title);

        console.log(documentToSource(window.document));
    });
Above, we first read html source from disk into a string, then we invoke jsdom with that source. It calls us back when everything is finished, with a window object. To make things easier to understand, we have used the custom function call_jsdom() that hides a few unnecessary details and loads jQuery “into” the window. Hence, we only need to access window.$ and can work with jQuery as we would in a browser: The document does not yet have a heading, so we read the title and put it into the empty h1 tag. Finally, we log the transformed HTML to the console. To try it out, you can download the complete project jsdom_demo; run transform.js on the shell, either directly or via Node.js. The input is:
    <!doctype html>
    <html>
        <head>
            <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
            <title>My document</title>
        </head>
        <body>
            <h1></h1>
        </body>
    </html>
The output is:
    <!doctype html>
    <html>
        <head>
            <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
            <title>My document</title>
        </head>
        <body>
            <h1>My document</h1>
        </body>
    <script src="jquery-1.7.1.min.js"></script></html>

Caveats

Preserving the structure of the source code. The original source code will be changed in several ways: Closing tags will be added (e.g. to close a <p> tag) and loading jQuery causes a script tag to be added (see output above). A possible work-around for transforming HTML (as opposed to extracting data) is to not work with a complete document. Instead, one can use $() to work with an HTML fragment that is separate from the document:
    var fragment = $("<ul><li>item</li></ul>");
Seeing thrown exceptions. jsdom catches all exceptions. Unfortunately that catching extends to its callbacks. For example, the following is a function that we have called previously.
    function call_jsdom(source, callback) {
        jsdom.env(
            source,
            [ 'jquery-1.7.1.min.js' ],  // (*)
            function(errors, window) {  // (**)
                process.nextTick(
                    function () {
                        if (errors) {
                            throw new Error("There were errors: "+errors);
                        }
                        callback(window);
                    }
                );
            }
        );
    }
jsdom swallows all exceptions thrown inside the callback at (**), including in any functions that it calls. To escape that effect, you can use process.nextTick() to add a function to the event loop queue. It will be executed after the current code is finished.

Loading jQuery from a file. The examples in the jsdom readme load jQuery from a URL, causing internet traffic each time the code is run. A solution is to put a copy of jQuery next to the script and specify a file path instead of a URL, as seen above at (*).

Using jQuery multiple times. Do you have to invoke call_jsdom (or jsdom.env) every time you want to use jQuery? No, you can store window somewhere and use it again later. The initial startup is only callback-based to accommodate asynchronous script loading.

Conclusion: What is this good for?

When you are faced with having to parse or transform HTML, you realize just how great a transformation language jQuery is. Even more so, because its documentation is so well done, perfect for casual users. The solution described above is ideal for extracting information from HTML. Changing existing HTML requires more care.

References

  1. Write your shell scripts in JavaScript, via Node.js

No comments: