2012-01-22

JavaScript myth: JavaScript needs a standard bytecode

The idea is obvious: Why not standardize the bytecode of the virtual machines (VMs) that JavaScript runs on? That would mean that JavaScript programs could be delivered as bytecode and thus would be smaller and start more quickly (after having been loaded). Additionally, it would seem to be easier to port other languages to web browsers, by targeting that bytecode. This post makes its case in two steps: First, it shows that bytecode has several disadvantages. Second, it explains that source code is not as bad a solution as it seems.

The disadvantages of bytecode

Bytecode is a very specialized mechanism:
  • There is no single bytecode to “rule all languages”: A good bytecode is intimately tied to the language that is most frequently compiled to it. It is thus impossible to define a bytecode that works well with all languages, especially if you want to support both dynamic and static languages.
  • There is no common ground between browsers: The previous rule even applies to the competing implementations of the same language JavaScript. They are too different for a common bytecode to be found; Firefox, Safari and Internet Explorer each use different bytecode, Google’s V8 initially compiles directly to machine code. But wouldn’t it be possible to work towards the goal of a common bytecode or to adopt a single implementation in the long run? Doing so indeed would have some advantages. But having several implementations of the same languages is also useful, because different approaches can be tried. Competition between engines so far has been very good for the JavaScript ecosystem. V8 started a race that so far hasn’t ended and brought tremendous speed gains to JavaScript.
  • Bytecode is inflexible: it ties you to the current version of the language and to implementation details such as how data is encoded. Especially with regard to language versions, you need to be flexible on the web where you have many combinations of
    language version(s) sent by the server × language versions supported by the browser
    Quoting Brendan Eich [1]:
    Now, of course, you could say “Let’s version the bytecode”, and then you’re in version hell. The web really doesn’t like to have that kind of versioning. There’s a saying in the WhatWG that versioning is an Anti-Pattern and I agree we should avoid brittle a priori versioning, or heavy-handed versioning. If you look at Flash, it’s gotten into a situation where it has to support versions going back to Flash 4. They have to ship ActionScript 2 as a separate interpreter along with Tamarin. This is the hard row you hoe when you do make detailed choices in a lower-level bytecode, I think, and when you simply have an installed base that can’t be upgraded or doesn’t use a common source language.

Source code is not that bad – it’s meta-bytecode

At first glance, it seems like a suboptimal solution to use source code to deliver programs. At second glance, it has the benefit of flexibility and, with a little work, it can obtain much of the efficiency of bytecode.
  • Source code abstracts over different language implementations: JavaScript source code is remarkable in how many closely compatible virtual machines there are for it (browser incompatibilities are another issue!). That is due to several factors: First, with ECMA-262 (“ECMAScript”), JavaScript has a very well written language specification (especially compared to Dart’s which – to be fair – is evolving). Whenever you have a doubt about a language feature, you can turn to ECMA-262 and get a clear answer. Second, JavaScript engine vendors work closely together to evolve the language. Third, there is a test suite called test262 that checks conformance of a JavaScript implementation. Hence, you can consider JavaScript source code to be meta-bytecode – a data format that unifies the different bytecode formats and V8’s machine code.
  • Source code abstracts over different language versions: Keeping the delivery format of a new language version backward compatible is easier with source code than it is with bytecode.
  • Parsing source code is fast: JavaScript engines have become very efficient at parsing JavaScript source code. Coupled with increased CPU speed, the overhead caused by parsing is becoming less and less important.
  • Source can be quite compact: There are two ways of making source code more compact. First, minification – a transformation of source code that maintains the semantics while decreasing the size. For example, minification removes comments and changes variable names to be shorter. Second, compression. After minification, one can apply a compression algorithm such as gzip to achieve further reductions in size.
  • Already a good compilation target: JavaScript source code having such a high level of abstraction makes it relatively easy to compile to. Furthermore, being a good compilation target is a consideration in JavaScript’s evolution. Examples of features that are partially motivated by that consideration are: typed arrays (supported by many modern browsers, proposed for a future ECMAScript version) and SIMD (which might be part of ECMAScript 8 [2]). Lastly, JavaScript engines increasingly support this use case. For example, via source maps [3]: If a file A is compiled to a JavaScript file B, then B can be delivered with a source map. Whenever a source code location is reported for B (e.g. in an error message) then it can be traced back to A, via the source map. In the future, source maps will even allow one to debug JavaScript code in the original language.

The remaining bytecode advantage

The main remaining bytecode advantage is that (static and dynamic) analyses can be performed ahead of time and delivered alongside the bytecode. The closest to bytecode one can get without losing the advantages of source code is to use the abstract syntax tree (AST) produced by a parser. The research project JSZap [4] does just that:
In this paper, we consider reducing the JavaScript source code to a compressed abstract syntax tree (AST) and transmitting the code in this format.
The AST is complemented by the result of several analyses. Such a format could become a standard JavaScript storage format. The advantages of the JSZap approach are:
  • Faster parsing and well-formedness checking (including security checks).
  • Reduced program size (by approximately 10% compared to minification plus gzip compression).
  • Some JavaScript code is currently loaded synchronously via script tags embedded in HTML. With JSZap, the HTML parser can load such code asynchronously whenever the JSZap data indicates that it doesn’t interact with the DOM. The main example are libraries. This is mainly an optimization for older JavaScript applications. Modern applications load all library code asynchronously.
Dart’s snapshots are an extreme kind of ahead-of-time analysis that can probably not be duplicated by a cross-VM format. They improve application startup time. Quoting “The Essence of Google Dart: Building Applications, Snapshots, Isolates” by Werner Schuster for InfoQ:
... the heap snapshot feature ... is similar to Smalltalk’s image system. An application’s heap is walked and all objects are written to a file. At the moment, the Dart distribution ships with a tool that fires up a Dart VM, loads an application’s code, and just before calling main, it takes a snapshot of the heap. The Dart VM can use such a snapshot file to quickly load an application.

Conclusion

I hope that this post has convinced you that delivering JavaScript programs as source code is not as different as it seems from delivering them as bytecode, especially when size reduction techniques are used. Moreover, while source code takes up more space and loads more slowly, it is also more flexible than bytecode – a trait that is very valuable on the web.

Related reading

  1. Bytecode Standard In Browsers” – A Minute With Brendan Eich
  2. A first look at what might be in ECMAScript 7 and 8
  3. SourceMap on Firefox: source debugging for languages compiled to JavaScript [update: WebKit, too]
  4. JSZap: Compressing JavaScript Code”, by Martin Burtscher, Benjamin Livshits, Gaurav Sinha, Benjamin G. Zorn. Microsoft Research, 2010.

No comments: