Have tracing JIT compilers won?

I've been watching the interpreter and JIT compiler competitions a bit in the JavaScript and Lua worlds. I haven't collected much organized data but the impression I'm getting is that tracing JIT's are turning up as the winners: sometimes even beating programs statically compiled with GCC. Is there a growing body of evidence that tracing JIT compilers are the horse to bet on? Will a more static style JIT like Google's V8 be able to keep up?

Thanks,
Peter

[I promoted this item to the front page, since the discussion is highly interesting & informative. -- Ehud]

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Betting on JIT

There is no clear 'winner', especially given the variety of issues other than performance that one should consider (runtime code distribution (modules, extensions); upgrade of code at runtime; security and safety analyses; memory overhead; realtime considerations; etc.). I doubt there will ever be a clear winner.

But I would consider runtime compilation to be a good bet for many languages. Its performance is competitive. If combined with persistence and automated code-distribution features, allowing reuse of code over long periods of time, one might even benefit from supporting computation-intensive optimizations in the JIT compiler.

It is unclear to me why you'd bet on just one thing, though. I would want to compile based on in which circumstances the code is acquired and used.

mozilla

Mozilla's best guess is apparently that a combination of the two techniques ultimately wins:

http://hacks.mozilla.org/2010/03/a-quick-note-on-javascript-engine-components/

which, if I understand it correctly, announces their plan for a very coarse-grained JIT for Javascript (directly generating native instead byte code), combined with a tracing JIT that looks after hot-spots in the resulting code. Which kinda makes some sense: the overhead to native code shouldn't be much greater than to bytecode in the initial translation - yet you should get some good payback. And then tracing magic lazily optimizes some paths based on execution feedback.

Native Code Tracing?

In Andreas Gal's dissertation on tracing he describes tracing inside a bytecode interpreting VM. That seems reasonable as the VM is software and so can record/trace the bytecode instructions it executes.

How can native code execution be similarly monitored to record execution traces?

Profiling

One compiles native code with a few extra instructions (inside functions, branches, etc.) to track usage and trigger a more intensive optimization of code regions that see a lot of use. That is, you don't trace Native CPU instructions directly, but rather trace code regions via the profiling instructions. You optimize from the original source, or from an intermediate representation. This also works for bytecode; one can compile source to profiled bytecode that traces itself and triggers more intensive optimization... this avoids mucking with the VM's basic interpretation loop.

I remember reading about this somewhere... can't recall where, though.

Profiling

The idea of inserting profiling instructions into the native code makes sense. I imagine that to get the exact traces that Gal's tracing VM achieves it would be necessary to insert a profiling instruction at the beginning of each "basic block" of code as he calls them.

If you do happen to remember where you read about self-tracing code please mention it here. I'll be looking for similar information.

Depends on domain

Maybe for general purpose systems tracing JITs are fast enough. Until someone shows impressive numerical results using a tracing JIT there will still be at least one domain where offline computation is the norm. What I think you see here is that the amount of computation you can usefully throw at optimising "integer" code hits a point of diminishing returns quite quickly, so that tracing JIT is effective here. This isn't the case with numeric programs AFAIK where you can really grind to optimise memory access patterns and concurrency.

That said, I think the idea of adaptive optimisation is very promising. The old Dynamo project (which became Transmeta IIRC) showed that online optimisation could get equal performance to GCC's -O4 (profile directed) optimisation level. This was working on raw machine code -- so only low level optimisations were feasible. The MILEPOST compiler shows what you can do with more feedback in the offline setting. The next step for JIT IMO is to save information to for offline / idle processing. I'm not aware of any work that does this but it should be an easy win.

Dynamo and MILEPOST

The old Dynamo project (which became Transmeta IIRC) showed that online optimisation could get equal performance to GCC's -O4 (profile directed) optimisation level. This was working on raw machine code -- so only low level optimisations were feasible.

Just to be clear, the more general idea here is that field programmable gate arrays can be used to efficiently implement in hardware Turing's connectionist networks.

The MILEPOST compiler shows what you can do with more feedback in the offline setting. The next step for JIT IMO is to save information to for offline / idle processing.

The next step is actually moving optimizations into a version control scheme*, and distributing optimizations. Think WAY bigger than you are thinking right now. There were glimpses of this idea in the 1980's when some researchers were looking at ways to harness parallel computing machines (before they were becoming widespread), and one of these research projects focused on using neural networks to do optimizations. If you abstract out the neural net and focus on conflict-serializability of optimizations, then you've got a killer app that a company like Google should be building. But Google thinks small and pragmatic (e.g., Go, GWT, Gentoo for ChromeOS, etc.), not big and potentially revolutionary.

* Technically, hotspot optimizing compilers already implement a form of revision control (literally, revising a string). They simply do not implement branching versions (morphing a string and creating simultaneously active but different strings). Also, schemes with hardware change control management is typically known as Product Data Management (PDM), whereas software-only change control management schemes are called Software Configuration Management (SCM).

Revision control for optimizations

The next step is actually moving optimizations into a version control scheme*, and distributing optimizations.

You may be interested in this paper: Equality saturation: a new approach to optimization.

As Thomas Lord mentions, this is about modeling compiler optimizations as an actual search/optimization problem, instead of applying optimization passes sequentially in a substantially ad hoc way.

Seen it

Equality saturation should only be a part of the picture. Equality saturation is about integrity.

I also disagree with you that Thomas was suggesting modeling compiler optimizations as an actual search/optimization problem. From what I've read of Thomas's ideas in the past, such as A Case for Gestures/Visualizations and Against Concrete Syntax. In this example, Thomas actually discusses the idea (in a round-about way*) of making optimizations a service for the programmer to plug into. Thomas gives the example of choosing between a number of implementations. Usually, though, a compiler does not know which one is preferable. It is up for the programmer to decide. Due to the current medium of expressing changes to code (file edits), we cannot really capture these design trade-offs.

The problem is in change control management, and as I indirectly noted above, it exists for both hardware and software folks. I figured Thomas was zero-ing in on this because of his background in SCM and version control as the creator of GNU Arch (he also started that aforementioned thread after a discussion on LtU about how much subversion pretty much blows).

The programmer should be able to automate solving scenarios like, "I've got 16Kb of memory I have to stuff these requirements into a program, for a real-time system. How do I do it?" Well, if that's the case, it doesn't really make sense to choose quicksort necessarily. So what algorithm do you choose? Well, today, you probably don't even choose an algorithm. You implement it yourself. Then you have to go through the tedium of seeing how much memory your compiler allocates for that sorting routine, and how well it performs. Do you expect equality saturation to handle that? I don't. I expect the programmer to know he has a specific problem to deal with that the human brain needs to get immediately involved with to solve it effectively. Current compilers wait way too long to let human brains into the picture. If you know what your requirements are, then you shouldn't let a compiler architecture dictate to you your development process!

*I would argue Thomas's model example for trade-offs is solving the problem at the wrong level of abstraction. But he has the right idea.

"to see and be seen"

Right, Z-Bo. That's kind of neat, for me, because you synthesized various things I've remarked on in a way that leaves me saying: "Yup, that's about how I see it."

I think that projects like JaegerMonkey, extrapolated, go in the direction of the AI-search kind of translation system. And, I can't prove that that's intractable but I can point to reasons why it's implausible. I'm not sure Mr. Eich is right that it's a long game of chess. I think it might be an infinite game of "construct a rational number lower-bound approximation of .7 that is less than .7, within the time limits of your turn, and that is greater than any approximation your opponent constructed in earlier turns. The game ends when a player succeeds under those rules at constructing .99999" Yeah, that competition is converging on the AI-search approach and, yeah, I doubt that that's a winning strategy.

And, I agree that the alternative to AI-search using a toolbox of translation tools is human-directed application of the tools. Search has a place as a tool in the toolbox (e.g., "super-optimizers") but it takes the situational awareness of people to decide which of these imperfectly understood tools to apply to what, and when.

The C example might be solving the problem at the wrong level of abstraction, sure. All I mean to highlight about early C is that here was a language translation tool that under nominal operating conditions gave highly, easily forseeable, easily reproducible results. It shouldn't be as hard as it is made by existing interop standards to write an efficient FFT that is portable across browsers, for example - and fancifying Javascript implementation techniques along current lines won't fix problems like that well.

Right, Z-Bo. That's kind of

Right, Z-Bo. That's kind of neat, for me, because you synthesized various things I've remarked on in a way that leaves me saying: "Yup, that's about how I see it."

For what it is worth, that is pretty much how myself and David Barbour see it, too. See the c2 wiki RethinkingCompilerDesign and CompileTimeResolution. The only difference between you and I vs. David is that David is heavily devoted to the wiki concept, because he also wants tight integration with collaborative editing. But that difference isn't important for this discussion.

In the same vein

In the same vein you might be interested in a paper I did a couple of years ago called Program Interpolation. The system that we describe in that paper is a very simple sketch of an idea, but it show what happens if you make a type of strong equality guarantee between the pieces of program that you are trying to optimise. The large sets of enumerated programs would then be your candidate optimisations within some VCS (we never looked at it that way) and the paper shows how to build explicit search problems from the optimisation. For people without pay-wall access there is another version here although I'm afraid I can't remember if there are any mistakes in that draft.

Never seen your paper

Never seen your paper before, thanks for sharing. I'll have a look. I should also dig up the reference to neural network compilers from the '80s so people realize the machine learning ideas in MILEPOST are nothing fundamentally novel, just stuffed into an ancient, non-modular code base (gcc).

The large sets of enumerated programs would then be your candidate optimisations within some VCS (we never looked at it that way)

I think a good question would be: How few lines of code can you implement Git's core module in using something like VPRI's Pepsi/Cola/Ometa.

Compiler tuning

You might be interested in this paper then too: COLE: Compiler Oprimisation Level Exploration - CGO 2008. Disclaimer: they guys are my colleagues :-). We did a follow-up paper for this years' CGO titled Automated JIT compiler tuning. You can find the abstract and a preprint version here.

Different types of optimization gains

This isn't the case with numeric programs AFAIK where you can really grind to optimise memory access patterns and concurrency.

When I see results that sometimes LuaJIT is beating programs compiled with GCC it makes me think there must be optimizations that can be made in a tracing JIT that cannot be made by static analysis.

different types of ....

It seems to me that:

In a simple static compiler, the compiler would have to make good guesses about which paths to optimize tracing-style (or to generate unrealistically large amounts of code and conditional branches that cache the dynamically most likely prediction).

A profiling static compiler can make better guesses about what paths to optimize. It can essentially answer the question: what would a native code with tracing jit's state be after this profile?

Real tracing jit can beat all of that in the case that in most runs there are dominant paths, but there are many possible paths, and which paths will be dominant varies widely across runs.

Conversely, tracing jit can't well handle hot loops which have many branches, none of them dominant - while static or method-jit methods can do a better job. This is apparently one the experience of Mozilla and one of their motivations for this new work:

http://blog.mozilla.com/dmandelin/2010/02/26/starting-jagermonkey/

It's also worth noting that in some cases: doing no optimization whatsoever produces the most desirable performance. For example, for very simple code, naive interpretation can win for speed. In general, naive interpretation can win for latency (amount of delay before execution commences) which is sometimes desirable (e.g., in the presence of EVAL).

So it seems like ultimately, at least conceptually, you have a discreet spectrum of composable code transformations available, and a real-time search problem. At one end, you simply parse the code and directly implement a syntax tree (or s-exp if you're into that kind of thing). You have a wealth of possible compiler passes that can optimize trees and also take them other forms (against which further optimization passes can be applied). You have various interpretation strategies for many of the intermediate forms along the way. You have various ways of generating native code (or FPGA configurations, etc.) from some of these intermediate forms.

If we think of execution as comprising everything from the submission of source code through an actual "run" of the program, and *if* (big if) a system operator can characterize what counts for this execution as optimal performance, then you have a big search problem - deciding which allowable code transformation passes, interpretive strategies, or native code generation to apply when in order to maximize the performance score along the imposed metric. Necessarily, in many cases, you want the system to play that search game during the run of the program, using feedback about which of the potentially "further optimizable" parts of the code are most harming the performance score in a given run.

It would, I think it's safe to say, be very difficult to engineer that search directly in a clean and elegant way. Perhaps someday people will have worked out enough of the math and done enough of the puzzle solving that you can just declare all of these passes in nice clear ways, let the system help identify the stochastic performance impacts of various passes, and let a meta-translator turn all of these nice clean declarations into static front-ends and gnarly code-generating run-time systems. It seems like that, if it ever happens, is still a long way away.

And an arguable weakness of the entire approach is that it depends on (a) there being a good way for the system operator (or context otherwise) to characterize the desired performance goals; (b) a search algorithm that scores reliably high on achieving those performance goals. It might never be possible. We might instead, permanently, have "touchy" systems where you rearrange your code in seemingly innocent, minor ways - and the resulting performance scores vary wildly in baffling ways.

The very opposite of the entire approach is exemplified by languages like C in the early days: don't worry about generating *optimal* code (that's why human-friendly assemblers exist) -- worry about generating pretty good but above all *predictable* code.

With a language like that, at least you can build something robust and have an easier time convincing yourself and others that it is robust in just the ways you say. No surprises.

I think that the C-style approach fell out of favor for two reasons: (1) is that modern VM systems and processor architectures undermine the very ability for a compiler to generate code of predictable performance; (2) the economic history of the big systems that drive much of the visible PL research lowered the bar, relative to C's early days, of what kind of robustness programmers were supposed to aim for.

Concerning that economic history, here's an example:

http://vimeo.com/8525101

Here we have a sweet, impressive demo showing what Firefox's Javascript is capable of in terms of running an FFT algorithm and a graphics front end. Nicely, nicely done.

But think about what is going on there: We're going through many, many layers of highly complicated Javascript implemenation just to get that performance. And it's flakey: combine it with something that stresses the GC or try and run it on another browser (assuming that other browser also exposes audio buffer raw data to Javascript) and the results won't be as impressive. Indeed, in the next round of that experiment, they've moved the FFT into a C++ patch to the browser:

http://vocamus.net/dave/?p=974

Why don't browsers have, perhaps in addition to a kitchen-sink language like Javascript, several other high-level, tiny, safe languages for particular domains - and ways to combine these in a single application? It would be relatively easy to make a safe, high-level, tiny language suitable for such things as computing FFT in a portable way but we didn't. And we didn't because of the history of how Javascript gained its position, and the history about how advanced browser application development became a largely empirical affair (each year, a zillion new apps are implemented - these are then "tested" on lots of browsers - some of them seem to be fairly robust, performance-wise and others don't. Lather, rinse, repeat.)

The C guys had the better idea. And the harder sales problem.

JägerMonkey

Wow! If Mozilla is implementing a method-based JIT compiler after so much energy invested in TraceMonkey and it's trace-based JIT compiler then that is quite an admission that the trace-based approach is not sufficient. It would be interesting if the V8 group adds a trace-based JIT next.

Long Chess Game

V8 may not do tracing soon, but Android (Dalvik) is doing a straight-on trace-trees JIT based on Andreas Gal's thesis.

Trace-JITting has greater headroom than less specialized JITting on numerically-intensive, type-stable loops. Tracing makes variables constant on trace, with trivial dominance computation for all the traditional loop optimizations with none of the full SSA construction overhead. But Tracing has its drawbacks, in spite of the greater headroom. We live in a world of trade-offs.

So beware "Wow!" reactions. We JS implementors are in the market for the long haul. While the best techniques will rise to the top and win out, this process will take time. There won't be a quick victory for only one approach. Many dynamic and even static analyses are conceivable, and more are feasible over time. It's early days still.

/be

spinning your questions around

Why don't browsers have, perhaps in addition to a kitchen-sink language like Javascript, several other high-level, tiny, safe languages for particular domains - and ways to combine these in a single application?

Good question. VPRI's Ian Piumarta argues all you really need are 3 objects and 5 methods - not 'several other high-level, tiny, safe languages for particular domains'.

Agreed, but

I agree. A tracing JIT has much more information available, but it has much less time to do anything with that information.

By "really grind" I'm think of the kind of optimisations Single Assignment C (http://www.sac-home.org/) does on numeric programs. This is stuff like solving linear programs to work out sets of array indices, and then rearranging access patterns to make use of cache locality. Very few static compilers do this kind of optimisation. I don't know of any tracing compiler that does anything approaching this.

Maybe

Maybe they're winning. But the competition is not over, yet.

You won't win if you just slap a trace compiler onto an existing VM (as exemplified by TraceMonkey).

Creating a good trace compiler requires a lot of careful engineering and research, a fast interpreter, a specially crafted IR, the right set of compiler optimizations and lots of patience/money.

So be warned: it's not the easiest path you could take. But IMHO it's currently the best bet, if you want a dynamic language to be competitive with static languages.

There are two weak points of this approach:

  • The needed heuristics quickly grow in complexity, need a lot of tuning and create some unwanted surprises.
  • Branchy code is troublesome for a trace compiler. I'll eventually add hyperblock scheduling to solve this issue.

Lua's interpreter

Mike,

Thanks again for your response. (It was me you answer on Reddit.)

You have implemented a bytecode interpreter in LuaJIT. Did you consider the idea presented in some of the comments here about generating self-tracing machine code rather than bytecode. It would save the whole interpreter and as you say the LuaJIT interpreter is written in Assembly then generating machine code would not break any current platform independence of the interpreter. (I could imagine a design where the interpreter is completely platform independent but the trace compiler would kick in when it's machine-specific back ends were supported. But this doesn't seem to be what you've done.)

Better use an interpreter

It's much easier to record what an interpreter is doing. Just patch its dispatch table and intercept every instruction.

The small gains of a simplistic compiler over a carefully hand-optimized interpreter are just not worth the trouble (the LJ1 JIT compiler is not much faster than the LJ2 interpreter, sometimes it's worse). Better focus on improving the trace compiler and stay 'on-trace' as much as possible.

Another subtle point, often forgotten in the reports about Mozilla's plan with JägerMonkey: the method compiler only triggers trace recording, but they still need the interpreter to actually record the traces! So now they gotta keep three engines in sync (interpreter, method compiler, trace compiler). I'll leave it to your judgement whether that's a smart move or pure desperation.

We're too old to be desperate

We survived Netscape, remember? What we at Mozilla are is old (my hair color confirms this :-|), with long-lived APIs and platform components built on top of our JS engine. So our changes involve incremental rewrites and breakage.

To be clear, we intend to retire our old bytecode interpreter at some point. We do not want an interpreter and two JITs if we can use just JaegerMonkey and tracing. The bytecode (rather an improved evolution of it) may remain as a first-stage compiler output. But for the short term we will keep the bytecode interpreter, since it does many things (simple debugging support, ECMA-357 aka "E4X" support) that the JITs do not yet do.

I've talked to Ollie Hunt of Apple about tracing for JSC (WebKit). It may happen, as JS is hard to optimize fully without some kind of aggressive type inference and speculation, more aggressive than polymorphic inline caching.

There is no "one best way" to compile dynamic languages, but for JS, it seems clear that both inline/call(context)-threading and polymorphic inline caching are important. Firefox had a PIC for its bytecode interpreter before we did TraceMonkey.

Tracing is not an alternative to a well-tuned PIC: trace trees branch up to a fairly low bound to cut off potentially exponential growth, and due to aggressive specialization they do not easily recompile to more generic "megamorphic" code.

Since fast generic-code JITs, PICs, and tracing JITs seem to be complementary, we are doing all three -- we already have the latter two developed and in shipping products. But we will lose the old bytecode interpreter (I first wrote version 1 of it in 1995!) in due course.

/be

Tracing without the bytecode interpreter

Brendan, I'm glad you found this thread. :-)

If you retire the bytecode interpreter completely then what is the high level plan to record traces? Will you method JIT first with some extra profiling instructions that map code blocks back to the intermediate representation and then compile optimized code from a combination of profiling info and the intermediate representation?

Looks like it

For notes on their plan, see a quick note on JavaScript engine components.

Maybe I'm missing it but

Maybe I'm missing it but does that post really answer my question about how native code will be "traced" and/or mapped back to a higher level representation using some profiling instructions?

Code is code

It's not that hard. With either a bytecode interpreter or an inline/call-threaded JIT, you have to count hot loops.

With a bytecode interpreter, once a loop header is hot enough, you can swap indirect-threading computed-goto jump tables, or if computed gotos are not supported (MSVC), use hideous macrology to double your switch-in-a-loop's case count to interpose recording call-outs.

With inline/call-threading, you quickly regenerate the code to include recording-hook calls, and switch to the instrumented code.

You write "native code" but JaegerMonkey, like V8 and JSC, does not emit machine code with arbitrary structure. Instead you get blocks of more type-generic code, either inlined and specialized a bit using caches and hottest-case-first self-organizing structures, or big enough to be consolidated into a stub or helper that is called for slow paths and chunky opcodes.

These native code blocks still correspond to higher-level operations in the source language. The recorder can therefore re-lower to trace-optimized machine code.

/be

How to be a millionaire

Old Steve Martin joke...

So after you get the million dollars, you have to do all those things Mike said -- his analysis is spot on.

One thing Mike didn't highlight: get a simpler language. Lua is much simpler than JS. This means you can make a simple interpreter that runs fast enough to be balanced with respect to the trace-JITted code.

IIRC, LuaJIT2 does not nest trace-trees (loops) via call instructions, as the UC Irvine researchers have done, and as we do in TraceMonkey. It simply runs the outer loop in the fast-enough interpreter. In TraceMonkey nowadays, we not only nest trees, we trace tail and direct recursion.

JS is surely a beast compared to Lua, but it's the one language interoperably implemented across browsers. This interoperation condition is hard to achieve (C, Scheme, etc. don't try -- compiler writers guard the standards' underspecification to enable optimization experiments; deployed code is firewalled generally, or only ported slowly), and harder still to justify supplementing with yet more (generally unsafe-code) language implementations.

So people are targeting JS with compilers, more and more from compilers running in-browser. JS is the X86 of the Web.

To quote the Wicked Witch of the West (mid-melt): "What a world!" ;-)

/be

This interoperation

This interoperation condition is hard to achieve (C, Scheme, etc. don't try -- compiler writers guard the standards' underspecification to enable optimization experiments; deployed code is firewalled generally, or only ported slowly), and harder still to justify supplementing with yet more (generally unsafe-code) language implementations.

Sorry, I don't understand this. Even JavaScript has portability issues. It is just the case that it has 100% market share and tightly coupled to HTML thanks to ridiculously bad ideas (the DOM and CSS) aimed at adding interactive behaviors on top of static documents as progressive enhancements.

C is an idiotic language because it doesn't even do array bounds checking. The web would have failed years ago if C was used to manipulate HTML. We'd all be drinking Brawndo by now if that was the case. So I don't understand how "underspecification" of C matters at all.

As for Scheme, I don't understand how it is underspecified. Quite the contrary, Scheme demands and expects compiler writers to provide Proper Tail Call optimization. By making this trade-off in specificity, Scheme is a much simpler language to specify (<100 pages, an order of magnitude smaller than Common Lisp), but requires more advanced knowledge for implementors of the Scheme spec. Back in 1995, this was a big issue, since most of the world had never heard of "open source software" and Linus Torvalds was just starting Linux, and, my gosh, slackware was an active linux distribution! The world has changed, and your expectations of how to solve problems in it today must change if you want to radically innovate. As an example, look at how the current web standards process for HTML rendering has now basically become simply pointing to webkit and saying "That's what HTML looks like!"

JS is the X86 of the Web.

Agreed, but this does not mean we cannot improve trustworthiness on the Web. CSS was arguably 'fixed' (sort of) the same way by refining it into levels.

Edit: Fixed escaping <

under-specification

As for Scheme, I don't understand how it is underspecified.

I suspect Brendan was referring to argument evaluation order.

Scheme

As for Scheme, I don't understand how it is underspecified

http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-8.html

5.9 Unspecified behavior

If an expression is said to “return unspecified values”, then the expression must evaluate without raising an exception, but the values returned depend on the implementation; this report explicitly does not say how many or what values should be returned. Programmers should not rely on a specific number of return values or the specific values themselves.

http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-10.html

The effect of returning twice to the continuation of the last body expression is unspecified.

http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-12.html

When a procedure call is evaluated, the operator and operand expressions are evaluated (in an unspecified order) and the resulting procedure is passed the resulting arguments.

http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-14.html

If test yields #f and no alternate is specified, then the result of the expression is unspecified.

The result of the set! expression is unspecified.

etc...

The goal was, like C, to allow implementers to do whatever was most efficient given their target platform and over-all approach. Just as with C a great deal of the art of writing portable Scheme is to avoid unspecified behavior.

I get it

Thanks for your help James!

I just searched the ECMA-262 5th ed spec and the word "unspecified" only comes up once, and it is a note stating that an unspecified behavior must now be specifically handled by ignoring extra tokens in the token stream for an argument list.

I had no idea JavaScript was so detailed.

However, my points were mainly about how with JavaScript people still avoid certain JavaScript idioms because they leak memory on certain platforms or perform poorly on others. That's why I gave the example of PTC. My viewpoint is changed now thanks to you and Dave Herman.

Edit: I just had a separate thought. RFC2119 uses a different vocabulary. So I searched the ECMA standard for "should not" and found some search results.

Host objects may define additional constraints upon [[Put]] operations. If possible, host objects should not allow [[Put]] operations in situations where this definition of [[CanPut]] returns false.

The implementation of ECMAScript should not try to determine whether the exact time was subject to daylight saving time, but just whether daylight saving time would have been in effect if the current daylight saving time algorithm had been used at the time. This avoids complications such as taking into account the years that the locale observed daylight saving time year round.

An implementation shall not treat other kinds of errors as early errors even if the compiler can prove that a construct cannot execute without error under any circumstances. An implementation may issue an early warning in such a case, but it should not report the error until the relevant construct is actually executed.

Host objects are from the Devil

Good finds. Those underspecified counter-examples are botches, big sore points with web developers. Mainly they concern the IE DOM and its COM objects that do not behave the same as JS objects. Until ES5, many host objects could not be mimicked by self-hosted implementations (some misfeatures require future meta-programming features).

Of course historical DST would be a giant pain and source of bugs. Indeed ECMA-262 has always used "extrapolated Gregorian" to go back into the past, never mind the lost week in the middle ages, or previous calendar standards and shifts.

ES5 has tightened up error reporting norms a bit. This is actually good for interoperation. One of those "host object" quirks, a taint from VBScript, is callable lvalues:

d.item(i) = j;
This is found on the web, but in IE-only branches of JS control flow. It must parse in all browsers without an early error.


/be

Not "what Brand X does"

As an example, look at how the current web standards process for HTML rendering has now basically become simply pointing to webkit and saying "That's what HTML looks like!"

Sorry, no. Mozilla is not rolling over, we have more market share than WebKit browsers, and we co-founded the WHAT-WG, where we continue to work on HTML5 as well as in the W3C.

In truth much in HTML5 is based on Hixie's synthesis of what IE, Gecko, WebKit, and Presto (Opera's engine) do. WebKit may indeed "win" more by being (a) open source (this helps Gecko too); (b) younger in its bones, so benefiting from more experience gained by older efforts; but that's fine and fitting.

HTML5 is not a matter of standardizing "what WebKit does", any more than JS future standards will bind themselves to one implementation. This is one reason the ECMA-262 standard is so detailed: the vendors competing and cooperating to create it will not agree to use one source base. They will follow de-facto standards where those standards are good enough, however.

Given this reality, and the web-developer mandate for interoperation, we have no choice but to specify harder. We're looking at a testable spec (definitional interpreter) for the next major edition.

John Mitchell, Sergio Maffeis, and Ankur Taly have been working on an operational semantics for JS, but this is a long road, and it would need mechanizing. Others have done subset semantics, but ECMA-262 needs to be as complete in future editions as it has been in past ones, if not moreso. We can always drop into prose, but we lose testability if not specificity.

Apart from parsing, HTML5 has an easier time using prose, since pixel exact layout is not promised unless you use CSS to achieve it, and other effects need less precision -- but at the limit HTML5 wants to be as precisely specified as JS does. That definitely excludes "what WebKit does", or "what [brand X] does", as a fiat spec.


/be

Observation

Most people like me develop in one browser and then figure out how to make it work on other browsers. [Edit: this is true even if I use a tool like Expression to allow simultaneous previews of multiple renderers, and also have VMs in place. The reason is that I pick a default browser, and right now the one with good looks and great speed is my favorite, because it is simply fast.] Sometimes we have the hacks in our head as we create the design.

My observation is that I anticipate more and more developers will be using WebKit as that default. This will in turn mean WebKit is the defacto standard, because it is what us in the real world building HTML - what goes across the wire - are using for testing.

Right now, it seems to me like Firefox's greatest advantage is that it has FireBug.

Just being pragmatic, even if fiat!

Off topic, but web developers target users, not themselves

IE still has dominant market share on "top sites", and web developers do not fail to make content that works in IE (all of them -- IE7 and IE8 at least, and they differ! IE6 is hopeless outside intranets, and web developers are starting to shun it).

Firefox has more market share than all WebKit-based browsers, and while we aim to keep web developers happily using us as "browser plus debugger in which to develop" and "first target to test", there's no guarantee we will prevail. Competition is what we wanted to restore to the market when we launched Mozilla and then Firefox. We succeeded.

But since market share shifts only slowly, developers cannot deploy and then tell their bosses "works in WebKit" to excuse a failure to work well in Firefox -- or in IE.

The game theory on the web is such that being "number 2" can beat being numbers 1 or of course number 3 or higher. Example: when Microsoft rolled out Live Maps (first version, using "Atlas", their answer to AJAX or Ajax), they assumed the IE DOM by browser-sniffing for IE: if not IE, they then assumed Firefox's JS dialect, with its getters and setters.

This caused rapid work at Apple and Opera to support getters and setters. The asymmetry in most web content whereby browser-sniffing tests

if (document.all) {
  // IE only code here
} else {
  // Mozilla, WebKit, Opera, etc. here
}

is pervasive and it won't go away quickly.

I agree that the competitive fight is over second-place status, and it is being fought first in the web developer segment of the market. Nevertheless, web developers cannot in general move users to new browsers by giving them broken pages or apps. This is the risky and rightly hated "viewed best in Netscape" vs. "viewed best in IE" stratagem from the '90s.

Market share in my view shifts based on keeping pages working while adding better features, performance, and intangible goods to-do with brand (privacy, trustworthiness).

Lets get back to JITs, shall we?

/be

A little more appealing in

A little more appealing in terms of semantics to the LtU crowd might be Arjun Guha's more standard approach to semantics (including executable PLT redexes I think!) for JavaScript.

Clearing up some misconceptions

LuaJIT does not treat traces as trees. It treats them as a graph structure, more like Dynamo. There are no 'interpreter-only' gaps in the control-flow (other than for NYI stuff).

Of course nested loops run fully native. It needs one root trace for the innermost loop and a side trace for each outer loop which connects back to the root trace.

Why don't you try LuaJIT before speculating? It takes only a few seconds to compile. Running a triple-nested loop:

$ cat test.lua
for i=1,100 do
  for j=1,100 do
    for k=1,100 do
    end
  end
end
$ luajit -jv test.lua
[TRACE   1 test.lua:3]
[TRACE   2 (1/3) test.lua:2 -> 1]
[TRACE   3 (2/1) test.lua:1 -> 1]

And the example of two inner loops, Andreas likes to cite:

$ cat test.lua
for i=1,100 do 
  for j=1,100 do end 
  for j=1,100 do end 
end
$ luajit -jv test.lua
[TRACE   1 test.lua:2]
[TRACE   2 test.lua:3]
[TRACE   3 (1/3) test.lua:3 -> 2]
[TRACE   4 (2/3) test.lua:1 -> 1]

Runs fully native of course. Draw the graph, and you'll see. Or run with -jdump and look at the traced bytecode, IR and machine code.

Oh, and yes, LuaJIT traces tail-, up- and down-recursion, too. The generated graphs are ... umm ... a bit odd and not what you'd expect. Certainly not what a method-at-a-time compiler would ever generate. But it runs fast.

IMHO trace trees and especially nested ones just drive up complexity without a real gain. In TraceMonkey you're not compiling a whole tree together (which is where the only pay-off might be), since this would lead to excessive recompilation whenever a trace is attached to the tree. So this approach is a pure loss.

LuaJIT reduces trace-transition costs with cross-trace register coalescing. This is much simpler and helps all edges of the graph. Not just the side traces that happen to loop back to their parent root trace.

Yes sure, JS does have some weird semantics. But no show stoppers. And most JS programs don't use all of the weirdness at once. It does take more of an effort to write a JIT compiler than for Lua, of course. With proper design and engineering, there's no reason a JS JIT compiler couldn't compete with LuaJIT.

Why Lua?

With proper design and engineering, there's no reason a JS JIT compiler couldn't compete with LuaJIT.

Slightly tangential but why did you choose to write a Lua JIT rather than a JavaScript JIT? Where there any technical reasons?

Trading speculations

Mike, you've done great work -- when last we talked, it seemed trace trees without nesting were enough, but I do remember you were pursuing other structures.

Your point seems to cut both ways, however: if you've kept up with TraceMonkey, you know it does up, down, and tail recursion now. We also stitch together differently-typed trees anchored at common points using fragment traces. With Intel, we are investigating using spare cores for background whole-tree recompilation, which gives us more cycles for optimizing harder.

Trace trees have the virtue of returning control to the same program point (loop header, recursive call), so in the case of loops at least, one can inline method calls without constructing activation records. Less structured traces may have to push and pop activations, which always hurts.

On the differences between JS and Lua, you can say it's all a matter of proper design and engineering (what isn't?), but intrinsic complexity differences in degree still cost. You can push the hard cases off the hot paths, certainly, but they take their toll. JS has more and harder hard cases than Lua.

One example: Lua (without explicit metatable usage) has nothing like JS's prototype object chain. See this bug comment for some startling statistics on how deep these chains can get in gmail.

/be

How about the Pypy approach?

Mike, I'm an admirer of your work and your oppinions are always very interesting an spot on. I'd like to know what's your oppinion on pypy's approach to tracing. AFAIK, it's very differet than that of Luajit or Tracemonkey, since it works "one level down" (tracing the interpreter loop rather than the user program).

It's also a much broader project, not exclusively aimed to speed up python, but to become a general framework for implementing dynamic languages. So I guess that, perhaps, the approach taken for getting performance involves other trade offs not present in Lua or js.

I wonder if the idea of having a slow interpreter automagically sped up by tracing is a good strategy. It seems this didn't worked for Mozila and, obviously, it's not the path you've taken, but considering the special characteristics of this project (interpreter writen in a subset of python, tracing one level down, etc...and the latest benchmarks), what do you think?

I would also like to know the oppinions of the Mozila guys as well :-)

TANSTAAFL

I think there are three major problems with the PyPy approach:

1. All those layers add up. Removing the overhead is costly and you rarely get down to zero overhead.

2. Many complex language constructs are based on sequences of simple decisions. The order in which the interpreter performs these is not necessarily the optimal order for compiled code.

E.g. the LuaJIT interpreter needs to do checks for metamethods in a certain order, but the JIT compiler reorders them to minimize the number of guards and to improve hoisting opportunities.

This is a bit like what some C compilers do with an (expr1 && expr2 && expr3) construct. They are able to reorder, combine or parallelize the individual expression evaluations if it's profitable and the semantics allow for that.

3. PyPy is tracing low-level constructs. This drops all high-level semantic information that was present in the original language constructs. Maybe they've found a way to keep these (I haven't checked). Otherwise they're going to lose big time on alias analysis (if/when they add that). Any store could alias any load in that model.

Generating good code for higher-level languages crucially depends on having high-level alias analysis. E.g. have a look at this Lua code:

local t={}; for i=1,100 do t[i] = math.sqrt(i) end

The loop contains two hash table loads (for _G["math"] and math["sin"]) and a store to the array part of a table. LuaJIT is able to hoist all hash table lookups out of the loop, because it can prove they cannot alias the stores to the array part. That's because it preserves the high level information in the IR (HLOAD cannot alias ASTORE).

So I'm not sure they can make PyPy fly, but at least it holds more promise than Unladen Swallow.

These are valid points

Hi mike.

This are all valid points (maybe except 3 which is not completely true) and we pay for this. The reason why we decided to do it in the first place is that python is an insanely complex language. Not only that, but certain details are not well defined, but instead they're based on CPython behaviors. I don't think we would be able to write a JIT for full Python, including builtin modules within a finite timeframe if not tracing on the meta meta level.

Regarding point 3 we do preserve certain high level constructs, so the JIT, even though it's tracing on low level, knows a bit about language semantics.

The approach is in a way middleground. It does not provide as many opportunities for micro optimizing as LuaJIT, on the other end it yields a promise of a working JIT with not that much effort.

Probably you can extend the approach to have python-level optimizations on top of meta-tracing, but we're clearly not there yet.

On the other end, we're not trying to compete with other languages, simply because Python semantics don't allow certain optimizations to happen, but instead we simply try to make python faster.

Cheers,
fijal

This thread

I would just like to say that this is an awesome thread, and I'd like to see more like this. It's great to see some of this discussed (a) in one place, and (b) in the context of general design tradeoffs rather than in the context of one specific system (e.g., Firefox). Most of the time, it seems that keeping up with this topic involves following a bunch of mailing lists and blog posts, on which most of the content is irrelevant.

My thanks to all the contributors to this discussion.

This thread

I agree the comments have been fantastic. Certainly more then I'd hoped for. Can anyone lure the V8 guys over this way?

Keeping up with the state-of-the-art

Most of the time, it seems that keeping up with this topic involves following a bunch of mailing lists and blog posts, on which most of the content is irrelevant.

That's why I asked a while ago about a course on JIT compiler design. I found that few compiler courses mention the topic at all, let alone delve into any kind of detail on the anatomy of cutting edge dynamic language implementations.

Definitely

I strongly agreed with you at the time, but I didn't post, because I had nothing to add. ;)

PyPy on tracing

The PyPy team posted benchmarks on their status blog last week and from the comments, it appears they are betting on tracing too.

Unstated here, when

Unstated here, when discussing tracing, an elephant in the room is tracing calls into the browser, which is a good chunk of the time in most sites (though I realize one can make a chicken-and-egg argument about why that is).

On a somewhat related note, there was a fun talk about layering tracing JITs at DLS this year that LtUers might enjoy.

What do you mean?

Hi Leo,

I'm not sure what you mean by "tracing calls into the browser" -- can you elaborate? Naively, I would guess that you mean calls into native code triggered by DOM updates, but that code is already compiled. Do you mean JS callbacks invoked by the native methods, so that the control flow is obscured by the intervening native calls?

Naively, I would guess that

Naively, I would guess that you mean calls into native code triggered by DOM updates, but that code is already compiled.

I mean exactly that: if you control the x86 compiler... ;-) There are many different ways for this to work: there are projects to write browser components in managed languages or (what I'm doing) code generate them, projects to do hardware or hyperviser-level tracing, and I'd expect instrumented C++ compilers (e.g., a LLVM extension), etc.

The IE8 guys published some funny numbers showing only 4% of the CPU time on average pages (top 250?) was in JS, with JS heavy sites (e.g., gmail) jumping up to 15-20%. My numbers have come out similar for WebKit on a smaller sample. While a lot of this discussion is great, it's important to step back and understand the browser domain and where it fits in.

Don't assume equilibrium

Intensive use of JS, e.g. for rendering to canvas, or for self-hosted DOMs, is the wave of the future. Don't let IE's trailing edge fool you. We want memory safety and control flow integrity, neither forthcoming in large C++ DOM codebases.

Many DOM methods memoize or cache. Self-hosting their "front sides" that try to hit the cache lets the tracing JIT inline the hit-case code. Wins on both safety and speed, compared to C++ implementations.

/be

Intensive use of JS The

Intensive use of JS

The question is whether it is isolated or uses native components. My chicken-and-egg statement was that perhaps we don't see a CPU intense JS because that'd be prohibitively expensive and we're already maxed out just doing native GUI stuff.

, e.g. for rendering to canvas, or for self-hosted DOMs, is the wave of the future

This is why I brought up managed browsers -- which is what this encodes. I disagree in part; I actually don't believe self-hosted DOMs are the way of the future if you mean by user code (a la Caja). In particular, the Proxy/mirror proposal is too geared to membranes, which are fragile (security-wise): I am on the side of *not* distinguishing between proxy and normal objects in terms of such controls, so any taming of the DOM can be more directly handled (see my papers on ConScript and Object Views: I see the latter principal-sensitive advice primitives implemented using interpreter support as in the former).

I have less developed guesses about the role of canvas (in terms of accessibility and performance) colored from my time @ Adobe and our current investigations into GPUs, but can't make as educated of a comment about it yet.

Don't let IE's trailing edge fool you.

Yep, that's why I said I benchmarked WebKit ;-) (OTOH, all browsers except maybe Dillo are too slow for my domain of mobile devices!)

We want memory safety and control flow integrity, neither forthcoming in large C++ DOM codebases.

Amen on the former. Unfortunately, not sure what you mean by the latter (e.g., stuff like Trevor Jim's intrusion detection systems of CFGs of code?).

DOMs, membranes, CFI, oh my

As noted, self-hosting is not all or nothing: the hot path may be the cache hit case, and the slow path goes into native code. We're going to try this in Mozilla, but after we fry other fish.

Membranes are fragile if you (the web developer) have to remember to make them yourself instead of leaking a capability (is this what you meant by fragile?), but pervasive mediation using compilers and VMs is the better way. We already create wrappers in Firefox under the hood, from a central place, to solve several problems.

This is not ocap religion, just good separation of costs and concerns. The same-origin/same-global JS access paths must be fast, so are unmediated.

"Control Flow Integrity" is Abadi, Budiu, Erlingsson, and Ligatti; the follow-on work has culminated (so far) in Google Native Client. We don't plan to compile Gecko with NaCl, but we are already mitigating attacks on vptrs in C++ objects allocated and potentially prematurely freed from type-classifying arena pools.

Virtual method forgery via JS reallocation of prematurely deleted C++ object bugs is a problem in all major browsers on many OSes.

/be

Membranes are fragile if you

Membranes are fragile if you (the web developer) have to remember to make them yourself instead of leaking a capability (is this what you meant by fragile?), but pervasive mediation using compilers and VMs is the better way. We already create wrappers in Firefox under the hood, from a central place, to solve several problems.

One small leak, whether from an incorrectly implemented membrane, incorrect usage, or an incorrect policy controlling it, can expose everything. In security terms, privilege escalation is excessive under them and the attack surface is wide. I think we're agreeing language support (e.g., dependent types a la Fine or what I'm attempting by adding origins to JS) fixes that, though I'm unclear as to the actual technique you're suggesting with compilers and VMs (in my examples, dep types is an example of mostly static analysis in compiler, a case of VM support is what we're now doing with origins in JS). I'm for the abstraction of a membrane -- it's close to notions of boundaries being examined for contracts at NEU PLT -- just not the fragile implementation as a wrapper without VM or type support. I don't know of a reasonable solution for C++-ish code, but, again, I'm pursuing a story for JavaScript.

"Control Flow Integrity"

Thanks for the ref! Trevor Jim's papers were around the same time and surprisingly similar, hadn't seen this one.

Yes, but

Well, yeah, but that stuff is worth optimizing, too. If you mark your native calls as pure or impure, then you can, say, move some pure function calls out of a loop.

downside: asm

Perhaps unrelated to tracing vs non-tracing jits, but a downside of writing a JIT is that you usually can't implement it in your language of interest -- for example, Mike has implemented LuaJIT in assembly and C instead of Lua.

A tracing JIT implemented by the traced language would probably have unacceptable warmup costs. I would rather write a compiler in Scheme than in C, C++, or assembler.

Dog food

A tracing JIT implemented by the traced language [...]

Despite the "dog food" aspect, not all languages are well suited for writing compilers (but still would benefit from JIT compilation).

Case in point: I would not want to write a compiler in a concurrency-oriented language like NESL, Erlang, etc., when I have the choice of using Haskell, or Lisp, or one of the ML dialects.

PyPy

Tell that to the PyPy guys :) They've developed a language called RPython that is a statically typed susbset of Python which they can automatically translate to C, JVM bytecode, CLR bytecode, and probably some other stuff. This lets them right the VM, stdlib, GCs, and JIT all in Python.

Jikes RVM

Well, most, if not all of the Jikes RVM JIT is written in Java, so they eat their own dog food. Mostly.

Is "byte code" dead ?

It seems to me that compiling to "byte code" will sooner or later become extinct. It seems much more reasonable to keep the syntax tree, possibly post-processed for stuff like names -> de-bruijn and so on. Then from this syntax tree, much better optimization decisions should be possible. Also I think while JavaScript is interesting for obvious reasons, from a purely technical view it is more interesting to start with designing a dynamic language from scratch (like I do with Babel-17 :-)) and look at how to interpret / compile this new language. For example, I decided that Babel-17 will have no side-effects. My working hypothesis is that we first need to understand how to deal with side-effect free dynamic languages, before looking beyond.

Is "byte code" dead ?

It seems that Mike Pall is saying bytecode is not dead but V8 and eventually Spidermonkey won't have bytecode. Byte code seems to be a way to make a fast, platform-independent (e.g. ANSI C) implementation of a language. When the implementation needs to go faster than that a line has to be crossed into the platform-dependent world so why not go all the way if there enough interest/money to write really fast compilers supporting the various platforms?

I agree. I am just wondering

I agree. I am just wondering if it is worth the effort to produce a bytecoded interpreter at all nowadays. For example, my current situation is that I am developing this language, and I am writing an interpreter for it which accompanies the spec for the language.
This interpreter is coded in Java, so it is sort of platform-independent already. The representation of the program is not bytecode, but a high-level java data structure. This data structure would be a great starting point for a compiler, too, so it seems if there is a need for speed, then it might be better to directly work on compilers instead of spending time on bytecode.

Web annotations