As of previous post, we still had a number of problems with optimised builds. Specifically, Java "long" type variables have been handled wrongly with optimisation beyond "-O." We now have identified and fixed all currently known problems, ranging from missing assembly code clobber flags to unsafe 64-bit variable manipulation. While working on this we have written a couple of Java "long" type tests, which, after adding all missing long bytecode trap handlers, now all work. Jamvm can now be successfully built and used with "-O," "-O2," "-O3," and "-Os" optimisation flags. Interestingly, we have not been able to detect any performance improvements when switching from "-O" to any deeper optimisation. All relevant changes committed upstream.
Added a TABLESWITCH bytecode implementation, fixed a couple more object references, and fixed two bugs, that caused segfaults when built with "-O" gcc optimisation flag. This brings a nice speed-up of 0.6 seconds on "Hello, World!" (around 33%), but the software version benefits from it too, when similarly fixed, and still keeps us just above 10% behind on this test. All fixes committed to the git repository.
We managed to further improve performance of our hardware-accelerated JVM using some of "quick" bytecodes, supported by JEM, optimizing memory allocations for code segments, eliminating multiple redundant calls to pthread_getspecific(). This brings us down to 1.83s on our "Hello, World!" example, which is now "just" 17% slower than the pure software version. One further optimization we have not committed yet is using (MAP_PRIVATE | MAP_POPULATE) instead of (MAP_SHARED), which saves us another 40ms on the same test (15% slower than the soft version), but we are not quite certain about implications of using a private mapping yet.
Our first threaded example works! Having implemented kernel support for JEM context saving and restoring (see earlier posts in this category) we now have fixed a few more object reference on the heap cases and a couple of other buglets, and now the test successfully runs through.
As usual, the code is available for download from our git repository and marked with "jem-0.2" tag.
Until now we only used JamVM in single-threaded mode, because firstly, saving and restoring of Java-context has not been supported in the kernel, secondly, the kernel oopsed reproducibly but at various places when multiple jamvm threads were running. Finally I found time to study this problem, and have "recognized its nature." It turned out to be the same old problem with "rotated" registers. JEM implements Java Operand Stack in the first 8 general purpose registers, and pushes and pops them using some internal rotation. When such a rotation is in effect, access to single and to multiple registers actually hit different registers. For example, a "load multiple" (LDM) from registers r0-r7 returns values v0-v7, but reading r0 using MOV or another single-register instruction can actually deliver v1, or any other value, depending on the number of elements on Java Stack. This rotation is enabled always as long as the R bit in the Status Register is set. It has been known since early days of JEM development, that this bit does not get automatically cleared when leaving JEM mode. A thread that once has entered JEM has the R bit set for its entire life-time. Only when switching into the supervisor (kernel) mode or to a different thread the Status Register gets replaced and the R bit is cleared. This is the reason we have to wind and unwind the Java Operand Stack on every entry to and exit from the JEM mode. It now turned out, that also other threads in JamVM process get the R bit set! And not only once at their creation time. Clearing the R bit at fork time is not enough - it gets set again and again! Unfortunately, it is still absolutely unclear_how_ and where it gets set. A workaround has been implemented to clear the R bit upon every context switch, if the thread we are switching to is not going to run in JEM mode. This eliminated oopses for now and allows us to further work on multithreading support in JEM.
On a simple "Hello, World" test we are still about 1 second behind the pure software JamVM version, which makes performance optimizations one of our primary goals. We started by counting all occurring traps and measuring time spent in them. This provided us with some interesting and valuable information, which we presented at FOSDEM this year. Next we want to find out exactly which code fragments take most time, so, we used oprofile. For this we had to upgrade to the current version 2.3.0 of the AVR32 buildroot. Now we've got some preliminary results, which seem to point at UTF-8 string hashing, and at memory allocation in the kernel.