We managed to further improve performance of our hardware-accelerated JVM using some of "quick" bytecodes, supported by JEM, optimizing memory allocations for code segments, eliminating multiple redundant calls to pthread_getspecific(). This brings us down to 1.83s on our "Hello, World!" example, which is now "just" 17% slower than the pure software version. One further optimization we have not committed yet is using (MAP_PRIVATE | MAP_POPULATE) instead of (MAP_SHARED), which saves us another 40ms on the same test (15% slower than the soft version), but we are not quite certain about implications of using a private mapping yet.
Our first threaded example works! Having implemented kernel support for JEM context saving and restoring (see earlier posts in this category) we now have fixed a few more object reference on the heap cases and a couple of other buglets, and now the test successfully runs through.
As usual, the code is available for download from our git repository and marked with "jem-0.2" tag.