This write-up discusses an aspect of JVM's just-in-time compilation around dynamic method dispatch.
Following are the results from a run of 5 measurement iterations of JMH (Java micro-benchmarking harness) showing that JVM could perform 69 million operations in the bi-morphic case than just 62 million operations in mega-morphic case. And no matter how many times the benchmark is run, the results remain the same.
This benchmark was executed against a Java 8 runtime, similar results appear if we run the same code against Java 10's latest update.
The hierarchy of classes involved in this example is shown here:
Above results are from a run of following benchmarking code and we can see there is literally no difference between the two methods being benchmarked. Then what just happened?
In general, JVM knows (via just in time compilation) how many possible types exist for a polymorphic reference type. With this awareness, JIT compiler can possibly make some optimizations.
if JIT sees that application is using only one concrete type (e.g only Beowulf or only Centaur) to be assigned to reference variable 'beast', then all invocations of beast.showSomeAwesomeness() don't need a virtual function table dispatch.
A virtual function table dispatch (exactly what happens in C++ and don't be surprised to know that JVM in its running state is actually a C++ process) is responsible to invoke and execute the correct method implementation from the correct concrete instance in polymorphic scenarios. Virtual lookup and dispatch have a cost which is generally negligible but can be seen in precisely benchmarked scenarios.
When JIT sees only one possible concrete type, it can cache the C++ pointers used for correct method dispatch at the call site. This is called Monomorphic dispatch. And after caching by JIT, it gets superfast as no lookup is needed on future invocations.
If JIT sees, only two possible concrete call site invocations, it applies another optimization which is called Bi-morphic dispatch which works similar to the monomorphic case by caching two pointers at the call site.
This is the case at line number 43 in the above example. Since JIT knew only possible concrete types in use will be Centaur and Polyphemus, it will cache the method dispatch accordingly. This is possible because the code already revealed Beowulf as one concrete type at line number 40. So the possible remaining types were two only and JIT was conscious enough to perform the optimization at bi-morphic call site in else block. This can be seen as peeling off (or revelation) few possible types, leaving only two remaining types.
This kind of trick is only possible when we know the number of possible types will be limited. If there are more types that can be presumed, a mega-morphic dispatch happens which is the slowest because there is no call-site caching involved by JIT.
Moreover, the caching which JIT does is not everlasting. There is an undo behavior which JIT applies. E.g. what if the reference variable beast is not method local and is coming from outside where potentially some other thread might change the variable assignment just before the invocations at lines 40, 43 or 57. That will lead to incorrect behavior if runtime uses the cached pointers because then incorrect code will be invoked. Therefore runtime applies a guard just before the invocation to validate the reference. If it sees the guard has been invalidated, it un-caches the call site which means the next call site invocation will use virtual table lookup again.
This is only a small view of many profile-guided optimizations Hotspot VM's perform at runtime. Among other optimizations are escape analysis, inlining, lock elision, loop unrolling etc.. This kind of self-awareness is not possible for statically compiled language runtimes and is one of the strongest arguments in favor of Java and JVM enabled languages (Kotlin, Scala, Java, Grovvy ... many others).
---------------------
Regarding the beasts used in this code, here is a self-colored portrait of Polyphemus (I only colored it, did not draw it). More details about Polyphemus can be found here.
Following are the results from a run of 5 measurement iterations of JMH (Java micro-benchmarking harness) showing that JVM could perform 69 million operations in the bi-morphic case than just 62 million operations in mega-morphic case. And no matter how many times the benchmark is run, the results remain the same.
This benchmark was executed against a Java 8 runtime, similar results appear if we run the same code against Java 10's latest update.
Above results are from a run of following benchmarking code and we can see there is literally no difference between the two methods being benchmarked. Then what just happened?
if JIT sees that application is using only one concrete type (e.g only Beowulf or only Centaur) to be assigned to reference variable 'beast', then all invocations of beast.showSomeAwesomeness() don't need a virtual function table dispatch.
A virtual function table dispatch (exactly what happens in C++ and don't be surprised to know that JVM in its running state is actually a C++ process) is responsible to invoke and execute the correct method implementation from the correct concrete instance in polymorphic scenarios. Virtual lookup and dispatch have a cost which is generally negligible but can be seen in precisely benchmarked scenarios.
When JIT sees only one possible concrete type, it can cache the C++ pointers used for correct method dispatch at the call site. This is called Monomorphic dispatch. And after caching by JIT, it gets superfast as no lookup is needed on future invocations.
If JIT sees, only two possible concrete call site invocations, it applies another optimization which is called Bi-morphic dispatch which works similar to the monomorphic case by caching two pointers at the call site.
This is the case at line number 43 in the above example. Since JIT knew only possible concrete types in use will be Centaur and Polyphemus, it will cache the method dispatch accordingly. This is possible because the code already revealed Beowulf as one concrete type at line number 40. So the possible remaining types were two only and JIT was conscious enough to perform the optimization at bi-morphic call site in else block. This can be seen as peeling off (or revelation) few possible types, leaving only two remaining types.
This kind of trick is only possible when we know the number of possible types will be limited. If there are more types that can be presumed, a mega-morphic dispatch happens which is the slowest because there is no call-site caching involved by JIT.
Moreover, the caching which JIT does is not everlasting. There is an undo behavior which JIT applies. E.g. what if the reference variable beast is not method local and is coming from outside where potentially some other thread might change the variable assignment just before the invocations at lines 40, 43 or 57. That will lead to incorrect behavior if runtime uses the cached pointers because then incorrect code will be invoked. Therefore runtime applies a guard just before the invocation to validate the reference. If it sees the guard has been invalidated, it un-caches the call site which means the next call site invocation will use virtual table lookup again.
This is only a small view of many profile-guided optimizations Hotspot VM's perform at runtime. Among other optimizations are escape analysis, inlining, lock elision, loop unrolling etc.. This kind of self-awareness is not possible for statically compiled language runtimes and is one of the strongest arguments in favor of Java and JVM enabled languages (Kotlin, Scala, Java, Grovvy ... many others).
---------------------
Regarding the beasts used in this code, here is a self-colored portrait of Polyphemus (I only colored it, did not draw it). More details about Polyphemus can be found here.
Comments
Post a Comment