The Java Instruction Set, Assemblers, and Disassemblers

By 01

Sunday, October 05, 2008
Note: this is another article that I wrote a while ago using a Java Disassembler that no longer seems to be available--d-Java.  There are numerous Java Disassemblers on the Internet.  Two more that could be used are mentioned in the reference section: Jasper & Kimera.  They can produce the Jasmin syntax that is needed for this article.
 
I spend most of my time working with Java and JVMs--performance tuning, troubleshooting, etc.  What many Java Developers don't realize--and may not want to realize--is that a JVM, any JVM, from any vendor, implements an Instruction Set defined by Sun.
 
Now, if you're not familiar with what an Instruction Set is, I'll define it as the most basic operations that a Microprocessor can perform.  A JVM, being a "Virtual Machine", as it is, implements an Instruction Set that doesn't actually have a physical manifestation, but rather is an abstraction that really defines large chunk of what the Java Virtual Machine is.
Most Instruction Sets today operate on a set of on-chip memory units called registers.  An older way of doing things, that was revived in the early nineties by Sun with Java, was the idea of a stack-based Instruction set.  This basically means that instead of moving values in and out of registers and manipulating these values, the instruction push/pop operands onto a stack, perform an operation on the top values of the stack, then place the result of operations onto the stack.  This is not the most efficient way of doing things, but it does make it relatively easy to follow along with what each instruction is doing--for example, there is no pipelining or out-of-order execution that can happen here.  Though, I think exploration of optimizations at the bytecode level have been left alone in liu of what the JIT compiler can do at the machine instruction level--another Instruction Set, by the way. 
 
Now, I'll be completely honest, I'm on the outside of a murky fish bowl, looking in--there are assumptions I'm making here that may or may not be correct.  I look at starting this blog as an opportunity to test some of these assumptions.  Perhaps, someone who knows a lot more about the subject than I do will read it and point out any errors in my previous description.
I took a Compiler Design course as a part of my Masters Program at Wash U.  All of the lab exercises were done in Java.  The final project involved creating a compiler that could take a substantial subset of the Java language and convert it into corresponding Java Assembly instructions.  The tools I'm using here are the ones that I used in that class.
What we are going to do here is take a simple Java program:
 
  • quickly walk through its logic
  • compile it
  • disassemble it
  • walk through the assembly logic
  • make a minor change
  • assemble the modified class
  • and run the resulting class file
 
Here is the Java code:
public class Test
{
  public static void main(String args[])
  {
    long I = 1;
    long j = 0;
    long k = 0;
    long t1 = System.currentTimeMillis();
    for(i = 0; i < 1000000; i = i + 1)
    {
      j = i + (i * i + k/2);
    }
    long t2 = System.currentTimeMillis();
    System.out.println(t2 - t1);
  }
}
This code needs to be in a file called Test.java
I'm using Sun's Java 1.5.0_08 to do this example on Windows XP; I'm also using Cywin, if anyone cares.
This code can be compiled with the following command:
javac Test.java
I use  d-Java to disassemble the Java Class file.  The following command will disassemble the Test class:
 d-Java.exe -o jasmin Test.class > Test.j
d-Java.exe writes output to standard out.  The '-o' option requests that the output be specified in Jasmin format.  Jasmin is a popular Open Source Java Assembler.  There isn't really a standard format for Java Assembly instructions that Sun has released.  Somewhere in the past decade of Java's existance, Jasmin became a defacto standard for Java Assembly.  It's also what I used in the Compiler Design course I mentioned.  So, I am using it here.
The file that was produced should look like the following:
;
; Output created by D-Java (
mailto:umsilve1@cc.umanitoba.ca)
;
;Classfile version:
;    Major: 49
;    Minor: 0
.source Test.java
.class  public synchronized Test
.super  java/lang/Object
; >> METHOD 1 <<
.method public <init>()V
    .limit stack 1
    .limit locals 1
.line 1
    aload_0
    invokenonvirtual java/lang/Object/<init>()V
    return
.end method
; >> METHOD 2 <<
.method public static main([Ljava/lang/String;)V
    .limit stack 8
    .limit locals 11
.line 5
    lconst_1
    lstore_1
.line 6
    lconst_0
    lstore_3
.line 7
    lconst_0
    lstore 5
.line 8
    invokestatic java/lang/System/currentTimeMillis()J
    lstore 7
.line 9
    lconst_0
    lstore_1
Label1:
    lload_1
    ldc2_w 1000000
    lcmp
    ifge Label2
.line 11
    lload_1
    lload_1
    lload_1
    lmul
    lload 5
    ldc2_w 2
    ldiv
    ladd
    ladd
    lstore_3
.line 9
    lload_1
    lconst_1
    ladd
    lstore_1
    goto Label1
.line 13
Label2:
    invokestatic java/lang/System/currentTimeMillis()J
    lstore 9
.line 14
    getstatic java/lang/System/out Ljava/io/PrintStream;
    lload 9
    lload 7
    lsub
    invokevirtual java/io/PrintStream/println(J)V
.line 15
    return
.end method
Luckily, d-Java provides the original line numbers, it will make explaining all of this much easier.
 
So, what does all of this mean?
 
First, it prints out, in comments, the Class version Major & Minor numbers.
;Classfile version:
;    Major: 49
;    Minor: 0
 
 
Sun will change this every now and then.  I'm honestly not sure what they've done in the 48 changes that have been made to the Class file format specification.
 
Next, a couple of directives are given that tell us a little about what this class does (and where it came from):
.source Test.java
.class  public synchronized Test
.super  java/lang/Object
The first line tells us where the class came from.  The second corresponds to the "class public Test" line in the original Test.java file.  The third declares that the Super Class of Test is .java.lang.Object.  This is understood by default in the Java Language, but it must be explictly specified in the Java Assembly file.
 
Next, we have the very first method defined in the assembly file:
; >> METHOD 1 <<
.method public <init>()V
    .limit stack 1
    .limit locals 1
.line 1
    aload_0
    invokenonvirtual java/lang/Object/<init>()V
    return
.end method
Note, that every method defined in a Jasmin Java assembly file must begin with ".method ..." and end with ".method".  This method is the default constructor.  In Java code, the default constructor doesn't have to have the call to its super class default constructor explictly listed, but in Java Assembly, it is required.  Even, if you don't specify a default constructor in a Java class, the compiler creates one for you.  It looks more-or-less like the one listed above.  "<init>" is the name of the default constructor--this will always be the case.
 
The full first line of this method is ".method public <init>()V".  So, this is a public method whose name is <init>.  The <init> method takes zero arguments, note the "()" and has a return value type of Void, note the trailing V.
 
Types are represented in the following way:
 
Remember, there is a huge difference between a primitive type int and the Integer class.   For a method defined with the following, signature:
public void method1(int i, int j, byte k, String l);
 
the following Assembly directive would mark the beginning of the method:
.method public method1(IIBLjava/lang/String;)V
The next two lines define some important traits of the method.
 
    .limit stack 1
    .limit locals 1
The first line defines the depth of the Operand Stack.  Every method has an operand stack associated with it.  This is the stack that all Assembly Language instructions operate on.  The Java Compiler calculates the maximum depth needed for this stack based upon the instructions present in the method body.  If at any point the stack depth grows greater than this number, an exception is thrown and the JVM exits.  This is a part of Java Security model.  The second line defines how much room should be made in this methods stack frame for local variables.  In this case, there is one local variable--the "this" reference.  "this" is the first local variable in non-static methods.  This brings us to the first set of instructions that actually do anything.
.line 1
    aload_0
    invokenonvirtual java/lang/Object/<init>()V
    return
 
This is the Java code added by the compiler to call the Super Class constructor.  The first instruction, "aload_0", pushes the "this" reference onto the stack.  The second instruction calls the Default Constructor for java.lang.Object.  "invokenonvirtual" instruction calls the given Nonvirtual Method--the Default Constructor of all objects is Nonvirtual.  The final instruction "return" causes the method to return--which entails copying the return value (if any), dismantling the Stack Frame, and preparing to execute the next instruction following the instruction that called this method.
 
So, that takes care of the Default Constructor.  We went on numerous tangents that will make the next method easier.
 
Next, we have the declaration of the main() method:
; >> METHOD 2 <<
.method public static main([Ljava/lang/String;)V
    .limit stack 8
    .limit locals 11
From what we have previously learned, this is a public method, it is also static, called main.  It takes an array of Strings as an argument--we know that it is an array of strings because of the leading "[" before "Ljava.lang.String".  The main() method has a return type of void.  The Operand Stack has a maximum depth of eight.  It appears there are eleven local variables--including the "this" reference.  This isn't correct; there are only six local variables.  The "this" reference is a 32-bit, 4-byte variable.  The other five variables are 64-bit, 8-byte longs. So, the local variables have the following positions in the stack frame's local variable slots--note, that the numbering convention assumes a variable of length 32-bits.  Also, note that this applies to a 32-bit JVM; if we were using a 64-bit JVM, this would be different.
 
 
The first thing that happens in the main method is the variables i,j, & k are initialized to one, zero, and zero, respectively.  That happens in the following assembly instructions:
.line 5
    lconst_1
    lstore_1
.line 6
    lconst_0
    lstore_3
.line 7
    lconst_0
    lstore 5
The lconst_1 instruction pushes (long)1 onto the stack.  The lstore_1 instruction pops the (long)1 off the stack and places it in the local variable 1 slot.  The same thing happens for the j and k--local variable 3 and local variable 5, respectively.
 
Next, the t1 variable is initialized to the current number of milliseconds since midnight:
.line 8
    invokestatic java/lang/System/currentTimeMillis()J
    lstore 7
Since java.lang.System.currentTimeMillis() is a static method, we use the assembly instruction "invokestatic" to invoke it.  This puts a long value on the stack, which is the current number of milliseconds since midnight.  The next line pops the value off the stack and stores it in local variable slot 7.
 
This brings us to what is probably the most complex part of this program from an assembly language perspective, the for-loop.  I have added comments in line to make this easier.
.line 9
    lconst_0                ;Push a (long)0 on the stack
    lstore_1                 ;Pop the zero and assign it to i, in the local variable 1 slot.
Label1:
    lload_1                  ;Push the value stored in i on the stack.
    ldc2_w 1000000      ;Push the long value 1000000 on the stack
    lcmp                      ;Compare two long values on the stack. 
    ifge Label2              ;if ( i < 1000000) then goto Label2.  This breaks us out of the loop eventually.
.line 11
    lload_1                  ;Push the value of i on the stack
    lload_1                  ;Push the value of i on the stack
    lload_1                  ;Push the value of i on the stack
    lmul                      ;Multiply the top two i's on the stack.  Push the result on the stack.
    lload 5                   ;Push the value of k on the stack.
    ldc2_w 2                ;Push a (long)2 on the stack.
    ldiv                       ;Divide k by 2.  Push the result on the stack.
    ladd                       ;Add the top two longs on the stack (i*i + k/2).  Put the result on the stack.
    ladd                       ;Add the top two lons on the stack (the original i and the sum we just computed).
    lstore_3                  ;The result of i + (i*i + k/2) is popped off the stack and stored in j.
.line 9
    lload_1                  ;Push the value of i on the stack.
    lconst_1                 ;Push a (long)1 on the stack.
    ladd                      ;Add these two values together. Push the sum on the stack.
    lstore_1                 ;Assign this value to i.
    goto Label1           ;Jump to label1.
.line 13
After we break out of the loop, we are at Label2, which is essentially the exact same thing that we saw prior to the loop.
Label2:
    invokestatic java/lang/System/currentTimeMillis()J
    lstore 9
This assigns the number of milliseconds since midnight to the variable t2.  Of course, the variable name, t2, has long since been lost, it's actually assigning it to a local variable slot.  But, the result is the same.
 
This brings us to the Println statement:
getstatic java/lang/System/out Ljava/io/PrintStream;
lload 9
lload 7
lsub
invokevirtual java/io/PrintStream/println(J)V
The first line pushes the static java.io.PrintStream object, which represents standard out, onto the stack.  The second pushes the value of t2 on the stack.  The third line pushes the value of t1 on the stack.  The forth line subtracts these two values to produce the elapsed number of milliseconds it took the for-loop to execute--the result is pushed on the stack.  We now have the object we are invoking java.io.PrintStream.println() on and the methods one argument on the stack.  We can invoke the virtual method java.io.PrintStream.println()  no via the instruction "invokevirtual".
 
The final piece is the return statement.  This does the same things as described for the Default Constructor.
 
Now, as a final exercise, let's change the program to print the value of i instead of the value of t2-t1.  We can do this by replacing
lload 9
lload 7
lsub
with
lload_1
We could just as easily use "lload 1".
 
Assemble the new class file with the following command:
java -jar jasmin Test.j
Run the program:
java Test
It should print out the value: 1000000
 
I have found this kind of thing to be useful when the Java decompilers fail.  Java Compilers tend to put in extra pieces of code when things like nested try-catch blocks and syncronization are used.  This can help get around such problems, but it can be tedious and slow.
 
References:
 
[1] Jasmin
[2] d-Java
[3] Jasper
[4] Kimera
[5] JVM Spec & Instruction Set
[6] Sun's Java Homepage
[7] jad 

 

©2008 www.thinkmiddleware.com

All copyrights & trademarks belong to their respective owners.

The comments and opinions herein are that of the author.

Please direct all comments to 01.

While the information presented on this web site is believed to be correct, the author is not responsible for any damage, loss of data, or other issues that may arise from using the information posted here.

Made with CityDesk
Last Modified: Sunday, 09-Nov-2008 10:48:37 MST