; last updated - 13 minutes read

It's funny how this article came into being. I only wanted to write about such a simple thing as how CPU caches influence Java programs. But then, I couldn't help but notice the topic isn't quite as simple. Simple to me, who has written a lot of assembler code (granted: decades ago). But many Java developers - actually, most Java programmers I know - don't know much about things like CPUs, registers, let alone caches and out-of-order execution, which play a crucial role in my article. So I decided to write an article explaining assembler programming to Java programmers. But the most simple approach to Java programmers is the HotSpot compiler, which in turn deals a lot with Java bytecode. That became article number three, the article you're reading.

So I ended up with three articles. You're lucky: three article for the price of one!

Four articles, actually. I started to investigate how to write bytecode and wrote an article about it. Another interesting topic is what the JVM does to optimize your code on the fly, and how you can watch it. But this article has yet to be written.

Today I'd like to start the series with a short and cursory instruction to Java bytecode. I'll tell you what bytecode is, how to see it in your editor and how to read it. But it's not a textbook: I'll try to bring you to speed fast, which means the article is exhausting, not exhaustive. You can read the textbooks later. This article is about getting the gist of it.

What happens when Java source code is compiled?

Chances are you already know this bit: The Java compiler compiles the human-readable source code - the stuff who've written - to machine-readable byte code. This bytecode has been designed without a particular hardware in mind. Instead, it runs on a virtual machine. In other words, it runs on a computer that doesn't exist as hardware but has been implemented in software.

In theory, it's possible to create a CPU that executes Java bytecode natively. Such CPU have been produced in the past; by the look of it, a few of them sell even today (see Wikipedia). An interesting example is Jazelle, which was part of many ARM processors. However, implementing Java in Hardware never became a major commercial success. The dominance of Windows and Unix operating system prevented this. Java CPUs usually were secondary co-processors, adding to the cost of the hardware, but adding little value to most users. Plus, Java bytecode seems to be surprisingly complicated. As far as I know, no Java CPU ever implemented the entire instruction set.

A short introduction to Java byte code

Java byte code is an intermediate language between machine code and Java. It follows the stack-oriented paradigm, which make is particularly easy to implement a virtual machine. Did you ever own an HP calculator using reverse polish notation? Then you know the idea. Forth is a programming language that follows the same idea. Instead of writing

println(3+4);

you write

3 4 + println

Intuitively, the idea may be clear: there's a "3", and a "4". The "plus" operator works on the two numbers before it. The result is "7", which is fed into the last term println, which consumes the "7" and prints it on the screen.

Every term is an instruction. Writing every instruction on a line of its own exhibits the structure of the program more clearly:

3 // push 3 on the stack 4 // push 4 on the stack + // consume 3 and 4 and push the result on the stack println // consume the result and print it

As it happens, that's almost Java bytecode. Simplifying things a bit, the corresponding Java bytecode looks like so:

iconst_3 iconst_4 iadd println // push the method pointer on the stack invokestatic // consume the method on the stack and invoke it

Introducing stacks

The idea is to use the data structure CPUs support best: stacks. Every CPU needs a stack in order to support function calls. Did you ever wonder how a function call works? Your program is executed until a function is called. The program execution continues with the code of the function. After finishing the function, the program continues where it was interrupted by the function call.

That sounds simple, but it isn't: How does the program know where it came from?

In fact, early CPUs didn't know the concept of a function. It took engineers a while to figure out how to implement subroutines. They tried various approaches, including self-modifying code. In retrospective, the idea is simple: before heading off to the function put the current line of your code on a stack. When it's time to return from the function, the stack tells you where to continue.

The nice thing about this approach is functions can call functions themselves. The last-in-first-out stacks make it possible.

As it turns out, stacks are incredibly versatile. Most programming language store local variables on the subroutine stack. Forth and Java bytecode go one step further by storing everything on the stack.

I'd like to illustrate the idea with a step-by-step walkthrough. Watch the stack grow and shrink during the execution of 3 4 + println:

The diagrams shows how the stack (blue) changes after executing each statement of the program.

  • We start with an empty stack.
  • iconst_3 puts the integer value 3 on the first free slot of the stack.
  • The next instruction is iconst_4. It puts a 4 on the first free slot of the stack. That's the slot above the 3. [1]
  • The add instruction take the two upper-most values from the stack and adds them. After that, it puts the result on the stack again.
  • The println() method takes the upper-most value from the stack and prints it to the system console. After that, the stack is empty again.
    • That's the general idea of a stack machine: every instruction either puts something on a stack or removes something from the stack. So the instruction set is extraordinary simple: every instruction takes zero or one parameters. Everything else is on the stack. Even better: everything you need is on top of the stack. You never have to reach three or four levels down. The top-most stack element is everything you ever need.

      These traits make stack machines very attractive to compiler designers. A simple stack machine is implemented quickly. Plus, the stack approach makes it simple to implement an optimizer (i.e. a program that analyzes your code and tries to convert it into a more efficient version).

      There's another nice property of stack machines. You don't have to care about operator precedence. Consider the statement

      println(3+4*5);

      Is it equivalent to println(3+(4*5)); or println((3+4)*5);? It requires some ingenuity to implement operator precedence properly. It's not really difficult, but stack machine avoid this problem altogether. To modify the precedence on a stack machine, you have to change the order the values are pushed onto the stack:

      So far, we followed a rather theoretical approach. It's time to examine a real-world example. You'll see a small Java program and the bytecode it compiles to.

      Watching byte codes in your application

      Your Java applications are compiled to byte code. In other words, the .class files on your hard disks are made up of byte code (plus some meta data). To read the byte code, you have (at least) two options:

      • Use the Windows explorer to browse to the class file. Drag it and drop it into an Eclipse editor window.

      • Use the javap program to disassemble the class. Open the console, cd to the folder of the class and enter javap -c CacheLine.class.

      Here's the Java method to the byte code on the right hand side:

      private static void loop1() { int length = array.length; for (int i = 0; i < length; i++) array[i] --; }

      I guess you can read the bytecode with a little help by now. Like I said above, I don't want to write an exhaustive course on Java bytecode. I'd rather give you a quick introduction that may be too fast at parts, but gives you an impression what it's all about, so you can decide yourself whether it's worth to dig more deeply or not. Prepare for a tour de force!

      The numbers preceding the bytecode instructions is the position of the instruction in bytes. The first line is the zeroth byte within the method, the next line starts at the third byte, the third line starts at the fourth byte. In other words: the first instruction takes three bytes, the second instruction uses only one byte.

      • The first line puts the reference to a static array on the stack.

      • The next line remove the reference from the stack again and replaces it with the length of the array.

      • The next instruction, istore, store the top-most element of the stack in the zeroth variable.

      That's an important difference between bytecode and Java source code: bytecode doesn't need a name for a local variable. It simply calls them by number - much the way ancient Romans called their children. Ottavio is still a popular name in Italy. Originally, it simply means "son number eight".

      Funny thing is class names, method names and package names are still there. Most of the time developers compile their libraries with the standard settings of the Java compiler, so the class files contain a variable table which allows the debugger to convert numbers back to names. However, every once in a while you stumble over a library without debug information. Debugging such a library is a pain (which is the reason why it's done so rarely, despite the performance penalty). Every variable name is converted to something like "arg_0" and "arg_1".

      The next two statements implement the Java assignment int i=0:

      • We've already encountered iconst_0. This instruction pushes a zero on the stack. iconst is a very short opcode[2], consuming merely a byte. There are only very few one-byte iconst opcodes: -1, 0, 1, 2, 3, 4 and 5. Larger numbers are pushed on the stack by bipush x, x being a 16-byte short signed integer. Even larger numbers have to be added to the constant pool of the class. Once they're there, they can be pushed on the stack with ldc index_within_constant_pool. This trick allows the JVM to generate very compact byte code: Apart from the extra memory needed to allocate the constant pool, ldc requires only two or three bytes.
      • istore_1 removes the zero from the stack and stores it in variable #1. This would be our index i.

      And then there's a - GOTO?

      Yes. It's true they've taught us to never, never use a GOTO statement. But that applies to higher languages. Byte code certainly isn't a higher language. There's no such thing as a while or for loop. In low-level languages these loops have to be emulated by GOTO statements and conditional jumps. That's precisely what's happening here. for loop can be executed zero times, so we start the loop by checking the terminating condition. The check happens to be at the end of the methods, bytes 22-24:

      • iload_1 pushes variable #1 on the stack. That would be our index i.
      • iload_0 pushes variable #0 on the stack. That would be the length of the array.
      • if_icmplt 10 is a comparison, merged with a conditional branch. The two topmost values are pulled from the stack, compared, and depending on the result a the program continue at a different line. In the case, the program jumps back to byte 10 if i is less than length.

      Most branches deal with integers, so it makes sense to fuse the conditional branch with the compare instruction. Longs and floating point values are treated with two instructions: cmp, which compares the two topmost values and pushes and integer on the stack, followed by iflt, which pulls the integer from the stack and performs the conditional jump if the integer is smaller than zero. There's also ifeq (jump if equal, i.e. the topmost stack element is zero), ifge (jump if greater or equal), ifgt (greater than), ifle (less or equal) and ifne (not equal).

      If suppose by now you've got the gist of it. Let's browse through the remaining instructions quickly:

      • Byte 10: getstatic pushes the pointer to the array on the stack.
      • iload_1 pushes the content of the index variable on the stack.
      • We need both variables two times, so we use dup2 copies the two topmost stack elements.
      • iaload takes the array pointer and the array index from the stack, reads the corresponding array element (e.g. array[i]) and pushes its value on the stack.
      • iconst_1, isub decrement the topmost stack element and puts the decremented number on the stack.
      • iastore pulls the pointer to the array, the index in the array and the values from the stack and store the value into the corresponding array element (i.e. array[i]). That's why we duplicated the array pointer and the array index at byte #14.
      • The last instruction, iinc 1 1, increments the variable #1 by 1. In other words: it's performs an i++.

      That's all I'd like to tell you about byte codes today. For those among you who want to dig deeper: Wikipedia is a good starting point. Read more details about byte code on Wikipedia. The complete list of byte codes is here.

      Wrapping it up and Sneak Preview

      So far we've learned who to disassemble a Java class. The next article of this series does it the other way round: we'll use the ASM library to write Java bytecode. After these preparations, we'll dive even deeper: how do CPUs work? How does Java translate to machine code? What is a cache? The final article is going to close the circle: I'll show you that the effect of a cache sometimes shows up in real-world programs - even if they are written in such a high-level language like Java that doesn't compile to machine code directly.

      Stay tuned!


      Dig deeper

      Wikipedia on Java bytecode

      complete list of byte codes.

      Andrey Loskutov's Byte Code Outline Eclipse plugin

      A Java Programmer’s Guide to Byte Code

      A quick guide on writing byte code with ASM

      A Java Programmer's Guide to Assembler Language

      A Child’s Garden of Cache Effects

      JVM internal principles (6) (a very detailed introduction to Java byte code. I you don't speak Chinese (like me): Google Translate does an excellent job).


      1. Due to historical reasons, most people let their stacks grow downwards.↩
      2. opcode, short for "operation code", is another word for instruction, chiefly used by assembler programmers.↩

Comments