- 23 minutes read

Imagine a programming language without variables. A language without function parameters and return values. A language without floating point numbers, maybe even without multiplication and division. A language without type casting. Without types, actually. Not to mention classes or objects. Join me in a journey to the world of assembler languages.

Granted, modern CPU have evolved a lot. Programming in an assembler language feels almost like programming in a high-level language, compared to the old days. Nowadays floating point numbers, multiplications and divisions are part of the standard toolkit. Thing is, you don't need them. The 6502 instruction set lists a remarkably small instruction set of 54 mnemonics, many of them redundant, and each instruction dealing with 8 bits. That's all you need to write any program you want. The CPU doesn't need to implement multiplications in hardware because you can implement multiplication in software. It's just a bit slower, that's all.

Who cares?

Did you expect to learn about Assembler in a Java blog? Java programmers hardly ever care about machine code. It's there, that's all they have to know. But knowing how your program works under that hood - well, I don't know if it makes you a better programmer, but it sure gives you a broader perspective, which may help you every once in a while. Besides, it's fun. And the nice thing about Java is machine code is only a mouse click away. You can see it. That's what we'll do in this article. And sometimes Java programmers can even see the effect of the CPU architecture.

In fact, the HotSpot compiler does such a good job we can even observe the effect of individual cache lines in a real-world Java program. That's the topic of the next article of this series.

A word on didactics (or the lack thereof)

By the way, this is not a textbook. I won't start with the basics, boring you to death along the way. I'll do it the other way round. Let's start with a Java program, learn how to display the assembler code the HotSpot compiler generates and try to decipher it piece by piece. The disadvantage of this approach is that I can't explain everything you see immediately. Maybe I'm too fast at times. Be patient if you're puzzled by mysterious Assembler instructions. At the end of the article everything will fall into place. If you prefer the bottom-up approach, Wikipedia has a good introduction for you.

How to show the assembler code of a method

To see the assembler code the JVM executes, you have to do two things: first, start the program with a couple of JVM arguments:

javaw.exe -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly Adder

Second, you have to run the method of interest so often that it "runs hot". By default, the JVM doesn't bother with compiling your Java program to native machine code. In theory, native machine code runs a lot faster than interpreted byte code, but the reality looks different. Modern CPU architectures - in particular, CPU cache architectures - make running native machine code unattractive. It only pays if a method is run very often. The default setting of Oracles JVM is 10000 repetitions in server mode, give or take a few. Short methods seem to be on the "give" side. I had to run it 80.000 times to convince the JVM to unveil the Assembler code:[1]

public class Adder { public static void main(String[] args) { for (int i = 0; i < 80000;i++) { additions(); } } private static void additions() { int a = 1; int b = 2; int c = 3; a = b + c; b = a + c; c = a + b; } }

I chose 80.000 repetitions because that was the minimum iteration count showing the method reliably. I could have chosen a higher figure, but that causes the JVM to compile and print the main() method, too. Of course, the exact numbers are specific to the JVM implementation and configuration, so they may be different on your machine. There even seems to be a random element. Be that as it may, let's have a look at the code and look for the additions. Regard the central column, where the three-letter instruction keywords are. I expect to see something like ADD two or three times, or possibly an INC statement if the JVM is clever enough to figure out that a equals 1, so adding a actually is incrementing a number.

# {method} 'additions' '()V' in 'Adder' # [sp+0x20] (sp of caller) 0x00000000026e2480: sub rsp,18h 0x00000000026e2487: mov qword ptr [rsp+10h],rbp ;*synchronization entry ; - Adder::additions@-1 (line 11) 0x00000000026e248c: add rsp,10h 0x00000000026e2490: pop rbp 0x00000000026e2491: test dword ptr [120000h],eax ; {poll_return} 0x00000000026e2497: ret 0x00000000026e2498: hlt 0x00000000026e2499: hlt 0x00000000026e249a: hlt 0x00000000026e249b: hlt 0x00000000026e249c: hlt 0x00000000026e249d: hlt 0x00000000026e249e: hlt 0x00000000026e249f: hlt [Exception Handler] [Stub Code] 0x00000000026e24a0: jmp 26df220h ; {no_reloc} [Deopt Handler Code] 0x00000000026e24a5: call 26e24aah 0x00000000026e24aa: sub qword ptr [rsp],5h 0x00000000026e24af: jmp 26b9000h ; {runtime_call} 0x00000000026e24b4: hlt 0x00000000026e24b5: hlt 0x00000000026e24b6: hlt 0x00000000026e24b7: hlt

Ouch - whatever that may be, this code hasn't much to do with our program. Neither of the constants 1, 2 or 3 appears in the assembler code. There's an ADD 10h, but we don't add 16 in the Java code, so this has to be something else. My bet is the JVM notices our program does effectively nothing at all, and so it optimizes our additions away. It can do so safely, because the sums are never used. The remaining lines are rather mysterious. We'll come to that later. At the moment, accept them as boilerplate code and ignore them. First let's try something different. Can we talk the JVM into performing the additions by adding global variables instead of local ones?

public class Adder { static int a = 1; static int b = 2; static int c = 3; public static void main(String[] args) { for (int i = 0; i < 80000; i++) { additions(); } } private static void additions() { a = b + 1; b = c + 2; c = a + 3; } }

This time the resulting code at least resembles our original Java code. We still observe the magic of the optimizer, but this time the code is similar enough to discuss it:

# {method} 'additions' '()V' in 'Adder' # [sp+0x20] (sp of caller) 0x0000000002602480: sub rsp,18h 0x0000000002602487: mov qword ptr [rsp+10h],rbp ;*synchronization entry ; - Adder::additions@-1 (line 13) 0x000000000260248c: mov r11d,2h 0x0000000002602492: mov r10,7d56d18b0h ; {oop(a 'java/lang/Class' = 'Adder')} 0x000000000260249c: add r11d,dword ptr [r10+78h] 0x00000000026024a0: mov r9d,dword ptr [r10+74h] ;*getstatic b ; - Adder::additions@0 (line 13) 0x00000000026024a4: mov dword ptr [r10+74h],r11d ;*putstatic b ; - Adder::additions@13 (line 14) 0x00000000026024a8: mov r11d,r9d 0x00000000026024ab: inc r11d 0x00000000026024ae: mov dword ptr [r10+70h],r11d ;*putstatic a ; - Adder::additions@5 (line 13) 0x00000000026024b2: add r9d,4h 0x00000000026024b6: mov dword ptr [r10+78h],r9d ;*putstatic c ; - Adder::additions@21 (line 15) 0x00000000026024ba: add rsp,10h 0x00000000026024be: pop rbp 0x00000000026024bf: test dword ptr [120000h],eax ; {poll_return} 0x00000000026024c5: ret 0x00000000026024c6: hlt 0x00000000026024c7: hlt 0x00000000026024c8: hlt 0x00000000026024c9: hlt 0x00000000026024ca: hlt 0x00000000026024cb: hlt 0x00000000026024cc: hlt 0x00000000026024cd: hlt 0x00000000026024ce: hlt 0x00000000026024cf: hlt 0x00000000026024d0: hlt 0x00000000026024d1: hlt 0x00000000026024d2: hlt 0x00000000026024d3: hlt 0x00000000026024d4: hlt 0x00000000026024d5: hlt 0x00000000026024d6: hlt 0x00000000026024d7: hlt 0x00000000026024d8: hlt 0x00000000026024d9: hlt 0x00000000026024da: hlt 0x00000000026024db: hlt 0x00000000026024dc: hlt 0x00000000026024dd: hlt 0x00000000026024de: hlt 0x00000000026024df: hlt [Exception Handler] [Stub Code] 0x00000000026024e0: jmp 25ff220h ; {no_reloc} [Deopt Handler Code] 0x00000000026024e5: call 26024eah 0x00000000026024ea: sub qword ptr [rsp],5h 0x00000000026024ef: jmp 25d9000h ; {runtime_call} 0x00000000026024f4: hlt 0x00000000026024f5: hlt 0x00000000026024f6: hlt 0x00000000026024f7: hlt

Before explaining the actual code, I'd like to explain the general structure of the output. It consists of four columns: "line numbers", the mnemonics of the instructions, parameters of the instructions and comments. The code also consists of several blocks, as you can see in the sketch:

I ran the test on a 64-bit JVM, so the line numbers at the beginning of each line are extremely long. In fact, they aren't really line numbers, but the address of the instruction in memory. 0x00000000026024f7 means the instruction starts at the 26024f7's byte - which is actually byte 39855351 if you prefer to calculate in the decimal system. In low level programming, counting hexadecimal is easier, so every number is given as a hexadecimal number. To avoid confusion, a "0x" prefix or the "h" postfix is added to the number.

The code snippet shows three different types of comment. Comments begin with a ";" or a "#". The comments generated by the JVM tell us which method we're looking at, and it even tells us the line number of the Java code. Other comments, such as "putstatic", refer to the Java byte code you've seen in my previous articles. The JVM also adds a couple of lines surrounded by brackets. These are also comments, explaining the role of the next lines.

There's an astonishing number of lines contain the word HLT. This instruction stops the CPU. Obviously, that's not what we want, so it's dead code which is never called. In other words, it's a waste of memory. These one-byte instructions aren't really part of the program. Instead, they are an important performance optimization: the CPU works much faster on "even" addresses than on odd addresses, so it pays to waste memory. Modern CPUs read 8 (or 16) bytes at a time[2], so "even" means "can be divided by 8".

We aren't interested in boilerplate code, so let's reprint the interesting part of the code. In Java, that's three lines:

a = b + 1; b = c + 2; c = a + 3;

They are compiled to ten lines of machine code:

mov r11d,2h ; r11d = 2; (line 2) (sic!) mov r10,7d56d18b0h ; r10 = pointer to the variable table; add r11d,dword ptr [r10+78h] ; r11d += c; (line 2) mov r9d,dword ptr [r10+74h] ; r9d = b; (line 1) mov dword ptr [r10+74h],r11d ; b = r11d; (line 2) mov r11d,r9d ; r11d = r9d;(line 1) inc r11d ; r11d ++; (line 1) mov dword ptr [r10+70h],r11d ; a = r11d; (line 1) add r9d,4h ; r9d += 4; (line 3 + line 1) (sic!) mov dword ptr [r10+78h],r9d ; c = r9d; (line 3)

The code starts with a surprise: first it executes line 2, then it executes line 1, and at last it executes a line that isn't part of the Java program at all. That's the magic of the optimizing JIT compiler.

At the beginning of each line there are three- or four-letter-words (called "mnemonics"). Almost every mnemonic is followed by one or two parameters. The mnemonics are very simple commands: ADD is an addition, SUB is a subtraction, INC increments a number and MOV reads data from on memory cell and writes it to another memory cell. In other words, it's an assignment. For instance, mov r11d,2h could be written in Java as d=2;.

The Intel x86 flavor of Assembler instructions supports two parameters at most. One of them can be a real memory cell. The other parameter is almost always a "register" - a high-performance memory cell within the CPU. With most CPUs, registers are a scarce resource. It's expensive to implement such a high-performance memory cell in silicon. The 6502 CPU knew five registers, two of which had special purposes: the 16-bit wide instruction counter and the 8-bit stack pointer. Programs had to work with remaining three registers, each of which was 8 bits wide. Even so, not every mnemonic supported every register. Early 8086 processors suffered from similar constraints. Today, the situation is more comfortable. Register width went from 8 bits over 16 and 32 bits to really versatile 64 bits. 64-bit processors have 6 16-bit registers, 8 32-bit registers and 16 64-bit registers. There are even 16 128-bit registers, 16 256-bit registers and 32 512-bit registers. However, Java only has 64-bit bit types (long, so these extra registers don't play a role in the generated code (yet?). In case you're interested, Wikipedia has a lot of details on x86 registers.

Most of the time, Java code works with the eight registers r8-r15. The disassembled JVM code also calls them r8d-r15d (probably to show that it's a decimal number, as opposed to a hexadecimal number).

The registers are used as temporary variables. Many Java byte code instructions involve two memory accesses. These byte codes are converted to two or three machine code instructions. Each instruction can access only a single memory cell, so you have to store the content of the first cell temporarily in a register. In our example, b=c+2 is compiled to three lines of assembly code:

mov r11d,2h ; r11d = 2; add r11d,dword ptr [r10+78h] ; r11d += c; (line 2) mov dword ptr [r10+74h],r11d ; b = r11d; (line 2)

The first line, mov r11,2h, moves the value "2" to register 11. Notice that the first parameter in Intel's assembler language notation is always the target, and the second is the source.

The next line, add r11d,dword ptr [r10+78h], adds a 32-bit value (a "dword") which is read from a memory cell. The pointer to the memory cell is the value of the r10 register, which has been initialized with 7d56d18b0h, plus an offset of 78h bytes. The resulting memory address is 7D56D1928h, and that in turn is our variable "c". The pattern repeats time and again in the code: variable accesses are always done using the r10 register, plus an offset that denotes the variable.

In theory, we could just as well address the variable directly:

add r11d,dword ptr [7D56D1928h] ; r11d += c; (line 2)

However, the clumsy calculation [r10+78h] is executed much more efficiently. Among other things, 78h fits into a byte. So add [r10+78h] can be expressed as a two-byte instruction: one byte for the mnemonic, one byte for the parameter. add [7D56D1928h] requires at least six bytes. In reality, CPU builders love even numbers - i.e. 2, 4 or 8 bytes. So the minimum size of add [7D56D1928h] is 9 bytes. Reading and deciphering 9 bytes takes much more time than reading and decoding a compact two-byte code. If you're a CPU, that is - human beings think differently.

By the way, a brief look in the first version of the disassembled code reveals that add [r10+78h] actually takes four bytes. Still, that's much better than nine bytes.

Excursion: why offsets have been invented

The other reason why offsets are so popular in the x86 world is history. When computers outgrew the 64 KB memory barrier that was imposed by 16-bit addressing, there were two different strategies to do so. The first was direct memory mapping. CPUs following this strategy had to implement new addressing modes using more than 16 bits. As far as I remember, many CPU manufacturers didn't jump immediately to 32 bits. 20 and 24 bits were also common. Intel decided to stick with 16 bits. They implemented an addressing mode that took an address, multiplied it by 16 and added an register to access individual bytes within the 16-byte chunk. Over time, it turned out that this annoying deviation is useful in many situations. For instance, you never know where the operating system puts your program. Atari's TOS went through the entire program and corrected every pointer, every branch, every jump and every function call before starting the program. The comment {no_reloc} in our disassembly refers to that relocation process. Relative addressing is a simple alternative to that. Load a register with the address of the first byte of the program, and you can safely use constant offsets afterwards. Like so often, Wikipedia has a nice article on position-independent code.

Local variables and method parameters

That's another reason why the stack we've seen my previous article is so useful: local variables can be accessed relatively to the stack pointer. We've seen in the previous articles of this series that local variables and parameters are put on the stack, and when the code is compiled to native machine code, the CPU stack is used to store local variables and pass parameters.

The JVM's optimizing compiler

I suppose, you can read the assembler code by now. As you can see, the static variables a, b and c are laid out linearly in memory: [r10+70h]=a, [r10+74h]=b and [r10+78h]=2. That's a common pattern which helps a lot to read the code.

However, I already mentioned the code isn't quite what we expect. What we wrote was

a = b + 1; b = c + 2; c = a + 3;

But what's executed is similar to

int temp = b; a = b; a++; b = 2 + c; c = temp + 4;

The JVM recognizes that c = a + 3 can be rewritten as c = b + 4 because a = b + 1. It also considers a=b; a++; to be more efficient than a = b + 1;. By the look of it, the Just-in-Time-compiler of Java 8 not only compiles your code, it also optimizes it. If it recognizes superfluous lines, they are omitted altogether (remember the first attempt at getting the disassembled code). It re-orders line if it seems fit, and much more. That's one of the reasons why it's not recommended to try to optimize the Java source code at a low level: probably the JVM performs the same optimization itself, and sometimes your optimization is confusing the JIT compiler, resulting in a slower program.

By the way, that's only the beginning of the optimizations performed by the JIT. Just to name a few, there's method inlining (on bytecode level), loop unrolling, lock coarsening and eliding, dead code elimination (we've already seen that!), escape analysis (to remove synchronized block if they're superfluous) and a couple of others.

Out of order execution and reordering instructions

I'm not sure why the JIT reorders the order of the instructions, but it may be because of cache latency considerations. Modern CPU don't process one instruction at a time. Instead, they implement a pipeline that consists of roughly a dozen steps (depending on the CPU model). Memory accesses are always a bottleneck. So CPU can bypass the offending instruction and work on the following instruction until the data arrives from the memory. That's Out-of-Order-Execution. As you can imagine, Out-of-Order-Execution is a highly sophisticated technique requiring a lot of silicon. Among other things, the CPU must be able to detect which instructions can safely be executed because they don't rely on the result of the stalled instruction.

Maybe - just maybe - JIT decided to rearrange the order of the instruction because it recognized a potential memory access bottleneck.

Boilerplate code

At the beginning and the end of the disassembled code there's some boilerplate code:

# {method} 'additions' '()V' in 'Adder' # [sp+0x20] (sp of caller) 0x0000000002602480: sub rsp,18h 0x0000000002602487: mov qword ptr [rsp+10h],rbp ;*synchronization entry ... 0x00000000026024be: pop rbp 0x00000000026024bf: test dword ptr [120000h],eax ; {poll_return} 0x00000000026024c5: ret

That's highly sophisticated code that enables synchronization of multiple threads. In many cases, the JVM detects there's nothing to synchronize and omits the synchronization code. In this case, we access static variables, so there's no way around adding code to synchronize multiple threads. The JVM can't know there's only one thread. I'll explain synchronization in more detail in the next article of this series. For now, suffice it to say static variables can reduce your application's performance.

The remaining boilerplate code consists of an exception handler and a deoptimizer:

[Exception Handler] [Stub Code] 0x00000000026024e0: jmp 25ff220h ; {no_reloc} [Deopt Handler Code] 0x00000000026024e5: call 26024eah 0x00000000026024ea: sub qword ptr [rsp],5h 0x00000000026024ef: jmp 25d9000h ; {runtime_call}

We know exception can't occur, but the JVM adds an exception handler nonetheless.

More interesting is the deopt handler code. A part of the compilation and optimization is speculative. It works if certain assumptions are fulfilled. If the situation changes, the JVM can revert the optimization. That's allows the JVM to optimize much more aggressive. Charles Nutter has a few slides on deoptimization (see slide 46ff).

Implementing branches and loops: if, while and for

So far, machine code was a bit tedious, but nothing completely unexpected. That changes when we start to compile an if statement or a while loop. Machine code doesn't know such things.

Instead, it knows the famous GOTO statement. Loops are code that ends with a JMP mnemonic jumping to the first line of the code.

GOTO is fine to write infinite loops, but that's not what we need. Luckily, there are also conditional branches, which are very primitive if statements. A conditional branch typically consists of a comparison instruction and the branch mnemonic itself:

0x00000000027f0f6e: inc r13d 0x00000000027f0f71: cmp r13d,r9d 0x00000000027f0f74: jl 27f0f6eh

That's the equivalent of

do { r13d++; } while(r13d < r9d);

The cmp mnemonics takes two parameters and compares them. The result of the comparison is stored in a special CPU register, the flags. There are quite a few flags that can be evaluated by conditional branches. The most important flags are

  • the zero flag, which is set if both parameters are equal.
  • the carry flag. It's a gross simplification, but in our context it indicates that the target parameter is greater than the source parameter.

The two flags are evaluated by conditional jumps like jl (jump if lower) or jne (jump if not equal).

Sometimes you can also use conditional branches without the comparison. The flags are also set by arithmetic operations, such as add, sub or inc. That's why

r13d=r9d; do { r13d--; } while(r13d>0);

is more efficent than the incrementing loop. It can be compiled to a shorthand version:

mov r13d, r9d loop_start: dec r13d jl loop_start

This is the first time I show you a symbolic constant ("loop_start") instead of the numerical memory address of the instruction. You won't ever see this in the JVM's disassembled code, but as you can imagine that symbolic constants and labels are much more useful when you write assembler code.

In the next article of this series, we will examine loops much more closely.

Cache architecture

The reason why I started to write the article is to provide you with some background to understand my next article, which shows the effect of the cache architecture in a real-world Java program.

Do you remember the HLT statements we encountered above? They were added to align the address so that the CPU can access them faster. The CPU always reads chunks of - say - 16 bytes (depending on the CPU) at once. These chunks always start at even addresses (i.e. addresses which are divisible by 16). Reading two bytes at odd addresses can result in two memory accesses, which take it's time.

Cache architecture makes the problem much worse. It's possible to implement fast memory, but that's expensive. It takes a lot of transistors, heat and energy. That's why there are so few registers: that's the most efficient memory available. Nowadays CPUs are supported by a fast, but small level 1 cache (the "L1 cache"). Reading data from such a cache is many times slower than reading a register, but it's still a lot faster than reading the main memory. Most PCs also contain a level 2 cache and even a level 3 cache, each of which is slower than the previous cache but faster than main memory.

The art of efficient programming is to write small programs fitting into the cache. In the best case, the program and its data fit into the L1 cache. Cache architecture brings a couple of surprising consequences with it. For instance, I'm told Java programs often run faster than equivalent C++ programs. C++ programs are always compiled, so they need much more memory which fills the cache. Java byte code is much more compact, exploiting the cache better.

To me, that's a bold claim, given the fact that this approach works best when the Java code interpreted instead of being compiled.

Java's memory layout doesn't help, either. A real-world program usually deals with objects. C++ programmers can make sure that arrays of objects are laid out linearly in memory. In Java, collections of objects are almost always scattered among the memory (because the array merely stores a pointer to the object, and each object is allocated individually). As you can imagine, caching such a scattered array is quite a challenge. That's one of the reasons why Brian Goetz and his team started to implement value types. Arrays of value types are laid out linearly in memory, allowing for very efficient processing.

Wrapping it up

We've started from a Java program, talk the JVM into compiling it and we've read and analyzed the disassembled machine code. Along the way, we've learned something about the optimizing compiler of Java, and about cache architecture.

Now you're well prepared for the last article of this series. We'll apply what we've learned to write a real-world Java program that makes the effect of your CPU's cache architecture visible. We'll write two programs that take the same time - but one of the programs does 16 times more work than the other. Enjoy!

Dig deeper

my previous article

Charles Nutter on what the JIT does for you when you aren't looking

Wikipedia's introduction to Assembler programming

Wikipedia on x86 registers

Wikipedia on relocation and position-independent code

The 6502 instruction set. 6502 was a popular CPU back in the early eighties.

a list of x86 CPU instructions, their throughput and latency

A Java Programmer’s Guide to Byte Code

A quick guide on writing byte code with ASM

A Java Programmer's Guide to Assembler Language

A Child’s Garden of Cache Effects

  1. It's possible there are other reasons why I had to run it 80.000 times instead of 10.000 times. Maybe the result is printed with a certain delay, or the compiler is started after 10.000 iterations, but runs in a separate thread.
  2. the precise number depending on the CPU model and the cache implementation