Stage 1

Stage 1, SPO600 Project – Identifying and benchmarking a CPU-intensive function or method in an open source software project

I am looking to find an open source software package that includes a CPU-intensive function or method that is implemented in a language that compiles to machine code (such as C, C++, or Assembler). This could be a hash function, compression, a codec, or any other compute-intensive task.

After searching, was selected from looking through github.


Following instructions from the readme to compile test applications
> ./configure
> make

2. Benchmark the performance of the current implementation on AArch64 systems. Be sure to isolate the performance of the function of interest from the performance of all other functions. Prove the repeatability of your observations.

3. Clearly identify the strategy you will use in Stage 2 to attempt to optimize the hash function for better performance on AArch64 systems. This might include some combination of:

  • Altered build options
  • Code changes to permit better optimization by the compiler
  • Algorithm improvements
  • In-line assembler

4. Report on your results to this point.


Lab 5 – Algorithm Selection

For this lab a Basic Sound Scale Program was tested on the Aarch64 archie.

Basic Sound Scale Program

    • Creates 500,000 random “sound samples” in an input array (the number of samples is set in the vol.h file).
    • Scales those samples by the volume factor 0.75 and stores them in an output array.
    • Sums the output array and prints the sum.

[Added a timer to track how long it takes for the scaling section to finish for this program and the alternative approaches]

After building and testing,

The result of the program does not change after being compiled and run, no matter when it is run. Run-times also being very similar (but not always the same). Different samples increase or decrease the run-time but the result remains constant with the same sample.

Alternate Approaches

Alternate approaches to scaling the sound samples by modifying copies of going to be tested.

1. Pre-calculate a lookup table (array) of all possible sample values multiplied by the volume factor, and look up each sample in that table to get the scaled values.

The time spent on precalculating a lookup table and then looking up the table looks to be longer than having a function scale it.

2. Convert the volume factor 0.75 to a fix-point integer by multiplying by a binary number representing a fixed-point value “1”. For example, you could use 0b100000000 (= 256 in decimal) to represent 1.00. Shift the result to the right the required number of bits after the multiplication (>>8 if you’re using 256 as the multiplier).

Much faster than approach #1, slightly faster than the basic program.

[vol1 = basic program, vol2 = approach #1, vol3 = approach#2]



Analysing Results

The higher the optimization level, the slower the compile time. code size increases,  execution speed increases.

Approach #2 edges out approach #1 in speed.

Source code:



Lab 4 – Vectorization

This lab is designed to explore single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler.

A short program is written that creates two 1000-element integer arrays and fills them with random numbers in the range -1000 to +1000, then sums those two arrays element-by-element to a third array, and finally sums the third array and prints the result.
The following section of code in the short program should be vectorizable as demonstrated in the earlier examples listed here:

for (i=0; i<1000; i++){
arraySum[i] = arrayOne[i] + arrayTwo[i];

The program is then compiled with the following command [note that O3 is the highest level of optimization]:

gcc -g -O3 -fno-builtin -o lab4 lab4.c

The dissassembly listing was obtained with the following command:

objdump -d lab4


0000000000400560 <main>:
400560: d285e410 mov x16, #0x2f20 // #12064
400564: cb3063ff sub sp, sp, x16
400568: d2800000 mov x0, #0x0 // #0
40056c: a9007bfd stp x29, x30, [sp]
400570: 910003fd mov x29, sp
400574: a90153f3 stp x19, x20, [sp, #16]
400578: 529a9c74 mov w20, #0xd4e3 // #54499
40057c: a9025bf5 stp x21, x22, [sp, #32]
400580: 72a83014 movk w20, #0x4180, lsl #16
400584: f9001bf7 str x23, [sp, #48]
400588: 910103b6 add x22, x29, #0x40
40058c: 913f83b5 add x21, x29, #0xfe0
400590: 5280fa33 mov w19, #0x7d1 // #2001
400594: d2800017 mov x23, #0x0 // #0
400598: 97ffffd6 bl 4004f0 <time@plt>
40059c: 97ffffe9 bl 400540 <srand@plt>
4005a0: 97ffffdc bl 400510 <rand@plt>
4005a4: 9b347c01 smull x1, w0, w20
4005a8: 9369fc21 asr x1, x1, #41
4005ac: 4b807c21 sub w1, w1, w0, asr #31
4005b0: 1b138020 msub w0, w1, w19, w0
4005b4: 510fa000 sub w0, w0, #0x3e8
4005b8: b8376ac0 str w0, [x22, x23]
4005bc: 97ffffd5 bl 400510 <rand@plt>
4005c0: 9b347c01 smull x1, w0, w20
4005c4: 9369fc21 asr x1, x1, #41
4005c8: 4b807c21 sub w1, w1, w0, asr #31
4005cc: 1b138020 msub w0, w1, w19, w0
4005d0: 510fa000 sub w0, w0, #0x3e8
4005d4: b8376aa0 str w0, [x21, x23]
4005d8: 910012f7 add x23, x23, #0x4
4005dc: f13e82ff cmp x23, #0xfa0
4005e0: 54fffe01 4005a0 <main+0x40> // b.any
4005e4: d283f002 mov x2, #0x1f80 // #8064
4005e8: 8b0203a1 add x1, x29, x2
4005ec: d2800000 mov x0, #0x0 // #0
4005f0: 3ce06ac0 ldr q0, [x22, x0]
4005f4: 3ce06aa1 ldr q1, [x21, x0]
4005f8: 4ea18400 add v0.4s, v0.4s, v1.4s //vector addition
4005fc: 3ca06820 str q0, [x1, x0]
400600: 91004000 add x0, x0, #0x10
400604: f13e801f cmp x0, #0xfa0
400608: 54ffff41 4005f0 <main+0x90> // b.any
40060c: 4f000400 movi v0.4s, #0x0 //SIMD vector element
400610: aa0103e0 mov x0, x1
400614: d285e401 mov x1, #0x2f20 // #12064
400618: 8b0103a1 add x1, x29, x1
40061c: 3cc10401 ldr q1, [x0], #16
400620: 4ea18400 add v0.4s, v0.4s, v1.4s // vector addition
400624: eb01001f cmp x0, x1
400628: 54ffffa1 40061c <main+0xbc> // b.any
40062c: 4eb1b800 addv s0, v0.4s //addv=add across vector, SIMD vector instruction 
400630: 90000000 adrp x0, 400000 <_init-0x4b8>
400634: 91208000 add x0, x0, #0x820
400638: 0e043c01 mov w1, v0.s[0]
40063c: 97ffffc5 bl 400550 <printf@plt> //print out sum result
400640: f9401bf7 ldr x23, [sp, #48]
400644: a94153f3 ldp x19, x20, [sp, #16]
400648: 52800000 mov w0, #0x0 // #0
40064c: a9425bf5 ldp x21, x22, [sp, #32]
400650: d285e410 mov x16, #0x2f20 // #12064
400654: a9407bfd ldp x29, x30, [sp]
400658: 8b3063ff add sp, sp, x16
40065c: d65f03c0 ret

Going through assembly instructions is quite difficult, having to constantly check what each instruction means step by step
source code:

Lab 3 – Assembler Lab

For this lab, we wrote simple assembly programs. The program given to us looped 10 times without doing anything. We first changed it to print the message “loop” 10 times. Then we changed it to print numbers in ascending order, then finally changed it to loop 30 times instead of 10.

Assembly is a lot more anal than higher level programming languages (C, C++, Java, etc). You must tell it exactly what to do (gives you higher level of control but responsibility is all on you), how to handle memory etc. I am more comfortable working with higher level languages as I am already experienced with writing and debugging such programs, compared to this.

The x86_64 version of this lab was run on Xerxes, and the Aarch64 version of this lab was run on Betty.

Assembly  is interesting for programmers looking into actual machine code, but I had difficulty writing and debugging the code.

source code:

Lab 2 – Compiled C Lab

This lab will investigate the differences in compiler options by compiling a simple “Hello World” program. The changes for each GCC compiler option can affect the size, lines of code, the execution of code, the use of libraries, and so on.


(1) Add the compiler option -static.
Note and explain the change in size, section headers, and the function call.

The –static option links a program statically and prevents the use of using shared libraries, making the file much larger than without the option. The other change is that <_IO_printf> is called instead, again also due to the -static option.

(2) Remove the compiler option -fno-builtin.  Note and explain the change in the function call.

The difference with removing this compiler option is that it calls put() instead of printf(). According to , the -fno-builtin “results in code is often both smaller and faster, but since the function calls no longer appear as such, you cannot set a breakpoint on those calls, nor can you change the behavior of the functions by linking with a different library.” Without it, it seems that put() is called by default here.

(3) Remove the compiler option -g. Note and explain the change in size, section headers, and disassembly output.

Removing the compiler option -g results in a smaller sized file, which makes sense since the debugging information is removed.

(4) Add additional arguments to the printf() function in your program. Note which register each argument is placed in. (Tip: Use sequential integer arguments after the first string argument. Go up to 10 arguments and note the pattern).

The arguments put into the printf function add some data stored into registers. It seems like some arguments have their own register and others have their own.

(5) Move the printf() call to a separate function named output(), and call that function from main(). Explain the changes in the object code.

The difference here is that the <main> calls <output>, which in turn calls the printf in the code.

(6) Remove -O0 and add -O3 to the gcc options. Note and explain the difference in the compiled code.

Changing from -O0 to –O3 is supposed to have a higher level of optimization – meaning the program should run faster, according to The mov $0x0,%eax is changed to xor %eax,%eax.


Lab 1 – Code Review Lab

This lab is designed to explore the code review processes used in open source projects/communities.

(1) GCC

Testing patches for submission: 

changes to backend or c/c++ frontend – complete build of GCC and runtime libraries on at least one target. complete bootstrap on all default languages and run all testsuites. (make bootstrap; make -k check; will accomplish this)

changes to front end – perform complete bootstrap.

test exactly the change that is intended for submission

Submitting patches:

  • Description of problem/bug and how patch addresses this
  • Testcases
  • ChangeLog
  • Bootstrapping and testing
  • the patch itself

Bundled into mail message and sent to appropriate mailing list(s)

GCC 6 Release Series (6.2) Patch

Participants and role

Problem Report and users involved listed here ->

issues discussed (and resolved)

aside from problem report above, specific targeted issues:

  • Support for –with-cpu-32 and –with-cpu-64 configure options have been added on bi-architecture platforms
  • Support for the SPARC M7(Niagara 7) processor has been added.
  • Support for the VIS 4.0 instruction set has been added

(2) VirtualBox

Testing/Submitting patches for submission:

The technical documentation can be accessed to get source code and submit patches via its own hosted website. More information detailing how to contribute can be found directly here: .

However, one of two things must be done before contributing to VirtualBox

  1. Fill out the Contributor’s Agreement (CA) and send it to Oracle. With the CA, you give Oracle permission to use your contribution under licenses other than the main VirtualBox license. This is a once-in-a-lifetime event: once we have your CA, we will never again ask you to undergo any bureaucratic procedures, and all future contributions are covered.
  2. If you don’t want to sign such an agreement, you can alternatively submit your contribution under the MIT license. This is a liberal, wide-spread Open Source license that allows Oracle (and anyone else) to use your contribution in both open-source and closed-source projects.