Link to lab: https://wiki.cdot.senecacollege.ca/wiki/SPO600_Inline_Assembler_Lab
In this lab we are looking at how inline assemblers effects performance.
Inline Assembly is pieces of assembly code that can be run and allows for performance increases.
First we had to copy code that used inline assembly and it happened to do that same thing as the previous lab.
The results of 500000 indicate that inline assembly gives better performance than algorithm changes.
Time: 0m.028s
And upping the sample count to 5000000 shows similar results.
Time: 0m.261s
Also running on BBetty with 500000 gives these results:
Time: 0m0.0033s
And with 5000000 samples.
Time: 0m0.321s
BBetty results were a little slower but that’s because BBetty has a weaker cpu.
The results fall in line with what happened in lab 5. It is interesting to note that with inline assembly it achieves the same type of performance as switching to the second alternate method in lab 5.
Opening up vol_simd.c reveals a couple points of interest in this code.
One thing to take notice is this:
register int16_t* in_cursor asm("r20"); // input cursor
register int16_t* out_cursor asm("r21"); // output cursor
register int16_t vol_int asm("r22"); // volume as int16_t
These lines are assigning a specific register to a variable.
The alternate approach could be not assigning these variables but you lose the ability to constrain a register to a variable.
Another thing interesting about this code is this:
vol_int = (int16_t) (0.75 * 32767.0);
This piece of code is setting the fixed point value. It is similar to what we did in lab 5 in the second alternate approach. The number 32767 is chosen because it is the highest number that a 16 bit integer can go.
Additionally, this inline has some interesting things going on:
__asm__ ("dup v1.8h,%w0"::"r"(vol_int));
According to the documentation it will take %w0 and copy it into v1.8h. v1.8h represents a vector register that has 8 bits with 8 lanes. This will basically copy the values in vol_int into vector register one.
After that line the code goes into a loop that does an asm with these parameters:
"ldr q0, [%[in]],#16 \n\t"
"sqdmulh v0.8h, v0.8h, v1.8h \n\t"
"str q0, [%[out]],#16 \n\t"
: [in]"+r"(in_cursor)
: "0"(in_cursor),[out]"r"(out_cursor)
If you were to remove the last two lines, the program won’t compile. The last two lines are output/input operands and are optional. However, this block requires it because it has “+r” which means this is going to be used to pass data into the asm code.
Finally, the code prints out the result:
printf("Result: %d\n", ttl);
The result is usable but it is interesting to note that the result still suffers the same problem as the previous lab where the result is the same unless your re-compile. But this can be fixed by using time and giving null to create a new value each time.
In my next part I’m going to research an open source package.
I choose to do Sooperlooper which is a live looping sampler.
Sooperlooper is only available for linux and mac systems.
Website: http://essej.net/sooperlooper/download.html
Source Code: https://github.com/essej/sooperlooper
Looking at the source code there is some source code but not a lot.
All of the assembly is in inline and there is no separate assembly file.
Interestingly the assembly is specific to X86 and PowerPC.
For example, look at this piece of code:
static __inline__ int CountLeadingZeroes(int arg) {
#if TARGET_CPU_PPC || TARGET_CPU_PPC64
__asm__ volatile("cntlzw %0, %1" : "=r" (arg) : "r" (arg));
return arg;
#elif TARGET_CPU_X86 || TARGET_CPU_X86_64
__asm__ volatile(
"bsrl %0, %0\n\t"
"movl $63, %%ecx\n\t"
"cmove %%ecx, %0\n\t"
"xorl $31, %0"
: "=r" (arg)
: "0" (arg) : "%ecx"
);
return arg;
#else
if (arg == 0) return 32;
return __builtin_clz(arg);
#endif
}
It is checking to see if it is Powerpc then checks to see if it’s x86 and finally it has an else to catch any other architecture. One thing to notice is that if the program is not on Powerpc or x86 it uses __builtin_clz which is a builtin function that counts leading zeros. Also,this function’s purpose is to count leading zeros.
I personally find it weird that it supports Powerpc but it may be because mac’s before 2006 ran on Powerpc and then switched to x86 afterwards. It is probably for backwards capability reasons and it is interesting seeing Powerpc being supported while arch64 is not. While this does add complexity to the code it adds performance to a large amount of pc’s using this program.