NEON Assembly Programming on the iPad/iPhone 3GS 

In order to provide players with  endless game-play in Trudy’s Mechanicals we chose to procedurally generate maps.  However this poses the problem that we can not pre-bake any of the lighting off-line.  In order to get the best graphics as possible in Trudy, we chose to  generate our lighting on the fly and it had to be as fast as possible.  Luckily these kinds of calculations are easily parallelised  using the iPhone 3GS/4 and iPad’s NEON unit!

The Architecture

For the sake of this article I’m going to assume that you have some grasp of the ARM architecture and assembly programming in general.  NEON is a super-set of the existing VFP floating point units found in older iDevices and features sixteen 128bit quad-word SIMD registers.

Each 128bit register can be thought of as two 64-bit registers or four 32-bit registers.  These registers are named as follows:

s0 d0 q0
s1
s2 d1
s3
s4 d2 q1
s5
s6 d3
s7

.. d30 q15
..
.. d31
..

NEON instructions operate on “qX” and “dX” registers.  VFP instructions operate on the first 16 “dX” and the first 32 “sX” registers.  For more information see the official ARM docs.

Configuring your project in XCode

Any assembly you write will only work on your iDevice because the Apple chose to create a simulator instead of an emulator for debugging code on your Mac.  Including “TargetConditionals.h”  imports several defines that allow us to figure out what sort of compilation is occurring.  The most important is TARGET_IPHONE_SIMULATOR which is set when compiling for the simulator.

1
2
3
4
5
6
#include "TargetConditionals.h"
#if  TARGET_IPHONE_SIMULATOR
// put your non-optimized C/Objective-C code here
#else
//  put your assembly here
#endif

XCode itself is geared towards creating binaries for all iPhone/iPad devices.  In order to compile for the iPhone 3GS/4 or iPad we must change some build settings:

1) Un-check “Compile for Thumb”

Compile for Thumb

2) Set “Valid Architectures” to “armv7″ (Yes this doesn’t say Cortex-A8 but its fine) and check “Build Active Architecture Only”.

Valid Architectures

3) Under the Build drop-down un-check armv6 if it is.

Build Dropdown

An Example

Most of the time the compiler will do a much better job creating assembly code than you, however transposing a 4 dimensional matrix is a great use case  for switching to assembly.  Traditionally you’d be forced to interchange the various elements of a matrix with several swaps.  Using the NEON unit we can load the entire matrix into memory and using some trickery swap the elements and write them back.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#include  "TargetConditionals.h"
void mat4_transpose(float mat[4][4])
{
#if TARGET_IPHONE_SIMULATOR
float tmp;

tmp = mat[1][0];
mat[1][0] = mat[0][1];
mat[0][1] = tmp;

tmp = mat[2][0];
mat[2][0] = mat[0][2];
mat[0][2] = tmp;

tmp = mat[2][1];
mat[2][1] = mat[1][2];
mat[1][2] = tmp;

tmp = mat[3][0];
mat[3][0] = mat[0][3];
mat[0][3] = tmp;

tmp = mat[3][1];
mat[3][1] = mat[1][3];
mat[1][3] = tmp;

tmp = mat[3][2];
mat[3][2] = mat[2][3]
mat[2][3] = tmp;

#else

__asm__ volatile (
// load the matrix into q0,q1,q2,q3
"vldmia        %0, {d0,d1,d2,d3,d4,d5,d6,d7}       \n\t"
"vzip.32   q0,q2        \n\t"
"vzip.32   q1,q3        \n\t"
"vzip.32   q0,q1        \n\t"
"vzip.32   q2,q3        \n\t"
"vstmia        %0, {d0,d1,d2,d3,d4,d5,d6,d7} \n\t"
// no output registers
:

// input  registers
: "r"(mat)       // %0  the address of the first  element in src1

// modified / used  register list
: "q0","q1","q2","q3","q4","q5","q6","q7","memory");
#endif
}

You’ll notice that at the end of each assembly statement there’s a \n\t. This places a new line and a tab at the end of the line to make it work correctly with the assembly generated by gcc.  If you have an iPhone 3GS or iPad connected the assembly code should successfully compile.

GCC Inline Assembly Format

If anyone knows a good reference for ARM assembly in GCC let me know, the most comprehensive reference in found was for X86  here.

Each assembly call contains 4 parameters.

__asm__ volatile ( asm code : output registers : input registers : modified list )

Output registers

Output registers are registers that will be set with values when your assembly code completes.  This allows you to pass values out of your assembly .  They are preceded by an “=” then the type of register.

“=r” (output variable)

Input registers

Input registers are registers that are provided to your code from the c/objective-c environment. If a register is used for both input and output it should appear once in the output register list and once in the input register list.

“r”(input variable)

Using “+r”(input variable) means that you are modifying the contents of the register.

Modified registers/fields

These fields inform the compiler which registers and data you are overwriting in code.  This allows the compiler to generate instructions to save these values before calling your assembly.

rX – an arm register  (r0-r15)
sX – a single precision floating point number (s0-s31)  (used by VFP unit)
dX – a double precision / 64bit VFP/NEON register
qX – a quad / 128-bit NEON
memory – you’ve modified memory
cc – you’ve modified the condition codes by making a compare or using an opcode that modifies them

- since q0 is composed of d0 and d1, you only needs to specify q0 in the list even though you used d0 and d1

Some Notes

Here’s some helpful tips to keep you from pulling out your hair.. as I’ve already done it

  • use very short label names.  I found using a label name more than 4 characters could cause unexpected errors.
  • labels at the very end of your assembly output will cause similar errors.  If ever in a bind try a NOP  “add r0,r0,#0″
  • when using output and input registers, GCC will often give the same register number to an output and an input parameter.  You should read all input registers into temporary registers before you start writing to any output registers.
  • gcc assembly is case sensitive.  ie. typing ADD is not the same as add

Good Luck!

Posted by Tristan Campbell
@igtristan, programmer at Incubator Games
IncubatorGames Says:

[Post] NEON Assembly Programming on the iPad/iPhone 3GS – http://www.incubatorgames.com/index.php/

igtristan Says:

RT @IncubatorGames: [Post] NEON Assembly Programming on the iPad/iPhone 3GS – http://www.incubatorgames.com/index.php/

Tweets that mention NEON Assembly Programming on the iPad/iPhone 3GS | Incubator Games -- Topsy.com Says:

[...] This post was mentioned on Twitter by Tristan Campbell, Incubator Games. Incubator Games said: [Post] NEON Assembly Programming on the iPad/iPhone 3GS – http://bit.ly/dCdER0 [...]

infey Says:

RT @IncubatorGames: [Post] NEON Assembly Programming on the iPad/iPhone 3GS – http://www.incubatorgames.com/index.php/

prideout Says:

RT @IncubatorGames: [Post] NEON Assembly Programming on the iPad/iPhone 3GS – http://www.incubatorgames.com/index.php/

tuan_kuranes Says:

RT @IncubatorGames: [Post] NEON Assembly Programming on the iPad/iPhone 3GS – http://www.incubatorgames.com/index.php/

Comment: