Programming high-performance applications on the Cell BE processor, Part 3: Meet the synergistic processing unit
Continue looking in depth at the Cell Broadband Engine™ (Cell BE) processor's synergistic processor elements (SPEs) and how they work at the lowest level. This installment explores storage alignment issues and the communication facilities of the SPEs.
Because the synergistic processing unit (SPU) is focused on vector, not scalar, processing, it is only able to load and store 16 bytes at a time (the size of a register) from local store locations which are aligned on 16-byte boundaries. Therefore, you cannot just load a word from, say, memory location 12. To get that word, you would need to load a quadword from memory location 0, and then shift the bits so that the value you want is in the preferred slot. The original quadword must be loaded, the appropriate value inserted into the right location in the quadword, and then the result stored back. Because of these issues, it is usually advisable to store all data aligned to 16 bytes. To load a value which crosses a 16-byte boundary is even more difficult, as you would actually have to load it into two registers, shift them, and then mask and combine them. Storing such values is even more difficult, so it is best to never use values that cross 16-byte boundaries.
While it allows you to use data that is not aligned to 16-byte boundaries, the loading and storing technique I will discuss requires that the data be naturally aligned to prevent it from crossing the 16-byte boundary. That means that words will be 4-byte aligned, halfwords will be 2-byte aligned, and bytes don't have to be aligned at all.
Doing an unaligned load requires two or three instructions, depending on the size of the data. The reason for this is that if you are loading a single value, you probably want it in the preferred slot of the register. The first instruction does the load and the second instruction rotates the value so that the requested address is at the beginning of the register. Then, if the data is smaller than a word, a shift is needed to move it away from the beginning into the preferred slot (if it is a word or a doubleword, the beginning of the register is the preferred slot). Here is the code for a byte load, which takes an address in the preferred slot of register 3 and uses it to load a byte into the preferred slot of register 4:
Listing 3. Load from non-aligned memory
###Load byte unaligned address $3 into preferred slot of register $4### |
Remember, the lqd
instruction only loads from 16-byte boundaries. It will therefore ignore the four least significant bits during the load, and just load an aligned quadword from memory. Therefore, for arbitrary addresses, we have no idea where in the loaded quadword the value we wanted is. The rotqby
instruction, "rotate (left) quadword by bytes," uses the address you loaded from to indicate how far to rotate the register. It only uses the least four significant bits of the address in the register (the ones ignored by the load) to determine how far to rotate. This will always be the number of bytes it needs to shift left to move the address specified to the beginning of the register. Finally, for bytes, the preferred slot is not at the beginning of the register, but three bytes to the right. So the instruction rotqbyi
does a shift using an immediate-mode value to shift by. Word- and doubleword-sized transfers do not need this last instruction, because their preferred slot is at the beginning of the register anyway. At the end of this, register 4 has the final value, with the byte shifted into the preferred slot.
Storing is more difficult. Here is the code to store a byte that is in the preferred slot of register $4 into the address specified by register $3:
Listing 4. Store to non-aligned address
###Store preferred byte slot $4 into unaligned address $3 |
To understand this cryptic-looking sequence, again keep in mind that the SPU only does loads and stores a quadword at a time, on quadword-aligned addresses. Therefore, if you want to store only one byte, if you tried to do it directly on an unaligned address, it would both go into the wrong location and clobber the remaining bytes in the quadword. To avoid this, you need to first load the quadword from memory, insert the value into the appropriate byte in the quadword, and then store it back. The hard part is inserting it into the proper location based only on the address. Thankfully, two instructions help out, cbd
("generate control for byte insertion") and shufb
("shuffle bytes"). The cbd
instruction takes an address and generates a control word that can be used by shufb
to insert a byte at the proper location in the quadword for that address. cbd $6, 0($3)
uses the address in register 3 to generate the control quadword, and then stores it in register 6. The instruction shufb $7, $4, $5, $6
uses the control quadword in register 6 to generate a new value into register 7 which consists of the original quadword that was in memory (now in register 5) and a byte from register 4 in the preferred slot, and stores the result in register 7. Once the byte is shuffled in, the value is stored back into memory.
To illustrate the technique, I'll write a function that takes the address of an ASCII character, loads it, converts it to uppercase, and stores it back. I'll put the function convert_to_upper
in a separate file from the main
function so that I can reuse it in another program later on. Here is the code for the main
function (save it as convert_main.s
):
Listing 5. Uppercase conversion program start
.data |
Now enter the function that actually does the uppercase conversion (enter as convert_to_upper.s
):
Listing 6. Function to convert to uppercase
.text |
To compile and run, perform the following commands:
spu-gcc convert_main.s convert_to_upper.s -o convert |
The main
function doesn't function too differently than before, so I won't discuss it here. Note, however, that it is passing the address of the letter to convert_to_upper
, not the letter itself.
The convert_to_upper
function takes the address of an arbitrary character, converts it to uppercase, and then stores it back and returns nothing. It never calls another function, so it doesn't need a stack frame.
The first thing the function does is an unaligned load as described previously into register 4. It then checks to see if the byte is in the range a
through z
. It does that by comparing if it is greater than 'a' - 1
, and then seeing if it is greater than 'z.'
I did not do a "less than" comparison, because they aren't available on the SPU! SPUs only have comparisons for "greater than" and "equal to." Therefore, if you want to do a "less than or equal to" comparison, you must do a "greater than" comparison and then do a "not" on it, which is performed using the nand
instruction with both source arguments being the same register. You then combine the comparisons using the and
instruction (note that you could have combined all the logical instructions into one with an xor
, but the code would have been much less clear). Finally, because the branch instructions only operate on halfword or word values, you have to mask out the non-relevant portions of the register. (I didn't have to do that in the factorial example because I was dealing with a full word).
If the bits in the preferred slot of register 8 are all set to false, you skip to the end of the function. If they are true, you perform the conversion. The only byte-oriented arithmetic function on the SPU is absdb
, "absolute difference of bytes," which gives the absolute value of the difference between two operands. You use that, combined with the difference between the lowercase and uppercase values, to perform the conversion. Finally, you perform an unaligned store. Since you did not call any functions or use any local storage, you did not need a stack frame at all, so you can now just exit through the link register.
|
So far I have concentrated on SPE-only programs. Now I will look into PPE-controlled programs, and for that, I need to know how to get the PPE and the SPE to communicate.
Remember that SPEs have a memory that is separate from the processor's main memory, called the local store. The SPE cannot read main memory directly, but instead must import and export data between the local store and main memory using DMA commands to a unit called the memory flow controller, or MFC. The local store address space is limited to 32 bits, but it is usually much smaller (in the Sony® PLAYSTATION® 3, for instance, it is only 18 bits). The reason for this is so that memory accesses by SPE code can be deterministic. Main memory can get swapped out, moved around, cached, uncached, or memory mapped. Therefore, the amount of time required for any particular memory access is completely unknown (if the memory is swapped out, who knows how long it will take). By separating out the SPE memory into a local store, the SPE can have a deterministic access time for any memory it accesses, and schedule the MFC to asynchronously move data in and out of main memory as needed. Addresses within an SPE's local store are called local store addresses (LSAs), while addresses within the main memory are called effective addresses (EAs). This will be important as you learn how to use the memory flow controller's DMA facilities.
SPEs communicate with the outside world by using channels. A channel is a 32-bit area which can be written to or read from (but not both -- they are unidirectional) using special instructions. A channel can also have a depth, or channel count. The channel count is the amount of data waiting to be read (for read channels), or the amount of data which can still be written (for write channels). Channels are used for all SPE input and output. They are used for issuing DMA commands to the memory flow controller, handling SPE events, and reading and writing messages to and from the PPE. The next program I'll show you utilizes the MFC and the channel interface to do character conversions on data specified by the PPE.
Creating and running SPE tasks
So far, the main
function has not been using any parameters. However, when it is run from a PPE program, it actually receives three 64-bit parameters -- the SPE task identifier in register 3, a pointer to application parameters in register 4, and a pointer to runtime environment information in register 5. The contents of the areas pointed to by application and environment pointers are actually user-defined. However, remember that they point to memory in the main storage of the application (an effective address), not to the SPE's local store. Therefore, they cannot be accessed directly, but must be moved in through DMA.
SPE tasks are created with the function speid_t spe_create_thread(spe_gid_t spe_gid, spe_program_handle_t *spe_program_handle, void *argp, void *envp, unsigned long mask, int flags)
. The parameters work as follows:
- spe_gid
This is the SPE thread group to assign this task to. It can simply be set to zero. - spe_program_handle
This is a pointer to a structure which holds the data about the SPE program itself. This data is normally defined either automatically by embedding an SPU application within a PPU executable (this will be shown later), by usingdlopen()
/dlsym()
on a library containing an SPU application, or by usingspe_open_image()
to directly load an SPU application. - argp
This is a pointer to application-specific data for program initialization. Set to null if it is not going to be used. - envp
This is a pointer to environment data for the program. Set to null if it is not going to be used. - mask
This is the processor affinity mask. Set it to -1 to assign the process to any available SPE. Otherwise, it contains a bitmask for each available processor. 1 means that the processor should be used, 0 means that it should not. Most applications set this to -1. - flags
This is a set of bit flags which modify how the SPE is set up. These are all outside the scope of this article.
As an example of DMA communication, I will write a program where the PPE takes a string, and invokes an SPE program which copies over the string, converts it to uppercase, and copies it back into main storage. All of the data transfers will use the MFC's DMA facilities, controlled through SPE channels.
The main SPE program will receive an effective address pointer to a struct containing the size and pointer of a string in main memory. It will then copy it into its buffer, perform the conversion, and copy it back. Here is the SPE code (enter as convert_dma_main.s
):
Listing 7. SPU code to perform uppercase conversion for PPU program
.data |
This code relies on some utility functions for handling DMA commands. Enter those functions as dma_utils.s
:
Listing 8. DMA transferring utilities
##UTILITY FUNCTION TO PERFORM DMA OPS## |
Now, not only do you need to compile this program, you need to prepare it for embedding in a PPE application. Assuming you still have the convert_to_upper.s
from your last program in the current directory, here are the commands to compile the code and prepare it for embedding:
spu-gcc convert_dma_main.s dma_utils.s convert_to_upper.s -o spe_convert |
This produces what is called a CESOF Linkable, which allows an object file for the SPE to be embedded in a PPE application and loaded as needed.
Here is the PPU code to make use of the SPU code (enter as ppu_dma_main.c
):
Listing 9. PPU code to utilize SPU application
#include |
To build and execute the program, enter the following commands:
gcc -m64 spe_convert_csf.o ppu_dma_main.c -lspe -o dma_convert |
A lot of things are going on in this code, and my goal is to introduce all of the necessary foundational material so that we don't get bogged down in it when learning optimization secrets in the next article. (Stay with me, and you'll be on your way to expert SPU programming in no time!) Now, I'll explain what the code is doing. I'll start with the PPU code, since it's a little easier.
|
The first interesting part of the PPU code is the inclusion of the libspe.h
header file, which contains all of the function declarations for running programs on the SPE. It then references a handle called convert_to_upper_handle
. This is only an extern
reference, not the declaration itself. This is because convert_to_upper_handle
is defined in spe_convert_csf.o
. The name of the variable was set on the command line of the embedspu
command. That variable is the handle to the program code, which will be used to create your SPE tasks.
Next, you define the structure that will be used as the parameter to your SPE program. You need the length of the string and the pointer to the string itself. These all need to be quadword aligned, so that you can copy it into your main program and use the values with DMA transfers. Note that the pointer you used is declared an unsigned long long
rather than just a pointer. This is so that the address transfer is stored the same way whether it is compiled in 32-bit mode or 64-bit mode. With a pointer, if it were compiled in 32-bit mode, the pointer would be aligned differently within the structure. You also have to use the memalign
function and strcpy
to copy the data into an area of appropriate alignment. Here's a pointer from long nights of trial and error with this stuff: If you are continually receiving a "bus error," you are probably doing a DMA transfer that is either not 16-byte aligned or is not a multiple of 16 bytes.
In the main program, you declare your variables. Note that all of the declared variables which will be copied using DMA are aligned on quadword boundaries and are multiples of quadwords. That's because DMA transfers, with a few exceptions for small transfers, must be quadword aligned in both the source and destination addresses (the program will get even better performance if both source and destination are 128-byte aligned). Next, the SPE task is created with spe_create_thread
, passing in your parameter structure. Now you can just wait for the SPE task to complete using spe_wait
, and then print out the final value. As you may have guessed, most of the interesting parts of the program are taking place on the SPE, including all of the DMA transfers. DMA transfers are almost always done by the SPEs rather than by the PPE because they can handle much more data and many more active DMA operations than the PPE.
Before getting into the details of the main program, I'll explore the DMA utility functions. The first function is perform_dma
, which, not surprisingly, performs DMA commands. The Cell BE Handbook defines the sequence of channel operations needed to perform a DMA transfer on pages 450-456 (see Resources). The first thing the function is doing is converting the 64-bit effective address in register 4 into two 32-bit components -- a high- and a low-order component (remember, the channels are only 32 bits wide). Because channels are written using a register's preferred word-sized slot, the 64-bit address already has the high-order bits in the preferred slot. Therefore, you just shift the contents to the left by four bytes into a new register to get the low-order bits in the preferred slot. You then write the local store address, the high-order bits of the effective address, the low-order bits of the effective address, the size of the transfer, the "tag" of the DMA command, and then the command itself to their appropriate channels using the wrch
instruction. When the command is written, the DMA request is enqueued into the MFC provided it has available slots -- yours certainly does as you are not doing any other concurrent DMA requests. The "tag" is a number which can be assigned to one or many DMA commands. All DMA commands issued with the same tag are considered a single group, and status updates and sequencing operations apply to the group as a whole. In this application, you will only have one DMA command active at a time, so all of your operations will use 0 as the DMA tag. The DMA command should be either MFC_GET_CMD
or MFC_PUT_CMD
. There are others, but we aren't concerned with them here. MFC commands are all done from the perspective of the SPE, whether or not it is actually the SPE issuing the command. So MFC_GET_CMD
moves data from main memory to the local store, and MFC_PUT_CMD
goes the other way.
Because DMA commands are asynchronous, it is useful to be able to wait for one to complete. The function wait_for_dma_completion
does precisely that. It takes a tag as its only parameter, converts it to a tag mask, requests a DMA status, and then reads the status. So how does this wait for the DMA operation to complete? When writing to the $MFC_WrTagUpdate
channel with a value of 2, it causes the $MFC_RdTagStat
to not have a value until the operation is completed. Thus, when you try to read the channel using rdch
, it will block until the status is available, at which point the transfer will be complete.
Now, moving on to the actual program itself. The first thing our SPE program does is reserve space for the application's parameter data. This is also aligned to quadword boundaries (.align 4
in assembly language works the same as __attribute__((aligned(16)))
in C because 2^4 = 16). .octa
reserves quadword values (the mnemonic is a holdover from 16-bit days). You then define a constant CONVERSION_STRUCT_SIZE
for the size of the whole structure.
After this, you go to the .bss
section, which is like the .data
section, except that the executable itself does not contain the values, it just notes how much space should be reserved for them. This section is for uninitialized data. .lcomm conversion_buffer, 16384
reserves 16K of space, with the starting address defined in the symbol conversion_buffer
. It is defined for holding 16K because that is the maximum size of an MFC DMA transfer. Therefore, if any string is longer than that, the PPE will have to invoke the program multiple times (a better program would simply break up the request into chunks on the SPE side).
The main
function has the main meat of the program. It starts by setting up a stack frame. It then saves three non-volatile registers that will be used for the main control of the program. Next, it performs a DMA transfer to copy in the parameter structure from the PPE. Remember, the first parameter to the function is the 64-bit address that was passed in from the PPE. You then use a DMA command to fetch the full structure, and wait for the DMA to complete. After the transfer, you use the data in that structure to copy the string itself into your buffer in the local store using another DMA transfer, and wait for it to complete. Note that you used the ila
instruction ("immediate load address") to load the address of the buffer. The ila
instruction maxes out as 18 bits, which works for the PLAYSTATION 3. However, if a Cell BE processor has a larger local store size, you would load it instead with the following two instructions:
ilhu $3, conversion_buffer@h #load high-order 16 bits of conversion_buffer |
Then the target effective address, the length of the string, the DMA tag, and a MFC_GET_CMD
DMA command are all passed to perform_dma
. The program then waits for the operation to complete.
At this point, all of the data is loaded in and you just need to convert it. You then use register 127 as your loop counter and register 126 as your base pointer, and perform convert_to_upper
on each value until you get to the end of the buffer.
At loop_end
, all of the data is converted, and you need only to copy it back. You use the same DMA parameters as for the last transfer, but this time it is an MFC_PUT_CMD
command. Once the DMA is completed, your function is done. You load register 3 with the return value and perform the function epilogue to restore the stack frame and return.
|
SPE/PPE communication using mailboxes
While DMA transfers are an excellent way of moving bulk data between the SPE and the PPE, another simpler method for smaller transfers which I will briefly discuss is mailboxes. For the SPE, it is simply a set of channels (a read channel and a write channel) to write 32-bit values to the PPE.
To demonstrate the concept, I will write a very simple SPE server which waits for an unsigned integer number in the mailbox and then writes back the square of that number. Here is the code (enter as square_server.s
):
Listing 10. SPU squaring server
.text |
That's all! This will just sit around and wait for requests and process them. It simply quits when the parent program quits. And, if there is no value available in the inbox, the rdch
instruction simply stalls until there is one.
The PPE side isn't much harder (enter as square_client.c
):
Listing 11. PPE squaring client
#include |
To compile and run this program, issue the following commands:
spu-gcc square_server.s -o square_server |
The mailboxes, even for the PPE, are named according to the perspective of the SPE. So you write to the inbox and read from the outbox if you are the PPE. Unlike the SPE, the PPE does not stall and wait for a value when it reads or writes. Instead, the program must use spe_stat_out_mbox
to wait for a value, and spe_stat_in_mbox
to see if there are slots left for writing to the mailbox. You don't use the latter as you only have one value in play at a time.
The real power of mailboxes comes when a program combines the mailbox and the DMA approach. For example, an SPE task can be created which listens for buffer addresses on its mailbox, and then uses that address to pull in all of the data to be processed through DMA.
Thus far, this series has covered the main concepts of assembly language programming on the Cell BE processor of the PLAYSTATION 3 under Linux®. Topics covered include the basic architecture, the syntax of the SPU assembly language, and the primary modes of communication between the SPE and the PPE. The next article looks at how to pump every ounce of performance out of the Cell BE processor SPEs that you can. And later articles will apply this knowledge to SPE programming in C, to make your life just a little bit easier.