As you may know the PSP has a coprocessor called VFPU (Vector Floating Point Unit). As its name indicates it's a FPU with the ability to compute vector operations such as cross product, vector-matrix transform, matrix multiply, etc.
The VFPU has 128 registers grouped in 4x4 float 8 matrices. Operations on individual (.s), pair (.p), triplet (.t) and quadruple (.q) registers can be done. The same applies to matrices (can be operated as 2x2 and 3x3) but with some restrictions. You can have a look at this excellent guide:
So it's like a common MIPS coprocessor, you can move registers by using mtc1 and mfc1 and it has also some memory access functions.
lv.s Reg, Mem Loads one float from the src address and stores them into the dst register.
sv.s Reg, Mem Saves one register to the dst address.
lv.q Reg, Mem Loads 4 IEE754 floats from the src address and stores them into the dst register, which must be a row or a column. Also the memory address must be 16 byte aligned (128bit alignment).
sv.q Reg, Mem The same as above but now we save register to memory.
ulv.q Reg, Mem Loads 4 IEE754 floats from the src address and stores them into the dst register, which must be a row or a column. Also the memory address must be 8 byte aligned (64bit alignment).
usv.q Reg, Mem Exactly the save as ulv.q but for saving.
In case you don't know ulv.q and usv.q are just aliases or macros for two instructions (that's why they eat two cycles). Don't remember the name of the real opcodes but I don't know if they can be used safely.
The fact is that on FAT PSP (PSP 1000) ulv.q instruction is broken. By broken I mean it doesn't do or behave the way it's expected. The problem is that the instruction corrupts FPU registers. This representes a performance hit in games, since they also have to be backwards compatible to the PSP1000.
The way it does is shown in this table:
|ulv.q C000, Addr||ulv.q R000, Addr||$f0|
|ulv.q C010, Addr||ulv.q R001, Addr||$f1|
|ulv.q C020, Addr||ulv.q R002, Addr||$f2|
|ulv.q C730, Addr||ulv.q R703, Addr||$f31|
Every ulv.q instruction corrupts a FPU register. So this forces us to use lv.s or align the data and use lv.q. Warning! Don't assume alignment on the stack. If you use attributes like pack or aligned (in GCC of course) those alignments should be correct for data allocated with new (my experience says that, but gcc documentation negates this), but data allocated with malloc or placed in the stack won't be aligned!
Just use memalign (although it's not completely portable can be used on the PSP) instead of malloc and alloca instead declaring data in the stack. You can align data in the stack by using alloca or a normal C buffer. Just create a buffer big enough (with 16 extra bytes) and then a pointer to the buffer, just round up that pointer to get an aligned address inside the buffer with the needed size (that's why 16 extra bytes). This can also be done by using alignment attributes and forcing the stack to mantain 16 byte alignment. In this particular case the calls across functions will mantain the 16 byte alignment by wasting space and will allow us to align the floats, but! (there's always a but) take care when using libraries. The SDK libraries won't preserve the alignment so if you have callbacks (functions in your code which are called by another function in a library) the stack alignment will be compromised. An example of that is ODE or SDL callbacks. You'll have to recompile the libs or make a wrapper which fixes the stack.
As I was facing problems with the FAT PSPs (which I don't have) I did a test app which reveals what registers get corrupted with ulv.q. This test was made because I wasn't able to find more info about the bug. In ps2dev some folks just said that one or two registers were affected, which is not true. I'll expand the program to test other instructions just to be safe about the VFPU, which is capricious and buggy (well done Sony!).
http://mrmrice.fx-world.org/vfpu.htmlThe VFPU guide. Very graphical.
http://wiki.fx-world.org/doku.php?id=general:cycles Very important description of every instruction and the cycles it takes (critical! keep in mind those numbers).
http://forums.ps2dev.org/viewtopic.php?t=6929 The original post of the pepople which added support for VFPU assembler. In the past VFPU assembler was done with macros and not all the opcodes were exposed. Thanks to all!