Performance tips

Contents

1 How to give a performance measure

2 STL

3 PPC

4 Must read papers

5 Memory aliasing

6 Virtual methods

6.1 Why virtuals can get very slow, detailed explanation

7 Cache, structure of arrays vs arrays of structure

8 Load-Hit-Store

8.1 Int/Float conversions

9 Use SIMD!

10 GPU-CPU

Permanent link How to give a performance measure

It's better to give a timing in ms and the platform. A frame-rate won't tell much about performances as it measure the whole code/engine, not some functions nor a single module; and as improvements would be non linear with saved ms.

Also giving an algorithm complexity in term of 0(n) is meaningless, as it forgets the constant in front: depending on N, sometimes k1 . 0(N*N) will be faster than k2 . 0(N). Also I've seen some code splitting data on a tree to reduce the amount of work... good idea. Except each single node was allocated individually... One cache-miss per tree node, slow performances in the end.

Permanent link STL

Don't use size() in each iteration of a loop when possible. For example, for( int i=0, end=container.size(); i < end; ++i ) instead of for( int i=0; i < container.size(); ++i )
Use iterators instead of checking a counter against size().
Careful with std::map, std::set and the likes: they stores data in binary tree, of which nodes are micro-allocated, meaning lots of time is spent on memory management, and lots of cache-misses will occur. Often a linear vector can be faster than a small map.
Using custom allocators, pool allocators or similar tools with STL is a very good idea ; there's nothing wrong with STL, once we stop the polluting micro-allocations.

Permanent link PPC

This document will target PPC as in the Xenon and Cell CPU ; low-level optimizations, that should always come after the architecture optimization.

The PPC processor is a very weak architecture compared to the x86 classic PC CPU - especially when misused. No branch prediction, weak cache prediction thus developpers must handle prefetching, slow branches and loops, weak 32 bits integer and float mul and div - 5 instructions...

The compilers also won't help as they're unable to solve memory aliasing on their own and this will have a bigger impact on performance compared to other platforms.
See Jon Olick, Charles Bloom and other's remarks here

Permanent link Must read papers

Permanent link Memory aliasing

When accessing a memory variable, the compiler is unable to decide whether in the logic of your program another thread might alter this variable or if it's safe to cache it. Thus lots of optimization will be disable every time one access memory ; this is true even for structure or class members.
The best solution is to cache manually such variables in local variable on the stack ; then the compiler will be able to fully optimize it using registers, 64 bits generic computation registers that make compilers use very intricate code to clear the upper part every time it is handling memory addresses...

Example - this will run slowly as the program will keep on reading m_NbrLoops:
```
for( int i=0; i < m_NbrLoops; ++i ) { stuff on m_Array[i] ... }
```

Example - this will run fast:

Foo *ptr = m_Array;
const int nbrLoops = m_NbrLoops;
for( int i=0; i < nbrLoops; ++i, ++ptr ) { stuff on *ptr ... }

Another solution is to use the keyword __restrict to tell the compiler you're sure no other thread will alter a pointer thus it can enable all its optimizations on it. But it's dangerous:

Works only on pointers, not on references:

void Foo( Stuff * __restrict myPtr ) { ... }

Very bug-prone if mis-used: if two __restrict-ed pointers actually access the same memory, alteration through one might not be seen through the other, due to optimization (values may be cached in registers for example).

Hard to activate: it must be set on most function parameters:

void Foo( Stuff0 * __restrict myPtr0, Stuff1 * __restrict myPtr1 ) { ... }

Permanent link Virtual methods

Very slow, avoid when possible!

Virtual methods will have a huge cost on PPC. They're a double indirection on PPC (object->VMT->code), and worse on Cell.
However they're fine in high level/inexpensive code. But they should be forbidden in expensive loops (objects rendering, particles...).
It's best to batch several virtual calls into a single one, to use templated code or to duplicate the calling code.

Bad:

void ParticleEmitter::Update()
{
    for (int i=0; i<nbrParticles; ++i )
    {
        aPArticle[i].VirtualUpdateMethod();
    }
}

Good even if not pretty:

void ParticleEmitter::Update()
{
    switch( m_ParticleType )
    {
        case Type0:
        {
            ParticleType0 *particle = (ParticleType0*) m_ParticleBuffer;
            for (int i=0; i<nbrParticles; ++i )
            {
                particle[i].UpdateMethod();
            }
            break;
        }

        case Type1:
        {
            ParticleType1 *particle = (ParticleType1*) m_ParticleBuffer;
            for (int i=0; i<nbrParticles; ++i )
            {
                particle[i].UpdateMethod();
            }
            break;
        }

        default:
            assert( false, "Invalid particle type" );
    }
}

Good:

void ParticleEmitterType0::Update() // high level virtual call
{
    ParticleType0 *particle = (ParticleType0*) m_ParticleBuffer;
    for (int i=0; i<nbrParticles; ++i )
    {
        particle[i].NotVirtualUpdateMethod();
    }
}

void ParticleEmitterType1::Update() // high level virtual call
{
    ParticleType1 *particle = (ParticleType1*) m_ParticleBuffer;
    for (int i=0; i<nbrParticles; ++i )
    {
        particle[i].NotVirtualUpdateMethod();
    }
}

One a side note: the previous example is not advising to use a big update method on each particle, nor to have a big particle structure. See further the point about structure of arrays.

Why virtuals can get very slow, detailed explanation

I'll just link a very good explanation by "RYG" on CBloom's blog (2nd comment):

A very important element missing from your coarse cost model is cache footprint, both I$ and D$. Let's go with the Xenon/PS3 again for a minute. There's 32k of I$ and 32k of D$, with 128-byte cache lines - i.e. 256 cache lines each (and suddenly it feels quite a bit smaller). It's not unrealistic to assume that no two different vtables end up sharing the same cache line (that's virtually guaranteed if the two types are implemented in different source files). So in a loop that iterates over objects of different classes, each class costs you one cache line (or more, depending on what you do). Even with a low number of distinct classes and careful coding style, you can quickly lose 15-16 cache lines to this: about 6% of your L1 cache, to buffer less than a cache line's worth of data you're actually interested in right now! It's not a big hit, but a little bit of friction everywhere.

I$ is worse. 1k of code = 256 instructions = not very much. How many unique instructions do you need per object in a typical update loop? It's very easy to run out of I$. Not all-out thrashing, but a bit of extra work on every call.

These two things have a notable effect on performance in typical virtual-heavy code. And it's often trivial to fix, without removing the virtuals or any big redesign efforts: if possible, just keep your entity list sorted by type. Easy fix as long as there's no particular order that things have to happen in.

The big shift in perspective is not in changing the mechanism of dynamic dispatch, but the granularity. Instead of dispatching per object, you group everything into coherent batches and do one dispatch per batch (that's an incremental change from the "sort by type" version). In fact, you can do it all incrementally:

Sort by type (reduce D$, I$ misses)
Group into batches, dispatch per batch (reduce function call overhead, both virtual and non-virtual)
Allocate as batches, not individual objects (D$ again)
Turn inheritance into composition and segregate storage. Adds extra ptrs and indirections, but transforms code from "one loop that does everything" into "10 loops that each do 1/10th of the work". D$, I$ benefits. Code that only cares about base class fields never needs to access cache lines

Permanent link Cache, structure of arrays vs arrays of structure

A L2 cache miss will cost several hundred cycles of instructions in waiting for memory (TODO: check if actual value is under NDA or not). It's actually the number one performance grief by far on modern CPU.

Let's take the example of a modern particle emitter, with lots of options for artists to play with (which means several features' parameters stored for each particle).

One particle's often larger than one cache line → one cache-miss per touched particle → updating and displaying particles will hurt the cache a lot, causing a huge slow down.

Modern CPUs can load several cache lines at the same time: having from times to times several cache-miss getting solved at the same time is cheaper that having one cache-miss more often

Using arrays of elementary types will reduce cache-miss:

elements of several particles gets loaded in the cache at the same time.
we can process several elements in the same cache lines, thus we have more time to prefetch the following cache lines, thus we may get rid of all L2 cache misses.

For example, if you have a loop over all your particles, doing little computation using their position, an array of struct will give you a cache-miss every one or second particle(s) ; killing your performances.
In such a case, particle positions should be stored in a separated array ; each array item being a vector of 12 (3d vector) or 16 bytes (SIMD 4d vector). Thus up to ten particles will fit in each cache line, reducing the amount of cache misses in a ratio probably around 1:8, thus boosting your performances by 3 orders of magnitude at least (~ *4000) in this case.

What's true here with cache, is also true with DMA-ing data from/to SPUs on PS3 - actually handling L2 cache and handling SPU memory are completely similar.

Permanent link Load-Hit-Store

A LHS is a several-tens-of-ms penalty that occurs whenever one reads a value that just got written into memory and pushed out of the L1 cache. Thus a L2 to L1 load penalty is to be taken.

It occurs much easier that expected, as the Integer, Float and SIMD execution blocks of the CPU cannot directly communicate. Thus any type conversion between int/float/SIMD will force the CPU to wait until the data reaches the L2 cache. Also code being compiled with optimize-for-size flag will often not cache common values in register or local variables, but instead it will keep on updating and reloading it, thus it can often generate lots of LHS easily.

When a loop will create lots of LHS, either rewrite it, or split it in two parts, with a small buffer on the stack in-between ; for example a 1st loop would do some SIMD computation and store them in an array on the stack, then a second loop will read the array as floats or ints and use them.

Int/Float conversions

The Integer to/from Float conversions are very slow (LHS). Avoid at all cost!
For example to interpolate integer colors, use fixed point computation instead.
Bad:

for( int i=0; i<nbr; ++i )
{
    // Multiple LHS here on every loop iterations
    aDest[i].r = (unsigned char) lerp( (float)aSrc0[i].r, (float)aSrc[1].r, ratio );
    ...
}

Good:

int iRatio = int( ratio*65536.0f ); // One LHS here
for( int i=0; i<nbr; ++i )
{
    aDest[i].r = (unsigned char) lerp( aSrc0[i].r, aSrc[1].r, iRatio ) >> 16;
    ...
}

Permanent link Use SIMD!

SIMD instructions (SSE and VMX128 and the likes) are an easy way to get 3X or 4X speedups. Most of the maths and most low-level loop found in a 3D engine can be implemented using SIMD.
VMX128 instructions.
When using SIMD on PPC, access as little as possible element of a vector4 using float computation: that would prone lots of LHS.
For example instead of scaling several vectors by a float scalar in a loop, create before the loop a vector4 containing the float value on all components, then in the loop use a vector4 x vector4 multiplication.

Often loops are not so simple to vectorize: they mix up different kind of numbers (int, floats), or each iteration will access not contiguous memory location... Still most of the time one can split the loop in two, one part being vectorized and computing all indices/offsets/weights/..., the other one doing the not-SIMDed work.

Simpler example: scaling a vector: 4 float multiplication if doing a float[4]*scale (and some LHS if your vector is SIMDed) ; a single operation if doing vector*scalingVector.

Permanent link GPU-CPU

Make sure the ring buffers aren't too large nor too small for your needs, using SetRingBufferParameters: a too large buffer could make the GPU starve for commands if you have GPU-CPU synch points.

Main page

Back

email : Sly