1 Create mask -1 or 0 or +1 from 2 integers difference


2.0.1 Debugging under Windows

3 How to debug danglings pointers (destructed pointers pointing to freed memory)

4 Convert a short int to a float between -1 et 1 using only int computations no to do Load-Hit-Store

5 Don't do int/float conversions

5.1 Bit operations on floats shall be forbidden

6 How to use a STL custom allocator

7 Spherical Harmonics

8 Reflexion in C++

9 Multiplying two 32 bits integers into a 64 bits integer

10 Radix Sorting

11 My favorite IDE: Visual Studio + Vim + ...

12 Nant

13 DS

14 About coding habits

14.1 General:

14.2 Readable code

15 Quick inversion of a 4x3 matrix

16 Vector and matrix classes, math operators

17 GCC and the PreCompiled Headers (PCH)

18 How to cleanly use dependencies with GCC/Makefile

19 How to correctly indent using tabs

20 For those who love bits tricks

Some coding tricks

Here are a few tricks, links or advices found here and there on the web.

Permanent link Create mask -1 or 0 or +1 from 2 integers difference   

From Sean Barrett
return (a<b) ? -1 : (a>b);

Permanent link STL   

Bad as it keeps on reallocating the vector:
for ( int i = 0; i<Nbr; ++i )
    vec.push_bask( stuff );
for ( int i = 0; i<Nbr; ++i )
    vec.push_bask( stuff );

Bad as it creates new useless nodes in the map's tree:
if( hashmap[idx].second != NULL )
std::map::const_iterator it = hashmap.find(idx);
if( it != hashmap.end() )

Debugging under Windows

Content of a vector : m_vector._Myfirst

Permanent link How to debug danglings pointers (destructed pointers pointing to freed memory)   

Two strategies:

Permanent link Convert a short int to a float between -1 et 1 using only int computations no to do Load-Hit-Store   

static inline void DecompressS16ToFloat( short value, float &f ) // = float(value) / 32768.0f with exact precision
	enum { NbrFractionBits=23, LastFractionBitS16=14, ExponentZero = 127 };
int absValue = abs(value); int log2; { int v = absValue; int shift=0; log2 = (v > 0xFFFF) << 4; v >>= log2; shift = (v > 0xFF ) << 3; v >>= shift; log2 |= shift; shift = (v > 0xF ) << 2; v >>= shift; log2 |= shift; shift = (v > 0x3 ) << 1; v >>= shift; log2 |= shift; log2 |= (v >> 1); }
int maskHigherBit = 1 << log2; int maskFraction = maskHigherBit - 1; int fraction = absValue & maskFraction;
long i = fraction << (NbrFractionBits - log2 - LastFractionBitS16 - 18); // Mantisse i |= ( (ExponentZero - 15 + log2) << NbrFractionBits) & ((0==value) - 1) ; // Exponent
i |= (value & (1<<15)) << 16; // Sign bit
(long&) f = i; }
Precision :
Integer float(i)/32768.0f DecompressS16ToFloat Error
0 0.000000 0.000000 0.000000
32767 0.999969 0.999969 0.000000
16383 0.499969 0.499969 0.000000
8191 0.249969 0.249969 0.000000
-8191 -0.249969 -0.249969 0.000000
-16383 -0.499969 -0.499969 0.000000
-32768 -1.000000 -1.000000 0.000000
25735 aka π/4 0.785370 0.785370 0.000000
static inline void DecompressU8ToFloat( unsigned char value, float &f ) // = float(value) / 256.0f with exact precision
    enum { NbrFractionBits=23, ExponentZero = 127 };
int absValue = value; int log2; { int v = absValue; int shift=0; log2 = (v > 0xF ) << 2; v >>= log2; shift = (v > 0x3 ) << 1; v >>= shift; log2 |= shift; log2 |= (v >> 1); }
int maskHigherBit = 1 << log2; int maskFraction = maskHigherBit - 1; int fraction = absValue & maskFraction;
long i = fraction << (NbrFractionBits - log2); // Mantisse i |= ( (ExponentZero - 8 + log2) << NbrFractionBits); // Exponent i &= ((0==value) - 1);
(long&) f = i; }
Precision :
Integer float(i)/32768.0f DecompressU8ToFloat Error
0 0.000000 0.000000 0.000000
63 0.246094 0.246094 0.000000
64 0.250000 0.250000 0.000000
127 0.496094 0.496094 0.000000
128 0.500000 0.500000 0.000000
200 0.781250 0.781250 0.000000
255 0.996094 0.996094 0.000000

Permanent link Don't do int/float conversions   

On modern CPUs, such conversions are extremely slow. The reason being that the int computation and the FPU are completely separated CPU blocs that cannot communicate directly. Exchange must go through the memory (or at least the memory cache).

On Power processors, this cause a cache flush before reading directly in memory: a Load-Hit-Store ; that's the second highest performance eater in local optimisation after the L2 cache miss.

Seen in the engine we use at work (simplified example):
for( uint i=0; i<GetNbrVertexes(); ++i )
    u8 &color = aCppArray[i];
    float colorF = (float)color;
    colorF *= ratio;
    color = (u8) colorf;

I rewrote the previous code in the following way, making it 20 times faster:
const uint nbrVertexes = GetNbrVertexes();
const u32 ratioInt = u32(ratio*65536.0f);
u8 *aColors = &*aCppArray.Begin();
for( uint i=0; i<nbrVertexes; ++i ) { u8 &color = aColors[i]; u32 clr = color; clr = (clr*ratioInt) >> 16; color = (u8)clr; }

Bit operations on floats shall be forbidden

The following code samples are forbidden in modern C++:
uint floatInt = (uint) &floatValue;
union { float f; uint i; } conv; conv.f = floatValue; return conv.i
The returned value is completely random with some compilers, and this will be more and more the case. The int computation bloc and the FPU each have their own view on the memory through their cache. One may think his cache is up to date when the other one will have updated the value.

This phenomen is named memory aliasing.

Permanent link How to use a STL custom allocator   

	namespace pool_alloc {
		inline void destruct(char*) {}
		inline void destruct(wchar_t*) {}
		template <typename T>
		inline void destruct(T* t) { (void)t; t->~T(); }
template<typename T, typename MemMgr> class Yac3DeStlCustomAllocator { static const unsigned long MaxAllocation = 102410241024;
public: typedef size_t size_type; typedef ptrdiff_t difference_type; typedef T* pointer; typedef const T* const_pointer; typedef T& reference; typedef const T& const_reference; typedef T value_type; template <class U> struct rebind { typedef Yac3DeStlCustomAllocator other; };
Yac3DeStlCustomAllocator() throw() {} Yac3DeStlCustomAllocator(const Yac3DeStlCustomAllocator&) throw() {} template <class U> Yac3DeStlCustomAllocator(const Yac3DeStlCustomAllocator&) throw() {} ~Yac3DeStlCustomAllocator() throw() {}
pointer address(reference x) const { return &x; } const_pointer address(const_reference x) const { return &x; }
pointer allocate(size_type size, const_pointer = 0) { if( size == 0 ) return NULL; else { //T *p = (T*) malloc( size*sizeof(T) ); T *p = (T*) MemMgr::Alloc( size*sizeof(T) ); return p; } } void deallocate(pointer p, size_type /n/) { //free(p); MemMgr::Free(p); } size_type max_size() const throw() { return MaxAllocation; }
void construct(pointer p, const T& val) { new(static_cast(p)) T(val); }
void destroy(pointer p) { pool_alloc::destruct(p); }
bool operator ==(const Yac3DeStlCustomAllocator<T,MemMgr> & /other/) { return true; } };
Use example:
	template<typename T, typename MemMgr>
	class Vector : public std::vector<T, Yac3DeStlCustomAllocator<T,MemMgr>>
		Vector() {}
		~Vector() {}
template<typename T, typename MemMgr> class List : public std::list<T, Yac3DeStlCustomAllocator<T,MemMgr>> { public: List() {} ~List() {} };

Permanent link Spherical Harmonics   

Light to SH:
fConst1 = 4/17
fConst2 = 8/17
fConst3 = 15/17
fConst4 = 5/68
fConst5 = 15/68
void AddLightToSH ( Colour *sharm, Colour &colour, Vector &dirn ) { sharm[0] += colour * fConst1; sharm[1] += colour * fConst2 * dirn.x; sharm[2] += colour * fConst2 * dirn.y; sharm[3] += colour * fConst2 * dirn.z; sharm[4] += colour * fConst3 * (dirn.x * dirn.z); sharm[5] += colour * fConst3 * (dirn.z * dirn.y); sharm[6] += colour * fConst3 * (dirn.y * dirn.x); sharm[7] += colour * fConst4 * (3.0f * dirn.z * dirn.z - 1.0f); sharm[8] += colour * fConst5 * (dirn.x * dirn.x - dirn.y * dirn.y); }
Sum the sharm array for all lights involved for your object or sample.
Then for each pixel:
Colour LightNormal ( Vector &dirn )
    Colour colour = sharm[0];
    colour += sharm[1] * dirn.x;
    colour += sharm[2] * dirn.y;
    colour += sharm[3] * dirn.z;
    colour += sharm[4] * (dirn.x * dirn.z);
    colour += sharm[5] * (dirn.z * dirn.y);
    colour += sharm[6] * (dirn.y * dirn.x);
    colour += sharm[7] * (3.0f * dirn.z * dirn.z - 1.0f);
    colour += sharm[8] * (dirn.x * dirn.x - dirn.y * dirn.y);
    return colour;
Source : Spherical Harmonics GDCE PPT97 - Tom Forsyth and its addendum.

Permanent link Reflexion in C++   

Jon Watte explains how to implement C++ reflection quite simply.
His source code.

Permanent link Multiplying two 32 bits integers into a 64 bits integer   

From Charles Bloom's rants: VC++ seems quite stupid at optimizing a U32 * U32 multiplication... unless one uses the Int32x32To64 macro which does exactly that the same way! Sounds really weird.
#define Int32x32To64(a, b) ((LONGLONG)((LONG)(a)) * (LONGLONG)((LONG)(b)))

Also he found out in OpenSSL sources that the Windows stdlib has very quick functions for byte rotations in integers, _lrotl and _lrotr.

Permanent link Radix Sorting   

A very quick sorting algorithm, in 0(kN); meaning it's faster than conventional algorithms in O(NlogN) like qsort, bubble-sort...: the Radix Sort, a quite simple idea ; here adapted by Pierre Terdiman and Chris Hecker to handle any kind of data, even floats and negative floats.

Permanent link My favorite IDE: Visual Studio + Vim + ...   

January 2008, the 24th
My favorite work environment is made up of Visual C++ 2005 or more recent for its debugger, Visual Assist for its colorization and a few refactoring tools, and the ViEmu add on to be able to use inside VC++ my favorite editor, Vim.

Permanent link Nant   

January 2008, the 24th
Having worked with Nant, I really start disliking this build system.
First it's incredibly slow, mostly for incremental builds. It cannot be compared to a Makefile system. Rewriting in a C# dll some dependency checking between compiled data and source animations in the game I'm working on reduced the character compilation by 5 and the total data compilation by half.
Secondly it's set up through several incredibly verbose XML file. And thirdly it doesn't support multithreading/parallel building, while it comes out of the box with Make.

Permanent link DS   

January 2008, the 24th
When programming on DS, at first I thought this hardware was really crappy.

But after a while I actually started liking it: the amount of features is limited, but they're well implemented and easy to use. The whole thing is actually quite well balanced, as ie: the low screen resolution compensate the lack of 3d gpu power.

A big advantage over the PSP: the use of flash/rom memory makes loading very quick. Say bye-bye to loading screen, the DS allows streaming of level parts while the player is moving, compensating the very small number of drawable triangles in a frame.

There are annoying limitation, like the 1 bit stencil buffer. That prevents casting shadows of concave objects like character, well at least up to now (and I mean up to now).

Also one thing I dislike on the PSP is the lack of hardware clipping of polygons: polygons going too far outside of the view are just thrown away, making big holes in the rendering. Either one split polygons, hoping they'll be small enough to fit in the limits of the frustum (which is a bit wider than the camera view), or one implements a software clipping on the CPU, which will slow down the rendering. :( Also the PSP LCD screen has an incredibly high persistance compared to the DS screens.

Permanent link About coding habits   


Readable code

Permanent link Quick inversion of a 4x3 matrix   

Let's M = T * R, a 4x3 matrix.
Usually one would use a heavy method like Gauss to invert it, but watch this:
M-1=(T*R) -1=R -1*T -1=Rt*T -1

So you transpose the rotation part, you create T-1 which is a translation matrix which pos vector is the opposite of the original one, you multiply the 2 together and you're done! Pseudocode:
r.rot = transpose(M.rot);
r.pos = Vector(0,0,0);
t.pos = -M.pos;
Minv = r*t;

Permanent link Vector and matrix classes, math operators   

Very often you'll read that writing a class dumbly with a Vector operator +(const Vector&, const Vector& ) will create a temporary object and that it will slow down your code.

Well that is... true.
But if you define this operator as inline const Vector operator +( const Vector &a, const Vector &b ) not only most compilers (tested on PC and on PSP) will not create a temporary object, but they will also create faster assembly code (10% faster than using float arrays or temp objects and sage operators like += in my tests). Notice than the return type has to be consted. Otherwise it could be altered in another operator so the compiler won't optimize it.
Some beginner people return a reference to prevent the copy, but that's using an object after it has been destroyed, which is plain wrong and which will not always work (in a perfect world it would always crash).

Permanent link GCC and the PreCompiled Headers (PCH)   

And that's it, compilation will get at least twice faster!

Permanent link How to cleanly use dependencies with GCC/Makefile   

I've seen too many projects having 2 compilation passes: make dep then make. If one forget to relaunch make dep from time to time, dependencies can get plain wrong. The good solution: add -MD in your CFLAGS or CXXFLAGS. Recompiling a file will update its dependency file. Then add -include *.d in the end of your Makefile.
The make dep is now required only when deleting a header.

Permanent link How to correctly indent using tabs   

Indent using both tabs and spaces.

Permanent link For those who love bits tricks   

Source: Steve's 'Cute Code' collection and Sean Eron's Bit Twiddling Hacks
Main page Back

email : Sly