22 Performance

22.010 What do I need to know about performance?

First, read chapters 11 through 14 of the book OpenGL on Silicon Graphics Systems. Although some of the information is SGI machine specific, most of the information applies to OpenGL programming on any platform. It's invaluable reading for the performance-minded OpenGL programmer.

Consider a performance tuning analogy: A database application spends 5 percent of its time looking up records and 95 percent of its time transmitting data over a network. The database developer decides to tune the performance. He sits down and looks at the code for looking up records and sees that with a few simple changes he can reduce the time it’ll take to look up records by more than 50 percent. He makes the changes, compiles the database, and runs it. To his dismay, there's little or no noticeable performance increase!

What happened? The developer didn't identify the bottleneck before he began tuning. The most important thing you can do when attempting to boost your OpenGL program’s performance is to identify where the bottleneck is.

Graphics applications can be bound in several places. Generally speaking, bottlenecks fall into three broad categories: CPU limited, geometry limited, and fill limited.

CPU limited is a general term. Specifically, it means performance is limited by the speed of the CPU. Your application may also be bus limited, in which the bus bandwidth prevents better performance. Cache size and amount of RAM can also play a role in performance. For a true CPU-limited application, performance will increase with a faster CPU. Another way to increase performance is to reduce your application’s demand on CPU resources.

A geometry limited application is bound by how fast the computer or graphics hardware can perform vertex computations, such as transformation, clipping, lighting, culling, vertex fog, and other OpenGL operations performed on a per vertex basis. For many very low-end graphics devices, this processing is performed in the CPU. In this case, the line between CPU limited and geometry limited becomes fuzzy. In general, CPU limited implies that the bottleneck is CPU processing unrelated to graphics.

In a fill-limited application, the rate you can render is limited by how fast your graphics hardware can fill pixels. To go faster, you'll need to find a way to either fill fewer pixels, or simplify how pixels are filled, so they can be filled at a faster rate.

It’s usually quite simple to discern whether your application is fill limited. Shrink the window size, and see if rendering speeds up. If it does, you're fill limited.

If you're not fill limited, then you're either CPU limited or geometry limited. One way to test for a CPU limitation is to change your code, so it repeatedly renders a static, precalculated scene. If the performance is significantly faster, you're dealing with a CPU limitation. The part of your code that calculates the scene or does other application-specific processing is causing your performance hit. You need to focus on tuning this part of your code.

If it's not fill limited and not CPU limited, congratulations! It's geometry limited. The per vertex features you’ve enabled or the shear volume of vertices you're rendering is causing your performance hit. You need to reduce the geometry processing either by reducing the number of vertices or reducing the calculations OpenGL must use to process each vertex.

22.020 How can I measure my application's performance?

To measure an application's performance, note the system time, do some rendering, then note the system time again. The difference between the two system times tells you how long the application took to render. Benchmarking graphics is no different from benchmarking any other operations in a computer system.

Many graphics programmers often want to measure frames per second (FPS). A simple method is to note the system time, render a frame, and note the system time again. FPS is then calculated as (1.0 / elapsed_time). You can obtain a more accurate measurement by timing multiple frames. For example if you render 10 frames, FPS would be (10.0 / elapsed_time).

To obtain primitives or triangles per second, add a counter to your code for incrementing as each primitive is submitted for rendering. This counter needs to be reset to zero when the system time is initially obtained. If you already have a complex application that is nearly complete, adding this benchmarking feature as an afterthought might be difficult. When you intend to measure primitives per second, it's best to design your application with benchmarking in mind.

Calculating pixels per second is a little tougher. The easiest way to calculate pixels per second is to write a small benchmark program that renders primitives of a known pixel size.

GLUT 3.7 comes with a benchmark called progs/bucciarelli/gltest that measures OpenGL rendering performance and is free to download. You can also visit the Standard Performance Evaluation Corporation, which has many benchmarks you can download free, as well as the latest performance results from several OpenGL hardware vendors.

22.030 Which primitive type is the fastest?

GL_TRIANGLE_STRIP is generally recognized as the most optimal OpenGL primitive type. Be aware that the primitive type might not make a difference unless you're geometry limited.

22.040 What's the cost of redundant calls?

While some OpenGL implementations make redundant calls as cheap as possible, making redundant calls generally is considered bad practice. Certainly you shouldn't count on redundant calls as being cheap. Good application developers avoid them when possible.

22.050 I have (n) lights on, and when I turned on (n+1), suddenly performance dramatically drops. What happened?

Your graphics device supports (n) lights in hardware, but because you turned on more lights than what's supported, you were kicked off the hardware and are now rendering in the software. The only solution to this problem, except to use less lights, is to buy better hardware.

22.060 I'm using (n) different texture maps and when I started using (n+1) instead, performance drastically drops. What happened?

Your graphics device has a limited amount of dedicated texture map memory. Your (n) textures fit well in the texture memory, but there wasn't room left for any more texture maps. When you started using (n+1) textures, suddenly the device couldn't store all the textures it needed for a frame, and it had to swap them in from the computer’s system memory. The additional bus bandwidth required to download these textures in each frame killed your performance.

You might consider using smaller texture maps at the expense of image quality.

22.070 Why are glDrawPixels() and glReadPixels() so slow?

While performance of the OpenGL 2D path (as its called) is acceptable on many higher-end UNIX workstation-class devices, some implementations (especially low-end inexpensive consumer-level graphics cards) never have had good 2D path performance. One can only expect that corners were cut on these devices or in the device driver to bring their cost down and decrease their time to market. When this was written (early 2000), if you purchase a graphics device for under $500, chances are the OpenGL 2D path performance will be unacceptably slow.

If your graphics system should have decent performance but doesn’t, there are some steps you can take to boost the performance.

First, all glPixelTransfer() state should be set to their default values. Also, glPixelStore() should be set to its default value, with the exception of GL_PACK_ALIGNMENT and GL_UNPACK_ALIGNMENT (whichever is relevant), which should be set to 8. Your data pointer will need to be correspondingly double- word aligned.

Second, examine the parameters to glDrawPixels() or glReadPixels(). Do they correspond to the framebuffer layout? Think about how the framebuffer is configured for your application. For example, if you know you're rendering into a 24-bit framebuffer with eight bits of destination alpha, your type parameter should be GL_RGBA, and your format parameter should be GL_UNSIGNED_BYTE. If your type and format parameters don't correspond to the framebuffer configuration, it's likely you'll suffer a performance hit due to the per pixel processing that's required to translate your data between your parameter specification and the framebuffer format.

Finally, make sure you don't have unrealistic expectations. Know your system bus and memory bandwidth limitations.

22.080 Is it faster to use absolute coordinates or to use relative coordinates?

By using absolute (or “world”) coordinates, your application doesn't have to change the ModelView matrix as often. By using relative (or “object”) coordinates, you can cut down on data storage of redundant primitives or geometry.

A good analogy is an architectural software package that models a hotel. The hotel model has hundreds of thousands of rooms, most of which are identical. Certain features are identical in each room, and maybe each room has the same lamp or the same light switch or doorknob. The application might choose to keep only one doorknob model and change the ModelView matrix as needed to render the doorknob for each hotel room door. The advantage of this method is that data storage is minimized. The disadvantage is that several calls are made to change the ModelView matrix, which can reduce performance. Alternatively, the application could instead choose to keep hundreds of copies of the doorknob in memory, each with its own set of absolute coordinates. These doorknobs all could be rendered with no change to the ModelView matrix. The advantage is the possibility of increased performance due to less matrix changes. The disadvantage is additional memory overhead. If memory overhead gets out of hand, paging can become an issue, which certainly will be a performance hit.

There is no clear answer to this question. It's model- and application-specific. You'll need to benchmark to determine which method is best for your model or application.

22.090 Are display lists or vertex arrays faster?

Which is faster varies from system to system.

If your application isn't geometry limited, you might not see a performance difference at all between display lists, vertex arrays, or even immediate mode.

22.100 How do I make triangle strips out of triangles?

As mentioned in 22.030, GL_TRIANGLE_STRIP is generally recognized as the most optimal primitive. If your geometry consists of several separate triangles that share vertices and edges, you might want to convert your data to triangle strips to improve performance.

To create triangle strips from separate triangles, you need to implement an algorithm to find and join adjacent triangles.

Code for doing this is available free on the Web. The Stripe package is one solution.