August 2020: | Ported to a ZX Spectrum 48K (Hackaday-ed). |
August 2017: | Ported to an ATmega328p with an OLED screen (Hackaday-ed). |
April 2012: | Cleaned up triangle setup code via X-Macros |
August 2011: | Added Intel's Morphological Anti-aliasing (MLAA) |
January 2011: | Added raytracing mode: reflections, refractions, shadows, ambient occlusion and anti-aliasing. |
December 2010: | CUDA "port" of raycaster mode posted to Reddit |
July 2010: | Javascript "port" of points-only mode posted to Reddit |
May 2010: | Added to the Phoronix Test Suite |
April 2010: | Reddit-ed! |
My priorities, ever since I started doing this, are simple: try to make the code as clear and concise as possible, while using good algorithms to improve the rendering speed. In plain words, my primary care is the clarity of the code - as well as the renderer's speed.
Conciseness and clarity are mostly accomplished via C++ templates, that unify the incremental calculations for the rasterizers and the ray intersections for the raytracers. And as for speed, we are now firmly in the age of multi-core CPUs - so software rasterizing can (finally) do per-pixel lighting and soft-shadows in real-time, while raytracing can generate beautiful images in a matter of seconds.
This is a (more or less) clean implementation of the basic algorithms in polygon-based 3D graphics. The code includes...
The supported 3D formats are:
Implementation wise, the code...
This is a software-only renderer, so don't expect hardware class (OpenGL) speeds. Then again, speed is a relative thing: the train object (available inside the source package, in the "3D-Objects" folder) was rendered (in soft-shadows mode) at a meager 6fps on an Athlon XP, back in 2003. Around 2005, however, a Pentium4 desktop at work took this up to 11 fps. As of 2007, by way of Intel's Threading Building Blocks (or OpenMP) the code uses both cores of a Core2Duo to run at 23fps... And since it uses TBB/OpenMP, it will automatically make use of any additional cores... so give the CPUs a few more years... :‑)
Update, November 2009: On a 4-core AMD Phenom at 3.2GHz, the
train now spins at 80 frames per second... Give me more cores! :‑)
Update, September 2017: On a 16-core Intel Core i9 7960X...
718 frames per second!
Update, June 2018: Phoronix shows the evolution of my renderer's
speed across 28 CPUs...
Update, November 2020: 1000 frames-per-second barrier broken,
by Ryzen 9 5950X...
The code also runs 20-25% faster if compiled under 64-bit environments.
Points | Ambient occlusion | Per-pixel Phong | Shadow maps |
Normally, shadow maps generate sharp, "pixelated" shadow edges, because of the sampling of the shadow map. To improve this, instead of sampling only one "shadow pixel", the renderer can also use a weighted average of its neighbours, and thus provide nice looking soft-shadows in real-time:
Fast though it is, shadow-mapping has an issue if you zoom-in: the artifacts of the shadowmap sampling become annoying... In "deep" zooms, the renderer can be switched (at runtime) to raytracing mode, to create the correct shadows:
Two weeks later, I removed this mode in favour of a full raytracer - it was slower than the rasterizer modes anyway, and a full raytracer offers far better quality. It still exists in the CUDA port, if you are interested.
For the rasterizers, it linearly interpolates (per-pixel) the ambient occlusion coefficient, which must be pre-calculated per vertex and stored in the model (see below, "Creating more 3D objects on your own").
For the raytracer, by uncommenting the #define AMBIENT_OCCLUSION, you will enable a stochastic ambient occlusion calculation for each raytraced pixel: When a triangle is intersected by a primary ray, AMBIENT_SAMPLES rays will be spawned from the intersection point, and they will be used to calculate the ratio of ambient light at that point.
The difference is very clear:
If your CPU uses hyper-threading and/or has *many* cores, performance as you increase threads may go down instead of going up. You can control the number of threads used during rendering via the OMP_NUM_THREADS environment variable; and you may well have to, to avoid your performance going down because of memory bandwidth saturation.
As an example from both sides of the spectrum, on an Atom 330 (2 real cores, each one appearing as two "virtual"), the "virtual" cores help a lot: running with four threads, the raytracer is 1.3x faster than running with two. But on an dual-CPU, Intel Xeon W5580 (total of 8 real cores, appearing as 16 "virtual"), the speed increases almost linearly as we increase threads, until we reach 8 - and then the speed nose-dives, with the 16 thread version being 63 times slower (!).
So make sure you check the runtime performance of the renderer by exercising direct control over the number of threads (via OMP_NUM_THREADS).
For Windows/MSVC users:
Just open the project solution (under VisualC/) and compile for Release mode. It is configured by default to use Intel TBB for multithreading, since Microsoft decided to omit OpenMP support from the free version of its compiler (the Visual C++ Express Edition). All dependencies (include files and libraries for SDL and TBB) are pre-packaged under VisualC/, so compilation is as easy as it can get.
When the binary is built, right-click on "Renderer-2.x" in the Solution explorer, and select "Properties". Click on "Configuration Properties/Debugging", and enter ..\..\3D-Objects\chessboard.tri inside the "Command Arguments" text box. Click on OK, hit Ctrl-F5, and you should be seeing the chessboard spinning. Use the controls described below to fly around the object.
The default compilation options are set for maximum optimization, using SSE2 instructions.
If you have the commercial version of the compiler (which supports OpenMP) you can switch from TBB to OpenMP:
|
For everybody else (Linux, BSDs, Mac OS/X, etc)
Compilation follows the well known procedure...
bash$ ./configure bash$ makeThe source package includes a copy of the sources for lib3ds 1.3.0, and the build process will automatically build lib3ds first.
SSE, SSE2 and SSSE3 x86 SIMD optimizations will be detected by configure, but if you have a non-Intel CPU, pass your own CXXFLAGS flags, e.g.
bash$ CXXFLAGS="-maltivec" ./configure bash$ makeCompiling under 64-bit environments (e.g. AMD64 or Intel EM64T) is further improving speed; compiled with the same options, the code runs 25% faster under my 64-bit Debian.
A note for Mac OS/X and FreeBSD developers: The default FreeBSD and Mac OS/X environments (XCode) include an old version of GCC (4.2.x). This version is known to have issues with OpenMP, so if you do use it, your only available option with multicore machines is Intel TBB (which works fine). You can, however, download the latest GCC from ports, if you use FreeBSD, or from High Performance Computing for Mac OS/X - they both offer the latest GCC series. Results are much better this way: OpenMP works fine, and support for the SSE-based -mrecip option boosts the speed by more than 30%. |
bash$ cd 3D-Objects bash$ ../src/renderer/renderer chessboard.tri
Command line parameters
Usage: renderer [OPTIONS] [FILENAME] -h this help -r print FPS reports to stdout (every 5 seconds) -b benchmark rendering of N frames (default: 100) -n N set number of benchmarking frames -w use two lights -m <mode> rendering mode: 1 : point mode 2 : points based on triangles (culling,color) 3 : triangles, wireframe anti-aliased 4 : triangles, ambient colors 5 : triangles, Gouraud shading, ZBuffer 6 : triangles, per-pixel Phong, ZBuffer 7 : triangles, per-pixel Phong, ZBuffer, Shadowmaps 8 : triangles, per-pixel Phong, ZBuffer, Soft shadowmaps 9 : triangles, per-pixel Phong, ZBuffer, raycasted shadows 0 : raytracing, with shadows, reflections and anti-aliasing
Creating more 3D objects on your own
The rasterizer output is looking much better if the model carries pre-calculated ambient occlusion information per vertex. To do this:
Well... I've always loved coding real-time 3D graphics. Experimenting with new algorithms, trying to make things run faster, look better... And as a side effect, I became a better coder :‑)
Anyway, these sources are my "reference" implementations. At some point around 2003, I decided that it was time to clean up the code that I've been hacking on over the years and focus on code clarity - ignoring execution speed. To that end, floating point is used almost everywhere (fixed-point begone!) and this being Phong shading, the complete lighting equation is calculated per pixel. I basically created a "clean" implementation of everything I have ever learned about polygon-related graphics. The clarity of the code also paved the way for the OpenGL and CUDA versions...
Rant 2: Tales of Multicore
This code was single threaded until late 2007. At that point, I heard about OpenMP, and decided to try it out. I was amazed at how easy it was to make the code "OpenMP-aware": I simply added a couple of pragmas in the for-loops that drew the triangles and the shadow buffers, and ...presto!
The only things I had to change were static variables, which had to be moved to stack space. Threading code can't tolerate global/static data, because race conditions immediately appeared when more than one thread worked on them.
Only two compilers truly supported OpenMP at the time: Intel's compiler (version 8.1) and Microsoft's CL. GCC unfortunately died with 'internal compiler error'. I reported this to the GCC forums, found out that I was not the only one who had noticed, and was told (by the forum guys) to wait.
While waiting for GCC to catch up, I kept researching multicore technologies. Functional languages seem particularly adept to SMP, and I've put them next in line in my R&D agenda (Ocaml and F# in particular). Before leaving C++ behind, though, I heard about Intel Threading Building Blocks (TBB) and decided to put them to the test. TBB is a portable set of C++ templates that makes writing threading code a lot easier than legacy APIs (CreateThread, _beginthread, pthread_create, etc). TBB is also open-source, so it was easy to work with it and figure out its internals. Truth be told, it also required more changes in my code (OpenMP required almost none). Still, it is a vast improvement compared to conventional threading APIs.
I must also confess that I have not invested a lot of effort in using these technologies; I only enhanced two of my main rendering loops to make them SMP aware. Still, this was enough to boost the speed (on a Core2Duo) by 80%! Judging by the gain/effort ratio, this is one of the best bargains I've ever found...
As of now (October 2008), GCC 4.3.2 is up to speed and compiles OpenMP code just fine. TBB is of course running perfectly (since it is simply a C++ template library), so choose freely between any of the two, and easily achieve portable multithreading.
When I say portable, I mean it: these are the tests I did...
Talk about portable code!
If you're still in the... dark ages and use legacy APIs (CreateThread, _beginthread, pthread_create, etc) you are really missing out: Under both OpenMP and Intel TBB, I increased the rendering frame rate of the train object by more than 40%, by simply replacing...
#pragma omp parallel forwith
#pragma omp parallel for schedule(dynamic,100)(similar change for TBB, at code inside Scene.cc).
Why? Because these modern threading APIs allow us to easily adapt to different loads per thread, by using dynamic thread scheduling.
Index | CV | Updated: Mon Nov 13 22:15:57 2023 |
The comments on this website require the use of JavaScript. Perhaps your browser isn't JavaScript capable; or the script is not being run for another reason. If you're interested in reading the comments or leaving a comment behind please try again with a different browser or from a different connection.