On November 8, 2006, Nvidia officially launched its first unified shader architecture and first DirectX 10-compatible GPU, the G80. The new chip debuted in two new cards, the $599 GeForce 8800 GTX and the $449 GeForce 8800 GTS. Today, the 8800 GTX’s specs seem modest, even low-end, with 128 shader cores, 32 texture mapping units, and 24 Render Outputs (ROPs), backed by 768MB of RAM. But back in 2006, the G80 was a titan. It swept both Nvidia’s previous GTX 7xx generation and ATI’s Radeon X19xx series completely off the table, even in games where Team Red had previously enjoyed a significant performance advantage.
But the G80 didn’t just rewrite performance headlines — it redefined what GPUs were, and what they were capable of.
For this retrospective, we spoke with two Nvidia engineers who did a great deal of work on G80: Jonah Alben, Senior VP of GPU Engineering, and John Danskin, VP of GPU Architecture. Before we dive in, however, we want to give a bit of context on what made G80 so different from what came before. Beginning with the GeForce 3 and Radeon 8500 in 2001, both ATI and Nvidia cards could execute small programs via specialized, programmable vertex and pixel shaders. Nvidia’s last desktop architecture to use this approach was the G71, released on March 9, 2006. It looked like this:
In this diagram, the vertex shaders are the eight dedicated blocks at the top, above the “Cull / Clip / Setup” section. The 24 pixel shaders are the large group of six blocks in the middle of the diagram, where each block corresponds to four pixel pipelines (24 pixel shaders, total). If you aren’t familiar with how pre-unified shader GPUs were built, this diagram probably looks a bit odd. G80, in contrast, is rather more familiar:
Nvidia’s GTX 8800 family were the first consumer graphics cards to swap dedicated pixel and vertex shaders for a wide array of simpler stream processors (SPs, later referred to as CUDA cores). While previous GPUs were vector processors that could operate concurrently on the red, green, blue, and alpha color components of a single pixel, Nvidia designed the G80 as a scalar processor, in which each streaming processor handled one color component. At a high level, Nvidia had switched from a GPU architecture with dedicated hardware for specific types of shader programs to an array of relatively simple cores that could be programmed to perform whatever types of shader calculations the application required at that particular moment.
The simpler cores could also be clocked much faster. The GeForce 7900 GTX was built on a 90nm process and hit 650MHz, while the GeForce 8800 GTX was built on an 80nm “half node” process and ran its shader cores at 1.35GHz. But as with any brand-new architecture, there were significant risks involved.
Our interview has been lightly edited for clarity.
ET: G80 debuted more-or-less simultaneously with DirectX 10 and was the first fully programmable GPU to debut for PCs. It was also much larger than previous Nvidia chips (the GTX 7900 GTX had 278 million transistors, G80 was a 681 million transistor design.) What were some of the challenges associated with making this leap, either in terms of managing the design or choosing which features to include and support?
Jonah Alben: I think that one of the biggest challenges with G80 was the creation of the brand new “SM” processor design at the core of the GPU. We pretty much threw out the entire shader architecture from NV30/NV40 and made a new one from scratch with a new general processor architecture (SIMT), that also introduced new processor design methodologies.
ET: Were there any features or capabilities of G80 that represented a risk for Nvidia, in terms of die cost / difficulty, but that you included because you felt the risk was worth it?
Jonah Alben: We definitely felt that compute was a risk. Both in terms of area – we were adding area that all of our gaming products would have to carry even though it wouldn’t be used for gaming – and in terms of complexity of taking on compute as a parallel operating mode for the chip.
John Danskin: This was carefully metered. We gave John Nickolls’ compute team fixed area and engineering budgets. Inside their budgets, they did an incredible job. G80 was designed to run much more complicated pixel shaders with more branching, dependencies, and resource requirements than previous chips. Our poster-child shader was called oil slick and it was close to 1,000 instructions long. These complex shaders didn’t exist in games at the time, but we saw that programmable graphics was just beginning.
ET: Did all of these bets pay off? Were there capabilities that weren’t heavily adopted, or any features that were more successful than you expected?
Jonah Alben: Geometry shaders [introduced in G80 and DirectX 10] didn’t end up being very heavily adopted at the time. But they were a first step towards other investments in programmable geometry (tessellation in Fermi, multi-projection in Pascal) that have proven to be extremely useful. Compute ended up being even more important than I thought it would be at the time – especially, it has been exciting to see it become important to gaming.
John Danskin: In the short run, we overshot on the flexibility of our shader cores. For most of the games G80 ran, something simpler might have been more efficient. In the long run, G80 encouraged the development of more realistic, more exciting content, which was a win. We had high hopes [for GPU computing] but it was similar to a startup. Most startups fail. Some startups change the world. GPU computing started out exciting but small. Now it’s the foundation of deep learning. Few could have foreseen that.
ET: How much of G80’s lineage still remains in modern Nvidia cards? Have there been any follow-up designs (Tesla – Pascal) that you would say represent an even larger generational shift than G80 was compared to G71?
Jonah Alben: While we’ve definitely made major architectural changes since then (Fermi was a major system architecture change and Maxwell was another large change to the processor design), the basic structure that we introduced in G80 is still very much there today.
ET: The G80 debut predates the official launch of CUDA by roughly eight months — how much of the GPUs design was driven by what Nvidia wanted to accomplish with CUDA? Was this a case of starting with a programmable GPU and realizing you could accomplish much more, or did NV start out from the beginning with a plan to offer a GPU that could offer both excellent game performance and superior compute performance as well?
John Danskin: We were primarily driven by games, but we saw that gaming and computing performance were complementary. We made the most programmable graphics engine we could design, and then we made sure that it could do compute well, too. John Nickolls’ vision was that we would address general High Performance Computing problems.
ET: Looking back with the benefit of hindsight, did the launch of G80 and NV’s subsequent efforts take you where you thought they would?
Jonah Alben: It has taken us well beyond what I expected at the time. In particular CUDA has proven to be a great success. The fundamental design of CUDA that John Nickolls, John Danskin and others defined at the very beginning is still there today, both in CUDA and in similar programming languages (DirectX Compute, OpenCL, etc.). It turned out that we were right to believe that the world needed a new programming model that was designed thoughtfully for parallel programming – both in terms of how to express the workload and in terms of how to constrain the programmer so that their code would be structured in a way that was likely to perform well.
The speedups that people got with CUDA on GPUs were downright amazing.
It was the right call to put compute support in all of our GPUs. We built a huge product base and made GPU computing accessible to anyone with an idea that needed more performance than traditional CPUs could handle. Recent developments like the explosion of deep learning I think are directly connected to that decision.
True revolution only happens on occasion in the PC industry. Most products are iterative and evolutionary, rather than wholesale reinventions or radical performance leaps. The debut of G80, however, was arguably one of these moments — and 10 years on, we tip our hat to the GPU that launched Nvidia’s HPC ambitions and kickstarted much of the GPGPU business. Today, more TOP500 supercomputer systems use Fermi or Kepler-derived accelerators than using AMD Radeon or Intel Xeon Phi hardware combined (66 systems vs. 26 Xeon Phi or Radeon-equipped supercomputers). That’s a testament to Nvidia’s work on CUDA and its overall support of GPGPU computing.
Note: Technically, ATI’s Xenos GPU in the Xbox 360 was the first unified shader GPU in consumer hardware, but Xenos wasn’t a DirectX 10-capable GPU and it was never deployed in ATI’s PC business. ATI’s first unified shader architecture for the PC was the R600, which arrived on May 14, 2007