The Volta Architecture: In Brief

Before we dive into our benchmark results, I want to spend a bit of time discussing the Volta architecture and the GV100 GPU in particular. Understanding what Volta brings to the table is critical for understanding the performance possibilities of the card, and understanding the GV100 GPU is similarly important for understanding the practical performance of the card.

Volta is a brand new architecture for NVIDIA in almost every sense of the word. While the logical organization is the same much of the time, it's not Pascal at 12nm with Tensor Cores. Rather it's a significantly different architecture in terms of thread execution, thread scheduling, core layout, memory controllers, ISA, and more. And these are just the things NVIDIA is willing to talk about right now, never mind the ample secrets they still keep.

Of the many architectural changes that Volta makes, there are four items in particular that I feel really set it apart from Pascal and help shape its capabilities.

  1. New tensor cores
  2. Removing the second warp scheduler dispatch unit & eliminating superscalar execution
  3. Separating the Integer cores
  4. Finer-grained thread scheduling

NVIDIA's Big Bet: Tensor Cores

The big story here of course is the new tensor cores. While the Volta architecture has all the makings of a strong HPC architecture even in a traditional context, the specific massive performance numbers that NVIDIA quotes for the Titan V and other Volta cards has come from the use of these tensor cores.

Tensor Cores are a new type of core for Volta that can, at a high level, be thought of as a more rigid, less flexible (but still programmable) core geared specifically for tensor math operations. These cores are essentially a mass collection of ALUs for performing 4x4 Matrix operations; specifically a fused multiply add (A*B+C), multiplying two 4x4 FP16 matrices together, and then adding that result to an FP16 or FP32 4x4 matrix to generate a final 4x4 FP32 matrix. Tensor operations are common in certain types of workloads, but in particular neural networking training and execution (inferencing).

The significance of these cores are that by performing a massive matrix multiplication operation in one unit, NVIDIA can achieve a much higher number of FLOPS for this one operation. A single tensor core performs the equivalent of 64 FMA operations per clock (for 128 FLOPS total), and with GV100 packing 8 such cores per SM, results in 1024 FLOPS per clock per SM. By comparison, even with pure FP16 operations, the standard CUDA cores in a GV100 SM only generate 256 FLOPS per clock. So in scenarios where these cores can be used, we’re looking at the ability to hit 4x the performance versus Pascal.

The flip side to all of this is that the tensor cores are relatively rigid cores. They really aren’t very good for anything except tensor operations, so they are only applicable to certain classes of compute tasks, and right now I don’t know of any graphics tasks that would really benefit from the cores. The benefit to NVIDIA of doing this is that this allows the tensor cores to be very performance dense, both in total performance and in actual die space usage; by lumping together so many ALUs within a single core and without duplicating their control logic or other supporting hardware, the percentage of transistors in a core dedicated to ALUs is higher than on a standard CUDA core. The cost is flexibility, as the hardware to enable flexibility takes up space. So this is a very conscious tradeoff on NVIDIA’s part between flexibility and total throughput.

Meanwhile because tensor cores are brand-new to NVIDIA’s GPUs and because they need to be explicitly called upon, NVIDIA has a bit of a chicken & egg problem here. To make the most of the new hardware, NVIDIA needs developers to write software that taps into the new tensor cores, and in a roundabout way this is one of several roles the Titan V is designed to fill. The $3000 card is still expensive, but it’s the first workstation-class card incorporating these cores, vastly increasing the accessibility of the hardware to software developers. So whereas the first wave of Volta-optimized software has been specific big-ticket items like NVIDIA’s libraries and deep learning frameworks like Caffe2, Titan V will help developers put together the second wave of software.

Scheduling, ILP, & Integers

The second big change that Volta brings to the table is that, at least for GV100, the second warp scheduler dispatch port has been eliminated. Ever since GF104 in 2011, NVIDIA’s architectures have featured two dispatch ports per warp scheduler, allowing for superscalar execution. In other words, their architecture has relied on a degree of instruction level parallelism, requiring the ability to execute a second, non-dependent instruction from a thread in order to get the most out of the hardware.

Volta/GV100, by contrast, is no longer superscalar. Each partition within an SM is now feed by a single dispatch unit warp scheduler, with no opportunity to extract ILP. This means that Volta is a pure thread level parallelism (TLP) design: max utilization comes from maximizing the number of threads active at any given time.

ILP versus TLP is a constant balance, and it’s not unusual to see NVIDIA shifting between the two, especially for a compute-centric GPU like GV100. ILP is nice to have, but extracting it can be difficult. On the other hand while GPUs are meant for embarrassingly parallel tasks, it’s not always easy to generate more threads. So there’s a very real question over whether the performance gains from adding the hardware for ILP justifies the power and complexity costs of doing so.

Meanwhile NVIDIA has also made an interesting change to the Volta architecture with respect to how the integer ALUs are organized. Though not typically a subject of conversation in NVIDIA architecture design, the integer ALUs have traditionally been paired with the FP32 ALUs to make up a single CUDA core. So a black of CUDA cores could either execute integer or floating point operations, but not both at the same time.

Volta in turn separates these ALUs. The integer units have now graduated their own set of dedicates cores within the GPU design, meaning that they can be used alongside the FP32 cores much more freely. The specifics of this arrangement get a bit hairy in light of the fact that Volta isn’t superscalar – so you technically can’t issue INT and FP32 instructions at the same time regardless – but given the fact that most GPU operations take multiple clocks to execute, this allows for much more flexibility than before. NVIDIA notes that it’s especially useful for address generation and in FMA performance, the latter of which we’re taking a look at a bit later.

Finally, and admittedly getting into the even more esoteric aspects of GPU design, NVIDIA has reworked how SIMT works for Volta. The individual CUDA cores within a 32-thread warp now have a limited degree of autonomy; threads can now be synchronized at a fine-grain level, and while the SIMT paradigm is still alive and well, it means greater overall efficiency. Importantly, individual threads can now yield, and then be rescheduled together. This also means that a limited amount of scheduling hardware is back in NV’s GPUs.

This generally doesn’t mean anything for existing software. But for developers who really know their GPUs and threading, it gives them ways to extract performance that couldn’t be done under Pascal’s more rigid SIMT model.

The NVIDIA Titan V Preview Meet Titan V, A Note on Graphics, & the Test
POST A COMMENT

112 Comments

View All Comments

  • mode_13h - Sunday, December 31, 2017 - link

    True that Cuda seems to dominate HPC. I think Nvidia did a good job of cultivating the market for it.

    The trick for them now is that most deep learning users use frameworks which aren't tied to any Nvidia-specific APIs. I know they're pushing TensorRT, but it's certainly not dominant in the way Cuda dominates HPC.
    Reply
  • tuxRoller - Monday, January 1, 2018 - link

    The problem is that even the gpu accelerated nn frameworks are still largely built first using cuda. torch, caffe and tensorflow offer varying levels of ocl support (generally between some and none).
    Why is this still a problem? Well, where are the ocl 2.1+ drivers? Even 2.0 is super patchy (mainly due to nvidia not officially supporting anything beyond 1.2). Add to this their most recent announcements about merging ocp into vulkan and you have yourself an explanation for why cuda continues to dominate.
    My hope is that khronos announce vulkan 2.0, with ocl being subsumed, very soon. Doing that means vendors only have to maintain a single driver (with everything consuming spirv) and nvidia would, basically, be forced to offer opencl-next. Bottom-line: if they can bring the ocl functionality into vulkan without massively increasing the driver complexity, I'd expect far more interest from the community.
    Reply
  • mode_13h - Friday, January 5, 2018 - link

    Your mistake is focusing on OpenCL support as a proxy for AMD support. Their solution was actually developing OpenMI as a substitute for Nvidia's cuDNN. They have forks of all the popular frameworks to support it - hopefully they'll get merged in, once ROCm support exists in the mainline Linux kernel.

    Of course, until AMD can answer the V100 on at least power-effeciency grounds, they're going to remain an also-ran, in the market for training. I think they're a bit more competitive for inferencing workloads, however.
    Reply
  • CiccioB - Thursday, December 21, 2017 - link

    What are you suggesting?
    GPU are a very customized piece of silicon and you have to code for them with optimization for each single architecture if you want to exploit them at the maximum.
    If you think that people buy $10.000 cards to be put in $100.000 racks for a multiple $1.000.000 server just to use open source not optimized not supported not guarantee code in order to make AMD fanboys happy, well, not, it's not like the industry works.
    Grow up.
    Reply
  • mode_13h - Wednesday, December 27, 2017 - link

    I don't know if you've heard of OpenCL, but there's not reason why a GPU needs to be programmed in a proprietary language.

    It's true that OpenCL has some minor issues with performance portability, but the main problem is Nvidia's stubborn refusal to support anything past version 1.2.

    Anyway, lots of businesses know about vendor lock-in and would rather avoid it, so it sounds like you have some growing up to do if you don't understand that.
    Reply
  • CiccioB - Monday, January 1, 2018 - link

    Grow up.
    I repeat. None is wasting millions in using not certified, supported libraries. Let's avoid talking about entire frameworks.
    If you think that researches with budgets of millions are nerds working in a garage with avoiding lock-in strategies as their first thought in the morning, well, grow up kid.
    Nvidia provides the resources to allow them to exploit their expensive HW at the most of its potential reducing time and other associated costs. Also when upgrading the HW with a better one. That's what counts when investing millions for a job.
    For you kid's home made AI joke, you can use whatever alpha library with zero support and certification. Others have already grown up.
    Reply
  • mode_13h - Friday, January 5, 2018 - link

    No kid here. I've shipped deep-learning based products to paying customers for a major corporation.

    I've no doubt you're some sort of Nvidia shill. Employee? Maybe you bought a bunch of their stock? Certainly sounds like you've drunk their kool aid.

    Your line of reasoning reminds me of how people used to say businesses would never adopt Linux. Now, it overwhelmingly dominates cloud, embedded, and underpins the Android OS running on most of the world's handsets. Not to mention it's what most "researchers with budgets of millions" use.
    Reply
  • tuxRoller - Wednesday, December 20, 2017 - link

    "The integer units have now graduated their own set of dedicates cores within the GPU design, meaning that they can be used alongside the FP32 cores much more freely."

    Yay! Nvidia caught up to gcn 1.0!
    Seriously, this goes to show how good the gcn arch was. It was probably too ambitious for its time as those old gpus have aged really well it took a long time for games to catch up.
    Reply
  • CiccioB - Thursday, December 21, 2017 - link

    <blockquote>Nvidia caught up to gcn 1.0!</blockquote>
    Yeah! It is known to the entire universe that it is nvidia that trails AMD performances.
    Luckly they managed to get this Volta out in time before the bankruptcy.
    Reply
  • tuxRoller - Wednesday, December 27, 2017 - link

    I'm speaking about architecture not performance. Reply

Log in

Don't have an account? Sign up now