At this year’s GPU Technology Conference, Nvidia’s premier conference for technical computing with graphic processors, the company reserved the top keynote for its CEO Jensen Huang. Over the years, the GTC conference went from a segment in a larger, mostly gaming-oriented and somewhat scattershot conference called “nVision” to become one of the key conferences that mixes academic and commercial high-performance computing.
Jensen’s message was that GPU-accelerated machine learning is growing to touch every aspect of computing. While it’s becoming easier to use neural nets, the technology still has a way to go to reach a broader audience. It’s a hard problem, but Nvidia likes to tackle hard problems.
The Nvidia strategy is to disburse machine learning into every market. To accomplish this, the company is investing in Deep Learning Institute, a training program to spread the deep learning neural net programming model to a new class of developers.
Much as Sun promoted Java with an extensive series of courses, Nvidia wants to get all programmers to understand neural net programming. With deep neural networks (DNNs) promulgated into many segments, and with cloud support from all major cloud service suppliers, deep learning (DL) can be everywhere — accessible any way you want it, and integrated into every framework.
DL also will come to the Edge; IoT will be so ubiquitous that we will need software writing software, Jensen predicted. The future of artificial intelligence is about the automation of automation.
Nvidia Drives AI Into Toyota
Bringing DL to a wider market also drove Nvidia to build a new computer for autonomous driving. The Xavier processor is the next generation of processor powering the company’s Drive PX platform.
This new platform was chosen by Toyota as the basis for production of autonomous cars in the future. Nvidia couldn’t reveal any details of when we’ll see Toyota cars using Xavier on the road, but there will be various levels of autonomy. including copiloting for commuting and “guardian angel” accident avoidance.
Unique to the Xavier processor is the DLA, a deep learning accelerator that offers 10 Tera operations of performance. The custom DLA will improve power and speed for specialized functions such as computer vision.
To spread the DLA impact, Nvidia will open source instruction set and RTL for any third party to integrate. In addition to the DLA, the Xavier System on Chip will have Nvidia’s custom 64-bit ARM core and the Volta GPU.
Nvidia continues to execute on its high-performance computing roadmap and is starting to make major changes to its chip architectures to support deep leaning. With Volta, Nvidia has made the most flexible and robust platform for deep learning, and it will become the standard against which all other deep learning platforms are judged.
Deep Learning Needs for More Performance
Nvidia’s conference is all about building a pervasive ecosystem around its GPU architectures. The ecosystem influences the next GPU iteration as well. With early GPUs for high-performance computing and supercomputers, the market demanded more precise computation in the form of double precision floating-point format processing, and Nvidia was the first to add a fp64 unit to its GPUs.
GPUs are the predominant accelerator for machine learning training, but they also can be used to accelerate the inference (decision) execution process. Inference doesn’t require as much precision, but it needs fast throughput. For that need, Nvidia’s Pascal architecture can perform fast, 16-bit floating-point math (fp16).
The newest GPU is addressing the need for faster neural net processing by incorporating a specific processing unit for DNN tensors in its newest architecture — Volta. The Volta GPU processor already has more cores and processing power than the fastest Pascal GPU, but in addition, the tensor core pushes the DNN performance even further. The first Volta chip, the V100, is designed for the highest performance.
The V100 is a massive 21 billion transistors in semiconductor company TSMC’s 12nm FFN high-performance manufacturing process. The 12nm process — a shrink of the 16nm FF process — allows the use of models from 16nm. This reduces the design time.
Even with the shrink, at 815mm2 Nvidia pushed the size of the V100 die to the very limits of the optical reticle.
The V100 builds on Nvidia’s work with the high-performance Pascal P100 GPU, including the same mechanical layout, electrical connects, and the same power requirements. This makes the V100 an easy upgrade from the P100 in rack servers.
For traditional GPU processing, the V100 has more than 5,120 CUDA (compute unified device architecture) cores. The chip is capable of 7.5 Tera FLOPS of fp62 math and 13TF of fp32 math.
Feeding data to the cores requires an enormous amount of memory bandwidth. The V100 uses second generation high-bandwidth memory (HBM2) technology to feed 900 Gigabytes/sec of bandwidth to the chip from the 16 GB.
While the V100 supports the traditional PCIe interface, the chip expands the capability by delivering 300 GB/sec over six NVLink interfaces for GPU-to-GPU connections or GPU-to-CPU connections (presently, only IBM’s POWER 8 supports Nvidia’s NVLink wire-based communications protocol).
However, the real change in Volta is the addition of the tensor math unit. With this new unit, it’s possible to perform a 4x4x4 matrix operation in one clock cycle. The tensor unit takes in a 16-bit floating-point value, and it can perform two matrix operations and an accumulate — all in one clock cycle.
Internal computations in the tensor unit are performed with fp32 precision to ensure accuracy over many calculations. The V100 can perform 120 Tera FLOPS of tensor math using 640 tensor cores. This will make Volta very fast for deep neural net training and inference.
Because Nvidia already has built an extensive DNN framework with its CuDNN libraries, software will be able to use the new tensor units right out of the gate with a new set of libraries.
Nvidia will extend its support for DNN inference with TensorRT — where it can train neural nets and compile models for real-time execution. The V100 already has a home waiting for it in the Oak Ridge National Labs’ Summit supercomputer.