Systalyze

S systalyze.com ↗

▲ 128 points • 28 comments • by ManyaGhobadi • 2mo ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is primarily human-written, with some AI-generated content detected

14 %

AI likelihood · overall

Mixed

87% human-written 13% AI-generated

SEGMENTS · HUMAN 4 of 6

SEGMENTS · AI 2 of 6

WORD COUNT 1,737

PEAK AI % 100% · §4

Analyzed

Apr 27

backend: pangram/v3.3

Segments scanned

6 windows

avg 290 words each

Distribution

87 / 13%

human / AI fraction

Verdict

Mixed

Pangram v3.3

Article text · 1,737 words · 6 segments analyzed

Human AI-generated

§1 Human · 9%

"The gap isn't awareness — engineers who write CUDA kernels know what accurate utilization looks like. The gap is tooling. There has never been a way to see true GPU efficiency continuously, in production, without slowing down the workload."— Manya Ghobadi, MIT Professor & CEO, Systalyzenvtop (top row) reads 100% on all three workloads regardless of the size of the matrix multiplications. Utilyze (bottom row) tracks actual compute throughput, showing dramatic utilization variation for different matrix sizes.As shown in the above figure, nvtop is invariant to workload intensity: all three matrix multiplication sizes show 100% in nvtop (top row: cyan line pinned at the ceiling). Utilyze (bottom row) shows compute throughput scaling with matrix size, from 2.6% at N=256, 32% at N=1024 and 88% at N=4096.To validate the correctness of Utilyze, let’s calculate the true compute utilization directly: a matrix multiplication of two N×N matrices at TF32 precision performs 2·N³ floating-point operations. We scale the H200 peak TF32 rate of 494.5 TFLOPS by the GPU's clock speed and calculate utilization. At N=256, 2·256³ ≈ 0.034 GFLOPs × 155,349 iterations/sec = 5.2 TFLOPS, or 1% utilization. The same calculation yields 32% utilization at N=1024 and 86% at N=4096. In comparison, the theoretical ground truth numbers are within 2% of the numbers Utilyze reported.While this direct calculation is tractable for a simple compute operation like direct matrix multiplication, it becomes intractable for real-world AI workloads. Modern training, fine-tuning, and inference pipelines consist of heterogeneous operators (attention, normalization, communication, sparsity, control flow), dynamic shapes, and complex scheduling effects across the GPU. In such settings, deriving true utilization analytically from first principles is not practical. What is needed instead is a method that measures utilization directly at the hardware level.Utilyze provides exactly this capability: direct measurement of true compute utilization via GPU hardware performance counters.

§2 Human · 3%

Utilyze arrives at nearly identical values (within 2%) from the other direction. Instead of deriving utilization from FLOP counts, it samples hardware counters on the GPU directly. The two methods agree because they measure the same physical thing from different angles: arithmetic work done against arithmetic capacity available. This cross-validation confirms Utilyze’s hardware-counter approach is accurate. No other tool today delivers this level of accuracy in real time without incurring meaningful overhead.“Cloud providers and hardware vendors surface this same misleading metric on their dashboards. When that number reads 100%, the natural conclusion is that you need more hardware. The incentives to correct this misimpression are, to put it diplomatically, complicated.”— Manya Ghobadi, MIT Professor & CEO, SystalyzeDCGM-based Counters Aren’t Much Better Prior articles have pointed out this gap and suggested alternative metrics through NVIDIA's Data Center GPU Manager (DCGM), a toolkit that exposes richer GPU counters than nvidia-smi (see here and here). The most common proxy to GPU utilization is DCGM’s “SM Active.” It measures the ratio of SMs with at least one warp scheduled over the total number of SMs. This metric is an improvement over nvdia-smi, because at least it considers some compute activity inside the GPU rather than treating the whole chip as a single on/off switch. But SM active, and other DCGM metrics, have the same shape of problem one level down: a warp being resident on an SM does not mean that SM is doing arithmetic. The warp could be moving data, waiting for data to arrive from memory, or running bookkeeping instructions the entire time, and SM Active would still read 100%. Utilyze is specifically built to answer the true GPU utilization question: what fraction of peak arithmetic throughput is the GPU actually delivering? No off-the-shelf tool, including DCGM, provides this continuously in production.To see this in practice, we ran a memory-bound workload on an H200, similar in shape to a decode-heavy LLM inference step, with nvtop, DCGM, and Utilyze.

§3 Human · 9%

Under this workload the actual arithmetic throughput is around 6% of the ceilingnvtop: 100%ground truth: 6%SM Active (DCGM):99%ground truth: 6%Utilyze: 6%ground truth: 6%Only Utilyze gets it right. nvtop is wrong for the reason we already covered. SM Active reports a whopping 99% utilization. The SMs really do have warps resident the whole time, but those warps are waiting on memory rather than doing math, and SM Active cannot distinguish between a warp that is computing and a warp that is sitting idle waiting for data. Relying on SM active to monitor the GPU utilization gives the illusion that the GPU is fully saturated while it is actually just sitting idle.DCGM reports other metrics, such as SM issue (how often are instructions being issued), SM occupancy (how full are the SMs of warps), and tensor core throughput. None of these metrics, independently or combined, show the full picture that Utilyze provides.Introducing Utilyze, Open-Sourced by SystalyzeWe built Utilyze as an open-source, GPU monitoring tool to report true GPU compute and GPU memory bandwidth utilization as a percentage of the hardware’s theoretical limit. Beyond raw utilization, Utilyze estimates the portion of the theoretical limit that is practically attainable under the current hardware, software stack, and AI workload as well. Utilyze operates in real time with near-zero overhead, making it suitable for production environments where continuous observability is required without perturbing performance. At Systalyze, we use it to monitor, benchmark, and validate our performance optimization techniques and we think everyone should use it.Give Utilyze a try todayBefore describing how Utilyze works, let’s unpack why accurate GPU utilization is a technically difficult measurement problem. GPUs have two fundamentally different types of compute resources: CUDA cores for general floating-point math, and Tensor Cores that perform matrix multiplications. They also have multiple levels of memory: HBM (high bandwidth memory) sitting off-chip, L2 cache, shared memory inside each SM, and registers local to each thread. Each of these resources can be a bottleneck independently. A workload can be using its Tensor Cores at full capacity while memory bandwidth sits nearly idle, or vice versa. A single percentage cannot represent this two-dimensional reality.

§4 AI · 100%

As a result, every AI operation on a GPU is constrained by two physical limits: how fast the math units can execute arithmetic (compute throughput), and how fast data can move between memory and the math units (memory bandwidth). Every kernel hits one of these limits first, and that determines its maximum possible performance. This brings us to the framework that actually captures GPU utilization accurately: the Speed-of-Light (SOL) model. This model is a performance framework that measures how close a kernel gets to the GPU's theoretical hardware ceiling, reporting two key numbers: Compute SOL % (= achieved FLOPs ÷ peak FLOPs) and Memory SOL % (= achieved bandwidth ÷ peak bandwidth). It derives from the roofline model, where every kernel is bounded by either compute or memory, and the higher of the two SOL percentages identifies the binding constraint. Utilyze provides exactly that, with two headline numbers: Compute SOL % and Memory SOL %. Both are shown live. The numerator comes from direct measurement of each compute engine (e.g., Tensor Cores, FP32/FP64/INT32 pipelines) and each memory subsystem (e.g., HBM bandwidth, L2, L1) where NVIDIA exposes each as a percentage of that hardware unit's theoretical maximum. The denominator is the SOL itself, the hardware peak. Together, these give you an accurate, live picture of GPU utilization that no other tool provides. If the compute number is dominant, your workload is compute-bound.

§5 AI · 100%

If the memory number is dominant, you're memory-bound, and optimizations should target data movement first.But it doesn’t end here. Here's something important that raw SOL % doesn't tell you on its own: 100% is not a realistic target.The theoretical hardware peak of 2,000 TFLOPS of compute, 3.4 TB/s of memory bandwidth on an H100, is a physical limit that no real AI workload can reach. Kernel launches have overhead. Data moves between levels of the memory hierarchy. Thread synchronization takes cycles. In multi-GPU setups, communication between GPUs consumes time that could otherwise be spent on computation. For Mixture-of-Experts models, routing tokens to different experts creates irregular memory access patterns that reduce effective throughput. None of these are signs of poor optimization, they're structural properties of real deployments.Every deployment has a natural ceiling below 100% that reflects the specific combination of model architecture, hardware, parallelism strategy, and batch size. We call this ceiling the Attainable Compute SOL %, hereafter referred to as Attainable SOL %. The gap between your current SOL % and the Attainable SOL % is your optimization budget. The gap between the Attainable SOL % and 100% is the physics of your deployment; you can't close it by tuning.

§6 Human · 4%

For instance, if you're running a 120B-parameter inference setup at 30% Compute SOL % and the Attainable SOL % for that model on that hardware is 35%, you're close to the limit. If the Attainable SOL % is 65% and you're at 30%, you have 35 percentage points of recoverable performance, and the right move is optimization, not procurement.Why Is Utilyze Different?Performance engineers often rely on two main tools to debug performance problems of AI workloads. First is Nsight Compute (ncu), a kernel-level profiler that reports detailed compute and memory throughput metrics, such as what fraction of the Tensor Core's theoretical throughput was actually achieved, what fraction of the memory bus was saturated, and where the bottleneck lies. The second tool is Nsight Systems (nsys), a timeline tool that records when kernels ran and how they interacted. Both tools are built for offline analysis rather than a real-time dashboard. ncu gets its detail by "replaying" each kernel, running it many times with different counters selected, then stitching the results together. The result is valuable, but its overhead causes the workloads to run 10× to 100× slower than normal, which rules it out for live traffic. nsys avoids the slowdown but doesn't report throughput metrics at all, it answers "what happened" rather than "how efficiently." The practical consequence: seasoned engineers who regularly reach for ncu (or its AMD equivalent, Omniperf) are using them for offline, per-kernel debugging and not to watch live traffic. To address this challenge, Utilyze cycles through GPU performance counters across time windows using NVIDIA's Nsight Perf SDK. Rather than replaying kernels, Utilyze takes a rolling sample across multiple windows and aggregates the result. As a result, the overhead is negligible and the measurement is continuous. You can run Utilyze alongside any production AI workload and get meaningful data in real time.Benchmarking Utilyze The following are a few examples demonstrating how to leverage Utilyze to identify performance bottlenecks in real AI workloads.