SASS King reverses engineers NVIDIA's SASS, the low-level instruction set produced by compiled CUDA binaries. Focused initially on SM120 and SM120a architectures in consumer Blackwell GPUs, the project builds a dictionary of instructions, audits kernels, and identifies patterns across GPU generations. It addresses a documentation gap since the last major public effort by Jia et al. on Volta and Turing in 2018. Newer architectures like Ampere, Hopper, and Blackwell introduce async copy paths, tensor-core variants, matrix load/store ops, sparse MMA forms, and uniform-register flows, which lack detailed public breakdowns.

The project combines controlled micro-kernels, direct SASS inspection, runtime probes, and audits of production code. Its practical aim: equip kernel engineers to parse SASS dumps, spot compiler decisions, and trace performance issues back to source code. With 215 GitHub stars and written primarily in Cuda, the repository serves as a living knowledge base rather than a standalone tool. Future plans include a disassembler, audit CLI, and Ghidra plugin.

Core Components

SASS King organizes its findings into distinct areas, each backed by evidence from kernels and dumps.

  • Teaching kernels: Complete sets from corpus/basics/01_vector_add/ through corpus/math_and_spills/12_register_spill/. These isolate variables like data types or operations to demonstrate SASS emission.
  • Tensor-core studies: Covers kernels up to 25 in corpus/tensor_cores/, with a dedicated README detailing matrix-multiply-accumulate (MMA) patterns and scaled variants.
  • Instruction glossary: Active for SM120/SM120a in knowledge/SASS_INSTRUCTIONS_SM120.md, listing ops with evidence.
  • Global findings: Tracks observations in knowledge/FINDINGS.md.
  • Encoding notes: Early work on instructions like LDSM, STSM, and QMMA in knowledge/encoding/.

Badges on the README mark it as a research knowledge base under Apache-2.0 license, with architecture pinned to SM120/SM120a. A status table summarizes progress:

Area Status Location
SM120 teaching kernels Complete through kernels 01-12 corpus/basics/01_vector_add/ to corpus/math_and_spills/12_register_spill/
Tensor-core studies Complete through Kernel 25 corpus/tensor_cores/
Global findings Active source of truth knowledge/FINDINGS.md
SM120 instruction glossary Active, evidence-backed knowledge/SASS_INSTRUCTIONS_SM120.md
Encoding pilots Started with LDSM, STSM, QMMA knowledge/encoding/
denvdis cross-validation Initial pass complete knowledge/DENVDIS_INTEGRATION.md
Pattern library Next phase patterns/
Production audits Planned production/

Cross-validation uses denvdis, NVIDIA's disassembler, though gaps persist in control code. Methodology relies on controlled variation: kernels differ by one factor, such as data type or operand, to map SASS outputs precisely.

Getting Started

No compiled binaries or package managers appear in the repo; it's documentation-driven. Clone the repository to explore:

git clone https://github.com/florianmattana/sass-king.git
cd sass-king

Follow the paths in "Start Here":

  • Read docs/START_HERE.md for an overview.
  • Check knowledge/README.md for the full index.
  • Dive into knowledge/SASS_INSTRUCTIONS_SM120.md for the instruction map.
  • Review knowledge/FINDINGS.md for raw observations.
  • Explore tensor cores via corpus/tensor_cores/README.md.

Public articles expand on this:

To contribute SASS dumps or fixes, see CONTRIBUTING.md. Release notes in RELEASE_NOTES.md mark the v0.1 scope. Users need NVIDIA hardware (Blackwell preferred), CUDA toolkit for compiling the corpus kernels, and tools like cuobjdump to generate SASS from PTX or cubin files. For example, compile a kernel from corpus/basics/01_vector_add/ with nvcc, then dump SASS:

nvcc -arch=sm_120 kernel.cu -o kernel
cuobjdump -sass kernel | tee kernel.sass

Compare the output against project glossaries.

Who This Serves

Kernel engineers tuning CUDA code for high-performance computing or AI workloads benefit most. If you audit SASS to debug register spills, tensor-core scheduling, or async ops, the controlled kernels and glossaries provide traceable examples. Production teams optimizing Blackwell GPUs—think sparse MMA or new matrix stores—gain from the pattern library (forthcoming in patterns/). Reverse engineers mapping ISA evolution across architectures will reference the encoding pilots and findings.

It's less useful for application developers not touching low-level GPU code. Casual CUDA users sticking to libraries like cuBLAS skip this entirely.

Comparisons and Context

SASS King updates efforts like Jia et al.'s 2018 Volta/Turing disassembly, which predates Ampere's tensor expansions. NVIDIA's denvdis offers basic disassembly but lacks the project's pedagogical kernels or architecture-specific glossaries—SASS King integrates it for validation (see knowledge/DENVDIS_INTEGRATION.md). Commercial tools like Nsight Compute visualize SASS indirectly; this repo emphasizes raw reading and compiler tracing.

Open alternatives include scattered GitHub repos on Hopper SASS or academic papers, but none match the structured corpus through kernel 25. For broader ISA work, check NVIDIA's CUDA documentation, though it omits full SASS details.

The project stays in research phase, heavy on docs over executables. Star or watch the repo at https://github.com/florianmattana/sass-king for updates; more at https://florianmattana.com/posts/.