SASS King reverses engineers NVIDIA's SASS, the low-level instruction set produced by compiled CUDA binaries. Focused initially on SM120 and SM120a architectures in consumer Blackwell GPUs, the project builds a dictionary of instructions, audits kernels, and identifies patterns across GPU generations. It addresses a documentation gap since the last major public effort by Jia et al. on Volta and Turing in 2018. Newer architectures like Ampere, Hopper, and Blackwell introduce async copy paths, tensor-core variants, matrix load/store ops, sparse MMA forms, and uniform-register flows, which lack detailed public breakdowns.
The project combines controlled micro-kernels, direct SASS inspection, runtime probes, and audits of production code. Its practical aim: equip kernel engineers to parse SASS dumps, spot compiler decisions, and trace performance issues back to source code. With 215 GitHub stars and written primarily in Cuda, the repository serves as a living knowledge base rather than a standalone tool. Future plans include a disassembler, audit CLI, and Ghidra plugin.
Core Components
SASS King organizes its findings into distinct areas, each backed by evidence from kernels and dumps.
- Teaching kernels: Complete sets from
corpus/basics/01_vector_add/throughcorpus/math_and_spills/12_register_spill/. These isolate variables like data types or operations to demonstrate SASS emission. - Tensor-core studies: Covers kernels up to 25 in
corpus/tensor_cores/, with a dedicated README detailing matrix-multiply-accumulate (MMA) patterns and scaled variants. - Instruction glossary: Active for SM120/SM120a in
knowledge/SASS_INSTRUCTIONS_SM120.md, listing ops with evidence. - Global findings: Tracks observations in
knowledge/FINDINGS.md. - Encoding notes: Early work on instructions like LDSM, STSM, and QMMA in
knowledge/encoding/.
Badges on the README mark it as a research knowledge base under Apache-2.0 license, with architecture pinned to SM120/SM120a. A status table summarizes progress:
| Area | Status | Location |
|---|---|---|
| SM120 teaching kernels | Complete through kernels 01-12 | corpus/basics/01_vector_add/ to corpus/math_and_spills/12_register_spill/ |
| Tensor-core studies | Complete through Kernel 25 | corpus/tensor_cores/ |
| Global findings | Active source of truth | knowledge/FINDINGS.md |
| SM120 instruction glossary | Active, evidence-backed | knowledge/SASS_INSTRUCTIONS_SM120.md |
| Encoding pilots | Started with LDSM, STSM, QMMA | knowledge/encoding/ |
| denvdis cross-validation | Initial pass complete | knowledge/DENVDIS_INTEGRATION.md |
| Pattern library | Next phase | patterns/ |
| Production audits | Planned | production/ |
Cross-validation uses denvdis, NVIDIA's disassembler, though gaps persist in control code. Methodology relies on controlled variation: kernels differ by one factor, such as data type or operand, to map SASS outputs precisely.
Getting Started
No compiled binaries or package managers appear in the repo; it's documentation-driven. Clone the repository to explore:
git clone https://github.com/florianmattana/sass-king.git
cd sass-king
Follow the paths in "Start Here":
- Read
docs/START_HERE.mdfor an overview. - Check
knowledge/README.mdfor the full index. - Dive into
knowledge/SASS_INSTRUCTIONS_SM120.mdfor the instruction map. - Review
knowledge/FINDINGS.mdfor raw observations. - Explore tensor cores via
corpus/tensor_cores/README.md.
Public articles expand on this:
To contribute SASS dumps or fixes, see CONTRIBUTING.md. Release notes in RELEASE_NOTES.md mark the v0.1 scope. Users need NVIDIA hardware (Blackwell preferred), CUDA toolkit for compiling the corpus kernels, and tools like cuobjdump to generate SASS from PTX or cubin files. For example, compile a kernel from corpus/basics/01_vector_add/ with nvcc, then dump SASS:
nvcc -arch=sm_120 kernel.cu -o kernel
cuobjdump -sass kernel | tee kernel.sass
Compare the output against project glossaries.
Who This Serves
Kernel engineers tuning CUDA code for high-performance computing or AI workloads benefit most. If you audit SASS to debug register spills, tensor-core scheduling, or async ops, the controlled kernels and glossaries provide traceable examples. Production teams optimizing Blackwell GPUs—think sparse MMA or new matrix stores—gain from the pattern library (forthcoming in patterns/). Reverse engineers mapping ISA evolution across architectures will reference the encoding pilots and findings.
It's less useful for application developers not touching low-level GPU code. Casual CUDA users sticking to libraries like cuBLAS skip this entirely.
Comparisons and Context
SASS King updates efforts like Jia et al.'s 2018 Volta/Turing disassembly, which predates Ampere's tensor expansions. NVIDIA's denvdis offers basic disassembly but lacks the project's pedagogical kernels or architecture-specific glossaries—SASS King integrates it for validation (see knowledge/DENVDIS_INTEGRATION.md). Commercial tools like Nsight Compute visualize SASS indirectly; this repo emphasizes raw reading and compiler tracing.
Open alternatives include scattered GitHub repos on Hopper SASS or academic papers, but none match the structured corpus through kernel 25. For broader ISA work, check NVIDIA's CUDA documentation, though it omits full SASS details.
The project stays in research phase, heavy on docs over executables. Star or watch the repo at https://github.com/florianmattana/sass-king for updates; more at https://florianmattana.com/posts/.
Comments