|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
Header: modules/coders/ans/ans_stage.h
Class: fz::ANSStage
Category: Coder (lossless)
Common instantiation:
Entropy-encodes a byte stream using GPU-accelerated rANS (range Asymmetric Numeral Systems) coding via the vendored dietGPU kernel library. Forward pass produces a self-describing ANS bitstream (ANSCoalescedHeader + rANS data); inverse pass reconstructs the original bytes exactly.
uint8_t[] → ANS bitstream (ANSCoalescedHeader + rANS data)ANS bitstream → uint8_t[] Exact reconstructionThe stage is byte-transparent: its input and output DataType is UNKNOWN, so the pipeline does not enforce type compatibility at the connection point. This allows ANSStage to accept any upstream output as raw bytes — most commonly the uint16_t quantization codes produced by LorenzoQuantStage or QuantizerStage.
| Setting | Type | Default | Purpose |
|---|---|---|---|
setProbBits(n) | uint8_t | 10 | ANS probability resolution (table size = 2^n) |
Only prob_bits=10 is supported in this build. The dietGPU rANS kernels are compiled as explicit template instantiations for kANSDefaultProbBits=10 in modules/coders/ans/dietgpu/GpuANS.cu. Calling setProbBits(n) with n != 10 and then calling execute() throws std::runtime_error. The setter is reserved for a future build that adds instantiations for prob_bits 9 and 11.
Why zigzag_codes=true: Raw signed-delta codes from LorenzoQuantStage span a sparse subset of the full uint16 range. Zigzag remaps them to a compact [0, 2*radius-2] range, concentrating the probability mass and improving rANS compression ratio. Without zigzag, most of the 256-bucket histogram will be near zero, wasting table entries.
Consequence: one host barrier per compress call, one per decompress call — the stage is not CUDA Graph compatible. isGraphCompatible() returns false.
ANSStage pre-allocates 7 persistent device buffers from the pipeline MemoryPool at finalize() time (PREALLOCATE mode) or on the first execute() call (MINIMAL mode). All buffers are grow-only — if a subsequent call arrives with a larger input, a reallocation occurs; smaller inputs reuse the existing capacity.
| Buffer | Size formula |
|---|---|
d_temp_histogram_ | 256 × 4 = 1 KiB |
d_table_ | 256 × 16 = 4 KiB (uint4 per symbol) |
d_compressed_blocks_ | max_blocks × 5248 B (5120 B data + 128 B ANSWarpState) |
d_compressed_words_ | max_blocks × 4 B |
d_comp_words_prefix_ | max_blocks × 4 B |
d_temp_prefix_sum_ | CUB temp storage (negligible for ≤512 blocks) |
d_decode_table_ | 1024 × 4 = 4 KiB (2^prob_bits entries) |
where max_blocks = ceil(inlen / 4096).
The histogram launch parameters (grid size, block size, shared memory, elements per block) are precomputed at scratch-allocation time by GPU_histogram_generic_optimizer_on_initialization and reused every call — there is no per-call optimizer overhead.
The FZM stage header is 12 bytes:
original_bytes_ is stored so that estimateOutputSizes() can return the exact decompressed size for output buffer allocation without needing a device-to-host header peek inside estimateOutputSizes().
Only prob_bits=10 is supported. See Stage settings.
Not CUDA Graph compatible. One device-to-host synchronization point exists in every forward call (to read the ANSCoalescedHeader for the compressed size) and one in every inverse call (to read the header before decoding).
Byte-level encoding only. ANSStage operates on uint8_t symbols (256-entry alphabet). For multi-byte integer streams (e.g., uint16_t quantization codes), pair it with ADMStage (ADMStage) which remaps the wide symbol space into the 8-bit domain before ANS coding — or use MANSStage for the fused path.
Compression ratio depends on upstream symbol compactness. ANS achieves its theoretical Shannon entropy bound only when the symbol distribution is known at encode time. Highly uniform or sparse input byte streams compress poorly. When connecting to LorenzoQuantStage, setZigzagCodes(true) is strongly recommended.
ANSStage incorporates rANS kernel headers from dietGPU (Meta Platforms, Inc., and affiliates) vendored under modules/coders/ans/dietgpu/. Copyright notices are retained verbatim in each vendored file.
Meta Platforms, Inc. and affiliates. dietGPU: GPU-based lossless compression for numerical data. https://github.com/facebookresearch/dietgpu
See THIRD_PARTY.md for the full license text.