|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
Header: transforms/adm/adm_stage.h
Class: fz::ADMStage
Category: Transform (lossless)
Common instantiation:
Remaps a uint16_t[] or uint32_t[] integer stream into a compact 8-bit symbol domain, dramatically improving the compression ratio of a downstream entropy coder (typically ANSStage).
uint16_t[] or uint32_t[] → opaque ADM payload (uint8_t[])Algorithm: ADM partitions the input into 512-element warp blocks and computes a per-block center (mean). Each element is then encoded as a *(code, unary-signal)* pair relative to the center. The resulting byte codes have a highly skewed, low-entropy distribution ideal for GPU rANS (ANSStage) or Huffman coding.
The output is an opaque ADM payload — getOutputDataType() returns UNKNOWN to opt out of downstream type checking.
| Setting | Type | Default | Purpose |
|---|---|---|---|
setDtype(ADMDtype) | ADMDtype | U16 | Input element type: U16 or U32 |
ADMDtype::U16 expects uint16_t input; ADMDtype::U32 expects uint32_t input. The dtype must match the upstream stage's output element type.
Why ADM before ANS: ANSStage operates on 8-bit symbols (256-entry alphabet). Raw uint16_t quantization codes have a 65536-symbol alphabet, which ANS cannot encode directly. ADM acts as a symbol-domain adapter — it remaps the wide integer stream into the 8-bit domain while preserving all information losslessly.
ADM uses one of two prefix-sum strategies depending on the number of warp blocks (gsize = ⌈N / 512⌉):
Decoupled look-back path (gsize ≤ 1024):
Thrust fallback path (gsize > 1024, ~512 K+ elements for uint16_t):
Consequence: one host barrier per compress call — the stage is not CUDA Graph compatible. isGraphCompatible() returns false.
ADMStage pre-allocates 10 persistent device buffers from the pipeline MemoryPool at finalize() time (PREALLOCATE mode) or on the first execute() call (MINIMAL mode). All buffers are grow-only.
Let gsize = ⌈N / 512⌉ and sig = kMaxSignalBytes (2 for U16, 4 for U32):
| Buffer | Size formula |
|---|---|
d_signal_length_ | gsize × 4 B |
d_output_lengths_ | (gsize + 1) × 4 B |
d_centers_ | gsize × sizeof(T) (2 or 4 B per element) |
d_block_flags_ | ⌈gsize × 512 / 32⌉ × 4 B |
d_codes_ | N B |
d_concat_signals_ | N × sig B |
d_bit_signals_ | N × sig B (Thrust fallback path) |
d_loc_offset_ | (gsize + 1) × 4 B |
d_prefix_state_ | (gsize + 1) × 4 B |
d_overflow_flag_ | 4 B (debug overflow sentinel; always allocated) |
For 4096 uint16_t elements: ~68 KiB total device footprint. For 1M uint16_t elements: ~10 MiB total device footprint.
The FZM stage header is 12 bytes:
num_elements is stored so that estimateOutputSizes() can return the exact decompressed byte count for inverse-pass output buffer allocation.
Not CUDA Graph compatible. A device-to-host synchronization occurs in every forward call to read the actual payload size. isGraphCompatible() returns false.
Bounded-diff constraint. Each GPU thread encodes kChunk = 16 elements into a fixed kChunk × kMaxSignalBytes byte buffer (32 B for U16, 64 B for U32). The number of signal bits required per element is ⌈diff / 126⌉, where diff is the element's absolute deviation from the warp-block center. If the total per-thread signal bits exceed the buffer size, the kernel writes out-of-bounds. In debug builds (NDEBUG not defined) an atomicOr sentinel detects this condition and compress_u16/compress_u32 throw std::runtime_error rather than silently corrupting data. ADM is designed for bounded quantization codes (output of LorenzoQuantStage or QuantizerStage) — arbitrary integer arrays with large value ranges are not supported.
Opaque output. The ADM payload is not a self-describing format — it cannot be decoded without the FZM header (which stores num_elements). The payload must always be wrapped in a Pipeline with writeToFile/decompressFromFile or paired with the pipeline decompress() that restores the header.
Dtype must be set before finalize(). setDtype() must be called before pipeline.finalize() so the correct element width is used for scratch sizing and type checking.
ADMStage is a direct port of the ADM encode/decode kernels from MANS (Wenjing Huang, Jinwu Yang, JingKai Huang, Haoquan Long, Dingwen Tao, Guangming Tan) from nv/adm/ in the MANS repository. Kernel logic is unchanged; changes from the original are documented at the top of modules/transforms/adm/mapping_uint16.cu and mapping_uint32.cu.
Wenjing Huang, Jinwu Yang, JingKai Huang, Haoquan Long, Dingwen Tao, Guangming Tan. MANS: Multidimensional Adaptive Numerical Compressor for Scientific Data. https://github.com/hpdps-group/MANS
See THIRD_PARTY.md for the full BSD-3-Clause license text.