|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
Header: modules/coders/adaptive_bitpack/adaptive_bitpack_stage.h Class: fz::AdaptiveBitpackStage<T> Category: Coder (lossless)
Per-block adaptive fixed-rate bit-plane coder — the cuSZp lossless back-end's "plain" mode, as a modular stage. It partitions the signed input into fixed-size blocks and, for each block, stores only as many bit-planes as the largest magnitude in that block requires:
r — the bit width of the largest |value| in the block (0 if the block is all zeros).r > 0: a sign bitmap (ceil(block_size/8) bytes) followed by r bit-planes of ceil(block_size/8) bytes each. Bit j of byte k of plane p is bit p of |element 8k+j|.Per-block payloads are concatenated using a device-wide exclusive scan of the per-block byte costs, preceded by the array of rate bytes. The archive carries no internal header of its own — block_size and num_elements live in the FZM stage header.
Unlike cuSZp's fused single kernel, the offset scan here is an ordinary CUB DeviceScan rather than the cuSZp decoupled look-back scan: fusing predictor + quantizer + coder into one kernel (and lowering the scan to a single pass) is a job for the downstream compiler, not the stage.
T — signed element type: int16_t or int32_t (quantizer codes or block deltas).
| Setting | Purpose | Notes |
|---|---|---|
setBlockSize(n) | Elements per block (fixed-rate granularity) | [1, 1024]; default 32 (cuSZp) |
setOutlierSelection(b) | cuSZp2 per-block plain/outlier selection | off by default; see below |
With setOutlierSelection(true), each block independently chooses the cheaper of two encodings:
sizeof(T)-byte magnitude and pack only elements 1..n-1.This targets non-sparse, high-smoothness data: with a block-local predictor the first element of each block is a delta-vs-0 (a full magnitude) that would inflate the whole block's bit width. Per-block metadata grows from 1 to 2 bytes ([rate][sel], where sel bit 0 = is-outlier and bits 1-2 = outlier byte count − 1). The mode is recorded in the FZM header, so a cold decompress selects the right path automatically.
cuSZp2 packs the per-block outlier metadata into a single byte — bit 7 = outlier flag, bits 5-6 = outlier byte count − 1, bits 0-4 = rate — which caps the per-block bit width at 31 (its decoder masks the rate with 0x1f). That is safe for cuSZp's own f32→quant→Lorenzo pipeline, but this stage is a general signed-integer coder: an int32 block can legitimately need a bit width of 32 (e.g. a magnitude of 2^31, as from INT32_MIN), which does not fit in 5 bits. To stay correct for arbitrary int32 input we use two bytes per block — a full 8-bit rate plus a separate selection byte — instead of cuSZp2's packed byte.
The cost is one extra metadata byte per block, which slightly lowers the compression ratio versus the reference on outlier-heavy data (measured: CLDHGH at abs eb=1e-3 → 8.49x here vs 9.09x for reference cuSZp2; plain mode matches the reference, 3.88x vs 3.88x). The error bound is respected identically. If exact cuSZp2 ratio parity is needed, the packed 1-byte layout can be restored for int16 (rate ≤ 16 always fits) or for int32 with a rate-overflow sentinel.
Single input → single output.
| Direction | Port | Type |
|---|---|---|
| Forward in / inverse out | "output" | T[n] (signed codes) |
| Forward out / inverse in | "output" | uint8_t[] (archive) |
isGraphCompatible() is true for the forward (compress) path, false for the inverse. The archive length is still data-dependent (cuSZp's cmpSize), but the host readback of the scanned total payload is deferred to postStreamSync() — which the pipeline calls after the launch and a full stream sync, outside any capture window — so execute() enqueues only stream-ordered device work. The per-block cost/offset scratch is kept persistent (grown lazily, freed in the destructor) so the readback can happen post-sync and no allocation occurs inside a captured graph replay. This mirrors RZEStage's forward path. The inverse keeps a per-execute layout and is left out of graph capture.
This means the full cuSZp pipelines (Quantizer(linear) → Lorenzo(block)/TiledLorenzo → AdaptiveBitpack) can be captured and replayed as a CUDA graph on the compress side. See examples/cuszp_variants.cpp for a PREALLOCATE-vs-GRAPH benchmark.
block_size on the coder need not match the Lorenzo block, but for faithful cuSZp both are 32.
The per-block fixed-rate ("fixed-length") bit-plane scheme originates in cuSZp (Yafan Huang et al., SC'23); the per-block plain/outlier selection is the cuSZp2 contribution (SC'24). This stage is a direct port of the cuSZp fixed-length encode/decode kernel logic, re-expressed one-thread-per-block with a byte-granular layout and a CUB DeviceScan for per-block offsets (in place of cuSZp's fused decoupled look-back scan); MemoryPool integration and FZM scaffolding are FZGPUModules code. cuSZp is BSD-3-Clause (copyright reproduced verbatim in THIRD_PARTY.md). Repo: https://github.com/szcompressor/cuSZp. See THIRD_PARTY.md and memory/cuszp_stages.md.