FZGPUModules 2.0
GPU-accelerated modular compression pipelines
Loading...
Searching...
No Matches
AdaptiveBitpackStage

Header: modules/coders/adaptive_bitpack/adaptive_bitpack_stage.h Class: fz::AdaptiveBitpackStage<T> Category: Coder (lossless)


What it does

Per-block adaptive fixed-rate bit-plane coder — the cuSZp lossless back-end's "plain" mode, as a modular stage. It partitions the signed input into fixed-size blocks and, for each block, stores only as many bit-planes as the largest magnitude in that block requires:

  • rate byte r — the bit width of the largest |value| in the block (0 if the block is all zeros).
  • when r > 0: a sign bitmap (ceil(block_size/8) bytes) followed by r bit-planes of ceil(block_size/8) bytes each. Bit j of byte k of plane p is bit p of |element 8k+j|.

Per-block payloads are concatenated using a device-wide exclusive scan of the per-block byte costs, preceded by the array of rate bytes. The archive carries no internal header of its own — block_size and num_elements live in the FZM stage header.

Unlike cuSZp's fused single kernel, the offset scan here is an ordinary CUB DeviceScan rather than the cuSZp decoupled look-back scan: fusing predictor + quantizer + coder into one kernel (and lowering the scan to a single pass) is a job for the downstream compiler, not the stage.


Template parameter

T — signed element type: int16_t or int32_t (quantizer codes or block deltas).


Stage settings

Setting Purpose Notes
setBlockSize(n) Elements per block (fixed-rate granularity) [1, 1024]; default 32 (cuSZp)
setOutlierSelection(b) cuSZp2 per-block plain/outlier selection off by default; see below
auto* ab = p.addStage<AdaptiveBitpackStage<int32_t>>();
ab->setBlockSize(32);
ab->setOutlierSelection(true); // cuSZp2 mode (optional)

Outlier selection (cuSZp2)

With setOutlierSelection(true), each block independently chooses the cheaper of two encodings:

  • plain — pack all elements (as above), or
  • outlier — store element 0 separately as a raw 1..sizeof(T)-byte magnitude and pack only elements 1..n-1.

This targets non-sparse, high-smoothness data: with a block-local predictor the first element of each block is a delta-vs-0 (a full magnitude) that would inflate the whole block's bit width. Per-block metadata grows from 1 to 2 bytes ([rate][sel], where sel bit 0 = is-outlier and bits 1-2 = outlier byte count − 1). The mode is recorded in the FZM header, so a cold decompress selects the right path automatically.

Metadata difference from cuSZp2 (intentional)

cuSZp2 packs the per-block outlier metadata into a single byte — bit 7 = outlier flag, bits 5-6 = outlier byte count − 1, bits 0-4 = rate — which caps the per-block bit width at 31 (its decoder masks the rate with 0x1f). That is safe for cuSZp's own f32→quant→Lorenzo pipeline, but this stage is a general signed-integer coder: an int32 block can legitimately need a bit width of 32 (e.g. a magnitude of 2^31, as from INT32_MIN), which does not fit in 5 bits. To stay correct for arbitrary int32 input we use two bytes per block — a full 8-bit rate plus a separate selection byte — instead of cuSZp2's packed byte.

The cost is one extra metadata byte per block, which slightly lowers the compression ratio versus the reference on outlier-heavy data (measured: CLDHGH at abs eb=1e-3 → 8.49x here vs 9.09x for reference cuSZp2; plain mode matches the reference, 3.88x vs 3.88x). The error bound is respected identically. If exact cuSZp2 ratio parity is needed, the packed 1-byte layout can be restored for int16 (rate ≤ 16 always fits) or for int32 with a rate-overflow sentinel.


Ports

Single input → single output.

Direction Port Type
Forward in / inverse out "output" T[n] (signed codes)
Forward out / inverse in "output" uint8_t[] (archive)

Graph compatibility

isGraphCompatible() is true for the forward (compress) path, false for the inverse. The archive length is still data-dependent (cuSZp's cmpSize), but the host readback of the scanned total payload is deferred to postStreamSync() — which the pipeline calls after the launch and a full stream sync, outside any capture window — so execute() enqueues only stream-ordered device work. The per-block cost/offset scratch is kept persistent (grown lazily, freed in the destructor) so the readback can happen post-sync and no allocation occurs inside a captured graph replay. This mirrors RZEStage's forward path. The inverse keeps a per-execute layout and is left out of graph capture.

This means the full cuSZp pipelines (Quantizer(linear) → Lorenzo(block)/TiledLorenzo → AdaptiveBitpack) can be captured and replayed as a CUDA graph on the compress side. See examples/cuszp_variants.cpp for a PREALLOCATE-vs-GRAPH benchmark.


Typical pipeline (cuSZp-style)

auto* quant = p.addStage<QuantizerStage<float, uint32_t>>();
quant->setErrorBound(1e-3f);
quant->setErrorBoundMode(ErrorBoundMode::ABS);
quant->setLinearMode(true); // signed INT32 codes, no outliers
auto* lrz = p.addStage<LorenzoStage<int32_t>>();
lrz->setBlockSize(32); // block-local 1-D delta
auto* ab = p.addStage<AdaptiveBitpackStage<int32_t>>();
ab->setBlockSize(32);
p.connect(lrz, quant, "codes");
p.connect(ab, lrz);
p.finalize();

block_size on the coder need not match the Lorenzo block, but for faithful cuSZp both are 32.


TOML

[[stage]]
type = "AdaptiveBitpack"
input_type = "int32" # or "int16"
block_size = 32
outlier_selection = false # true = cuSZp2 per-block plain/outlier selection

Acknowledgements

The per-block fixed-rate ("fixed-length") bit-plane scheme originates in cuSZp (Yafan Huang et al., SC'23); the per-block plain/outlier selection is the cuSZp2 contribution (SC'24). This stage is a direct port of the cuSZp fixed-length encode/decode kernel logic, re-expressed one-thread-per-block with a byte-granular layout and a CUB DeviceScan for per-block offsets (in place of cuSZp's fused decoupled look-back scan); MemoryPool integration and FZM scaffolding are FZGPUModules code. cuSZp is BSD-3-Clause (copyright reproduced verbatim in THIRD_PARTY.md). Repo: https://github.com/szcompressor/cuSZp. See THIRD_PARTY.md and memory/cuszp_stages.md.