|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
BitplaneRZEStage is the FZ-GPU lossless encoder, ported as a single fused pipeline stage. It is the lossless back-end of the FZ-GPU compressor (Zhang et al., HPDC '23): a fused bitplane transpose + zero-group elimination over uint16_t quantizer codes.
Header: bitplane_rze_stage.h · Type id: StageType::BITPLANE_RZE = 23
In one shared-memory pass per 4096-byte chunk (a 32×32 grid of uint32_t words), the forward kernel:
__ballot_sync transposes the 32×32 bit matrix of each chunk. This is a bitshuffle at 4-byte element width: bit i of all 32 words in a row is gathered into one output word.atomicAdd reserves each block's slice of the output bitstream.Input is interpreted as uint16_t symbols packed two per word, matching FZ-GPU's quantizer codes. The transpose is built in, so a separate BitshuffleStage is not placed in front of it.
The output is a self-describing byte archive (uint8_t).
This stage is functionally close to BitshuffleStage(element_width=4) → RZEStage(levels=1), but it is not equivalent, and it adds no new functionality the existing stages lack. It earns its place on two grounds — fidelity (it reproduces FZ-GPU's exact codec and archive) and throughput:
BitplaneRZEStage (fused) | Bitshuffle(ew=4) → RZE(levels=1) | |
|---|---|---|
| Memory traffic | Transpose result stays in shared memory; only the compacted bitstream reaches DRAM. One pass. | Full transposed buffer is materialized in global memory, read back, then zero-eliminated. ~2× the global-memory traffic for this segment. |
| Zero-elim granularity | Compacts 4-byte groups | RZE compacts single bytes, and can recurse (levels 2–4). Different CR on identical input. |
| Archive format | FZ-GPU's exact self-describing archive (per-block atomic start positions + 128-byte header) | The generic Bitshuffle + RZE stream layout |
None. The stage is hard-wired to uint16_t input, matching the reference kernels (using E = uint16_t). uint8_t / uint32_t support would require new kernels.
None. The chunk layout (pad_len, chunk_size, grid_x) is fully derived from the input length. Inputs shorter than a 4096-byte multiple are zero-padded inside the stage; the original length is recorded so decode trims back to it.
Not graph-compatible in either direction. The forward pass does a blocking cudaStreamSynchronize + 4-byte D2H to read the compacted bitstream length before writing the archive header; the inverse pass does a 128-byte D2H to read the archive header before launching the decode kernel.
uint16_t codes) → 1 output (uint8_t archive).uint16_t codes).Wire it downstream of a quantizer's codes port: p.connect(bprze, quant, "codes").
The faithful FZ-GPU pipeline:
No tunable keys — the stage has no parameters.
Self-describing, with a 128-byte header at byte 0:
The FZM stage header stores only original_len (8 bytes, the pre-pad uint16 symbol count) so a cold decompress can size its output buffer.
The encode/decode kernels are the FZ-GPU lossless codec (Boyuan Zhang, Jiannan Tian, et al., "FZ-GPU", HPDC '23), BSD-3-Clause, as vendored in the cuSZ repository. The stage wrapper, memory-pool integration, and padded-input handling are FZGPUModules code. See THIRD_PARTY.md.