|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
Header: modules/coders/rre/rre_stage.h
Class: fz::RREStage — no template parameters
Category: Coder (lossless)
Common instantiation:
Repetition-Reduction Encoding — the RRE lossless component of the LC framework, used by cuSZ-Hi's LC pipelines (the speed chain TCMS1 → BIT1 → RRE1 and the compression-ratio chain RRE4 → TCMS8 → RZE1).
Operates on a raw byte stream treated as word_size-byte words. Each chunk is processed in shared memory by one CUDA block:
0); emit a 1-bit-per-word bitmap.RZEStage's levels 2–4), so long runs of a repeated value collapse to a few bytes.RRE is the repetition-eliminating sibling of RZEStage (which eliminates zeros). Use RRE where the stream contains long runs of an arbitrary repeated value, not just zeros.
word_size selects the LC RRE_1 / RRE_2 / RRE_4 / RRE_8 variant. The cuSZ-Hi chains use RRE1 (speed, on quant codes), RRE2 (anchor/outlier), and RRE4 (compression-ratio).
Requires input to be a multiple of chunk_size (16384) bytes. The pipeline pads automatically when an upstream byte-oriented stage uses a matching block size.
Forward (compress) is CUDA-graph capturable. The inverse (decompress) path is not — it reads the stream header (original size, per-chunk sizes) with blocking device-to-host copies before launching the decode kernel. This mirrors RZEStage; decompress-only graph capture has no practical use case.
A chunk is stored verbatim (high-bit flag) when RRE fails to shrink it. A constant (all-repeating) chunk collapses to a 2-byte size tag.
The GPU kernels in RREStage (modules/coders/lc_common/lc_chunk_components.cuh) are a faithful port of d_RRE.h, d_repetition_elimination.h, and prefix_sum.h from the LC framework (Burtscher et al., Texas State University, BSD-3-Clause).
Noushin Azami, Alex Fallin, Brandon Burtchell, Andrew Rodriguez, Benila Jerald, Yiqian Liu, Anju Mongandampulath Akathoott, and Martin Burtscher. LC framework for synthesizing high-speed parallel lossless and error-bounded lossy data compression and decompression algorithms for CPUs and GPUs. https://github.com/burtscher/LC-framework
See THIRD_PARTY.md for the full license text.