|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
GPU-accelerated graph composable compression pipeline builder for analytical workflows.
FZGPUModules is a CUDA library for building composable, high-throughput compression pipelines. Each pipeline is a directed acyclic graph (DAG) of stages - coders, predictors, quantizers, shufflers, transforms, fused stages, and external stages - connected and executed entirely on the GPU with stream-ordered memory management.
Key properties:
| Requirement | Minimum | Notes |
|---|---|---|
| CUDA Toolkit | 11.2+ | Stream-ordered allocator required |
| Host Compiler | GCC 7+ or Clang 5+ | Upper bound set by CUDA version — see NVIDIA release notes; NVHPC 23.11 tested in CI |
| C++ Standard | C++17 | |
| CMake | 3.24+ | |
| Host byte order | Little-endian |
Note: using a vGPU will result in the CUDA mempool creation to fail, resulting in an automatic fallback allocation using cudaMalloc. This will work correctly but without the performance benefits of the stream-ordered allocator. For perfomance critical workloads avoid vGPU setups. The lack of stream-ordered allocator support also prevents CUDA Graph capture on vGPUs so this feature is unavailable in those environments.
For full build options (presets, examples/tests, install), see the Building from Source page.
See examples/ for more usage patterns including multi-branch pipelines, CUDA Graph capture, and the low-level DAG API.
For detailed per-stage documentation — constraints, behavioral rules, and extended usage notes — see the Stage Reference.
| Stage | Header | Description |
|---|---|---|
LorenzoQuantStage<TInput, TCode> | modules/fused/lorenzo_quant/lorenzo_quant.h | Fused float predictor + quantizer (lossy) |
LorenzoStage<T> | modules/predictors/lorenzo/lorenzo_stage.h | Plain integer Lorenzo predictor (lossless) |
QuantizerStage<TInput, TCode> | modules/quantizers/quantizer/quantizer.h | Direct-value quantizer (ABS/REL/NOA) |
RLEStage<T> | modules/coders/rle/rle.h | Run-length encoding |
DifferenceStage<T, TOut> | modules/predictors/diff/diff.h | First-order difference / cumulative-sum coding |
BitshuffleStage | modules/shufflers/bitshuffle/bitshuffle_stage.h | Bit-matrix transpose |
RZEStage | modules/coders/rze/rze_stage.h | Recursive zero-byte elimination |
ZigzagStage<TIn, TOut> | modules/transforms/zigzag/zigzag_stage.h | Zigzag encode/decode |
NegabinaryStage<TIn, TOut> | modules/transforms/negabinary/negabinary_stage.h | Negabinary encode/decode |
BitpackStage<T> | modules/coders/bitpack/bitpack_stage.h | Pack/unpack power-of-two value streams |
| Strategy | Description |
|---|---|
MINIMAL | Allocate on demand, free at last consumer. Lowest peak GPU memory. |
PREALLOCATE | Allocate everything at finalize(). Required for CUDA Graph capture. Enables buffer coloring for efficient buffer reuse. |
If you want full memory control, use the caller-allocated overloads. This mirrors nvcomp-style APIs: you pre-allocate an output buffer and pass its capacity; the API returns the actual size.
For decompression, size the output from the original input or from the FZM header:
See examples/ownership_example.cpp for a minimal end-to-end example.
For throughput-critical workloads, enable CUDA Graph capture to eliminate CPU-side kernel launch overhead on repeated compress calls:
Call compress() only after captureGraph(); use the same stream for capture and replay.
For complex pipelines, you can also load the stage graph from a TOML config file:
You can also use the Pipeline::loadFromConfig() API to load a config file from C++. The config schema supports arbitrary DAGs.
See examples/presets/ for reference and pre-built pipeline configurations and the Config File Reference for the full config schema.
FZM files embed the full stage configuration and compressed payload with CRC32 checksums. See the FZM File Format page for the full specification.
Each Pipeline must be used from a single host thread. There is no internal locking.
Safe — run one independent pipeline per thread:
Not safe — two threads sharing one pipeline:
The library has no global mutable state. The FZ_LOG logger singleton is set once at startup; do not change log level or callback while pipelines are running on other threads.
If you reference this work, please cite:
Note: this paper describes the 1.0 release of the library; the 2.0 API and documentation may differ.
[DRBSD-11] FZModules: A Heterogeneous Computing Framework for Customizable Scientific Data Compression Pipelines