GPU-accelerated graph composable compression pipeline builder for analytical workflows.
Overview
FZGPUModules is a CUDA library for building composable, high-throughput compression pipelines. Each pipeline is a directed acyclic graph (DAG) of stages - coders, predictors, quantizers, shufflers, transforms, fused stages, and external stages - connected and executed entirely on the GPU with stream-ordered memory management.
Key properties:
- Modular — mix and match stages (Lorenzo, G-Interp, Quantizer, ADM, RLE, RZE, RRE, Bitshuffle, Huffman, ANS, …)
- High throughput — parallel level execution, persistent scratch, CUDA Graph support
- Memory-efficient — MINIMAL and PREALLOCATE strategies; buffer coloring to alias non-overlapping allocations
- File format — FZM format with CRC32 checksums and full stage config serialization
Requirements
| Requirement | Minimum | Notes |
| CUDA Toolkit | 11.2+ | Stream-ordered allocator required |
| Host Compiler | GCC 7+ or Clang 5+ | Upper bound set by CUDA version — see NVIDIA release notes; NVHPC 23.11 tested in CI |
| C++ Standard | C++17 | |
| CMake | 3.24+ | |
| Host byte order | Little-endian | |
Note: using a vGPU will result in the CUDA mempool creation to fail, resulting in an automatic fallback allocation using cudaMalloc. This will work correctly but without the performance benefits of the stream-ordered allocator. For perfomance critical workloads avoid vGPU setups. The lack of stream-ordered allocator support also prevents CUDA Graph capture on vGPUs so this feature is unavailable in those environments.
Quick Start
Building from Source
For full build options (presets, examples/tests, install), see the Building from Source page.
git clone https://github.com/szcompressor/FZGPUModules.git
git submodule update --init --recursive
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
C++ API Usage
pipeline.connect(rle, lrz, "codes");
pipeline.finalize();
void* d_compressed = nullptr;
size_t compressed_size = 0;
pipeline.compress(d_input, n * sizeof(float), &d_compressed, &compressed_size, stream);
void* d_output = nullptr;
size_t output_size = 0;
pipeline.decompress(d_compressed, compressed_size, &d_output, &output_size, stream);
cudaStreamSynchronize(stream);
Definition lorenzo_quant.h:96
Definition compressor.h:34
FZGPUModules main API header — include this to access the full library.
Definition lorenzo_quant.h:99
See examples/ for more usage patterns including multi-branch pipelines, CUDA Graph capture, and the low-level DAG API.
Available Stages
For detailed per-stage documentation — constraints, behavioral rules, and extended usage notes — see the Stage Reference.
| Stage | Header | Description |
| LorenzoQuantStage<TInput, TCode> | modules/fused/lorenzo_quant/lorenzo_quant.h | Fused float predictor + quantizer (lossy) |
| LorenzoStage<T> | modules/predictors/lorenzo/lorenzo_stage.h | Plain integer Lorenzo predictor (lossless) |
| TiledLorenzoStage<T> | modules/predictors/tiled_lorenzo/tiled_lorenzo_stage.h | Dimension-aware (tiled separable) Lorenzo predictor (lossless, 2D/3D, cuSZp3 delta) |
| GInterpStage<TInput, TCode> | modules/fused/ginterp/ginterp_stage.h | Multi-level spline interpolation predictor + quantizer (lossy, 3D, cuSZ-Hi port) |
| QuantizerStage<TInput, TCode> | modules/quantizers/quantizer/quantizer.h | Direct-value quantizer (ABS/REL/NOA) |
| RLEStage<T> | modules/coders/rle/rle.h | Run-length encoding |
| DifferenceStage<T, TOut> | modules/predictors/diff/diff.h | First-order difference / cumulative-sum coding |
| ADMStage | modules/transforms/adm/adm_stage.h | Adaptive data mapping — uint16/32 → 8-bit symbol domain (MANS port) |
| BitshuffleStage | modules/shufflers/bitshuffle/bitshuffle_stage.h | Bit-matrix transpose |
| RZEStage | modules/coders/rze/rze_stage.h | Recursive zero-byte elimination |
| RREStage | modules/coders/rre/rre_stage.h | Repetition-reduction encoding (LC component) |
| ZigzagStage<TIn, TOut> | modules/transforms/zigzag/zigzag_stage.h | Zigzag encode/decode |
| NegabinaryStage<TIn, TOut> | modules/transforms/negabinary/negabinary_stage.h | Negabinary encode/decode |
| BitpackStage<T> | modules/coders/bitpack/bitpack_stage.h | Pack/unpack power-of-two value streams |
| AdaptiveBitpackStage<T> | modules/coders/adaptive_bitpack/adaptive_bitpack_stage.h | Per-block adaptive fixed-rate bit-plane coding (cuSZp/cuSZp2 port) |
| HuffmanStage<T> | modules/coders/huffman/huffman_stage.h | GPU Huffman entropy coding (PHF, cuSZ port) |
| ANSStage | modules/coders/ans/ans_stage.h | GPU rANS entropy coding (dietGPU port) |
| BitplaneRZEStage | modules/fused/bitplane_rze/bitplane_rze_stage.h | Fused bitplane transpose + zero-group RZE lossless encoder (FZ-GPU port) |
| MergeStage | modules/structural/merge/merge_stage.h | Concatenate N producer ports into one buffer / split back (structural) |
Memory Strategies
| Strategy | Description |
MINIMAL | Allocate on demand, free at last consumer. Lowest peak GPU memory. |
PREALLOCATE | Allocate everything at finalize(). Required for CUDA Graph capture. Enables buffer coloring for efficient buffer reuse. |
Caller-Allocated Output
If you want full memory control, use the caller-allocated overloads. This mirrors nvcomp-style APIs: you pre-allocate an output buffer and pass its capacity; the API returns the actual size.
size_t comp_capacity = pipeline.getMaxCompressedSize(input_bytes);
void* d_comp_user = nullptr;
cudaMalloc(&d_comp_user, comp_capacity);
size_t comp_size = 0;
pipeline.compress(d_input, input_bytes,
d_comp_user, comp_capacity,
&comp_size, stream);
For decompression, size the output from the original input or from the FZM header:
size_t decomp_capacity = header.core.uncompressed_size;
void* d_decomp_user = nullptr;
cudaMalloc(&d_decomp_user, decomp_capacity);
size_t decomp_size = 0;
pipeline.decompress(d_comp_user, comp_size,
d_decomp_user, decomp_capacity,
&decomp_size, stream);
static FZMFileHeader readHeader(const std::string &filename)
See examples/ownership_example.cpp for a minimal end-to-end example.
CUDA Graph Support
For throughput-critical workloads, enable CUDA Graph capture to eliminate CPU-side kernel launch overhead on repeated compress calls:
pipeline.enableGraphMode(true);
pipeline.finalize();
pipeline.warmup(stream);
pipeline.captureGraph(stream);
pipeline.compress(d_input, input_bytes, &d_compressed, &compressed_sz, stream);
@ PREALLOCATE
Allocate everything upfront at finalize(). Required for graph mode.
Call compress() only after captureGraph(); use the same stream for capture and replay.
Compressor Config File
For complex pipelines, you can also load the stage graph from a TOML config file:
fzgmod-cli -z -i data.f32 -c examples/presets/pfpl.toml -o compressed.fzm --report
You can also use the Pipeline::loadFromConfig() API to load a config file from C++. The config schema supports arbitrary DAGs.
See examples/presets/ for reference and pre-built pipeline configurations and the Config File Reference for the full config schema.
File I/O
pipeline.writeToFile("output.fzm", stream);
void* d_out = nullptr;
size_t out_size = 0;
cudaStreamSynchronize(stream);
cudaFree(d_out);
static void decompressFromFile(const std::string &filename, void **d_output, size_t *output_size, cudaStream_t stream=0, PipelinePerfResult *perf_out=nullptr, size_t pool_override_bytes=0)
FZM files embed the full stage configuration and compressed payload with CRC32 checksums. See the FZM File Format page for the full specification.
Thread Safety
Each Pipeline must be used from a single host thread. There is no internal locking.
Safe — run one independent pipeline per thread:
std::thread t1([&] {
});
std::thread t2([&] {
});
t1.join(); t2.join();
Not safe — two threads sharing one pipeline:
std::thread t1([&] { shared.
compress(...); });
std::thread t2([&] { shared.
compress(...); });
void compress(const void *d_input, size_t input_size, void **d_output, size_t *output_size, cudaStream_t stream=0)
The library has no global mutable state. The FZ_LOG logger singleton is set once at startup; do not change log level or callback while pipelines are running on other threads.
Citation
If you reference this work, please cite:
Note: this paper describes the 1.0 release of the library; the 2.0 API and documentation may differ.
[DRBSD-11] FZModules: A Heterogeneous Computing Framework for Customizable Scientific Data Compression Pipelines
@inproceedings{ruiter2025fzmodules,
author = {Ruiter, Skyler and Tian, Jiannan and Song, Fengguang},
title = {FZModules: A Heterogeneous Computing Framework for Customizable Scientific Data Compression Pipelines},
year = {2025},
url = {https://doi.org/10.1145/3731599.3767376},
booktitle = {Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis},
pages = {332-338},
series = {SC Workshops '25}
}