|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
Status: Implemented (v2.0)
Human-readable TOML files that fully describe a compression pipeline: the DAG (topology), stage types and parameters, and pipeline-level settings. A config file can reconstruct an identical pipeline without writing any C++ code.
Load a config and compress data:
For best results – especially when using memory_strategy = "PREALLOCATE" – pass the input size to the constructor before calling loadConfig(). This lets finalize() size buffers correctly rather than relying on a 1-byte placeholder.
Alternatively, the single-argument constructor can be used when MINIMAL strategy is sufficient and pool sizing from the .toml is acceptable:
[!IMPORTANT] When using
memory_strategy = "PREALLOCATE"(required for CUDA Graph capture), always use the constructor +loadConfig()pattern so the pipeline receives the realinput_bytesbeforefinalize()runs preallocations.
Build programmatically, then save for later reuse:
Load an existing config and update a parameter before reuse: Not supported – loadConfig() calls finalize() internally, and finalized pipelines are immutable. Edit the .toml file directly to change parameters.
A config file has one [pipeline] table and one or more [[stage]] entries (an array of tables).
All keys are optional. Absent keys use the pipeline constructor defaults.
| Key | Type | Default | Description |
|---|---|---|---|
| input_size | integer | 0 | Input buffer size hint in bytes. Used for pool sizing at finalize(). |
| dims | array of 3 integers | [0, 1, 1] | Spatial dimensions [x, y, z]. x=0 means infer from input_size. Used by LorenzoND kernels. |
| memory_strategy | string | "MINIMAL" | "MINIMAL" or "PREALLOCATE". |
| pool_multiplier | float | 3.0 | Pool capacity = input_size x pool_multiplier. Relevant for PREALLOCATE. |
| num_streams | integer | 1 | Number of CUDA streams for multi-stream execution. |
Stages are processed in file order. Each [[stage]] table describes one node in the pipeline DAG.
Required keys (all stages):
| Key | Type | Description |
|---|---|---|
| name | string | A unique local identifier used in inputs[].from references. |
| type | string | Stage class to instantiate (see Stage Types below). |
Optional key (non-source stages):
| Key | Type | Description |
|---|---|---|
| inputs | array of inline tables | Upstream connections. Each element is { from = "<name>" } or { from = "<name>", port = "<output_name>" }. Stages with no inputs key are pipeline sources. |
If port is omitted it defaults to "output" (the single-output port name for all stages except Lorenzo, which uses named ports "codes", "outlier_errors", "outlier_indices", and "outlier_count").
Error-bounded prediction and quantization. Dimensionality is encoded in the type string; runtime spatial dimensions come from [pipeline].dims.
| Key | Type | Default | Description |
|---|---|---|---|
| input_type | string | "float32" | Input element type. "float32" or "float64". |
| code_type | string | "uint16" | Quantization code type. "uint8", "uint16", or "uint32". |
| error_bound | float | 1e-3 | Error bound value. Interpretation depends on error_bound_mode. |
| error_bound_mode | string | "ABS" | "ABS" (absolute), "REL" (point-wise relative), or "NOA" (value-range relative). |
| quant_radius | integer | 32768 | Quantization radius. Must match the range of code_type (e.g. 32768 for uint16). |
| outlier_capacity | float | 0.2 | Fraction of elements reserved as outlier capacity (0.0-1.0). |
| zigzag_codes | boolean | false | Zigzag-encode codes before output to improve downstream compressibility. |
Output ports: "codes", "outlier_errors", "outlier_indices", "outlier_count". Ports not referenced in any downstream inputs become pipeline outputs and are stored in the .fzm file.
GPU bit-matrix transpose. Size-preserving; improves entropy coder performance on integer data.
| Key | Type | Default | Description |
|---|---|---|---|
| block_size | integer | 16384 | Chunk size in bytes. Must be a positive multiple of 1024 x element_width. |
| element_width | integer | 4 | Element width in bytes: 1, 2, 4, or 8. |
Recursive Zero-byte Elimination – lossless byte-stream compressor operating on Bitshuffle output.
| Key | Type | Default | Description |
|---|---|---|---|
| chunk_size | integer | 16384 | Chunk size in bytes. Must be a positive multiple of 4096. |
| levels | integer | 4 | Recursion depth 1-4. Level 1 = ZE only; levels 2-4 add RE passes. |
Run-Length Encoding. Effective on quantization code streams with long runs of identical values.
| Key | Type | Default | Description |
|---|---|---|---|
| data_type | string | "uint16" | Element type. One of "uint8", "uint16", "uint32", "int32". |
First-order difference coding with optional negabinary fusion.
| Key | Type | Default | Description |
|---|---|---|---|
| input_type | string | "float32" | Input element type. |
| output_type | string | (same as input_type) | Output element type. When output_type is the unsigned counterpart of a signed input_type, negabinary encoding is fused into the forward pass. |
| chunk_size | integer | 0 | Chunk size in bytes (0 = no chunking, process whole array as one context). When > 0, differences reset at each chunk boundary, enabling parallel decompression. |
Negabinary-fused instantiations (when input_type != output_type):
| input_type | output_type |
|---|---|
| "int8" | "uint8" |
| "int16" | "uint16" |
| "int32" | "uint32" |
| "int64" | "uint64" |
Element-wise zigzag encode/decode (signed integer -> unsigned integer of same width).
| Key | Type | Description |
|---|---|---|
| input_type | string | Signed integer type: "int8", "int16", "int32", "int64". |
| output_type | string | Corresponding unsigned type: "uint8", "uint16", "uint32", "uint64". |
Direct-value error-bounded quantizer with lossless outlier fallback. Unlike LorenzoND, this stage quantizes input values directly (no prediction step) and supports ABS, NOA, and REL (log-space) error bound modes.
| Key | Type | Default | Description |
|---|---|---|---|
| input_type | string | "float32" | Input element type. "float32" or "float64". |
| code_type | string | "uint32" | Quantization code type. "uint16" or "uint32". |
| error_bound | float | 1e-3 | Error bound value. Interpretation depends on error_bound_mode. |
| error_bound_mode | string | "REL" | "ABS" (absolute), "REL" (pointwise relative log-space), or "NOA" (value-range relative). |
| quant_radius | integer | 32768 | Quantization radius. |
| outlier_capacity | float | 0.05 | Fraction of elements reserved as outlier capacity (0.0-1.0). |
| zigzag_codes | boolean | true | Zigzag-encode codes before output to improve downstream compressibility. No effect in REL mode. |
| outlier_threshold | float | inf | ABS/NOA: values with |x| >= threshold are forced to lossless outlier regardless of bin. Omit (default) to disable. | | inplace_outliers | boolean | false | ABS/NOA: encode outlier raw bits in-place in the codes array (no scatter buffers). Cannot be used with REL mode. |
Output ports: "codes", "outlier_vals", "outlier_idxs", "outlier_count". In inplace-outlier mode only "codes" is produced; the other three outputs are omitted.
[!NOTE] REL mode requires a 4-byte code type ("uint32") because it stores sign + log-bin packed into 32 bits. Using "uint16" in REL mode will raise a runtime error if the bin magnitude overflows 15 bits (rare in practice for eb >= 0.01).
Element-wise negabinary encode/decode (same signed/unsigned pairing as Zigzag).
| Key | Type | Description |
|---|---|---|
| input_type | string | Signed integer type. |
| output_type | string | Corresponding unsigned type. |
Packs N-bit unsigned integers into a dense byte stream. Output is ceil(n * nbits / 8) bytes – smaller than the input when nbits < 8*sizeof(T). nbits must be a power of two.
[!NOTE] nbits must fit the actual code range. If codes span more bits than nbits, the upper bits are silently truncated and decompression will produce wrong values. The combination Lorenzo (small quant_radius, zigzag_codes=true) -> Bitpack works well because zigzag residuals cluster near zero. Adding a Difference stage between Lorenzo and Bitpack does not help: unsigned difference deltas wrap across the full uint16 range even when source values are small, so nbits=16 (identity) is required to round-trip correctly through a Difference stage.
| Key | Type | Default | Description |
|---|---|---|---|
| input_type | string | "uint16" | Element type of the input codes. One of "uint8", "uint16", "uint32". |
| nbits | integer | 16 | Bits per element. Must be a power of two: 1, 2, 4, 8 for uint8; 1-16 for uint16; 1-32 for uint32. |
Lorenzo predictor with zigzag codes feeding into Bitshuffle and RZE.
The PFPL (Predictor-Free Pipeline) preset – direct-value quantizer with relative error bound, followed by Difference -> Bitshuffle -> RZE. This is the examples/presets/pfpl.toml configuration.
Load it via the CLI: