|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
Header: modules/predictors/tiled_lorenzo/tiled_lorenzo_stage.h Class: fz::TiledLorenzoStage<T> Category: Predictor (lossless)
The cuSZp3 dimension-aware (tiled separable) delta predictor, as a modular stage. The input (signed quantizer codes) is partitioned into fixed-size tiles — 8×8 in 2-D, 4×4×4 in 3-D by default — and within each tile a separable integer Lorenzo is applied:
x == 0) predicts down y (and the leading y/z edge predicts down z),(0,0,0) is stored as a delta-vs-0 (its own full magnitude).This is the prediction used by the cuSZp3 plain and outlier kernels. It is a lossless integer delta on already-quantized codes, so the inverse is an exact separable prefix-sum.
Unlike LorenzoStage (which is size- and layout-preserving), the forward pass emits its output in tile-major order: each tile's tile_elems = tx*ty*tz elements are written contiguously, and edge tiles are zero-padded to a full tile. The forward output therefore has num_tiles * tile_elems elements (≥ the input count when the dims aren't exact multiples of the tile).
The point of tile-major output is to line the tiles up with a downstream fixed-rate coder: pair this stage with AdaptiveBitpackStage(block_size = tile_elems) so each tile becomes exactly one coder block — reproducing cuSZp3's per-tile fixed-rate layout with the coder unchanged. The inverse un-tiles back to the original natural (row-major) order.
This stage is a separate stage rather than a LorenzoStage mode precisely because the tile-major reshape + padding break LorenzoStage's "same size, same
layout" contract, and the separable formula differs from LorenzoStage's N-D inclusion-exclusion delta. (Same reasoning as AdaptiveBitpackStage vs BitpackStage.)
T — signed element type: int16_t or int32_t (linear quantizer codes).
| Setting | Purpose | Notes |
|---|---|---|
setDims(x, y, z) | Spatial dimensions (call before addStage, or use Pipeline::setDims) | ndim inferred from non-unit dims |
setTileShape(tx, ty, tz) | Tile extents | each ≤ 255, product tx*ty*tz ∈ [1,1024]; default 2-D 8×8, 3-D 4×4×4, 1-D 64 |
The downstream AdaptiveBitpackStage block size should equal getTileElems() (64 for both 8×8 and 4×4×4) so coder blocks align with tiles.
Single input → single output (both signed T). The forward output is the padded tile-major delta stream; the inverse output is the natural-order reconstruction.
| Direction | Port | Type |
|---|---|---|
| Forward in / inverse out | "output" | T[n] (natural order) |
| Forward out / inverse in | "output" | T[num_tiles*tile_elems] (tile-major, padded) |
isGraphCompatible() is true — pure kernels with deterministic output sizes (the padded size is known from dims at finalize; no host-blocking D2H).
The three cuSZp3 modes decompose as:
TiledLorenzo (Quantizer(linear) → AdaptiveBitpack).ab->setOutlierSelection(true).For 1-D data, LorenzoStage::setBlockSize(32) already provides the cuSZp/cuSZp3 1-D block-local delta; TiledLorenzoStage targets the 2-D/3-D tiled case.
Dimensions come from the pipeline (setDims / the CLI --dims), not the stage block.
The dimension-aware separable delta is the cuSZp3 / VGC design (Yafan Huang, Sheng Di, Guanpeng Li, Franck Cappello, "GPU Lossy Compression for HPC Can Be
Versatile and Ultra-Fast", SC'25, https://doi.org/10.1145/3712285.3759817). This stage is a direct port of the cuSZp3 tiled separable delta kernel logic (from cuSZp_kernels_{2D,3D}_f32.cu), re-expressed as a standalone integer predictor with a tile-major output reshape; the tile-major decomposition, FZM header, and MemoryPool integration are FZGPUModules code. cuSZp3 is BSD-3-Clause (copyright reproduced verbatim in THIRD_PARTY.md); its memory-efficient compression and selective decompression features are not ported. Repo: https://github.com/szcompressor/cuSZp. See THIRD_PARTY.md and memory/cuszp_stages.md (Part 8).