|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
Header: modules/structural/merge/merge_stage.h
Class: fz::MergeStage — no template parameters
Category: Structural
Common instantiation:
MergeStage concatenates N input buffers into one contiguous buffer (forward) and splits that buffer back into the original N segments (inverse). It is the structural mirror of the predictors' port asymmetry (e.g. LorenzoQuantStage is 1 → 3 outputs forward / 3 → 1 inputs inverse); MergeStage is N → 1 forward / 1 → N inverse.
FZGM's DAG runs each stage on its own buffer, and the final .fzm archive is assembled from the pipeline's leaf outputs after all stages have run. That is the right model when each port is compressed independently — but some codecs are defined to run a single lossless pass over the concatenation of several producer outputs. The motivating case is cuSZ-Hi's LC back-end:
[anchor | outliers];[Huffman(codes) | anchor | outliers].There is no way to express "compress the concatenation of these ports as one
stream" by wiring ports straight into a coder: a coder takes a single input buffer, and the archive-level concatenation happens too late (post-compression) to feed a downstream chain. MergeStage materialises that concatenation inside the pipeline so a downstream stage compresses it as one stream, and on decompress it splits the reconstructed blob back into the original segments to feed the inverse producers. Without it, a 1-to-1 port of cuSZ-Hi's merged-blob layout would be impossible.
It is also a general-purpose building block — any time you need several producer ports compressed together as one stream.
setSegmentNames must be called before connect() so getNumInputs() reports the right count. Connect the producers in segment order:
Maximum 16 segments.
finalize() type checking on every port — segments are opaque bytes once merged, so any mix of upstream element types concatenates without friction.RZEStage's cached_orig_bytes), which the inverse path uses to split.cudaMemcpyAsync with no host synchronisation.See examples/lc_lossless_pipeline.cpp for the full tp and cr pipelines.
Input order must match segments order.