|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
MergeStage — concatenate N input buffers into one, split back to N. More...
#include "stage/stage.h"#include "fzm_format.h"#include <cuda_runtime.h>#include <cstdint>#include <cstring>#include <numeric>#include <stdexcept>#include <string>#include <unordered_map>#include <vector>Go to the source code of this file.
Classes | |
| class | fz::MergeStage |
Namespaces | |
| namespace | fz |
MergeStage — concatenate N input buffers into one, split back to N.
FZGM's DAG runs each stage on its own buffer, and the final .fzm archive is assembled from the pipeline's leaf outputs after all stages have run. That is the right model when each port is compressed independently. But some codecs are defined to run a single lossless pass over the concatenation of several producer outputs — most notably cuSZ-Hi's LC back-end, which (in its compression-ratio mode) Huffman-encodes the quant codes and then runs one LC chain over the contiguous blob [Huffman | anchor | outliers], and (in its throughput mode) runs one LC chain over [anchor | outliers].
There is no way to express "compress the concatenation of these ports" by wiring ports straight into a coder: a coder takes one input buffer, and the archive-level concatenation happens too late (post-compression) to feed a downstream chain. MergeStage fills exactly this gap — it materialises the concatenation as a real buffer during the pipeline so a downstream stage can compress it as one stream, and on decompress it splits the reconstructed blob back into the original N segments to feed the inverse producers.
It is the structural mirror of the predictors' port asymmetry (e.g. LorenzoQuant is 1 → 3 outputs forward / 3 → 1 inputs inverse); MergeStage is N → 1 forward / 1 → N inverse. The existing inverse-DAG builder wires both shapes by input-position index, so MergeStage needs no new DAG support.
The forward output is a pure concatenation of the inputs in connection order — no in-stream header (matching cuSZ-Hi's byte layout). The per-segment sizes are data-dependent (outlier counts vary), so they are captured at compress time and carried in the FZM stage config header (like RZEStage's cached_orig_bytes), which the inverse path uses to split.
[0] uint8 num_segments (N) [1 .. 4N] uint32 segment_size × N (LE, in connection order) [4N+1 ..] name table: per segment [uint8 len][len bytes]