|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
#include <rze_stage.h>
Inheritance diagram for fz::RZEStage:Public Member Functions | |
| void | setInverse (bool inv) override |
| bool | isGraphCompatible () const override |
| size_t | getRequiredInputAlignment () const override |
| void | execute (cudaStream_t stream, MemoryPool *pool, const std::vector< void * > &inputs, const std::vector< void * > &outputs, const std::vector< size_t > &sizes) override |
| void | postStreamSync (cudaStream_t stream) override |
| std::string | getName () const override |
| std::vector< size_t > | estimateOutputSizes (const std::vector< size_t > &input_sizes) const override |
| std::unordered_map< std::string, size_t > | getActualOutputSizesByName () const override |
| size_t | getActualOutputSize (int index) const override |
| size_t | estimateScratchBytes (const std::vector< size_t > &input_sizes) const override |
| uint16_t | getStageTypeId () const override |
| uint8_t | getOutputDataType (size_t) const override |
| size_t | serializeHeader (size_t output_index, uint8_t *buf, size_t max_size) const override |
| void | deserializeHeader (const uint8_t *buf, size_t size) override |
| size_t | getMaxHeaderSize (size_t) const override |
| void | saveState () override |
Public Member Functions inherited from fz::Stage | |
| virtual std::vector< std::string > | getOutputNames () const |
| int | getOutputIndex (const std::string &name) const |
| virtual uint8_t | getInputDataType (size_t) const |
| virtual void | setDims (const std::array< size_t, 3 > &dims) |
| virtual void | onFinalize (size_t, MemoryPool *) |
| virtual size_t | estimateDeviceFootprintBytes (size_t) const |
| virtual size_t | estimatePinnedFootprintBytes (size_t) const |
Recursive Zero-byte Elimination stage.
setChunkSize(bytes) — chunk size (default 16384; must be a multiple of 4096). setLevels(n) — recursion depth 1–4 (default 4).
zero_elim.h, repeated_elim.h, and rze.h from the LC framework (Burtscher et al., BSD-3-Clause). See THIRD_PARTY.md.
|
inlineoverridevirtual |
Switch between forward (compression) and inverse (decompression) mode. Affects getNumInputs()/getNumOutputs() for stages with asymmetric port counts.
Reimplemented from fz::Stage.
|
inlineoverridevirtual |
CUDA Graph capture is supported for compression (forward pass) only.
The inverse path reads the stream header (orig_bytes, per-chunk sizes) with two blocking D2H cudaMemcpy calls before it can compute per-chunk decode offsets and launch the decode kernel. These calls prevent the inverse path from being recorded into a CUDA Graph.
This is intentional by design, not a fixable limitation: graph-compatible decompression would only help a "repeatedly decompress the same compressed buffer" workflow, which has no practical use case. The compression path (new data every iteration) is where graph capture provides real value.
Reimplemented from fz::Stage.
|
inlineoverridevirtual |
Minimum input size alignment in bytes. Chunked stages return their chunk size; the pipeline uses the LCM of all stage alignments at finalize() to transparently zero-pad the input. Default: 1 (no alignment requirement).
Reimplemented from fz::Stage.
|
overridevirtual |
Execute the stage. Inputs, outputs, and sizes are device pointers/bytes.
Stages may call cudaStreamSynchronize(stream) or issue blocking D2H copies when the algorithm requires it (e.g. Huffman histogram readback for codebook construction, ANS renormalization tables). Such stages must return false from isGraphCompatible() and must document the sync points.
Note: the DAG dispatches sibling nodes (same topological level) via a sequential CPU loop, each enqueuing to its own stream. A sync inside execute() blocks the CPU from dispatching subsequent siblings until the synced stream is idle — this delays parallel branches in wide DAGs. In a linear pipeline there are no siblings and no extra cost.
Implements fz::Stage.
|
overridevirtual |
|
inlineoverridevirtual |
Human-readable name used in error messages and debug output.
Implements fz::Stage.
|
inlineoverridevirtual |
Estimate output buffer sizes given input sizes. Used for buffer allocation planning in PREALLOCATE mode — must be a safe upper bound; under-estimation causes buffer overruns.
Implements fz::Stage.
|
overridevirtual |
|
overridevirtual |
Actual size of a single output by index after execute(). Avoids constructing the map for the common single-output case. Default delegates to getActualOutputSizesByName(); override to return directly from an internal field.
Reimplemented from fz::Stage.
|
inlineoverridevirtual |
Forward pass allocates four persistent pool arrays proportional to n_chunks = ceil(input_bytes / chunk_size_): d_scratch_ : n_chunks * chunk_size_ (per-chunk worst-case output) d_sizes_dev_ : n_chunks * 4 (raw compressed sizes) d_clean_dev_ : n_chunks * 4 (flag-stripped sizes) d_dst_off_dev_: n_chunks * 4 (exclusive prefix-sum offsets)
Inverse path scratch is transient (allocated and freed within execute), so it is not reported here.
Reimplemented from fz::Stage.
|
inlineoverridevirtual |
|
inlineoverridevirtual |
DataType enum of the given output port.
Implements fz::Stage.
|
inlineoverridevirtual |
Serialize stage config into header_buffer (max 128 bytes) for the FZM file. Return the number of bytes written, or 0 if the stage has no config.
Reimplemented from fz::Stage.
|
inlineoverridevirtual |
Restore stage config from header_buffer during decompression.
Reimplemented from fz::Stage.
|
inlineoverridevirtual |
Maximum bytes this stage writes into its per-output FZM header slot.
Reimplemented from fz::Stage.
|
inlineoverridevirtual |
Save/restore config state around a decompression pass. deserializeHeader() overwrites the stage's forward-pass config; saveState() is called before and restoreState() after so the stage returns to its original configuration.
Reimplemented from fz::Stage.