FZGPUModules 1.0
GPU-accelerated modular compression pipeline
Loading...
Searching...
No Matches
fz::RZEStage Class Reference

#include <rze_stage.h>

+ Inheritance diagram for fz::RZEStage:

Public Member Functions

void setInverse (bool inv) override
 
bool isGraphCompatible () const override
 
size_t getRequiredInputAlignment () const override
 
void execute (cudaStream_t stream, MemoryPool *pool, const std::vector< void * > &inputs, const std::vector< void * > &outputs, const std::vector< size_t > &sizes) override
 
void postStreamSync (cudaStream_t stream) override
 
std::string getName () const override
 
std::vector< size_t > estimateOutputSizes (const std::vector< size_t > &input_sizes) const override
 
std::unordered_map< std::string, size_t > getActualOutputSizesByName () const override
 
size_t getActualOutputSize (int index) const override
 
size_t estimateScratchBytes (const std::vector< size_t > &input_sizes) const override
 
uint16_t getStageTypeId () const override
 
uint8_t getOutputDataType (size_t) const override
 
size_t serializeHeader (size_t output_index, uint8_t *buf, size_t max_size) const override
 
void deserializeHeader (const uint8_t *buf, size_t size) override
 
size_t getMaxHeaderSize (size_t) const override
 
void saveState () override
 
- Public Member Functions inherited from fz::Stage
virtual std::vector< std::string > getOutputNames () const
 
int getOutputIndex (const std::string &name) const
 
virtual uint8_t getInputDataType (size_t) const
 
virtual void setDims (const std::array< size_t, 3 > &dims)
 

Detailed Description

Recursive Zero-byte Elimination stage.

setChunkSize(bytes) — chunk size (default 16384; must be a multiple of 4096). setLevels(n) — recursion depth 1–4 (default 4).

Note
CUDA Graph capture is supported for compression only. The inverse path requires two blocking D2H copies to read the stream header before the decode kernel can be launched.

Member Function Documentation

◆ setInverse()

void fz::RZEStage::setInverse ( bool  inverse)
inlineoverridevirtual

Switch between forward (compression) and inverse (decompression) mode. Affects getNumInputs()/getNumOutputs() for stages with asymmetric port counts.

Reimplemented from fz::Stage.

◆ isGraphCompatible()

bool fz::RZEStage::isGraphCompatible ( ) const
inlineoverridevirtual

CUDA Graph capture is supported for compression (forward pass) only.

The inverse path reads the stream header (orig_bytes, per-chunk sizes) with two blocking D2H cudaMemcpy calls before it can compute per-chunk decode offsets and launch the decode kernel. These calls prevent the inverse path from being recorded into a CUDA Graph.

This is intentional by design, not a fixable limitation: graph-compatible decompression would only help a "repeatedly decompress the same compressed buffer" workflow, which has no practical use case. The compression path (new data every iteration) is where graph capture provides real value.

Reimplemented from fz::Stage.

◆ getRequiredInputAlignment()

size_t fz::RZEStage::getRequiredInputAlignment ( ) const
inlineoverridevirtual

Minimum input size alignment in bytes. Chunked stages return their chunk size; the pipeline uses the LCM of all stage alignments at finalize() to transparently zero-pad the input. Default: 1 (no alignment requirement).

Reimplemented from fz::Stage.

◆ execute()

void fz::RZEStage::execute ( cudaStream_t  stream,
MemoryPool pool,
const std::vector< void * > &  inputs,
const std::vector< void * > &  outputs,
const std::vector< size_t > &  sizes 
)
overridevirtual

Execute the stage. Inputs, outputs, and sizes are device pointers/bytes.

Implements fz::Stage.

◆ postStreamSync()

void fz::RZEStage::postStreamSync ( cudaStream_t  stream)
overridevirtual

Called after dag->execute() and stream sync, before compress() returns. Use for D2H transfers that must not block mid-pipeline (e.g. Lorenzo's outlier count readback). The stream is already idle so a plain cudaMemcpy is safe here.

Reimplemented from fz::Stage.

◆ getName()

std::string fz::RZEStage::getName ( ) const
inlineoverridevirtual

Human-readable name used in error messages and debug output.

Implements fz::Stage.

◆ estimateOutputSizes()

std::vector< size_t > fz::RZEStage::estimateOutputSizes ( const std::vector< size_t > &  input_sizes) const
inlineoverridevirtual

Estimate output buffer sizes given input sizes. Used for buffer allocation planning in PREALLOCATE mode — must be a safe upper bound; under-estimation causes buffer overruns.

Implements fz::Stage.

◆ getActualOutputSizesByName()

std::unordered_map< std::string, size_t > fz::RZEStage::getActualOutputSizesByName ( ) const
overridevirtual

Actual output sizes after execute(), keyed by output port name.

Implements fz::Stage.

◆ getActualOutputSize()

size_t fz::RZEStage::getActualOutputSize ( int  index) const
overridevirtual

Actual size of a single output by index after execute(). Avoids constructing the map for the common single-output case. Default delegates to getActualOutputSizesByName(); override to return directly from an internal field.

Reimplemented from fz::Stage.

◆ estimateScratchBytes()

size_t fz::RZEStage::estimateScratchBytes ( const std::vector< size_t > &  input_sizes) const
inlineoverridevirtual

Forward pass allocates four persistent pool arrays proportional to n_chunks = ceil(input_bytes / chunk_size_): d_scratch_ : n_chunks * chunk_size_ (per-chunk worst-case output) d_sizes_dev_ : n_chunks * 4 (raw compressed sizes) d_clean_dev_ : n_chunks * 4 (flag-stripped sizes) d_dst_off_dev_: n_chunks * 4 (exclusive prefix-sum offsets)

Inverse path scratch is transient (allocated and freed within execute), so it is not reported here.

Reimplemented from fz::Stage.

◆ getStageTypeId()

uint16_t fz::RZEStage::getStageTypeId ( ) const
inlineoverridevirtual

Stage type identifier written into the FZM file header.

Implements fz::Stage.

◆ getOutputDataType()

uint8_t fz::RZEStage::getOutputDataType ( size_t  output_index) const
inlineoverridevirtual

DataType enum of the given output port.

Implements fz::Stage.

◆ serializeHeader()

size_t fz::RZEStage::serializeHeader ( size_t  output_index,
uint8_t *  header_buffer,
size_t  max_size 
) const
inlineoverridevirtual

Serialize stage config into header_buffer (max 128 bytes) for the FZM file. Return the number of bytes written, or 0 if the stage has no config.

Reimplemented from fz::Stage.

◆ deserializeHeader()

void fz::RZEStage::deserializeHeader ( const uint8_t *  header_buffer,
size_t  size 
)
inlineoverridevirtual

Restore stage config from header_buffer during decompression.

Reimplemented from fz::Stage.

◆ getMaxHeaderSize()

size_t fz::RZEStage::getMaxHeaderSize ( size_t  output_index) const
inlineoverridevirtual

Maximum bytes this stage writes into its per-output FZM header slot.

Reimplemented from fz::Stage.

◆ saveState()

void fz::RZEStage::saveState ( )
inlineoverridevirtual

Save/restore config state around a decompression pass. deserializeHeader() overwrites the stage's forward-pass config; saveState() is called before and restoreState() after so the stage returns to its original configuration.

Reimplemented from fz::Stage.