FZGPUModules 2.0
GPU-accelerated modular compression pipelines
Loading...
Searching...
No Matches
fz::Stage Class Referenceabstract

#include <stage.h>

+ Inheritance diagram for fz::Stage:

Public Member Functions

virtual void execute (cudaStream_t stream, MemoryPool *pool, const std::vector< void * > &inputs, const std::vector< void * > &outputs, const std::vector< size_t > &sizes)=0
 
virtual std::string getName () const =0
 
virtual size_t getRequiredInputAlignment () const
 
virtual std::vector< std::string > getOutputNames () const
 
int getOutputIndex (const std::string &name) const
 
virtual std::vector< size_t > estimateOutputSizes (const std::vector< size_t > &input_sizes) const =0
 
virtual std::unordered_map< std::string, size_t > getActualOutputSizesByName () const =0
 
virtual size_t getActualOutputSize (int index) const
 
virtual void setInverse (bool inverse)
 
virtual uint16_t getStageTypeId () const =0
 
virtual uint8_t getOutputDataType (size_t output_index) const =0
 
virtual uint8_t getInputDataType (size_t) const
 
virtual size_t serializeHeader (size_t output_index, uint8_t *header_buffer, size_t max_size) const
 
virtual void deserializeHeader (const uint8_t *header_buffer, size_t size)
 
virtual void saveState ()
 
virtual void setDims (const std::array< size_t, 3 > &dims)
 
virtual void onFinalize (size_t, MemoryPool *)
 
virtual size_t estimateDeviceFootprintBytes (size_t) const
 
virtual size_t estimatePinnedFootprintBytes (size_t) const
 
virtual void postStreamSync (cudaStream_t stream)
 
virtual size_t getMaxHeaderSize (size_t output_index) const
 
virtual bool isGraphCompatible () const
 
virtual size_t estimateScratchBytes (const std::vector< size_t > &input_sizes) const
 

Detailed Description

Base class for all compression/decompression stages.

A stage is a single transformation in the pipeline (e.g. Lorenzo predictor, RLE encoder, bitshuffle). The pipeline interacts with stages exclusively through this interface — no downcasting or type-name branching anywhere in the pipeline or DAG code.

Member Function Documentation

◆ execute()

virtual void fz::Stage::execute ( cudaStream_t  stream,
MemoryPool pool,
const std::vector< void * > &  inputs,
const std::vector< void * > &  outputs,
const std::vector< size_t > &  sizes 
)
pure virtual

Execute the stage. Inputs, outputs, and sizes are device pointers/bytes.

Stages may call cudaStreamSynchronize(stream) or issue blocking D2H copies when the algorithm requires it (e.g. Huffman histogram readback for codebook construction, ANS renormalization tables). Such stages must return false from isGraphCompatible() and must document the sync points.

Note: the DAG dispatches sibling nodes (same topological level) via a sequential CPU loop, each enqueuing to its own stream. A sync inside execute() blocks the CPU from dispatching subsequent siblings until the synced stream is idle — this delays parallel branches in wide DAGs. In a linear pipeline there are no siblings and no extra cost.

Implemented in fz::BitpackStage< T >, fz::HuffmanStage< T >, fz::RLEStage< T >, fz::RZEStage, fz::LorenzoQuantStage< TInput, TCode >, fz::DifferenceStage< T, TOut >, fz::LorenzoStage< T >, fz::QuantizerStage< TInput, TCode >, fz::BitshuffleStage, fz::NegabinaryStage< TIn, TOut >, and fz::ZigzagStage< TIn, TOut >.

◆ getName()

◆ getRequiredInputAlignment()

virtual size_t fz::Stage::getRequiredInputAlignment ( ) const
inlinevirtual

Minimum input size alignment in bytes. Chunked stages return their chunk size; the pipeline uses the LCM of all stage alignments at finalize() to transparently zero-pad the input. Default: 1 (no alignment requirement).

Reimplemented in fz::RZEStage, fz::DifferenceStage< T, TOut >, and fz::BitshuffleStage.

◆ getOutputNames()

virtual std::vector< std::string > fz::Stage::getOutputNames ( ) const
inlinevirtual

Output port names in order. Default: single port named "output". Multi-output stages (e.g. Lorenzo: "codes", "outliers") override this.

Reimplemented in fz::LorenzoQuantStage< TInput, TCode >, and fz::QuantizerStage< TInput, TCode >.

◆ getOutputIndex()

int fz::Stage::getOutputIndex ( const std::string &  name) const
inline

Returns the index of a named output port, or -1 if not found.

◆ estimateOutputSizes()

virtual std::vector< size_t > fz::Stage::estimateOutputSizes ( const std::vector< size_t > &  input_sizes) const
pure virtual

Estimate output buffer sizes given input sizes. Used for buffer allocation planning in PREALLOCATE mode — must be a safe upper bound; under-estimation causes buffer overruns.

Implemented in fz::BitpackStage< T >, fz::HuffmanStage< T >, fz::RLEStage< T >, fz::RZEStage, fz::LorenzoQuantStage< TInput, TCode >, fz::DifferenceStage< T, TOut >, fz::LorenzoStage< T >, fz::QuantizerStage< TInput, TCode >, fz::BitshuffleStage, fz::NegabinaryStage< TIn, TOut >, and fz::ZigzagStage< TIn, TOut >.

◆ getActualOutputSizesByName()

virtual std::unordered_map< std::string, size_t > fz::Stage::getActualOutputSizesByName ( ) const
pure virtual

◆ getActualOutputSize()

virtual size_t fz::Stage::getActualOutputSize ( int  index) const
inlinevirtual

Actual size of a single output by index after execute(). Avoids constructing the map for the common single-output case. Default delegates to getActualOutputSizesByName(); override to return directly from an internal field.

Reimplemented in fz::BitpackStage< T >, fz::HuffmanStage< T >, fz::RLEStage< T >, fz::RZEStage, fz::LorenzoQuantStage< TInput, TCode >, fz::DifferenceStage< T, TOut >, fz::LorenzoStage< T >, fz::QuantizerStage< TInput, TCode >, fz::BitshuffleStage, fz::NegabinaryStage< TIn, TOut >, and fz::ZigzagStage< TIn, TOut >.

◆ setInverse()

virtual void fz::Stage::setInverse ( bool  inverse)
inlinevirtual

Switch between forward (compression) and inverse (decompression) mode. Affects getNumInputs()/getNumOutputs() for stages with asymmetric port counts.

Reimplemented in fz::BitpackStage< T >, fz::HuffmanStage< T >, fz::RZEStage, fz::LorenzoStage< T >, fz::BitshuffleStage, fz::NegabinaryStage< TIn, TOut >, fz::ZigzagStage< TIn, TOut >, fz::LorenzoQuantStage< TInput, TCode >, fz::RLEStage< T >, fz::DifferenceStage< T, TOut >, and fz::QuantizerStage< TInput, TCode >.

◆ getStageTypeId()

◆ getOutputDataType()

◆ getInputDataType()

virtual uint8_t fz::Stage::getInputDataType ( size_t  ) const
inlinevirtual

Expected DataType of the given input port.

Used by Pipeline::finalize() to detect type mismatches between connected stages before any execution. Return DataType::UNKNOWN to opt out of checking — byte-transparent stages (Bitshuffle, RZE) and mock stages must return UNKNOWN; finalize() skips any connection where either side is UNKNOWN.

Reimplemented in fz::BitpackStage< T >, fz::HuffmanStage< T >, fz::RLEStage< T >, fz::LorenzoQuantStage< TInput, TCode >, fz::DifferenceStage< T, TOut >, fz::LorenzoStage< T >, fz::QuantizerStage< TInput, TCode >, fz::NegabinaryStage< TIn, TOut >, and fz::ZigzagStage< TIn, TOut >.

◆ serializeHeader()

virtual size_t fz::Stage::serializeHeader ( size_t  output_index,
uint8_t *  header_buffer,
size_t  max_size 
) const
inlinevirtual

Serialize stage config into header_buffer (max 128 bytes) for the FZM file. Return the number of bytes written, or 0 if the stage has no config.

Reimplemented in fz::RZEStage, fz::DifferenceStage< T, TOut >, fz::QuantizerStage< TInput, TCode >, fz::BitshuffleStage, fz::ZigzagStage< TIn, TOut >, fz::RLEStage< T >, fz::LorenzoQuantStage< TInput, TCode >, fz::BitpackStage< T >, fz::HuffmanStage< T >, fz::LorenzoStage< T >, and fz::NegabinaryStage< TIn, TOut >.

◆ deserializeHeader()

virtual void fz::Stage::deserializeHeader ( const uint8_t *  header_buffer,
size_t  size 
)
inlinevirtual

◆ saveState()

virtual void fz::Stage::saveState ( )
inlinevirtual

Save/restore config state around a decompression pass. deserializeHeader() overwrites the stage's forward-pass config; saveState() is called before and restoreState() after so the stage returns to its original configuration.

Reimplemented in fz::BitpackStage< T >, fz::HuffmanStage< T >, fz::RZEStage, fz::LorenzoQuantStage< TInput, TCode >, fz::DifferenceStage< T, TOut >, fz::QuantizerStage< TInput, TCode >, and fz::BitshuffleStage.

◆ setDims()

virtual void fz::Stage::setDims ( const std::array< size_t, 3 > &  dims)
inlinevirtual

Called once by Pipeline::finalize() so stages can react to the dataset dimensions set via Pipeline::setDims() after construction.

Parameters
dims{x, y, z} extents (z==1 → 2-D; y==z==1 → 1-D)

Reimplemented in fz::LorenzoQuantStage< TInput, TCode >, and fz::LorenzoStage< T >.

◆ onFinalize()

virtual void fz::Stage::onFinalize ( size_t  ,
MemoryPool  
)
inlinevirtual

Called once by Pipeline::finalize() after buffer-size propagation, with this stage's estimated input size (bytes) and the pipeline pool.

Implement this to pre-allocate persistent stage-internal scratch (e.g. Huffman codebook/histogram buffers) via pool->allocatePersistentDevice and pool->allocatePersistentPinned rather than via cudaMalloc directly. Pre-allocating here makes PREALLOCATE mode semantically correct (all memory committed at finalize time) and makes the stage footprint visible via pool->getPersistentDeviceBytes() / getPersistentPinnedBytes().

Stages that also allow lazy allocation (e.g. for capacity-growth realloc in execute()) should check whether pool was already used to allocate here and skip the lazy path if so.

Default: no-op.

Reimplemented in fz::HuffmanStage< T >.

◆ estimateDeviceFootprintBytes()

virtual size_t fz::Stage::estimateDeviceFootprintBytes ( size_t  ) const
inlinevirtual

Estimated persistent device memory this stage allocates outside the pool (via pool->allocatePersistentDevice). Used for total footprint reporting. Default: 0.

Reimplemented in fz::HuffmanStage< T >.

◆ estimatePinnedFootprintBytes()

virtual size_t fz::Stage::estimatePinnedFootprintBytes ( size_t  ) const
inlinevirtual

Estimated persistent pinned-host memory this stage allocates outside the pool (via pool->allocatePersistentPinned). Used for total footprint reporting. Default: 0.

Reimplemented in fz::HuffmanStage< T >.

◆ postStreamSync()

virtual void fz::Stage::postStreamSync ( cudaStream_t  stream)
inlinevirtual

Called after dag->execute() and stream sync, before compress() returns. Use for D2H transfers that must not block mid-pipeline (e.g. Lorenzo's outlier count readback). The stream is already idle so a plain cudaMemcpy is safe here.

Reimplemented in fz::RLEStage< T >, fz::RZEStage, fz::LorenzoQuantStage< TInput, TCode >, and fz::QuantizerStage< TInput, TCode >.

◆ getMaxHeaderSize()

virtual size_t fz::Stage::getMaxHeaderSize ( size_t  output_index) const
inlinevirtual

◆ isGraphCompatible()

virtual bool fz::Stage::isGraphCompatible ( ) const
inlinevirtual

Whether this stage is safe inside a CUDA Graph capture.

A stage is graph-compatible if execute() enqueues only device-side work (kernel launches, cudaMemcpyAsync D2D/H2D) and makes no host-synchronous calls. Override and return false if execute() contains D2H copies or dynamic decisions based on device data — the DAG will throw at setCaptureMode(true) time rather than producing a broken graph.

Default: true. Inverse-mode stages that do D2H reads (e.g. RZE inverse) must return false.

Reimplemented in fz::BitpackStage< T >, fz::HuffmanStage< T >, and fz::RZEStage.

◆ estimateScratchBytes()

virtual size_t fz::Stage::estimateScratchBytes ( const std::vector< size_t > &  input_sizes) const
inlinevirtual

Peak persistent scratch bytes this stage holds in the MemoryPool.

Only count allocations that are drawn from the pool and kept alive across execute() calls. Transient scratch freed within execute() is already captured by the pool's high-water mark and must not be included. Used by CompressionDAG::computeTopoPoolSize() to size the release threshold.

Reimplemented in fz::RLEStage< T >, and fz::RZEStage.