#include <adaptive_bitpack_stage.h>

Inheritance diagram for fz::AdaptiveBitpackStage< T >:

Public Member Functions
void	setInverse (bool inv) override

bool	isGraphCompatible () const override

void	setBlockSize (uint32_t n)

void	setOutlierSelection (bool enable)

void	execute (cudaStream_t stream, MemoryPool pool, const std::vector< void > &inputs, const std::vector< void * > &outputs, const std::vector< size_t > &sizes) override

void	postStreamSync (cudaStream_t stream) override

std::string	getName () const override

std::vector< size_t >	estimateOutputSizes (const std::vector< size_t > &input_sizes) const override

size_t	estimateScratchBytes (const std::vector< size_t > &input_sizes) const override

std::unordered_map< std::string, size_t >	getActualOutputSizesByName () const override

size_t	getActualOutputSize (int index) const override

uint16_t	getStageTypeId () const override

uint8_t	getOutputDataType (size_t) const override

uint8_t	getInputDataType (size_t) const override

size_t	serializeHeader (size_t, uint8_t *buf, size_t max_size) const override

void	deserializeHeader (const uint8_t *buf, size_t size) override

size_t	getMaxHeaderSize (size_t) const override

void	saveState () override

Public Member Functions inherited from fz::Stage
virtual size_t	getRequiredInputAlignment () const

virtual std::vector< std::string >	getOutputNames () const

int	getOutputIndex (const std::string &name) const

virtual void	setDims (const std::array< size_t, 3 > &dims)

virtual void	onFinalize (size_t, MemoryPool *)

virtual size_t	estimateDeviceFootprintBytes (size_t) const

virtual size_t	estimatePinnedFootprintBytes (size_t) const

Detailed Description

template<typename T>
class fz::AdaptiveBitpackStage< T >

Per-block adaptive fixed-rate bit-plane coder — the cuSZp lossless back-end's "plain" mode, as a modular stage.

For each block of block_size signed elements it emits one rate byte (the bit width of the largest magnitude in the block) followed, when that rate is non-zero, by a per-block sign bitmap and that many bit-planes. Per-block byte offsets are resolved with a plain device-wide exclusive scan (CUB) rather than the cuSZp decoupled look-back scan, which is a fusion-time optimization left to the downstream compiler.

Pair with QuantizerStage(linear) → LorenzoStage(setBlockSize) for the cuSZp pipeline; block_size typically matches the Lorenzo block (32) but need not.

Note: Forward is graph-compatible; inverse is not. The forward path's data-dependent compressed-size readback (cuSZp's cmpSize D2H) is deferred to postStreamSync() — run after the launch, outside any capture window — so execute() enqueues only stream-ordered device work. Per-block cost/offset scratch is kept persistent so the readback can happen post-sync and no allocation occurs during graph replay (mirrors RZEStage's forward). The inverse keeps a per-execute layout and stays out of capture.; Prior work: the per-block fixed-rate bit-plane encoding is the cuSZp lossless scheme (Yafan Huang et al., SC'23/SC'24, BSD-3-Clause). This stage is a direct port of the cuSZp encode/decode kernel logic; the byte-granular layout, CUB offset scan, and FZM/MemoryPool scaffolding are FZGPUModules code. The cuSZp BSD-3-Clause copyright is reproduced in THIRD_PARTY.md. See also memory/cuszp_stages.md.

Template Parameters

T	Signed element type: `int16_t` or `int32_t`.

Member Function Documentation

◆ setInverse()

template<typename T >

void fz::AdaptiveBitpackStage< T >::setInverse ( bool inverse )

inlineoverridevirtual

Switch between forward (compression) and inverse (decompression) mode. Affects getNumInputs()/getNumOutputs() for stages with asymmetric port counts.

Reimplemented from fz::Stage.

◆ isGraphCompatible()

template<typename T >

bool fz::AdaptiveBitpackStage< T >::isGraphCompatible ( ) const

inlineoverridevirtual

Whether this stage is safe inside a CUDA Graph capture.

A stage is graph-compatible if execute() enqueues only device-side work (kernel launches, cudaMemcpyAsync D2D/H2D) and makes no host-synchronous calls. Override and return false if execute() contains D2H copies or dynamic decisions based on device data — the DAG will throw at setCaptureMode(true) time rather than producing a broken graph.

Default: true. Inverse-mode stages that do D2H reads (e.g. RZE inverse) must return false.

Reimplemented from fz::Stage.

◆ setBlockSize()

template<typename T >

void fz::AdaptiveBitpackStage< T >::setBlockSize ( uint32_t n )

inline

Elements per logical block (the fixed-rate granularity). Must be in [1, 1024]. cuSZp uses 32.

◆ setOutlierSelection()

template<typename T >

void fz::AdaptiveBitpackStage< T >::setOutlierSelection ( bool enable )

inline

Enable cuSZp2 per-block plain/outlier selection: each block may instead store element 0 as a raw 1..sizeof(T)-byte outlier and pack only the rest, whichever is smaller. Helps non-sparse, high-smoothness data (the first element of each block is delta-vs-0, i.e. a full magnitude). Off by default.

◆ execute()

template<typename T >

void fz::AdaptiveBitpackStage< T >::execute	(	cudaStream_t	stream,
		MemoryPool *	pool,
		const std::vector< void * > &	inputs,
		const std::vector< void * > &	outputs,
		const std::vector< size_t > &	sizes
	)

overridevirtual

Execute the stage. Inputs, outputs, and sizes are device pointers/bytes.

Stages may call cudaStreamSynchronize(stream) or issue blocking D2H copies when the algorithm requires it (e.g. Huffman histogram readback for codebook construction, ANS renormalization tables). Such stages must return false from isGraphCompatible() and must document the sync points.

Note: the DAG dispatches sibling nodes (same topological level) via a sequential CPU loop, each enqueuing to its own stream. A sync inside execute() blocks the CPU from dispatching subsequent siblings until the synced stream is idle — this delays parallel branches in wide DAGs. In a linear pipeline there are no siblings and no extra cost.

Implements fz::Stage.

◆ postStreamSync()

template<typename T >

void fz::AdaptiveBitpackStage< T >::postStreamSync ( cudaStream_t stream )

overridevirtual

Forward only: read the scanned payload length (D2H of two tail words from the persistent cost/offset scratch) once the stream is idle, and finalize actual_output_size_. Kept out of execute() so the forward path stays graph-capturable (no host sync inside the capture window).

Reimplemented from fz::Stage.

◆ getName()

template<typename T >

std::string fz::AdaptiveBitpackStage< T >::getName ( ) const

inlineoverridevirtual

Human-readable name used in error messages and debug output.

Implements fz::Stage.

◆ estimateOutputSizes()

template<typename T >

std::vector< size_t > fz::AdaptiveBitpackStage< T >::estimateOutputSizes ( const std::vector< size_t > & input_sizes ) const

overridevirtual

Estimate output buffer sizes given input sizes. Used for buffer allocation planning in PREALLOCATE mode — must be a safe upper bound; under-estimation causes buffer overruns.

Implements fz::Stage.

◆ estimateScratchBytes()

template<typename T >

size_t fz::AdaptiveBitpackStage< T >::estimateScratchBytes ( const std::vector< size_t > & input_sizes ) const

overridevirtual

Transient pool scratch held during execute() — two uint32 per-block arrays (cost + offset) plus the CUB exclusive-scan temp storage. Reported so the pipeline sizes the pool (PREALLOCATE) / peak (MINIMAL) for it and per-stage memory accounting reflects this stage. Defined in the .cu (needs CUB).

Reimplemented from fz::Stage.

◆ getActualOutputSizesByName()

template<typename T >

std::unordered_map< std::string, size_t > fz::AdaptiveBitpackStage< T >::getActualOutputSizesByName ( ) const

inlineoverridevirtual

Actual output sizes after execute(), keyed by output port name.

Implements fz::Stage.

◆ getActualOutputSize()

template<typename T >

size_t fz::AdaptiveBitpackStage< T >::getActualOutputSize ( int index ) const

inlineoverridevirtual

Actual size of a single output by index after execute(). Avoids constructing the map for the common single-output case. Default delegates to getActualOutputSizesByName(); override to return directly from an internal field.

Reimplemented from fz::Stage.

◆ getStageTypeId()

template<typename T >

uint16_t fz::AdaptiveBitpackStage< T >::getStageTypeId ( ) const

inlineoverridevirtual

Stage type identifier written into the FZM file header.

Implements fz::Stage.

◆ getOutputDataType()

template<typename T >

uint8_t fz::AdaptiveBitpackStage< T >::getOutputDataType ( size_t output_index ) const

inlineoverridevirtual

DataType enum of the given output port.

Implements fz::Stage.

◆ getInputDataType()

template<typename T >

uint8_t fz::AdaptiveBitpackStage< T >::getInputDataType ( size_t ) const

inlineoverridevirtual

Expected DataType of the given input port.

Used by Pipeline::finalize() to detect type mismatches between connected stages before any execution. Return DataType::UNKNOWN to opt out of checking — byte-transparent stages (Bitshuffle, RZE, RRE) and mock stages must return UNKNOWN; finalize() skips any connection where either side is UNKNOWN.

Reimplemented from fz::Stage.

◆ serializeHeader()

template<typename T >

size_t fz::AdaptiveBitpackStage< T >::serializeHeader	(	size_t	output_index,
		uint8_t *	header_buffer,
		size_t	max_size
	)		const

inlineoverridevirtual

Serialize stage config into header_buffer (max 128 bytes) for the FZM file. Return the number of bytes written, or 0 if the stage has no config.

Reimplemented from fz::Stage.

◆ deserializeHeader()

template<typename T >

void fz::AdaptiveBitpackStage< T >::deserializeHeader	(	const uint8_t *	header_buffer,
		size_t	size
	)

inlineoverridevirtual

Restore stage config from header_buffer during decompression.

Reimplemented from fz::Stage.

◆ getMaxHeaderSize()

template<typename T >

size_t fz::AdaptiveBitpackStage< T >::getMaxHeaderSize ( size_t output_index ) const

inlineoverridevirtual

Maximum bytes this stage writes into its per-output FZM header slot.

Reimplemented from fz::Stage.

◆ saveState()

template<typename T >

void fz::AdaptiveBitpackStage< T >::saveState ( )

inlineoverridevirtual

Save/restore config state around a decompression pass. deserializeHeader() overwrites the stage's forward-pass config; saveState() is called before and restoreState() after so the stage returns to its original configuration.

Reimplemented from fz::Stage.

Public Member Functions

Detailed Description

Member Function Documentation

◆ setInverse()

◆ isGraphCompatible()

◆ setBlockSize()

◆ setOutlierSelection()

◆ execute()

◆ postStreamSync()

◆ getName()

◆ estimateOutputSizes()

◆ estimateScratchBytes()

◆ getActualOutputSizesByName()

◆ getActualOutputSize()

◆ getStageTypeId()

◆ getOutputDataType()

◆ getInputDataType()

◆ serializeHeader()

◆ deserializeHeader()

◆ getMaxHeaderSize()

◆ saveState()