#include <ginterp_stage.h>

Inheritance diagram for fz::GInterpStage< TInput, TCode >:

Public Member Functions
void	execute (cudaStream_t stream, MemoryPool pool, const std::vector< void > &inputs, const std::vector< void * > &outputs, const std::vector< size_t > &sizes) override

void	postStreamSync (cudaStream_t stream) override

void	onFinalize (size_t estimated_inlen, MemoryPool *pool) override

size_t	estimateDeviceFootprintBytes (size_t) const override

size_t	estimatePinnedFootprintBytes (size_t) const override

std::string	getName () const override

std::vector< std::string >	getOutputNames () const override

std::vector< size_t >	estimateOutputSizes (const std::vector< size_t > &input_sizes) const override

std::unordered_map< std::string, size_t >	getActualOutputSizesByName () const override

size_t	getActualOutputSize (int index) const override

void	saveState () override

void	setErrorBoundMode (ErrorBoundMode m)

void	setAutoTuning (uint8_t mode)

void	setManualAlphaBeta (double alpha, double beta)

void	setDims (const std::array< size_t, 3 > &dims) override

void	setInverse (bool inv) override

int	ndim () const

uint16_t	getStageTypeId () const override

uint8_t	getOutputDataType (size_t output_index) const override

uint8_t	getInputDataType (size_t) const override

size_t	serializeHeader (size_t output_index, uint8_t *buf, size_t max_size) const override

void	deserializeHeader (const uint8_t *buf, size_t size) override

size_t	getMaxHeaderSize (size_t) const override

bool	isGraphCompatible () const override

Public Member Functions inherited from fz::Stage
virtual size_t	getRequiredInputAlignment () const

int	getOutputIndex (const std::string &name) const

virtual size_t	estimateScratchBytes (const std::vector< size_t > &input_sizes) const

Detailed Description

template<typename TInput = float, typename TCode = uint16_t>
class fz::GInterpStage< TInput, TCode >

G-Interp predictor with error-bounded quantization (3-D, MVP).

Note: Prior work: the underlying spline kernels are adapted from the cuSZ-Hi compressor (Indiana University, Argonne National Laboratory), BSD-3-Clause. The host-side wrapper, memory-pool integration, and outlier-fusion contract are FZGPUModules code. See THIRD_PARTY.md.

Forward outputs (compression):

[0] codes — quantization codes (TCode, full N elements)
[1] anchor — corner anchor values (TInput, ~N/4096 elements)
[2] outlier_vals — out-of-range residuals (TInput)
[3] outlier_idxs — outlier element indices (uint32_t)

The outlier count is not a DAG output port — it lives in a stage-private 4-byte device scratch (allocated via pool->allocatePersistentDevice in onFinalize()), is D2H'd in postStreamSync(), and is serialized in the FZM header. The inverse path consumes it as a uint32_t kernel-launch argument (read from the deserialized header), so the inverse kernel never has to dereference a device pointer to know its loop bound.

Inverse: takes the four forward outputs, produces the reconstructed TInput volume.

Error bound and limitations

The error bound eb is a target, not a hard guarantee. The multi-level interpolation tree predicts finer-level values from already-lossy coarser- level reconstructions, so prediction errors accumulate across the four levels. In practice the maximum element-wise error is:

typically <= 1.1 * eb on smooth data
up to ~2 * eb on data with many outliers (large spikes that the spline can't predict — these are stored exactly via the outlier triplet, but their neighbours still see compounded interpolation error).

Error-bound modes (REL is NOT exact PFPL per-element)

setErrorBoundMode() accepts ABS, REL, and NOA, but — like LorenzoQuantStage and unlike QuantizerStage — this stage resolves all modes to a single absolute bound before quantizing:

ABS — abs_eb = eb directly.
REL — global-approximate point-wise relative: abs_eb = eb * max(|data|) (one min/max scan, then treated as ABS). This is not the exact per-element PFPL relative bound |error| / |x| <= eb.
NOA — value-range relative: abs_eb = eb * (max(data) - min(data)).

The interpolation tree predicts each element against a fixed absolute tolerance, so a per-element varying bound cannot be threaded through it. For an exact point-wise relative bound use QuantizerStage with ErrorBoundMode::REL (log-space encoding); see quantizer.h.

Other limitations to be aware of:

2-D and 3-D only; setDims() throws for 1-D input.
3-D path: best results when each dim is a multiple of 16 (the 3-D anchor tile size). 2-D path: anchor tile is 32×8, so best results when dim_x is a multiple of 32 and dim_y of 8. Ragged dims still work but edge elements see slightly worse prediction.
INTERPOLATION_PARAMS auto-tuning (setAutoTuning(1) / (3)) is wired for the 3-D path only. In 2-D the stage logs a warning and falls back to the deterministic baseline (alpha=1.75, beta=4.0, use_md={t,t,f,f,f,f}). 2-D auto-tune (cuSZ-Hi auto_tuning_mode == 2) is a follow-up.

Precision (<tt>float</tt> / <tt>double</tt>) and shared memory

Both float and double inputs are supported, and both run the exact same encode/decode kernels — c_spline_infprecis_data / x_spline_infprecis_data are single templates parameterised on the data type, so there is no separate "float path" and "double path".

Those two kernels stage two working tiles (data + ectrl) in dynamic shared memory (extern __shared__, sized by the launcher) rather than static __shared__. This is unconditional — it is not a double-only branch:

The 3-D tile is (16+1)³ = 4913 elements × 2 buffers. In double that is ~77 KB, over the 48 KB static __shared__ cap, which is why dynamic shared memory is used. float is ~38.5 KB — comfortably under the cap, but it goes through the same dynamic-shmem code path.
The launcher (ginterpRaiseSmemIfNeeded) only calls cudaFuncSetAttribute(..., cudaFuncAttributeMaxDynamicSharedMemorySize, …) when the tile exceeds 48 KB. So in practice the opt-in fires only for the 3-D double path; float (and all 2-D) use the default dynamic region and never touch the attribute.
Performance impact on float is expected to be negligible: the requested shared-memory size, occupancy, and in-kernel access pattern are unchanged versus the previous static-__shared__ version; only the tile base address is now a launch-time value.
The 3-D double path therefore needs a GPU whose opt-in max dynamic shared memory is ≥ ~77 KB (Volta and newer). On older GPUs capped at 48 KB the cudaFuncSetAttribute call fails and the launch surfaces the error; 2-D double and all float configs are unaffected.
Auto-tuning (profiling modes 1-4) is **float-only**; double inputs with a profiling mode set fall back to the deterministic baseline with a warning (mode 5 manual alpha/beta is still honored for double).

Radius auto-tune (default behaviour)

setQuantRadius(0) (the default) means "auto": on first execute(), the stage scans the input min/max and picks the largest radius that fits the data range, capped at the TCode bit-width's maximum. This minimises outlier count for unknown data ranges and is the recommended setting.

For CUDA graph capture or strict determinism, set the radius explicitly to any positive value to skip the scan (e.g. setQuantRadius(512) for climate-style data where the user wants extremes routed to the outlier triplet for downstream handling).

Template Parameters

TInput	Floating-point input type (`float` or `double`). Both use the same dynamic-shared-memory kernels; see "Precision and shared memory" above. The 3-D `double` path needs a GPU whose opt-in max dynamic shared memory is ≥ ~77 KB (Volta+).
TCode	Quantization code type (`uint8_t`, `uint16_t`, or `uint32_t`).

Member Function Documentation

◆ execute()

template<typename TInput = float, typename TCode = uint16_t>

void fz::GInterpStage< TInput, TCode >::execute	(	cudaStream_t	stream,
		MemoryPool *	pool,
		const std::vector< void * > &	inputs,
		const std::vector< void * > &	outputs,
		const std::vector< size_t > &	sizes
	)

overridevirtual

Execute the stage. Inputs, outputs, and sizes are device pointers/bytes.

Stages may call cudaStreamSynchronize(stream) or issue blocking D2H copies when the algorithm requires it (e.g. Huffman histogram readback for codebook construction, ANS renormalization tables). Such stages must return false from isGraphCompatible() and must document the sync points.

Note: the DAG dispatches sibling nodes (same topological level) via a sequential CPU loop, each enqueuing to its own stream. A sync inside execute() blocks the CPU from dispatching subsequent siblings until the synced stream is idle — this delays parallel branches in wide DAGs. In a linear pipeline there are no siblings and no extra cost.

Implements fz::Stage.

◆ postStreamSync()

template<typename TInput = float, typename TCode = uint16_t>

void fz::GInterpStage< TInput, TCode >::postStreamSync ( cudaStream_t stream )

overridevirtual

Called after dag->execute() and stream sync, before compress() returns. Use for D2H transfers that must not block mid-pipeline (e.g. Lorenzo's outlier count readback). The stream is already idle so a plain cudaMemcpy is safe here.

Reimplemented from fz::Stage.

◆ onFinalize()

template<typename TInput = float, typename TCode = uint16_t>

void fz::GInterpStage< TInput, TCode >::onFinalize	(	size_t	estimated_inlen,
		MemoryPool *	pool
	)

overridevirtual

In PREALLOCATE mode + auto-tuning > 0, pre-allocate the persistent profiling-errors scratch (36 floats device + pinned host) via the pool's persistent allocators so they aren't on the per-call stream-ordered path. In MINIMAL mode this is deferred to first execute(). Auto-tune off is a no-op.

Reimplemented from fz::Stage.

◆ estimateDeviceFootprintBytes()

template<typename TInput = float, typename TCode = uint16_t>

size_t fz::GInterpStage< TInput, TCode >::estimateDeviceFootprintBytes ( size_t ) const

inlineoverridevirtual

Estimated persistent device memory this stage allocates outside the pool (via pool->allocatePersistentDevice). Used for total footprint reporting. Default: 0.

Reimplemented from fz::Stage.

◆ estimatePinnedFootprintBytes()

template<typename TInput = float, typename TCode = uint16_t>

size_t fz::GInterpStage< TInput, TCode >::estimatePinnedFootprintBytes ( size_t ) const

inlineoverridevirtual

Estimated persistent pinned-host memory this stage allocates outside the pool (via pool->allocatePersistentPinned). Used for total footprint reporting. Default: 0.

Reimplemented from fz::Stage.

◆ getName()

template<typename TInput = float, typename TCode = uint16_t>

std::string fz::GInterpStage< TInput, TCode >::getName ( ) const

inlineoverridevirtual

Human-readable name used in error messages and debug output.

Implements fz::Stage.

◆ getOutputNames()

template<typename TInput = float, typename TCode = uint16_t>

std::vector< std::string > fz::GInterpStage< TInput, TCode >::getOutputNames ( ) const

inlineoverridevirtual

Output port names in order. Default: single port named "output". Multi-output stages (e.g. Lorenzo: "codes", "outliers") override this.

Reimplemented from fz::Stage.

◆ estimateOutputSizes()

template<typename TInput = float, typename TCode = uint16_t>

std::vector< size_t > fz::GInterpStage< TInput, TCode >::estimateOutputSizes ( const std::vector< size_t > & input_sizes ) const

overridevirtual

Estimate output buffer sizes given input sizes. Used for buffer allocation planning in PREALLOCATE mode — must be a safe upper bound; under-estimation causes buffer overruns.

Implements fz::Stage.

◆ getActualOutputSizesByName()

template<typename TInput = float, typename TCode = uint16_t>

std::unordered_map< std::string, size_t > fz::GInterpStage< TInput, TCode >::getActualOutputSizesByName ( ) const

inlineoverridevirtual

Actual output sizes after execute(), keyed by output port name.

Implements fz::Stage.

◆ getActualOutputSize()

template<typename TInput = float, typename TCode = uint16_t>

size_t fz::GInterpStage< TInput, TCode >::getActualOutputSize ( int index ) const

inlineoverridevirtual

Actual size of a single output by index after execute(). Avoids constructing the map for the common single-output case. Default delegates to getActualOutputSizesByName(); override to return directly from an internal field.

Reimplemented from fz::Stage.

◆ saveState()

template<typename TInput = float, typename TCode = uint16_t>

void fz::GInterpStage< TInput, TCode >::saveState ( )

inlineoverridevirtual

Save/restore config state around a decompression pass. deserializeHeader() overwrites the stage's forward-pass config; saveState() is called before and restoreState() after so the stage returns to its original configuration.

Reimplemented from fz::Stage.

◆ setErrorBoundMode()

template<typename TInput = float, typename TCode = uint16_t>

void fz::GInterpStage< TInput, TCode >::setErrorBoundMode ( ErrorBoundMode m )

inline

REL here is global-approximate (abs_eb = eb * max(|data|)), NOT the exact per-element PFPL bound — use QuantizerStage REL for that. See the "Error-bound modes" section in the class doc.

◆ setAutoTuning()

template<typename TInput = float, typename TCode = uint16_t>

void fz::GInterpStage< TInput, TCode >::setAutoTuning ( uint8_t mode )

inline

Enable cuSZ-Hi's INTERPOLATION_PARAMS auto-tuning. See Config::auto_tuning_mode for mode semantics. Default is 0 (off). Profiling modes (1/3/4) disable CUDA graph capture for this stage; mode 5 (manual override) is graph-safe.

◆ setManualAlphaBeta()

template<typename TInput = float, typename TCode = uint16_t>

void fz::GInterpStage< TInput, TCode >::setManualAlphaBeta	(	double	alpha,
		double	beta
	)

inline

Set the manual alpha/beta override pair used when auto_tuning_mode == 5. Both must be > 0 to take effect; passing 0 for either field defers to the cuSZ-Hi piecewise-linear schedule (alpha from rel_eb) or the upstream default (beta = 4.0).

◆ setDims()

template<typename TInput = float, typename TCode = uint16_t>

void fz::GInterpStage< TInput, TCode >::setDims ( const std::array< size_t, 3 > & dims )

overridevirtual

Called once by Pipeline::finalize() so stages can react to the dataset dimensions set via Pipeline::setDims() after construction.

Parameters

dims	{x, y, z} extents (z==1 → 2-D; y==z==1 → 1-D)

Reimplemented from fz::Stage.

◆ setInverse()

template<typename TInput = float, typename TCode = uint16_t>

void fz::GInterpStage< TInput, TCode >::setInverse ( bool inverse )

inlineoverridevirtual

Switch between forward (compression) and inverse (decompression) mode. Affects getNumInputs()/getNumOutputs() for stages with asymmetric port counts.

Reimplemented from fz::Stage.

◆ ndim()

template<typename TInput = float, typename TCode = uint16_t>

int fz::GInterpStage< TInput, TCode >::ndim ( ) const

inline

Returns the effective spatial dimensionality (2 or 3). 1-D inputs are rejected by setDims() so this method never returns 1 once the stage has been configured.

◆ getStageTypeId()

template<typename TInput = float, typename TCode = uint16_t>

uint16_t fz::GInterpStage< TInput, TCode >::getStageTypeId ( ) const

inlineoverridevirtual

Stage type identifier written into the FZM file header.

Implements fz::Stage.

◆ getOutputDataType()

template<typename TInput = float, typename TCode = uint16_t>

uint8_t fz::GInterpStage< TInput, TCode >::getOutputDataType ( size_t output_index ) const

inlineoverridevirtual

DataType enum of the given output port.

Implements fz::Stage.

◆ getInputDataType()

template<typename TInput = float, typename TCode = uint16_t>

uint8_t fz::GInterpStage< TInput, TCode >::getInputDataType ( size_t ) const

inlineoverridevirtual

Expected DataType of the given input port.

Used by Pipeline::finalize() to detect type mismatches between connected stages before any execution. Return DataType::UNKNOWN to opt out of checking — byte-transparent stages (Bitshuffle, RZE, RRE) and mock stages must return UNKNOWN; finalize() skips any connection where either side is UNKNOWN.

Reimplemented from fz::Stage.

◆ serializeHeader()

template<typename TInput = float, typename TCode = uint16_t>

size_t fz::GInterpStage< TInput, TCode >::serializeHeader	(	size_t	output_index,
		uint8_t *	header_buffer,
		size_t	max_size
	)		const

overridevirtual

Serialize stage config into header_buffer (max 128 bytes) for the FZM file. Return the number of bytes written, or 0 if the stage has no config.

Reimplemented from fz::Stage.

◆ deserializeHeader()

template<typename TInput = float, typename TCode = uint16_t>

void fz::GInterpStage< TInput, TCode >::deserializeHeader	(	const uint8_t *	header_buffer,
		size_t	size
	)

overridevirtual

Restore stage config from header_buffer during decompression.

Reimplemented from fz::Stage.

◆ getMaxHeaderSize()

template<typename TInput = float, typename TCode = uint16_t>

size_t fz::GInterpStage< TInput, TCode >::getMaxHeaderSize ( size_t output_index ) const

inlineoverridevirtual

Maximum bytes this stage writes into its per-output FZM header slot.

Reimplemented from fz::Stage.

◆ isGraphCompatible()

template<typename TInput = float, typename TCode = uint16_t>

bool fz::GInterpStage< TInput, TCode >::isGraphCompatible ( ) const

inlineoverridevirtual

Graph-compatible iff execute() does no host-blocking work. Four scans inside execute() can break capture, each conditional:

computeValueBase() (D2H min/max scan) — runs for REL/NOA when precomputed_value_base <= 0. Caller must setValueBase(...) to skip this.
Radius auto-tune — runs when quant_radius == 0. Caller must setQuantRadius(...) to a positive value to skip.
Auto-tune profile kernels (modes 1/2/3/4) each end with a D2H + cudaStreamSynchronize of the error array. Only modes 0 (baseline) and 5 (manual α/β override) skip the profile entirely.
Mode 5 with manual_alpha <= 0 still computes rel_eb to pick α from the piecewise-linear schedule, which (for ABS) goes through the same min/max scan. Caller must setManualAlphaBeta(α>0, β) to skip that. (Mode 0 never touches rel_eb.) postStreamSync() runs outside the captured region (same pattern as LorenzoQuantStage) so the outlier-count D2H doesn't gate this.

Reimplemented from fz::Stage.

Public Member Functions

Detailed Description

Error bound and limitations

Error-bound modes (REL is NOT exact PFPL per-element)

Precision (<tt>float</tt> / <tt>double</tt>) and shared memory

Radius auto-tune (default behaviour)

Member Function Documentation

◆ execute()

◆ postStreamSync()

◆ onFinalize()

◆ estimateDeviceFootprintBytes()

◆ estimatePinnedFootprintBytes()

◆ getName()

◆ getOutputNames()

◆ estimateOutputSizes()

◆ getActualOutputSizesByName()

◆ getActualOutputSize()

◆ saveState()

◆ setErrorBoundMode()

◆ setAutoTuning()

◆ setManualAlphaBeta()

◆ setDims()

◆ setInverse()

◆ ndim()

◆ getStageTypeId()

◆ getOutputDataType()

◆ getInputDataType()

◆ serializeHeader()

◆ deserializeHeader()

◆ getMaxHeaderSize()

◆ isGraphCompatible()