|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
Host-callable launchers for the G-Interp encode/decode kernels. More...
#include <cuda_runtime.h>#include <cstddef>#include <cstdint>Go to the source code of this file.
Namespaces | |
| namespace | fz |
Functions | |
| dim3 | fz::ginterp::ginterpAnchorLen3 (size_t nx, size_t ny, size_t nz) |
| dim3 | fz::ginterp::ginterpAnchorLen2 (size_t nx, size_t ny) |
| template<typename TInput , typename TCode > | |
| void | fz::ginterp::launchGInterpForward3D (const TInput *d_data, dim3 data_len3, TCode *d_ectrl, TInput *d_anchor, dim3 anchor_len3, TInput *d_outlier_vals, uint32_t *d_outlier_idxs, uint32_t *d_outlier_count_scratch, double eb_r, double ebx2, int radius, const INTERPOLATION_PARAMS &intp_param, cudaStream_t stream) |
| template<typename TInput , typename TCode > | |
| void | fz::ginterp::launchGInterpInverse3D (const TCode *d_ectrl, dim3 data_len3, const TInput *d_anchor, dim3 anchor_len3, TInput *d_outlier_tmp, TInput *d_out, double eb_r, double ebx2, int radius, const INTERPOLATION_PARAMS &intp_param, cudaStream_t stream) |
| template<typename TInput , typename TCode > | |
| void | fz::ginterp::launchGInterpForward2D (const TInput *d_data, dim3 data_len3, TCode *d_ectrl, TInput *d_anchor, dim3 anchor_len3, TInput *d_outlier_vals, uint32_t *d_outlier_idxs, uint32_t *d_outlier_count_scratch, double eb_r, double ebx2, int radius, const INTERPOLATION_PARAMS &intp_param, cudaStream_t stream) |
| template<typename TInput , typename TCode > | |
| void | fz::ginterp::launchGInterpInverse2D (const TCode *d_ectrl, dim3 data_len3, const TInput *d_anchor, dim3 anchor_len3, TInput *d_outlier_tmp, TInput *d_out, double eb_r, double ebx2, int radius, const INTERPOLATION_PARAMS &intp_param, cudaStream_t stream) |
| void | fz::ginterp::launchGInterpResetErrors (float *d_errors, cudaStream_t stream) |
| template<typename TInput > | |
| void | fz::ginterp::launchGInterpProfileMode1 (const TInput *d_data, dim3 data_len3, float *d_errors, cudaStream_t stream) |
| template<typename TInput > | |
| void | fz::ginterp::launchGInterpProfileMode2 (const TInput *d_data, dim3 data_len3, int dim, float *d_errors, cudaStream_t stream) |
| template<typename TInput > | |
| void | fz::ginterp::launchGInterpProfileMode3 (const TInput *d_data, dim3 data_len3, int dim, dim3 sample_starts, dim3 sample_block_grid_sizes, dim3 sample_strides, float eb_r, float ebx2, const INTERPOLATION_PARAMS &intp_param, float *d_errors, bool workflow, cudaStream_t stream) |
| template<typename TInput > | |
| void | fz::ginterp::launchScatterOutliers (const TInput *d_outlier_vals, const uint32_t *d_outlier_idxs, uint32_t n, TInput *d_outlier_tmp, cudaStream_t stream) |
Host-callable launchers for the G-Interp encode/decode kernels.
This is an internal interface — only ginterp_stage.cu should include it. The actual template instantiations live in ginterp_kernels.cu, which includes the 3071-line ginterp_md.inl privately so callers do not pay the compile-time cost.
| dim3 fz::ginterp::ginterpAnchorLen3 | ( | size_t | nx, |
| size_t | ny, | ||
| size_t | nz | ||
| ) |
Compute the anchor grid extent for an input volume of size (nx, ny, nz). The 3D kernel uses a 16³ anchor stride, so the anchor volume is roughly 1/4096 of the input.
| dim3 fz::ginterp::ginterpAnchorLen2 | ( | size_t | nx, |
| size_t | ny | ||
| ) |
Compute the anchor grid extent for a 2-D input of size (nx, ny). The 2-D tile configuration mirrors the 3-D path with the z axis flattened: AnchorBlockSize{X,Y,Z}={16,16,1} × numAnchorBlock{X,Y,Z}={1,1,1}. Each grid block covers 16×16 input elements and emits one corner anchor, so the anchor extent is (ceil(nx/16), ceil(ny/16), 1) — roughly 1/256 of the input.
| void fz::ginterp::launchGInterpForward3D | ( | const TInput * | d_data, |
| dim3 | data_len3, | ||
| TCode * | d_ectrl, | ||
| TInput * | d_anchor, | ||
| dim3 | anchor_len3, | ||
| TInput * | d_outlier_vals, | ||
| uint32_t * | d_outlier_idxs, | ||
| uint32_t * | d_outlier_count_scratch, | ||
| double | eb_r, | ||
| double | ebx2, | ||
| int | radius, | ||
| const INTERPOLATION_PARAMS & | intp_param, | ||
| cudaStream_t | stream | ||
| ) |
Forward (compress) launcher — predicts via spline interpolation, quantizes residuals into d_ectrl, writes anchor corners to d_anchor, and routes out-of-range residuals into the outlier pair (d_outlier_vals, d_outlier_idxs). d_outlier_count_scratch is a stage-private 4-byte device pointer the kernel atomically increments — it is not a DAG output port. Caller D2H's it during postStreamSync() and stores the result in the FZM stage header.
Pre-conditions:
d_ectrl is sized nx * ny * nz * sizeof(TCode)d_anchor is sized prod(ginterpAnchorLen3(nx,ny,nz)) * sizeof(TInput)d_outlier_count_scratch has been cudaMemsetAsync(0, …) on the same streameb_r = 1 / (2 * abs_eb), ebx2 = 2 * abs_ebdata_len3.z >= 2 (3D path only in MVP)intp_param is the resolved cuSZ-Hi interpolation bundle. For phase-1 callers pass a default-constructed struct (deterministic baseline); phase-2 callers pass the auto-tuned result. | void fz::ginterp::launchGInterpInverse3D | ( | const TCode * | d_ectrl, |
| dim3 | data_len3, | ||
| const TInput * | d_anchor, | ||
| dim3 | anchor_len3, | ||
| TInput * | d_outlier_tmp, | ||
| TInput * | d_out, | ||
| double | eb_r, | ||
| double | ebx2, | ||
| int | radius, | ||
| const INTERPOLATION_PARAMS & | intp_param, | ||
| cudaStream_t | stream | ||
| ) |
Inverse (decompress) launcher — reads ectrl + anchor + scattered outliers (pre-merged into d_outlier_tmp by launchScatterOutliers) and produces the reconstructed volume in d_out. intp_param MUST match the value used during compression — both encoder and decoder kernels are parameterised by it.
d_outlier_tmp must be a full-N buffer with outlier values written at outlier indices and zero elsewhere — the kernel reads it via global2shmem_fuse during shmem load.
| void fz::ginterp::launchGInterpForward2D | ( | const TInput * | d_data, |
| dim3 | data_len3, | ||
| TCode * | d_ectrl, | ||
| TInput * | d_anchor, | ||
| dim3 | anchor_len3, | ||
| TInput * | d_outlier_vals, | ||
| uint32_t * | d_outlier_idxs, | ||
| uint32_t * | d_outlier_count_scratch, | ||
| double | eb_r, | ||
| double | ebx2, | ||
| int | radius, | ||
| const INTERPOLATION_PARAMS & | intp_param, | ||
| cudaStream_t | stream | ||
| ) |
Forward (compress) launcher for 2-D input. Identical contract to the 3-D variant — data_len3.z is assumed to be 1. Internally instantiates the spline kernels with SPLINE_DIM=2, AnchorBlockSize={16,16,1}, numAnchorBlock={1,1,1} (3-D-like tile, z flattened).
Pre-conditions:
data_len3.z == 1d_anchor sized prod(ginterpAnchorLen2(nx,ny)) * sizeof(TInput)| void fz::ginterp::launchGInterpInverse2D | ( | const TCode * | d_ectrl, |
| dim3 | data_len3, | ||
| const TInput * | d_anchor, | ||
| dim3 | anchor_len3, | ||
| TInput * | d_outlier_tmp, | ||
| TInput * | d_out, | ||
| double | eb_r, | ||
| double | ebx2, | ||
| int | radius, | ||
| const INTERPOLATION_PARAMS & | intp_param, | ||
| cudaStream_t | stream | ||
| ) |
Inverse (decompress) launcher for 2-D input. Mirrors the 3-D variant; the caller must have pre-scattered outliers into d_outlier_tmp already. intp_param must match the value used during compression.
| void fz::ginterp::launchGInterpResetErrors | ( | float * | d_errors, |
| cudaStream_t | stream | ||
| ) |
Reset the 36-float profiling-errors scratch to zero. One-block, one-thread kernel — used between profiling passes when reusing the same scratch buffer.
| void fz::ginterp::launchGInterpProfileMode1 | ( | const TInput * | d_data, |
| dim3 | data_len3, | ||
| float * | d_errors, | ||
| cudaStream_t | stream | ||
| ) |
Profiling mode 1 — runs the cheap c_spline_profiling_data kernel that estimates per-axis residual variance from a tiny shared-mem sample. Writes 2 floats: errors[0] (forward order), errors[1] (reverse order). Used to pick intp_param.reverse[0..3] (single global bool replicated to all levels).
Single-block launch — auto_tuning_grid_dim = dim3(1,1,1).
| void fz::ginterp::launchGInterpProfileMode2 | ( | const TInput * | d_data, |
| dim3 | data_len3, | ||
| int | dim, | ||
| float * | d_errors, | ||
| cudaStream_t | stream | ||
| ) |
Profiling mode 2 — runs the alternate cheap c_spline_profiling_data_2 kernel. Writes 6 floats to d_errors[0..5] covering forward/reverse × cubic and natural splines on a tiny shared-mem sample. Used to pick a single use_natural × reverse pair replicated across all levels (and clears use_md). Cheaper than mode 3 and works on both 3-D and 2-D inputs.
dim is 3 for 3-D inputs and 2 for 2-D inputs (data_len3.z == 1). Single-block launch — auto_tuning_grid_dim = dim3(1,1,1).
| void fz::ginterp::launchGInterpProfileMode3 | ( | const TInput * | d_data, |
| dim3 | data_len3, | ||
| int | dim, | ||
| dim3 | sample_starts, | ||
| dim3 | sample_block_grid_sizes, | ||
| dim3 | sample_strides, | ||
| float | eb_r, | ||
| float | ebx2, | ||
| const INTERPOLATION_PARAMS & | intp_param, | ||
| float * | d_errors, | ||
| bool | workflow, | ||
| cudaStream_t | stream | ||
| ) |
Profiling mode 3 — runs the structural pa_spline_infprecis_data kernel (cuSZ-Hi auto_tuning >= 3) that probes a grid of sample blocks. Caller must launchGInterpResetErrors first.
dim selects the spline-kernel branch (3 → SPLINE_DIM=3, 2 → SPLINE_DIM=2).
Outputs depend on dim:
errors+15+BIY to errors+16+BIY (see adapter-changes block at top of ginterp_md.inl).sample_starts, sample_block_grid_sizes, sample_strides are derived from data_len3 (see cuSZ-Hi spline3.cu calc_start_size for the recipe; S_STRIDE = 8 * 16 in 3-D, 20 * AnchorBlockSize in 2-D).
workflow selects the probe family:
true → structural (mode 3): grid.y=9 (3-D) / 11 (2-D)false → alpha/beta sweep (mode 4): grid.y=11, errors[0..10] one per (alpha, beta) combo enumerated by pre_compute_att (SPLINE3_AB_ATT). | void fz::ginterp::launchScatterOutliers | ( | const TInput * | d_outlier_vals, |
| const uint32_t * | d_outlier_idxs, | ||
| uint32_t | n, | ||
| TInput * | d_outlier_tmp, | ||
| cudaStream_t | stream | ||
| ) |
Scatter outlier-pair entries into a full-N temp buffer. The count n is supplied by the host (read from the deserialized FZM header) and passed as a register-resident kernel argument — the kernel never has to load it from device memory.
Caller must cudaMemsetAsync(d_outlier_tmp, 0, N*sizeof(TInput), stream) before invoking. n == 0 is a fast no-op.