|
FZGPUModules 2.0
GPU-accelerated modular compression pipelines
|
Header: modules/coders/bitpack/bitpack_stage.h
Class: fz::BitpackStage<T>
Category: Coder (lossless)
Packs each element using only its low nbits bits into a dense byte stream.
T[] → uint8_t[] — ceil(n × nbits / 8) bytes.uint8_t[] → T[] — unpacks elements, zero-extending to full width.Most useful after a Lorenzo predictor stage where small delta values only need a few bits of representation.
| Parameter | Constraint |
|---|---|
T | Unsigned integer type (see available instantiations below) |
Only these types are compiled and linked:
BitpackStage<uint8_t>BitpackStage<uint16_t>BitpackStage<uint32_t>Using any other type will result in a linker error. Most common: BitpackStage<uint32_t> (to match typical quantizer code width).
| Setting | Purpose | Notes |
|---|---|---|
setNBits(nbits) | Bits per element | Power of two, 1..8 * sizeof(T); ignored when auto-detect is on |
setAutoDetect(bool) | GPU scan to pick nbits automatically | Disables CUDA Graph compatibility while active |
nbits must be a power of two in [1, 8 × sizeof(T)]. Default is 8 * sizeof(T) (identity, no compression).
T | Allowed nbits |
|---|---|
uint8_t | 1, 2, 4, 8 |
uint16_t | 1, 2, 4, 8, 16 |
uint32_t | 1, 2, 4, 8, 16, 32 |
Violations throw std::invalid_argument at setNBits() time.
When enabled, forward execute scans the input for its maximum value using cub::DeviceReduce::Max and selects the smallest valid power-of-two nbits that covers it. The chosen nbits is written into the compressed header so the inverse pass unpacks correctly without any out-of-band configuration.
After compress(), getNBits() reflects the detected value.
Scratch buffers for the scan are allocated through the pipeline's memory pool (with a transparent cudaMalloc fallback in vGPU / pool-fallback mode), so all device memory remains tracked by the pipeline.
CUDA Graph incompatibility: auto-detect requires a device-to-host transfer and stream synchronization to read the max value, making it incompatible with CUDA Graph capture. isGraphCompatible() returns false while auto-detect is on. If you know the bit-width ahead of time, use setNBits() instead to keep graph capture available.
Output size estimate: estimateOutputSizes() returns the worst-case (full input size) when auto-detect is enabled, so PREALLOCATE mode reserves sufficient space regardless of the detected nbits.