What it does

GPU bit-matrix transpose over fixed-size chunks. Given a chunk of N elements each W bits wide, the forward pass groups all N values' bit-plane k together, producing W bit-planes of N bits each. This concentrates sign bits and exponent bits into contiguous regions, dramatically improving the compressibility of floating-point or integer data for downstream byte-oriented coders like RZEStage.

Output is the same byte size as input (size-preserving transform).

Stage settings

bshuf->setBlockSize(16384); // chunk size in bytes (default 16384)

bshuf->setElementWidth(sizeof(T)); // element width: 1, 2, 4, or 8 (default 4)

Constraints:

block_size must be a positive multiple of 1024 × element_width. The default of 16384 satisfies this for all supported element widths.
element_width must be 1, 2, 4, or 8. Both are enforced at execute() time.

Alignment requirement

BitshuffleStage requires its input to be a multiple of block_size bytes. The pipeline pads automatically when connected to a chunked upstream stage (DifferenceStage with matching chunk_size, or RZEStage).

Typical pipeline

auto* bshuf = p.addStage<BitshuffleStage>();
auto* rze   = p.addStage<RZEStage>();
 
bshuf->setElementWidth(sizeof(uint16_t));   // match upstream code type
 
p.connect(bshuf, upstream_stage);
p.connect(rze,   bshuf);
p.finalize();

Set element_width to match the element type flowing in from upstream:

Codes from LorenzoQuantStage<float, uint16_t> → setElementWidth(2)
Codes from QuantizerStage<float, uint32_t> → setElementWidth(4)

Acknowledgements

The 4- and 8-byte butterfly shuffle kernels in BitshuffleStage are adapted from d_BIT_4 / d_BIT_8 in the LC framework (Burtscher et al., Texas State University, BSD-3-Clause). The 1- and 2-byte paths use a standard __ballot_sync approach.

‍Noushin Azami, Alex Fallin, Brandon Burtchell, Andrew Rodriguez, Benila Jerald, Yiqian Liu, Anju Mongandampulath Akathoott, and Martin Burtscher. LC framework for synthesizing high-speed parallel lossless and error-bounded lossy data compression and decompression algorithms for CPUs and GPUs. https://github.com/burtscher/LC-framework

See THIRD_PARTY.md for the full license text.