GPU bit-matrix transpose stage (W × N bit shuffle over fixed-size chunks). More...

#include "stage/stage.h"
#include "fzm_format.h"
#include <cuda_runtime.h>
#include <cstdint>
#include <cstring>
#include <stdexcept>
#include <string>
#include <unordered_map>
#include <vector>

Classes
class	fz::BitshuffleStage

Namespaces
namespace	fz

Detailed Description

GPU bit-matrix transpose stage (W × N bit shuffle over fixed-size chunks).

Given a chunk of N elements each W bits wide, the forward pass produces W groups each containing the k-th bit of all N elements (a W × N bit-matrix transpose). Output is the same byte size as input.

Output layout: MSB-first — bit-plane W-1 at plane index 0, bit-plane 0 at W-1. Plane p occupies words p*(N_chunk/32)..(p+1)*(N_chunk/32)-1 where N_chunk = block_size / element_width.

Serialized header (5 bytes): [0..3] block_size (uint32_t LE), [4] element_width (uint8_t).

Classes

Namespaces

Detailed Description