Walkers
| i | idx | val | node |
|---|
Mem
| Addr (dec) | Addr (hex) | Value |
|---|
Scratch
| Addr (dec) | Addr (hex) | Value |
|---|
| i | idx | val | node |
|---|
| Addr (dec) | Addr (hex) | Value |
|---|
| Addr (dec) | Addr (hex) | Value |
|---|
The challenge: You are given a binary tree (dense, balanced, flat) in memory, and N "walkers" (each with a 32-bit internal state and position in the tree). The program runs for a certain number of rounds, each round each walker updates its internal state by hash-mixing it with the binary tree value at its position, and then decides whether to go left or right based on the least-significant bit in its internal state. Correctness is judged by whatever is in the main memory region for walker state at the end of the program.
Click Load Sample to load an unoptimized
reference kernel for a smaller version of the problem, or
Load Fibonacci to load a templated Fibonacci
example.
For more information, see the original GitHub repository.
Execution model: One line = one VLIW bundle
(one cycle). Slots are separated by ;. Each
bundle can issue multiple slots across engines in parallel.
Registers: rN is a scalar
scratch address. vN is a vector base (8 lanes
at N..N+7 Note: the distinction between rN and
vN is mostly convention and they are treated the same by the
parser). Hex register addresses are allowed (e.g.,
r0x1a). All values are 32-bit and wrap mod
2^32.
Slot limits per cycle: alu=12,
valu=6, load=2,
store=2, flow=1.
Hazards: All writes land at end-of-cycle. A slot in the same line that reads a just-written scratch address will see the old value.
Scalar 32-bit arithmetic, bitwise, shifts, and comparisons. Results wrap mod 2^32.
alu + rD rA rB — addition alu - rD rA rB — subtractionalu * rD rA rB — multiplicationalu // rD rA rB — floor divisionalu % rD rA rB — modulusalu ^ rD rA rB — xoralu & rD rA rB — andalu | rD rA rB — oralu << rD rA rB — left shiftalu >> rD rA rB — right shiftalu < rD rA rB — less-than (writes 1 if rA < rB,
else 0)alu == rD rA rB — equality (writes 1 if rA == rB,
else 0)
Move data from main memory (mem) or immediates into scratch (register file). Use these sparingly — only 2 loads per cycle.
Immediate load: materialize a constant.
load const r0 123
Before: r0 is undefined/old value.
After: r0 = 0x0000007B.
Scalar load: copy one word from mem.
load load r1 rAddr
Assume rAddr = 0x0100, mem[0x0100] = 0xDEADBEEF.
After: r1 = 0xDEADBEEF.
Vector load (vload): copy 8 contiguous words from mem into a vector in scratch.
load vload v10 rBase
Assume rBase = 0x0100, mem[0x0100..0x0107] = [0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80].
After: v10..v17 = [0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80].
Offset load (load_offset): copy one word at a computed address; often used with a vector base.
load load_offset v20 rBase 5
Assume rBase = 0x0100 (holds addresses [0x0100, 0x0104, 0x0108, ...]).
Reads from mem[scratch[rBase + 5]] = mem[0x0105].
After: v20+5 = that value.
Move data from scratch back into main memory. Again, only 2 stores per cycle.
Scalar store: write one word.
store store rAddr rS
Assume rAddr = 0x0100, rS = 0xABCDEF01.
After: mem[0x0100] = 0xABCDEF01.
Vector store (vstore): write 8 contiguous words.
store vstore rAddr rVec
Assume rAddr = 0x0100, rVec..rVec+7 = [0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80].
After: mem[0x0100..0x0107] = [0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80].
This engine covers control flow (jumps, conditional jumps), selects, and misc ops. Only 1 flow slot per cycle, so branching/selects can be a bottleneck.
Selects: Used to implement conditional moves (like C ternary) without branching. Handy for choosing constants based on parity, for example.
Scalar select: choose between two values based on a condition.
flow select rD rC rA rB
Assume rC = 1 (truthy), rA = 0xDEADBEEF, rB = 0xABCDEF01.
After: rD = rA = 0xDEADBEEF (because rC != 0).
If rC = 0: rD = rB = 0xABCDEF01.
Vector select (vselect): lane-wise conditional.
flow vselect rDest rCond rA rB
For each lane i (0..7): if vCond[i] != 0 then vDest[i] = vA[i] else vDest[i] = vB[i].
Use case: In the kernel, we branch left vs right based on value parity. Instead of a jump, use select with a parity mask to keep execution going straight.
PC and jumps: The VM has a program counter (pc) that usually increments. Jump ops can change it. Since this problem has only one core, unconditional jumps and core id are less useful.
flow jump PC — set pc = PC (immediate constant)flow cond_jump rC PC — if rC != 0 then set pc =
PCflow cond_jump_rel rC OFF — if rC != 0 then pc +=
OFF (jump relative)flow jump_indirect rAddr — set pc = scratch[rAddr]
Core and misc: These exist for multi-core configs but this problem uses one core.
flow coreid rD — write the current core id into rD
(useful if you were using multiple cores)flow add_imm rD rA IMM — rD = rA + IMM (adds
immediate without using an ALU slot)flow pause — pause execution for debugging (used in
local tests; ignored in submission harness)flow halt — stop the core
Lane-wise operations on 8-element vectors (addresses vN..vN+7 in scratch). This is where SIMD helps — process multiple walkers or contiguous memory in parallel.
Vector broadcast: replicate a scalar into all lanes of a vector.
valu vbroadcast v10 rS
Assume rS = 0x12345678.
After: v10..v17 = [0x12345678, 0x12345678, ..., 0x12345678] (8 copies).
Vector ALU ops: lane-wise arithmetic/logic.
valu + rD rA rB — rD[i] = rA[i] + rB[i] for all
lanes ivalu multiply_add rD rA rB rC — lane-wise fused
multiply-add: rD[i] = rA[i] * rB[i] + rC[i]valu ^ rD rA rB — lane-wise xorvalu << rD rA rB — lane-wise left shiftvalu >> rD rA rB — lane-wise right shift
One line (bundle) can have multiple slots from different engines running in parallel:
alu + r2 r0 r1; alu ^ r3 r4 r5
This cycle does an addition and an xor at the same time, as long as they don't depend on each other (no RAW hazard).