VLIW Web IDE

Forest height

Rounds

Batch size

Seed

Templates: Jinja2 supported (vars: forest_height, rounds, batch_size, seed, VLEN, SLOT_LIMITS).

Idle

Correctness: pending

Walkers

Page

Page size

i	idx	val	node

Mem

Header (0..6) Node values Walker indices Walker values

Page

Page size

Addr (dec)	Addr (hex)	Value

Scratch

Page

Page size

Addr (dec)	Addr (hex)	Value

Assembly Manual

The challenge: You are given a binary tree (dense, balanced, flat) in memory, and N "walkers" (each with a 32-bit internal state and position in the tree). The program runs for a certain number of rounds, each round each walker updates its internal state by hash-mixing it with the binary tree value at its position, and then decides whether to go left or right based on the least-significant bit in its internal state. Correctness is judged by whatever is in the main memory region for walker state at the end of the program.

Click Load Sample to load an unoptimized reference kernel for a smaller version of the problem, or Load Fibonacci to load a templated Fibonacci example.

For more information, see the original GitHub repository.

Execution model: One line = one VLIW bundle (one cycle). Slots are separated by ;. Each bundle can issue multiple slots across engines in parallel.

Registers: rN is a scalar scratch address. vN is a vector base (8 lanes at N..N+7 Note: the distinction between rN and vN is mostly convention and they are treated the same by the parser). Hex register addresses are allowed (e.g., r0x1a). All values are 32-bit and wrap mod 2^32.

Slot limits per cycle: alu=12, valu=6, load=2, store=2, flow=1.

Hazards: All writes land at end-of-cycle. A slot in the same line that reads a just-written scratch address will see the old value.

ALU (scalar integer ops)

Scalar 32-bit arithmetic, bitwise, shifts, and comparisons. Results wrap mod 2^32.

alu + rD rA rB — addition
alu - rD rA rB — subtraction
alu * rD rA rB — multiplication
alu // rD rA rB — floor division
alu % rD rA rB — modulus
alu ^ rD rA rB — xor
alu & rD rA rB — and
alu | rD rA rB — or
alu << rD rA rB — left shift
alu >> rD rA rB — right shift
alu < rD rA rB — less-than (writes 1 if rA < rB, else 0)
alu == rD rA rB — equality (writes 1 if rA == rB, else 0)

LOAD (mem/imm → scratch)

Move data from main memory (mem) or immediates into scratch (register file). Use these sparingly — only 2 loads per cycle.

Immediate load: materialize a constant.

load const r0 123

Before: r0 is undefined/old value.

After: r0 = 0x0000007B.

Scalar load: copy one word from mem.

load load r1 rAddr

Assume rAddr = 0x0100, mem[0x0100] = 0xDEADBEEF.

After: r1 = 0xDEADBEEF.

Vector load (vload): copy 8 contiguous words from mem into a vector in scratch.

load vload v10 rBase

Assume rBase = 0x0100, mem[0x0100..0x0107] = [0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80].

After: v10..v17 = [0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80].

Offset load (load_offset): copy one word at a computed address; often used with a vector base.

load load_offset v20 rBase 5

Assume rBase = 0x0100 (holds addresses [0x0100, 0x0104, 0x0108, ...]).

Reads from mem[scratch[rBase + 5]] = mem[0x0105].

After: v20+5 = that value.

STORE (scratch → mem)

Move data from scratch back into main memory. Again, only 2 stores per cycle.

Scalar store: write one word.

store store rAddr rS

Assume rAddr = 0x0100, rS = 0xABCDEF01.

After: mem[0x0100] = 0xABCDEF01.

Vector store (vstore): write 8 contiguous words.

store vstore rAddr rVec

Assume rAddr = 0x0100, rVec..rVec+7 = [0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80].

After: mem[0x0100..0x0107] = [0x10, 0x20, 0x30, 0x40, 0x50, 0x60, 0x70, 0x80].

FLOW (control flow and misc)

This engine covers control flow (jumps, conditional jumps), selects, and misc ops. Only 1 flow slot per cycle, so branching/selects can be a bottleneck.

Selects: Used to implement conditional moves (like C ternary) without branching. Handy for choosing constants based on parity, for example.

Scalar select: choose between two values based on a condition.

flow select rD rC rA rB

Assume rC = 1 (truthy), rA = 0xDEADBEEF, rB = 0xABCDEF01.

After: rD = rA = 0xDEADBEEF (because rC != 0).

If rC = 0: rD = rB = 0xABCDEF01.

Vector select (vselect): lane-wise conditional.

flow vselect rDest rCond rA rB

For each lane i (0..7): if vCond[i] != 0 then vDest[i] = vA[i] else vDest[i] = vB[i].

Use case: In the kernel, we branch left vs right based on value parity. Instead of a jump, use select with a parity mask to keep execution going straight.

PC and jumps: The VM has a program counter (pc) that usually increments. Jump ops can change it. Since this problem has only one core, unconditional jumps and core id are less useful.

flow jump PC — set pc = PC (immediate constant)
flow cond_jump rC PC — if rC != 0 then set pc = PC
flow cond_jump_rel rC OFF — if rC != 0 then pc += OFF (jump relative)
flow jump_indirect rAddr — set pc = scratch[rAddr]

Core and misc: These exist for multi-core configs but this problem uses one core.

flow coreid rD — write the current core id into rD (useful if you were using multiple cores)
flow add_imm rD rA IMM — rD = rA + IMM (adds immediate without using an ALU slot)
flow pause — pause execution for debugging (used in local tests; ignored in submission harness)
flow halt — stop the core

VALU (vector ALU, 8 lanes)

Lane-wise operations on 8-element vectors (addresses vN..vN+7 in scratch). This is where SIMD helps — process multiple walkers or contiguous memory in parallel.

Vector broadcast: replicate a scalar into all lanes of a vector.

valu vbroadcast v10 rS

Assume rS = 0x12345678.

After: v10..v17 = [0x12345678, 0x12345678, ..., 0x12345678] (8 copies).

Vector ALU ops: lane-wise arithmetic/logic.

valu + rD rA rB — rD[i] = rA[i] + rB[i] for all lanes i
valu multiply_add rD rA rB rC — lane-wise fused multiply-add: rD[i] = rA[i] * rB[i] + rC[i]
valu ^ rD rA rB — lane-wise xor
valu << rD rA rB — lane-wise left shift
valu >> rD rA rB — lane-wise right shift

Example bundle

One line (bundle) can have multiple slots from different engines running in parallel:

alu + r2 r0 r1; alu ^ r3 r4 r5

This cycle does an addition and an xor at the same time, as long as they don't depend on each other (no RAW hazard).