CPU-Free
Hardware Intelligence

BATENCORE Research investigates compute architectures in which neural inference executes directly in programmable logic — without processor instructions, without an operating system, and without the von Neumann bottleneck on the critical compute path.

Patent pending  ·  PCT/IB2026/053450

Research Focus

⚙️

AXI4 Master Engines

Hand-written RTL finite-state machines that drive AXI4 transactions directly — every handshake, address, and beat explicit and auditable at the register level.

🔢

Fixed-Point Arithmetic

Q16.16 signed multiply-accumulate on DSP Slices. No IEEE 754 floating-point hardware required. Precision validated against software reference on the same dataset.

🧠

On-Chip Neural Inference

Multi-layer feed-forward networks with ReLU activation executing entirely in programmable logic, reading weights from DDR3 via PS7 High-Performance ports.

🔬

Silicon Validation

Every result verified by direct JTAG register readback (Xilinx XSCT), not simulation. Ground truth is the silicon, not the waveform.

Current Project — FLVH Architecture

Overview

FLVH is a progressive hardware architecture for neural network inference on a Zynq-7020 FPGA (Arty Z7-20 development board, XC7Z020CLG400-1). The architecture chains five AXI4 Master engines — DDR3 write, Q16.16 ALU, single-neuron MAC, multi-neuron layer propagation (SPREAD), and multi-layer propagation with ReLU (DEEP) — into a complete inference pipeline validated on silicon across six development phases (N through S).

Verilog RTL AXI4 Q16.16 Zynq-7020 DDR3 Silicon validated

Key Result — XOR Inference

A 2-2-1 network (2 inputs, 2 hidden neurons with ReLU, 1 linear output) trained offline on XOR classifies all four input cases correctly on silicon. Hardware compute time: approximately 1.3 µs at 125 MHz. ReLU activation required 5 additional lines of Verilog and zero new GPIO registers. Scores land exactly at 0.0 or 1.0 in Q16.16.

4/4 XOR cases correct ~1.3 µs compute <4 DSP Slices / 220 <800 LUTs / 53 200

Architecture Milestone Sequence

Each phase validated by direct XSCT register readback before the next begins. No simulation-only results.

 Phase N — AXI4 Master write engine, DDR3 access via PS7 HP1  Phase O — Q16.16 fixed-point ALU on DSP Slice (FMUL, FDIV, signed)  Phase P — Single-neuron MAC engine, AXI4 Master Read via HP0  Phase Q — SPREAD: full layer without ARM intervention between neurons  Phase R — DEEP: multi-layer chaining without ARM intervention between layers  Phase S — ReLU in 5 lines of Verilog; XOR inference on silicon  Phase T — MNIST inference (planned) — same engine, larger weight arrays

Publications

FLVH: CPU-Free Multi-Layer Neural Network Inference on FPGA via AXI4 Master Engines — From Q16.16 MAC to ReLU-Activated XOR Classification on Silicon
Mohammed-Hounaïne EL HAMIANI-KHATAT
Preprint  ·  cs.AR (Hardware Architecture)  ·  2026
We present FLVH, a hardware architecture for neural network inference on a Zynq-7020 FPGA that executes entirely without CPU intervention during computation. Five AXI4 Master engines validated on silicon implement a complete inference pipeline from DDR3 weight fetch to Q16.16 MAC to ReLU activation. A 2-2-1 network classifies all four XOR cases correctly in approximately 1.3 µs of hardware compute time, using fewer than 4 DSP Slices out of 220 available.
Preprint — submission in preparation

Intellectual Property

Patent Number
PCT/IB2026/053450
Type
International PCT Application
Status
Pending
Subject Matter
CPU-Free Neural Inference Hardware Architecture

The patent covers the FLVH architecture, including the AXI4 Master engine design pattern, the Q16.16 fixed-point multiply-accumulate pipeline, and the multi-layer propagation protocol in which the ARM processor is removed from the neural compute path. Licensing inquiries: hounaine.hamiani@batencore.com

Team

MH

Mohammed-Hounaïne EL HAMIANI-KHATAT

Chief Scientist — BATENCORE Research

Hardware architect and inventor of the FLVH architecture. Specialises in FPGA-based neural inference, AXI4 interconnect design, and fixed-point arithmetic in programmable logic. Designed, implemented, and validated the complete FLVH pipeline from first principles in hand-written Verilog RTL on silicon, from individual engine validation to multi-layer XOR inference without CPU intervention on the compute path.

Open Research Questions

Automatic inter-layer chaining

Current implementation requires one ARM trigger per network layer. An extended FSM can chain layers automatically using the previous layer's output buffer address as the next layer's input — eliminating ARM involvement entirely from multi-layer inference. (Phase T)

Parallel neuron computation

The DEEP engine processes neurons sequentially. Instantiating M engines in parallel with a multi-master SmartConnect configuration would achieve M× speedup at the cost of additional DSP and LUT resources, with 218 DSP Slices still available on the XC7Z020. (Phase U)

Energy characterisation

Inference energy per MAC (pJ/MAC) has not yet been measured. The Zynq-7020 XADC provides on-chip power sensors accessible from bare-metal C, enabling direct comparison with published GPU and dedicated AI accelerator figures.