The GenSoC Execution Substrate: Why Physical AI’s Next Phase Is About Time, Not Tokens

A class-D amplifier feeding a tweeter expects a new sample every 20.8 microseconds. Miss one, by even a few hundred nanoseconds, and the speaker emits an audible click. The system did not crash. The arithmetic was correct. The output simply arrived late, and lateness, in this domain, is wrongness. Replace the audio sample with a PWM update on a brushless motor controller, a CAN frame acknowledgement on a vehicle bus, or a torque command to a humanoid’s ankle joint mid-stride, and the failure mode escalates from annoying to dangerous. These are not throughput problems. They are timing problems, and they are why hard real-time systems demand a different way of thinking about processor architecture.

They are also why physical AI; robots, drones, autonomous vehicles, and any embodied system that closes a control loop with the real world, cannot be built on the same architectural assumptions as cloud-based inference. A large model that decides what to do is useless if the actuator command arrives a millisecond after the foot has already hit the ground.

A hard real-time system is one in which a result produced after its deadline is not merely degraded — it is incorrect. Soft real-time tolerates the occasional miss; “fast” optimises the average. Hard real-time requires that every instance of an operation completes within a bounded, statically known time. The metric that matters is not throughput or average latency but worst-case execution time (WCET), together with jitter — the variation between successive executions of the same task. A system whose WCET cannot be bounded analytically cannot be certified as hard real-time, regardless of how fast it runs on a benchmark.

Almost every technique that modern processors use to raise single-thread performance does so by exploiting statistical regularities in program behaviour. Each one trades worst-case predictability for average-case speed.

Multi-level caches are the clearest example. A cache hit might take a single cycle; a miss to DRAM, several hundred. Whether a given load hits or misses depends on the access history of the entire program, including code that ran arbitrarily far in the past. Per-instruction timing becomes data-dependent and history-dependent, and static WCET analysis must assume the miss every time, which negates the cache’s benefit on the very metric that matters.

Deep pipelines, branch prediction, and out-of-order issue compound the problem. A mis-predicted branch flushes the pipeline; a speculative load that aliases a pending store causes a rollback. The cost of any individual instruction depends on what came before it and what the predictor happened to learn. Speculation makes the average case faster and the worst case both slower and harder to bound.

Interrupts inject arbitrary latency into whichever thread is running. Even with priority-based handling, lower-priority interrupts can be blocked by higher-priority ones, masked critical sections delay service, and shared state between handlers and tasks introduces priority inversion. Shared buses and DMA engines add another layer: the latency of a memory access depends on what every other bus master is doing at that instant. An RTOS scheduler running on top of all this adds context-switch jitter and tick quantisation. Each of these mechanisms is defensible on its own terms. Together they make WCET a statistical estimate rather than a guarantee.

It is tempting to assume that a sufficiently fast processor will absorb timing variability, that if the average is fast enough, the worst case will also be acceptable. The asymmetry runs the other way. A 3 GHz superscalar core with multi-level caches and aggressive speculation may average tens of picoseconds per instruction, but its worst-case latency for a hard deadline can be hundreds of microseconds once cache misses, pipeline flushes, interrupt service, and bus contention are accounted for. A 100 MHz core with no caches, no speculation, and a fixed instruction issue rate has a worst case that is essentially equal to its average — and may comfortably meet a deadline that the faster core cannot guarantee. Determinism and peak speed are independent axes, and they often pull against each other.

If a single thread of execution cannot escape the interference of caches, interrupts, and shared resources, the structural answer is to give each real-time task its own execution resource: its own register file, its own deterministic instruction issue, and its own predictable path to the memory and I/O it needs. Tasks that do not share hardware cannot interfere with each other in the time domain. WCET for one task becomes independent of what every other task is doing, which is precisely the property that static analysis requires. This is parallelism used as a correctness mechanism, not a performance optimisation. The point is not that two cores run twice as fast as one; the point is that two cores running independently cannot perturb each other’s timing.

The argument above collapses immediately if the parallel elements are themselves non-deterministic. A multi-core SoC built from conventional application cores does not deliver hard real-time guarantees. Each core remains internally unpredictable for all the reasons already discussed, and the cores typically share the very resources whose contention is hardest to bound: last-level caches, memory controllers, interconnect fabrics, interrupt distributors. Replicating non-determinism produces several non-deterministic systems running side by side, with new contention paths between them. The worst case gets worse, not better.

Parallelism is the structural fix only when it is parallelism of deterministic elements. Each element must execute predictably in isolation, and the connections between elements must not reintroduce the contention that isolation was meant to eliminate.

Determinism at the element level dictates the rest of the architecture. Memory must be tightly coupled and privately addressable, with fixed access latency — scratchpad rather than cached DRAM — so that load and store timing is a property of the instruction, not of the access history. Inter-element communication must use hardware channels or FIFOs with bounded behaviour, not shared memory with locks, because lock contention is itself a source of unbounded waiting. I/O must be handled at the element level, with cycle-accurate event response, rather than funnelled through a shared interrupt controller that serialises and reorders requests and a contested interconnect. These are not optional refinements. They are direct consequences of the requirement that every element remain analysable in isolation.

The XMOS XCORE® architecture instantiates these principles concretely. Each tile contains multiple hardware threads, and the instruction issue schedule guarantees each active thread a known fraction of the pipeline. Per-thread instruction timing is therefore independent of what the other threads on the tile are doing — there is no shared front-end whose contention has to be modelled. The architecture omits caches entirely; threads execute against tightly coupled memory with fixed access latency, so load and store timing is statically known. I/O is handled through ports directly mapped into the instruction set, rather than across a shared bus and interrupt controller: a thread can wait on a pin transition or a timed event with cycle-accurate response, and the wait does not perturb other threads. Inter-thread and inter-tile communication uses hardware channels with defined behaviour rather than shared memory protected by locks, removing contention from the inter-element path. The combined effect is that WCET for a given task can be computed at design time from the program text and the architectural rules, rather than measured statistically and hoped to hold.

Hard real-time is a discipline of bounds, not averages. The architectural choices that maximise average throughput — speculation, caching, shared interconnect, interrupt-driven concurrency — are precisely the choices that make worst-case timing unanalysable. Parallelism offers a way out, but only when each parallel element is itself deterministic and the fabric connecting them preserves that determinism end to end. The trade is real: some peak performance is given up in exchange for tight, provable bounds. For systems where lateness is wrongness, that trade is the whole point.

The cloud based AI systems that have become part of our everyday lives reward the architectures that the last decade of AI infrastructure has optimised — GPUs, accelerators, high-bandwidth memory, batched execution. Closed control loop Physical AI systems are a different problem: sensor fusion at the model layer, planning, and policy inference, actuator commands, safety interlocks, bus arbitration, and the inner loops of sensor acquisition are now bounded-time problems whose correctness is defined by when the result appears, not how quickly an average one can be produced. The two regimes have different correctness criteria, different failure modes, and different architectural requirements. Treating them as one problem, to be solved by a single class of silicon, is a category error in either direction.

The industry is currently making that error in one specific direction. Inference performance is taken seriously; deterministic execution is treated as a downstream integration concern, handled by whatever microcontroller happens to be on the board, with timing budgets validated empirically rather than analytically. That works for demos. It does not survive contact with certification, with adversarial operating conditions, or with the scale at which a fleet of embodied systems encounters every edge case the design implicitly assumed away. The reason physical AI products keep stalling between impressive demo and shippable product is not that the models are not good enough. It is that the substrate underneath them cannot make the guarantees that a regulator, an insurer, a safety case, or even a discerning consumer actually requires.

The architecture that follows is an integration, not a replacement.  Every aspect of the Physical AI control loop runs on silicon whose worst-case timing can be computed at design time rather than measured and hoped for. The interface between them is explicit, bounded, and analysable.  Getting the timing right is what separates systems that work in the lab from systems that can be sold, certified, and deployed.

Discover GenSoC: The Future of Hardware Programmability

XCORE® architecture is already deployed in more than 38 million devices worldwide, and has long possessed the core characteristics needed to enable a generative design flow. Building on this foundation, XMOS is defining a new category: the Generative System-on-Chip (GenSoC). GenSoC enables developers to describe system behaviour using natural language, while guaranteeing timing accuracy and real-time functional performance.

Scroll to Top
Secret Link