Physical AI:  First timing, then AI

A hearable suppresses the noise of a passing train while preserving the voice of the person across the table. A delivery robot navigates a crowded sidewalk and stops cleanly half a metre from a pedestrian who stepped into its path. A drone holds station against a gust of wind it sensed three milliseconds ago.

Each of these is a Physical AI achievement. None of them is, fundamentally, a model achievement.

What separates a working product from a compelling demo is not the sophistication of the intelligence on board. It is the discipline with which that intelligence is delivered to the physical world on time, every time, within a fixed energy budget. Timing is the load-bearing constraint of Physical AI – the property that, when it fails, no model accuracy can recover.  Put simply, Physical AI is a timing problem first, then an AI problem.

A drone holds station against a gust of wind it sensed three milliseconds ago.. Physical AI | First Timing, then AI. XMOS | GenSoC

“Physical AI” usefully describes a family of products that share a common engineering reality: intelligence must reach the physical world inside a deadline it cannot negotiate. The category is also genuinely broad. A drone obstacle-avoidance system may tolerate 50 ms of end-to-end latency. A hearable doing real-time voice enhancement targets less than 10 ms. Active noise cancellation in the same hearable operates in microseconds. Latency budgets across the category vary by three orders of magnitude.

This variance matters. The timing argument is not a single number; it is a discipline applied to a specific product’s specific deadline. To make that concrete, the rest of this piece walks through one worked example – a pair of smart glasses doing on-device voice enhancement – and uses it to derive principles that generalise.

A pair of true-wireless earbuds with ML-based noise suppression for voice pickup has a defined end-to-end latency target: roughly 10 ms from microphone capture to enhanced audio out. This is dictated by perception – any longer and conversational audio begins to sound unnatural to the wearer – and by the BLE LE Audio frame structure beneath the application.

That 10 ms budget is consumed by a chain of stages, each of which must complete deterministically:

StageWorst-case budget
ADC capture and DMA~200 μs
Dual-mic beamforming and pre-processing~300 μs
Model inference (noise suppression)3–4 ms
Post-processing (residual suppression, AGC)~200 μs
DAC output via DMA~200 μs
BLE stack interaction, marginremainder

The model has 3–4 ms of worst-case budget. A model that runs in 2 ms on average but 6 ms in the worst case cannot ship — not because it is inaccurate, but because it will miss the deadline often enough to be perceptible. Model selection is constrained by the timing budget before accuracy is evaluated.

The earbud also has roughly 60 mAh of battery per bud and a target life of five to eight hours. Average power across all on-device work – audio pipeline, BLE radio, sensors, system overhead – must sit on the order of 10 mA. The model must not only fit the timing budget; it must fit inside an energy envelope that leaves room for everything else.

This is what timing-first product engineering looks like in practice: numbers on a contract, derived from the physical and human realities the product must meet, decided before model architecture is finalised.

The worked example surfaces three properties that generalise across Physical AI products.

Worst-case latency, not average. The 3–4 ms inference budget is a worst-case figure. An accelerator with 1.5 ms average inference but 8 ms tail latency does not meet this budget, however strong its benchmark headline. Hard real-time systems are characterised by their operation every time, not just 50% of it – Physical AI products are hard real-time systems whether their designers acknowledge it or not.

Jitter, not just latency. Closed-loop processes — the BLE codec frame, the ANC path, the control loop in a drone — require predictable sample-to-sample timing. A pipeline with 5 ms average latency and 3 ms of jitter is functionally worse than one with 7 ms average and sub-microsecond jitter. The same architectural mechanisms that improve average throughput in general-purpose processors — caches, branch prediction, out-of-order execution — erode worst-case predictability, which is the property that matters here.

Temporal alignment across modalities. The earbud has two microphones; a wearable adds an IMU; a robot adds vision. These streams must align to within sample-level tolerance for fusion to operate on a coherent view of the world. Misaligned inputs do not produce slightly worse outputs; they produce confident, well-formed answers to the wrong question.

If the only timing problem were “how long does the model take to run”, the engineering would be relatively well-bounded. The harder reality, and the one most often underestimated, is timing across system boundaries.

In smart eyewear, the audio pipeline is not the only thing running. The BLE radio stack must service link-layer events at fixed cadence; if these are handled as preemptive interrupts on the same processor as the inference, they will insert jitter into the audio pipeline. Capacitive touch, battery monitoring, and thermal management each demand their own timing guarantees.

The binding constraint is rarely inference time itself. It is the coordination of inference time with everything else the system is doing concurrently. On general-purpose architectures, this is managed through priority schemes and best-effort scheduling — which means worst-case behaviour depends on what the rest of the system happens to be doing at any given moment. On architectures designed for deterministic execution — dedicated cores per subsystem, hardware-level isolation, predictable I/O — inter-subsystem interference is bounded by design rather than by software discipline.

This is where Physical AI departs most sharply from cloud AI. The cloud has no equivalent of a radio stack preempting your inference at an inconvenient moment. The challenge is not faster inference; it is isolated inference that runs at a known rate regardless of what else the system is doing.

Headline performance numbers — TOPS, FLOPS, peak throughput — describe what a processor can do in bursts under ideal conditions. They are necessary but insufficient for battery-powered Physical AI, because real products do not operate at idle. They operate under realistic interference, sustained load, and thermal constraints.

The metric that actually predicts product behaviour is sustained operations per joule under realistic worst-case interference. A processor delivering 4 TOPS/W at peak that collapses to 1.2 TOPS/W under sustained inference with radio active and peripherals firing has effectively delivered around 30% of its advertised efficiency in the regime that matters. A processor delivering 1.5 TOPS/W peak but 1.4 TOPS/W under the same conditions is closer to the truth.

For smart eyewear, this is the metric that determines battery life. Two products built around comparable models and comparable nominal silicon budgets can deliver materially different real-world battery life — and the variable is almost always how closely sustained efficiency tracks peak efficiency under realistic load. Battery life, in this class of product, is a timing artifact: every microsecond of inefficient or interrupted compute is paid for in milliampere-hours.

This metric is not yet routinely reported. It should be. Product teams making silicon decisions for battery-powered Physical AI need it more than they need another TOPS or GOPS comparison.

Timing budget, execution architecture, and model selection are not strictly serial decisions. In practice they are jointly constrained from day one — no team picks a timing budget in complete ignorance of what models exist, and no team picks a model in complete ignorance of what hardware will run it. The honest way to describe the discipline is as a prioritisation heuristic, not a workflow.

The heuristic is: when timing and model considerations conflict, timing wins. The model must fit the budget; the architecture must guarantee the budget; the budget itself is derived from the physical and human realities the product must meet, and those realities do not negotiate.

This is not new to embedded engineers; they have always known it. Its value is in giving them clear language to defend it against the parts of the organisation — product roadmaps anchored on model leaderboards, ML teams measured in accuracy points rather than latency characteristics — that determine what gets built.

The question that most determines whether a Physical AI product succeeds is asked before any model is finalised: what is the timing budget — worst-case latency, jitter, energy per inference, inter-subsystem interference — and what execution architecture can guarantee it? The model selection becomes tractable, and useful, once that foundation is in place.

Physical AI is one of the more consequential engineering frontiers of the decade. The products that define it will be the ones built on a clear understanding that intelligence in the physical world is, first and foremost, intelligence delivered on time, every time, within a budget that does not move. Timing is the discipline that makes Physical AI possible. The AI follows.

Discover GenSoC™: The Future of Hardware Programmability

XMOS is defining a new category: the Generative System-on-Chip (GenSoC). GenSoC enables developers to describe system behaviour using natural language, while guaranteeing timing accuracy and real-time functional performance.

Scroll to Top