The xcore micro-architecture integrates a 256-bit vector processing unit that offers a peak of over 100GOPS of int-8 AI inferencing performance rivalling that of dedicated NPU accelerators but without the heterogeneous communication overhead. Uniquely, the xcore natively supports binarized networks, extending the peak performance to over 800GOPS.
With 1MB of on-chip high-bandwidth, low-latency SRAM, the xcore can support AI models with a tensor arena requirement of up to ~800kB with weights stored in an external memory device such as the QSPI boot flash.
The unique, multi-threaded xcore architecture delivers 16 hardware threads in two tiles that can be used independently, or in collaboration to implement a diverse range of functions including AI inferencing, DSP signal conditioning, IO protocols and control algorithms.
A powerful inter-processor communication infrastructure provides high speed communication between the hardware threads within a single tile, chip or between chips. This enables the scalability of performance and memory to implement high performance AI inferencing concurrently with the other DSP, IO and control functions required by a diverse range of embedded applications.
EASE OF USE
AI models are typically developed in several widely used, Python based environments including PyTorch and Tensorflow before translation and optimisation, including pruning and quantization, to Tensorflow Lite for embedded applications. AI Tools supports this design flow by optimising the Tensorflow Lite flatbuffer to exploit the capabilities available with xcore through high-performance implementations of heavily used AI operators and minimising the required tensor arena size. The ML engineer can select the number of threads and the proportion of weights to be offloaded to flash memory before generating a C code, embedded software object for integration into the embedded system.
Embedded AI inference models can readily support image-based applications and, with support for time-based models and the ability to use a range of datatypes, audio-based applications.
The large, local memory with each tile is key to maximising the performance of an AI model inference. To achieve this, the tensor arena is located in this local memory with the weights stored in QSPI flash memory but for smaller TinyML models, everything can be readily stored in the on-chip memory. The optimisations performed by AI-Tools minimise the size of the tensor arena to enable high performance execution of larger AI models. Most embedded AI applications quantize the model to 8-bit to minimise the tensor arena. The xcore flexibility also supports binarized networks to further reduce the memory size and maximise performance but also allows some parts of the network to use larger datatypes including floating point where necessary.