What magic will the Setem team bring to XMOS voice interfaces?

Posted: 26 September 2017

Voice interfaces are still in the early days of an evolution that will radically change how people interact with technology. As product designers add voice interfaces to products and look for opportunities for new categories of no-interface products, they’re learning more about the complexities of voice capture, particularly in the far-field (3-5m from the microphone).

The Problem

One of the main problems is the ability to selectively extract specific voice sources from competing conversations, reverberation and point noise sources like TVs and sound systems – a problem often referred to as the "cocktail party problem".

While traditional beamformer technologies deliver robust performance in many environments, particularly when coupled to other technologies that allow the microphone to differentiate between human and synthesized voice, they generally default to the loudest sound source and can struggle to identify specific sound sources in noisier environments, like a house full of kids.

Our Solution

In July, XMOS acquired Setem Technologies, pioneers in a new type of Blind Source Separation technology. Kevin Short, founder and CTO of Setem Technologies, and now Chief Scientist of XMOS, and the team at XMOS Boston office are developing a solution to this challenge that uses a revolutionary combination of signal processing and machine learning techniques based on voice biometrics. By dissecting and reconstructing the sound field in the time, frequency and spatial domains, in real-time, microphones can extract individual voices from background noise and eliminate reverberation, producing much greater clarity than existing source separation technologies.

setem tech graph

A key feature of the modelling process is the ability to identify all sound sources in a space, to identify voice, and focus on one or more of those sources, making the technology flexible and powerful.

A typical scenario is the ability to extract the speech of a driver or any other passenger in a car, as a clear audio stream optimized for automatic speech recognition (ASR) systems, by separating out and rejecting road noise, engine noise, audio system and general conversation within the vehicle.

Kevin and the team have been working with xCORE technology since before the acquisition - for more than 18 months, and concluded that the architecture is the perfect match for the patented algorithms due to the high performance, integrated I/O and DSP capabilities.

The combination will lead to highly integrated solutions that deliver very high quality, voice interface controllers that solve the "cocktail party problem" while making it faster and easier to deliver voice-enabled products to market. We call it  VocalSorcery.


Comment on this post via social media





« XVF3000/3100 VocalFusion ExplainedHow voice user interfaces will feature in CES 2018 »