Blog

The power of voice: A new era for TV enthusiasts everywhere

The power of voice: A new era for TV enthusiasts everywhere 900 506 XMOS

Today we launched the next generation of voice technology that’s designed to drive the voice-enabled TV market forward and transform how we search and discover content.

Futuresource Consulting forecasts that nearly 700 million new smart speaker, smart TV, set-top box and smart home devices will ship in 2019, and voice assistants will be built into an increasing number of them. Built-in voice interfaces have attracted a lot of attention from manufacturers, but cost and integration complexity concerns pushed the early voice implementations towards ‘Alexa compatible’ and ‘Push to Talk’ solutions. However, people are switching on to the power of voice and the trend is set to rise over the coming years – which is bringing manufacturers back to the ‘built-in’ solution, because it delivers a more relaxed, intuitive experience.

XMOS’ new technology presents a real opportunity for manufacturers to bring a compelling voice control experience to the masses, effectively and economically.

When the TV in your living room offers far-field voice capture that works with close range precision (i.e. you can just tell it what you’d like to watch from anywhere in the room),  the remote control and push-to-talk voice experience starts to feel outdated. Voice-control unshackles us from hierarchical menus and never-ending pages of content, freeing us from push-button and touch-based interfaces. Whilst we may still need the remote control to switch between different hardware, the ability to just say: “Alexa, play Friends series two, episode three” or even “Alexa, find the pivot episode in Friends”  provides a much richer search experience. And it’s easy to see how voice control is perfect for TVs and set-top boxes – and how far-field makes that a more natural, conversational interaction.

The XMOS solution

Our new XVF3510 two-mic voice processor is designed with our modern living spaces in mind. Our new VocalFusion algorithms work intelligently to analyse the acoustic environment, detecting and isolating a voice command from every other sound in the room (including any media streaming through the device itself or other devices nearby), making it ideal for smart home devices and integration into smart TVs and set-top boxes. And with a price tag of just $0.99, it hurdles the cost barrier nicely.

Simon Bryant, Research Director at Futuresource Consulting said: “The strong adoption in multimedia and entertainment is expected to continue and this new device from XMOS addresses what manufacturers who want to add far-field voice control to their product lines will be looking for.”

The XVF3510 offers a clear incentive for manufacturers to move beyond the world of push-to-talk, ‘Alexa compatible’ and touch-based control interfaces. With that, we’ll start to see the real power of voice emerge.

For further technical details, the full product brief can be found here.

The XVF3510 will enter general distribution in August 2019. Developers can request a dev kit from XMOS here.

Embedding voice – DSP chip or DSP algorithms on the Applications Processor?

Embedding voice – DSP chip or DSP algorithms on the Applications Processor? 5578 3719 XMOS

Why smart developers choose a DSP chip rather than run DSP algorithms on the Applications Processor …

In our increasingly connected, intelligent world, voice-control opens the door for a more natural, engaging conversation with technology. Reliable, accurate voice capture relies on advanced digital signal processing (DSP) algorithms and good acoustic design to ‘hear’ the wake-word and pick up the voice command – even in a noisy environment. Some of the key algorithms include:

  • Acoustic-echo cancellation: When you give a voice command to a TV, the microphones will capture both your command and the audio track coming from the TV speakers. That captured audio track – the acoustic echo – needs to be cancelled from the captured signal so it ‘hears’ the wake-word’ first time, every time and captures a clean voice command to send to the speech recognition service (eg Alexa). This is also known as ‘barge-in’.
  • Beamforming: This detects and tracks where the voice is coming from, so the command is captured accurately, even if you’re walking across the living room.
  • Interference Canceller: This ‘scans’ the soundscape of the room and ignores (cancels out) the point noise sources, ie anything that’s not the voice of interest, in the surrounding space. The improved voice signal can then be sent to the speech recognition service.
  • Noise suppression: Noise suppression algorithms target diffuse noise sources such as air conditioning and road noise. They remove the stationary and non-stationary background sounds to enable accurate, reliable voice detection.

As voice starts to move beyond smart-speakers and into the living room, developers are having to figure out how best to build a voice interface into a smart TV or set-top box. And one of the common questions we hear is whether to embed the DSP on a separate voice processor (chip) or run DSP algorithms on the Applications Processor ….

Should you run DSP algorithms on the Applications Processor?

Most consumer electronics devices are built around an Applications Processor. Put simply, the more powerful the processor, the quicker your programmes, apps, games and features will appear. As a developer, you may choose to simply execute the DSP algorithms on the Applications Processor (host processor). At first glance, this seems cost effective and easy to integrate – primarily because there’s no additional chip to purchase and integrate. However, there are some significant downsides to this approach that developers need to consider.

  • Adverse impact on capacity: because the host processor handles the core system processes, it’s one of the most expensive elements of the electrical design. The more powerful the host processor, the more tasks it can handle – but in turn, it’ll cost more, consume more power and require more space. As a developer, you’ll want the cheapest processor that’s capable of running all the core functions, with minimal power. Therefore, adding DSP algorithms onto it, imposes additional processing that burdens the chip and takes up capacity that can otherwise be used for core functions.
  • Bill of Materials (BoM): This will be pushed up beyond original estimates as additional components will be required to support the integrations (eg microphone aggregator).
  • Performance risk: The DSP algorithms will be constrained by the capacity that’s available on the host processor and performance may be compromised.
  • Integration complexity: Adding algorithms onto the host processor, puts all of the integration demands onto the software team and can rapidly increase the cost of development. It can also create challenges in delivering with in the real time constraints to produce a glitch-free audio stream, without increasing the latency of the system. Further challenges may arise in the future around in-field updates and whether there’s sufficient capacity to run the update on the host processor.

How does that compare with running DSP algorithms on a separate chip?

A standalone DSP chip solution offers some compelling advantages over licensing DSP algorithms and integrating them into the host processor.

  • Transfers work away from host processor: Running the DSP on a separate chip, keeps the host processor free for core functions – and avoids impacting the software team
  • Easy to integrate: A ringfenced solution needs to be planned into the electrical design, but using an external DSP allows you to use standard hardware interfaces (such as I2S or USB for connectivity) which simplifies the integration task significantly. A separate chip ensures there are no dependencies between the code on the DSP chip and that on the host processor, there’s simply an API to deliver processed voice samples in an uninterrupted stream.
  • Future-proof solution: You benefit from the latest developments in voice technology; plus, in-field software releases are delivered easily via firmware update.
  • Accelerated time to market: A DSP chip offers a plug and play solution which separates the voice-capture solution from the rest of the TV electronic design, enabling developers to deliver a built-in voice interface rapidly.

Choosing the right far-field voice interface for your TV or set-top box is a key decision for your company. A separate voice processor such as XMOS’ VocalFusion often provides a more flexible and cost-effective solution over the complete lifecycle of a TV or set-top box. It reduces project risk, minimises dependencies between software functions and avoids burdening the host processor.

XMOS solutions are cost-effective and offer the flexibility to remove additional costs from your system design. Find out more about our voice solutions here. Or get in touch with one of our sales team here.

We’re here to help you transform the way people find and enjoy content through your products.

 

In his own words: Esher, our intern talks about his time at XMOS

In his own words: Esher, our intern talks about his time at XMOS 4503 3002 XMOS

XMOS continues to enlighten and surprise me.

I first worked at XMOS during the summer of 2017, after finishing my second-year university exams.  On finishing university, it was a pleasure to be able to return for a second work placement. XMOS is filled with such intelligent, dedicated and friendly people, and working here, I’ve seen how effectively a team can drive to reach shared goals and grow the business to where it deserves to be. Throughout this blog, I’d like to highlight my thoughts on the work I’ve been involved with, the culture that XMOS promotes and the vision I see for XMOS as a company moving forward. It has been an absolute honour to work alongside such talented people and I’ll watch XMOS with interest in the years to come.

During this work placement, I gained a fresh perspective on the company’s vision. I’d worked with the Finance team previously, so sitting with Marketing meant new faces, new ideas and new direction. I was immediately welcomed with my own desk, my own space and a sense of belonging – which really helped me relax into the role. My time was split between the Marketing and Finance teams, so I was never short of work! It was rewarding to feel I was adding real value to the business, helping in areas where help was clearly needed. I admit that some tasks were more exciting than others, but regardless of what I was doing, I certainly honed-in on and further developed some very important skills: the use of Excel, my attention to detail, my creativity, my communication and my ability to break down the task at hand logically and systematically. The work over the last 3 months has been full of variety. I’ve thoroughly enjoyed working across different departments, gaining a great insight into XMOS and this fast-moving technology industry, but also working alongside and befriending some fascinating people.

I’d describe XMOS as open, natural and hardworking. The sense of community is really felt here and I believe it all stems from the employees. Everyone welcomed me as an equal, which made it easy for me to learn and grow as an individual. I felt a great willingness to help from everyone, regardless of their position in the company. The lack of hierarchy at XMOS is something I truly admire. Their flexibility and understanding of people’s needs is also outstanding. It’s also a company that’s serious about having fun. The lead-up to Christmas was been like no other, full of festivities and good vibes. The people I’ve been surrounded by for the last 3 months have brought me nothing but smiles and joy.

Make no mistake though, hard work and ambition is instilled in the backbone of XMOS employees. Their dedication to get the job done on time, to out-perform the competition and still look to what they can do to improve – that’s a real recipe for success and will help them scale new heights. An Indian sage once said that in a finite world, we are all looking for infinite satiation. In a never decelerating world of innovation today, where competition is considered a status quo state, it is of crucial importance to dedicate time to understand people and their behaviour. I believe that with some finer tuning and a deeper understanding of the needs and wants of XMOS employees, they’re on their way to the perfect culture.

XMOS has a bright future. Innovative ideas, great people, hard work, ambition…it’s all in the DNA of XMOS. I speak from a young generation when I say that technology is hot and exciting. The limits are endless. But let’s not forget about the fierce competition that’s contained within this field and it’s important for XMOS to compete fiercely and further expand sustainably and intelligently in this growing market. I believe innovation and adaptability is key. Providing high-qualify innovative products will put XMOS on the radar of new clients. Being adaptable will help XMOS build its brand as an intelligent company that’s aware of the current issues at hand. How can one promote innovation? For me, the core lies in being able to freely express ideas and solutions. Emphasis on the word ‘freely’: XMOS Ideas Sessions where, once a month, small, diverse groups have complete freedom to share ideas on potential XMOS improvements (be they internal or external); this could help shape the working environment, evolve the strategy or even identify a new product opportunity. Innovation stems from having many different perspectives and these Ideas Sessions may help surface even more pioneering ideas from the team.

When I think of what is to come for XMOS, I am filled with excitement and intrigue. I am grateful to have played a small part of XMOS’ journey and am excited for what the future holds. For me though, now comes a time of relaxation. A time of enjoyment, peace and exploration. I am travelling to South America, with my girlfriend, to explore and understand new cultures, new situations and new people. It is going to be a beautiful journey which I have complete trust in. During my time away, I will also be ‘chasing the wind’ – following my passion of kitesurfing. I am excited to finally dedicate myself to this sport it hasn’t been feasible to do so until now. I’m going to see many wonders during my time in Argentina, Chile, Bolivia, Peru, Ecuador, Costa Rica and Cuba, but will never forget my time here at XMOS. I’ll be returning to England in summer 2019 and am looking forward to starting the next stage of my career at KPMG as a Technology Consultant. For now, I wish nothing but success and happiness for team XMOS. We’ll meet again!

by Esher Pegrum

XMOS is hiring!

If you’re interested in finding out more about what it’s like working for a leading company in the voice capture space, with partners such as Amazon Alexa and Infineon, you can browse our jobs or send us your CV with note about why you’d like to work here to work@xmos.com. We’d love to hear from you.

Girl Geeks? For sure. And we’re loud and proud about that.

Girl Geeks? For sure. And we’re loud and proud about that. 2000 1333 XMOS

At our Girl Geek Dinner on 24th October, we explored the marvel of the human voice.  In techie terms, each one of us is a piece of analogue kit and we all have a unique pitch, tone and frequency. Male and female voices have a fundamentally different pitch. Unfortunately, most people in our industry today operate at a low vocal pitch or to put it another way – we have a gender diversity problem.

We teamed up with the Bristol Girl Geek dinners network to host a Girl Geek event at XMOS HQ in Bristol. Over 45 people joined us to learn about the science of voice, see a little deeper into the magic of voice capture technology and find out how a virtual assistant can ‘hear’ your voice across a crowded room and execute your command.

It was a strong showcase of the talent in the South West, with female-identifying techies (and a few men) from across deep tech, finance, marketing, consulting and start-ups. The room was full of courage, easy camaraderie and razor-sharp minds, with representatives from the worlds of cyber-security, robotics, AI and imaging, together with others who are starting a business, freelance, starting a new life, studying or gone back to studying. With so energy and creativity across this space, it’s staggering that there’s still so little diversity.

Separating voice from noise

At the Girl Geek evening, we discussed how success of voice-enabled technology depends on its ability to identify when a human is talking and then isolate that voice signal from other noise such as room echo, other people talking, music and background or outside noise. This is achieved with a series of algorithms, housed on our silicon, which captures speech from across the room, cleans it up and sends the digital command to speech recognition service – such as Amazon Alexa.

Alex Craciun, our Algorithm Engineer, took us through the science of speech and explained that, if we know how speech is produced, we can extract features which model speech-like structures and from there create a speech detection algorithm to identify a speech source or a non-speech source accurately. Gwen Edwards, Director of Product Marketing, was then able to demonstrate the different techniques we use to extract human speech and clean up the digital signal – with the help of a talking lamp and an XMOS development kit that showed how far-field microphones and algorithms capture voice and strip out a cacophony of music to have a clear ‘Alexa ready’ voice command.

From understanding how this works, we then had a glimpse into the future at how other sensors (such as radar and imaging) will augment voice to make our interactions with technology more human. Once a voice assistant is able to sense our identity, mood, routine and personal preferences properly, it can start to evolve into something more meaningful – more like a trusted ‘digital twin’ or augmented extension of ourselves, than a tool to take orders.

Why are events such as Girl Geeks important?

Serrie-Justine Chapman, Founder of Women’s Tech Jobs says: “Bristol Girl Geek Dinners is a great network of over 850 members of brilliant, intelligent women. We welcome all women (or identifying as such) to the group whether already working in tech or simply tech-curious and wanting to see if there’s opportunity that might suit. I spent the majority of my career in that side of the industry, it’s dynamic, cutting edge and all about problem solving – great fun! I’m excited to watch the industry moves forward in such leaps and bounds and to see a Bristol based engineering company like XMOS being at the forefront of it all. Even more importantly, they’ve recognised the importance of encouraging women into the industry and are involving themselves in the change that’s needed. Our thanks to XMOS for a fabulous Girl Geek Dinner – everyone in the group is still buzzing from a great and welcoming evening!”

So, if we want to move the needle on the dial on gender diversity in engineering where do we start? Well, we know that having designated spaces and events for shared learning and networking is proving important for female-identifying individuals in male dominated industries. Promoting effective forms of skills sharing, particularly with networking events lets females absorb experience from others and share experiences with others who are treading a similar path. We’re often at our most aspirational when bouncing ideas and questions off like-minded people. This can be a daunting task for many when faced with a diversity chasm in the room – as is all too often the case with technology events.

If we take a closer look at the issue, simple maths tells us that there are more men in the engineering industry today – so of course it’s going to feel imbalanced. Figures from the Office of National Statistics from Q4 2017 and other reports suggest the UK has the lowest percentage of female engineering professionals in Europe, at just 11 per cent.

To correct that, we have to look at industry representation. Last year, the Joint Council for Qualifications revealed there is now very little gender difference in take up of and achievement in core STEM GCSE subjects. Which sounds like good very news. Unfortunately, the IET also confirmed last year that a mere 15.1 per cent of engineering undergraduates in the UK in 2017 were women. So, how can this drop off be explained? Why aren’t women pursuing engineering in Higher Education? Why is the industry failing to attract women?

Businesses must step up to address the challenge

In short, this industry needs to get smart on making it an attractive workplace for women – with clear career prospects and policies that reflect that. And this needs to happen fast. Everyone across the industry has a role to play in making sure we don’t lose some of the brightest talent available. The UK has a real opportunity to take the lead on breaking down the barriers to entry for women in voice-enabled technology and highlight the endless possibilities for career progression that this and the wider engineering industry has to offer.

And as for us girl geeks? Let’s call time on holding back from opportunities. Seize the day and get your confidence on – the world’s changing. Technology is moving at incredible speed. And it’s time we show just how much we can achieve.

XMOS is hiring!

If you’re interested in finding out more about what it’s like working for a leading company in the voice capture space, with partners such as Amazon Alexa and Infineon, you can browse our jobs or send us your CV with note about why you’d like to work here to work@xmos.com. We’d love to hear from you.

Talkin’ bout the Alexa Generation – by Mark Lippett

Talkin’ bout the Alexa Generation – by Mark Lippett 2000 1333 XMOS

“Alexa. Who or what is the Alexa Generation

When staying with friends last weekend, I asked their 6-year-old if she had a favourite song. ‘Alexa. Play Taylor Swift Shake it Off’, she immediately commanded their Echo device. As someone who lives and breathes voice interfaces, there was something about this casual, matter-of-fact interaction that really struck me – and it got me thinking.

The Alexa Generation is here – children are now growing up with voice-enabled technologies around them as the everyday familiar. By the time these children are teenagers, they’ll expect devices to be voice-enabled and their lifestyles will be inflected by digital assistants in a way unlike any generation before them.

Our Alexa Children will interact fluently, efficiently and naturally with their voice-enabled devices. They will search using voice, not text. Already, one in four 16 to 24 year olds use voice search on mobile … that’s one in four searches made by voice. Text based commands will decline as our Alexa Children form increasingly human relationships with technology. They’ll take their virtual assistants with them wherever they go.They’ll navigate websites by voice. And a remote control will look as strange to them as a cassette tape does to millennials.

Alexa children show a different pattern of usage

According to research from Childwise, 42 per cent of children aged between nine and 16 use voice recognition gadgets at home. And unlike many adults, children are using it for more than music streaming. They’re turning to these virtual assistants for help with homework, to look up facts and check spelling.

Some worry that the growth of voice technology could have an adverse effect on how children learn to communicate, However, for every skeptic there is an evangelist; Solace Shen, a researcher at Cornell who studied interactions between children and technology, says she sees opportunities for educational and entertainment content on voice-enabled devices that doesn’t “suck kids in”, in the same way that a smartphone does.

Indeed, universal voice commands, such as those used to instruct the Echo, come as naturally to the Alexa Generation as Windows and Mac operating systems came to their generational predecessors. And perhaps voice will bring us a behavioural shift – or at least a better balance between screen-time and time spent engaged in active listening.

We’re moving towards a more natural conversation with technology

Over 39 million Americans now own an Echo, and yet many of those are using a fraction of Alexa’s capabilities. Why? Well, it turns out that adults feel very self-conscious addressing technology directly. According to KDDI’s 2017 voice search study of the Japanese population, 70 per cent of users said it’s “embarrassing to voice search in front of others.” And when asked how they used voice-enabled devices at home, 40 per cent of respondents said, “I only do so with nobody in the house.”

The Alexa Generation vault clear of this barrier. There is a fluidity with which the Alexa Children interact with voice-enabled technologies that’s free of inhibition. This, coupled with advancements in human-machine interface technology, means that over time, the friction between human and machine interactions will erode away, until a conversation with the Internet of Things feels as natural as a conversation with a parent, friend or sibling.

Digital assistants will simply blend into the fabric of our lives

For the Alexa Generation, voice control seems as natural as the English rain. The culture and lifestyle of those growing up with voice-enabled technologies will be worlds apart from that of those born in the ‘80s and ‘90s. It’ll be defined by a cooperative co-existence with AI and technology, whereby voice-activated digital assistants become an integral, trusted part of everyday life.

Today, a demonstration of Alexa’s talents brings about surprise and fuels dinner-table discussion, but tomorrow, digital assistants will simply blend into the fabric of the lives of the Alexa Children. That the novelty will wear off is something to be celebrated, because it’ll marks the point when voice-assisted devices start to fulfill their full potential as tools (not toys) and heralds the passing of the torch from the Millennials to the Alexa Generation.

Giving voice to the elderly

Giving voice to the elderly 5386 3591 XMOS

Voice-enabled technologies will transform the health and happiness of the elderly.

The UN predicts a 56% rise in the number of people aged 60 years or over, taking us from over 900 million in 2015 to nearly 1.5 billion in 2030.The world’s population is changing. Our demographic is aging. And this could well be the defining issue of our time. An aging population creates a burden on health systems and individual households. Family members, clinicians, and assisted care providers will need a new generation of technology platforms to help them stay informed, coordinated, and most importantly, connected.

The social care system is facing a mountain of challenges and it can’t cope with a sustained upswing in the number of senior people and adults living with chronic illnesses.

Whether living at home or in an assisted facility, help may come from an unexpected source – technology. Speech recognition and voice-enabled devices make technology accessible to all. There’s no need to tap a keyboard or figure out how to work the remote control, you simply talk to the device from across the room. A voice-controlled device can empower a formerly ‘dis-empowered’ user. It can ease pressure on caregivers, becoming a companion and digital assistant. Of course it’s not a replacement for human interaction, but rather a meaningful addition.

How can voice-enabled assistants make a difference?

Voice-enabled assistants such as Amazon’s Alexa, Google’s Assistant, Apple’s Siri, and Microsoft’s Cortana are at the forefront of society’s screenless future. Thanks to rapid advancements in voice technology and natural language processing (NLP), these virtual assistants are far better equipped to understand human speech and respond accurately (and in real-time). These virtual-assistants can perform all sorts of tasks – including playing music on demand, calling friends or prompting you to take your medication at a certain time.

This can make a big difference to those living with a chronic illness, anyone who has limited mobility, and the elderly. It can make tasks easier and create a sense of companionship. More importantly, it helps people regain control. From simple actions such as lighting the room and adjusting the temperature, to things that are critical to our wellbeing, such as controlling access to our home and the calling for emergency assistance when needed.

How does this work in the real world?

The team at Pillohealth have come up with a ground-breaking, in-home 24/7 companion robot – Pillo – which combines voice control, facial recognition and artificial intelligence to provide personalised digital healthcare. The technology can also provide 24-hour care and companionship, entertainment on the go, reminders of when the patient needs to eat, sleep or move and the ability for the person to live independently.

The device acts as a secure pill dispenser, offers video check-ins with caregivers, can quickly and reliably identify valuable healthcare insights and can send data to healthcare professionals. But it offers much more than that to the user. They don’t need to learn how to use Pillo, they can just talk to it – and the companion robot is on hand to tune into their favourite radio station radio, answer a question, manage their calendar and give them handy reminders.

It all adds together to help the user enjoy a more independent life and could help to ease pressure on a stretched healthcare system. (see pilloheath.com)

Companionship is equally important

Happiness comes when the ‘assistant’ becomes something more akin to ‘companion’. A study by US company Brookdale Senior Living, explores technologies that can help the older generation stay independent for longer.

A team set out to determine whether reciting Shakespeare with a robot could increase engagement and lessen symptoms of depression. A two-foot tall robot called Nao recited the first 12 lines of Shakespeare’s Sonnet 18, “Shall I compare thee to a summer’s day?” and then prompted seniors to recite the last two lines.

Those who interacted with Nao experienced significant decreases in depression and significant increases in engagement over time. Showing the tangible capability of voice-enabled assistants to be more than just a virtual caregiver, but at times a companion who is available 24/7.

Voice can provide both practical and emotional benefits

Over time, advancements in artificial intelligence will improve voice-enabled assistants. Learning more about the user with each interaction, they will move from reactive responses to a more relevant, engaging conversation. They will become an integral part of the consumer ecosystem, seamlessly integrating across all devices and platforms to become a natural, digital companion.

Crucially, if technology is controlled by voice, it becomes accessible to everyone. There’s no need to learn how to use it, you just talk to it. A number of studies have shown that talking makes us feel happier, so it’s easy to see how voice enabled technology could transform life for the elderly in ways that are both practical and emotional. And given our aging demographic, this feels like a very good thing.

Want to develop a voice enabled device that can hear across the room?

Want to develop a voice enabled device that can hear across the room? 1014 762 XMOS

You’ll need the right acoustic echo cancellation (AEC) solution.

If you’re designing a voice-enabled product for the smart home that includes a loudspeaker, you’ll need to remove the acoustic echo it generates so you can interrupt the audio stream – barge-in – and give a voice command when the device is playing such as adjust volume.

Mono or stereo?

For products such as security solutions or kitchen appliances, and many smart speakers, mono-AEC is usually the right tool for the job. But if you’re designing products that output true stereo audio, for example TVs, soundbars and media streamers, then you’ll need stereo-AEC to secure the best performance available. Here’s why …

Acoustic echo cancellation explained

Acoustic echo cancellation is a digital signal processing technique for removing echo that originates from a loudspeaker. Within a device, there’s a direct path between the loudspeaker and microphones. There’s also an indirect path between the two, because the audio signal reflects off the walls and other surfaces before it reaches the microphone. Put simply, you’ll get a reflection off the ceiling, floor, each wall and every solid object in the room. These reflections are known as indirect acoustic echo and they’re picked up at different times by the microphone, depending on the length of path from the loudspeaker to the microphone.

If we look at a soundwave generated by a noise from the loudspeaker, the original sound can usually be identified at the beginning and then the soundwave tails off as the energy falls in reflections.

To support barge-in and capture a clear voice stream to send to your automatic speech recognition service (ASR), you need to remove as much echo from the captured microphone signal as possible.

It’s not possible to remove 100% of the echo because the time needed to capture the signal and separate out all of the echo would lead to a delayed response, and the user experience demands that this all happens in real time. So in practice, you’re looking to target an “acceptable” level of echo cancellation that allows the ASR to respond accurately.

Types of acoustic echo cancellers

Echo cancellers are categorised by the number of loudspeaker reference channels supported. Common configurations are either: mono – 1-channel, or true stereo – 2-channel. Another configuration – pseudo-stereo – behaves in a very similar way to mono, but has some significant performance issues when challenged with true stereo audio output.

Mono-AEC

Mono-AEC uses a single reference signal based on the audio input and applies it to the output, which can be one or more loudspeakers.

The Digital Signal Processor uses the reference signal to calculate indirect echo based on the time it takes the reflections to reach the microphone.

Where signal processing has been used to give the impression of a stereo system from a mono signal (e.g. by adjusting the signal pan and volume and output to two or more speakers) the calculation remains based on the reference signal and position of the loudspeakers from the microphone:

True Stereo-AEC

True stereo-AEC uses two separate reference signals based on the two-channel input.

Each reference signal is used to cancel the echo from its corresponding loudspeaker output.

True stereo-AEC requires almost twice the computational resources of a mono solution, and it requires very low latency within the system to keep all the echo cancellation synchronized within the required thresholds.

Pseudo-Stereo-AEC

A pseudo-stereo solution is similar to a mono-AEC configuration; it outputs the two audio streams to separate speakers but uses a single reference signal that is a mix of the two inputs.

The mixed reference signal is then applied to each loudspeaker output.

Problems arise when the mixed signal differs significantly from the two output channels, for example a loud track on one loudspeaker and a quiet one of the other, and the mixed reference signal is not representative of either input signal.

In the example above the amplitude of the reference signal is significantly larger than the output for Input A. This causes the signal to be drowned out leading to a very low signal-to-noise for the voice capture process. With Input B there is not enough AEC when the input is loud which will cause increased artefacts in the captured voice stream and a higher likelihood of inaccurate word recognition.

Choosing the right acoustic echo cancellation solution

The start point is to decide which acoustic echo canceller you need for your microphone array and audio subsystem.

Using a mono-AEC algorithm with a true stereo device will only work if both channels are very similar. If your stereo product uses the full capabilities of stereo audio with spatial soundscape and dramatic volume changes, then the only solution is one that supports true stereo-AEC.

For devices like smart speakers where the required range of output is more limited, a pseudo-stereo may provide an good solution. And for things like kitchen appliances where high quality audio isn’t required, mono-AEC is ideal.

XMOS has a range of solutions to fit whatever product you’re developing. Our XVF3000 series with mono-AEC is ideal for smart panels and smart speaker developers, while our XVF3500 series with two channel stereo-AEC delivers outstanding performance for smart TVs, soundbars and other products that playback true stereo output.

by Huw Geddes