voice enabled

MWC19: are we ushering in an ‘era of intelligent connectivity’?

MWC19: are we ushering in an ‘era of intelligent connectivity’? 800 600 XMOS

February | Barcelona

It’s a bold claim to make, but after last year’s MWC Europe was described by some as a “damp squib,” this year’s theme proposed to “usher in the era of intelligent connectivity,” thrusting high-speed 5G networks, IoT, AI and big data to the forefront of discussions.

Organisers were keen to highlight ‘intelligent connectivity’ will open up a number of opportunities for highly personalised experiences. Much of the discussion was driven by developments in the connectivity infrastructure – 5G, full fibre or otherwise, advances in artificial intelligence and machine learning, and the proliferation of smart devices in the IoT ecosystem, many of which were on the tradeshow floor. To our relief, MWC 2019 also moved past the gimmicky phase of IoT – who can forget the “record everything your kids say” wearable device of 2016 or the infamous laggy smart fridge which, in low lighting, could have passed as an iPad sellotaped to a regular fridge. Swish.

The conversation has moved on to how humans are going to engage in a simple, holistic way with their smart environments. And the biggest players in the world of tech are sitting up and taking notice. The success of voice services like Alexa, Google and Siri means that consumers are questioning why they need separate remote controls or mobile apps for each piece of technology in their home when they can ask a digital assistant to close the curtains, dim the lights and cue up the film on the TV.

There is no question that public consciousness has woken up to the power of voice. Smart speakers have fueled rapid growth in the voice space and other product categories are now coming to market. Latest estimates for home consumer electronics with ‘built in’ voice assistants have been revised upwards from 2017-18 to 75.6million units – based upon increased shipments in the smart TV and media streamer categories. (Futuresource)

The power of voice

XMOS was in the Department for International Trade’s exhibition space to showcase why voice is the natural interface between human and machine, withdemonstrable use-cases. Despite the challenges associated with running voice-enabled devices in a trade-show environment (the noise equivalence to a zoo), the commercial products on display passed the ultimate test: far-field voice capture with up close accuracy.

XMOS engineering has spurred a wave of third-party product implementations that are propelling the way humans interact with their digital assistants. From a healthcare companion robot to set-top boxes, our technology is at the heart of it. Here are some of the products we had on show:

  • Freebox Delta: the brainchild of innovative French telecoms operator Free, Freebox Delta is a high-performance fusion of technologies – a set-top box, media streamer, soundbar, WiFi server and smart home hub that responds intelligently to both touch and voice. XMOS provided the far-field voice capability for its two on-board personal assistants, Alexa and OK Freebox.
  • Hello2: created by the California-based communications company Solaborate Hello2, transforms your TV into a powerful communications device that responds intelligently to touch and voice. With far-field voice capture, a 4K HDR video sensor and a six element lens, Hello2 plugs into your TV with an HDMI cable to integrate audio, video and collaboration, so you can enjoy everything from digital whiteboarding to screen sharing, media streaming and gaming.
  • Pillo: created by Boston-based Pillo Health, this companion robot redefines the role of voice-enabled assistants in the homes of the elderly, sick and vulnerable, as well as transforming the way healthcare is administered. The Pillo robot is more than just a voice-enabled pill dispenser. By harnessing artificial intelligence, the device is capable of carrying out two-way interactions with its patients, offering genuine companionship as well as assistance.

Of the many, exciting new technologies on show, we were wowed by BMW’s Natural Interaction prototype, uses voice recognition, gesture control and gaze tracking to understand what you want to do in your car without pressing a button. Our CEO Mark Lippett spoke to analyst firm Bloomberg NEF on why cars represent such an exciting opportunity for voice technology. The combination of voice, sensors and AI will deliver conceptual awareness, which opens the potential for improved safety, reduce distractions and the ability for us to connect seamlessly with our voice assistant or ‘digital twin’.

Ushering in the era of intuitive intelligence

Perhaps a more fitting theme for this year’s show would be the era of intuitive intelligence. To look at connectivity in isolation, without articulating the benefits it’ll bring to consumers, can be frustrating – particularly as big token advancements such as 5G are still on the horizon for many. Today, voice interfaces are being adopted at double the rate of mobile phones when they first appeared. They are firmly positioned for general adoption at the intersection of humans and an augmented environment – which will be supercharged by next generation networks. The ultimate goal has to be a move away from touch-based commands toward ambient computing – where our interactions with the technology around us feel intuitive and easy – and we don’t need to learn how to use it, because the intelligence is already embedded in the device.

Talkin’ bout the Alexa Generation – by Mark Lippett

Talkin’ bout the Alexa Generation – by Mark Lippett 2000 1333 XMOS

“Alexa. Who or what is the Alexa Generation

When staying with friends last weekend, I asked their 6-year-old if she had a favourite song. ‘Alexa. Play Taylor Swift Shake it Off’, she immediately commanded their Echo device. As someone who lives and breathes voice interfaces, there was something about this casual, matter-of-fact interaction that really struck me – and it got me thinking.

The Alexa Generation is here – children are now growing up with voice-enabled technologies around them as the everyday familiar. By the time these children are teenagers, they’ll expect devices to be voice-enabled and their lifestyles will be inflected by digital assistants in a way unlike any generation before them.

Our Alexa Children will interact fluently, efficiently and naturally with their voice-enabled devices. They will search using voice, not text. Already, one in four 16 to 24 year olds use voice search on mobile … that’s one in four searches made by voice. Text based commands will decline as our Alexa Children form increasingly human relationships with technology. They’ll take their virtual assistants with them wherever they go.They’ll navigate websites by voice. And a remote control will look as strange to them as a cassette tape does to millennials.

Alexa children show a different pattern of usage

According to research from Childwise, 42 per cent of children aged between nine and 16 use voice recognition gadgets at home. And unlike many adults, children are using it for more than music streaming. They’re turning to these virtual assistants for help with homework, to look up facts and check spelling.

Some worry that the growth of voice technology could have an adverse effect on how children learn to communicate, However, for every skeptic there is an evangelist; Solace Shen, a researcher at Cornell who studied interactions between children and technology, says she sees opportunities for educational and entertainment content on voice-enabled devices that doesn’t “suck kids in”, in the same way that a smartphone does.

Indeed, universal voice commands, such as those used to instruct the Echo, come as naturally to the Alexa Generation as Windows and Mac operating systems came to their generational predecessors. And perhaps voice will bring us a behavioural shift – or at least a better balance between screen-time and time spent engaged in active listening.

We’re moving towards a more natural conversation with technology

Over 39 million Americans now own an Echo, and yet many of those are using a fraction of Alexa’s capabilities. Why? Well, it turns out that adults feel very self-conscious addressing technology directly. According to KDDI’s 2017 voice search study of the Japanese population, 70 per cent of users said it’s “embarrassing to voice search in front of others.” And when asked how they used voice-enabled devices at home, 40 per cent of respondents said, “I only do so with nobody in the house.”

The Alexa Generation vault clear of this barrier. There is a fluidity with which the Alexa Children interact with voice-enabled technologies that’s free of inhibition. This, coupled with advancements in human-machine interface technology, means that over time, the friction between human and machine interactions will erode away, until a conversation with the Internet of Things feels as natural as a conversation with a parent, friend or sibling.

Digital assistants will simply blend into the fabric of our lives

For the Alexa Generation, voice control seems as natural as the English rain. The culture and lifestyle of those growing up with voice-enabled technologies will be worlds apart from that of those born in the ‘80s and ‘90s. It’ll be defined by a cooperative co-existence with AI and technology, whereby voice-activated digital assistants become an integral, trusted part of everyday life.

Today, a demonstration of Alexa’s talents brings about surprise and fuels dinner-table discussion, but tomorrow, digital assistants will simply blend into the fabric of the lives of the Alexa Children. That the novelty will wear off is something to be celebrated, because it’ll marks the point when voice-assisted devices start to fulfill their full potential as tools (not toys) and heralds the passing of the torch from the Millennials to the Alexa Generation.

Giving voice to the elderly

Giving voice to the elderly 5386 3591 XMOS

Voice-enabled technologies will transform the health and happiness of the elderly.

The UN predicts a 56% rise in the number of people aged 60 years or over, taking us from over 900 million in 2015 to nearly 1.5 billion in 2030.The world’s population is changing. Our demographic is aging. And this could well be the defining issue of our time. An aging population creates a burden on health systems and individual households. Family members, clinicians, and assisted care providers will need a new generation of technology platforms to help them stay informed, coordinated, and most importantly, connected.

The social care system is facing a mountain of challenges and it can’t cope with a sustained upswing in the number of senior people and adults living with chronic illnesses.

Whether living at home or in an assisted facility, help may come from an unexpected source – technology. Speech recognition and voice-enabled devices make technology accessible to all. There’s no need to tap a keyboard or figure out how to work the remote control, you simply talk to the device from across the room. A voice-controlled device can empower a formerly ‘dis-empowered’ user. It can ease pressure on caregivers, becoming a companion and digital assistant. Of course it’s not a replacement for human interaction, but rather a meaningful addition.

How can voice-enabled assistants make a difference?

Voice-enabled assistants such as Amazon’s Alexa, Google’s Assistant, Apple’s Siri, and Microsoft’s Cortana are at the forefront of society’s screenless future. Thanks to rapid advancements in voice technology and natural language processing (NLP), these virtual assistants are far better equipped to understand human speech and respond accurately (and in real-time). These virtual-assistants can perform all sorts of tasks – including playing music on demand, calling friends or prompting you to take your medication at a certain time.

This can make a big difference to those living with a chronic illness, anyone who has limited mobility, and the elderly. It can make tasks easier and create a sense of companionship. More importantly, it helps people regain control. From simple actions such as lighting the room and adjusting the temperature, to things that are critical to our wellbeing, such as controlling access to our home and the calling for emergency assistance when needed.

How does this work in the real world?

The team at Pillohealth have come up with a ground-breaking, in-home 24/7 companion robot – Pillo – which combines voice control, facial recognition and artificial intelligence to provide personalised digital healthcare. The technology can also provide 24-hour care and companionship, entertainment on the go, reminders of when the patient needs to eat, sleep or move and the ability for the person to live independently.

The device acts as a secure pill dispenser, offers video check-ins with caregivers, can quickly and reliably identify valuable healthcare insights and can send data to healthcare professionals. But it offers much more than that to the user. They don’t need to learn how to use Pillo, they can just talk to it – and the companion robot is on hand to tune into their favourite radio station radio, answer a question, manage their calendar and give them handy reminders.

It all adds together to help the user enjoy a more independent life and could help to ease pressure on a stretched healthcare system. (see pilloheath.com)

Companionship is equally important

Happiness comes when the ‘assistant’ becomes something more akin to ‘companion’. A study by US company Brookdale Senior Living, explores technologies that can help the older generation stay independent for longer.

A team set out to determine whether reciting Shakespeare with a robot could increase engagement and lessen symptoms of depression. A two-foot tall robot called Nao recited the first 12 lines of Shakespeare’s Sonnet 18, “Shall I compare thee to a summer’s day?” and then prompted seniors to recite the last two lines.

Those who interacted with Nao experienced significant decreases in depression and significant increases in engagement over time. Showing the tangible capability of voice-enabled assistants to be more than just a virtual caregiver, but at times a companion who is available 24/7.

Voice can provide both practical and emotional benefits

Over time, advancements in artificial intelligence will improve voice-enabled assistants. Learning more about the user with each interaction, they will move from reactive responses to a more relevant, engaging conversation. They will become an integral part of the consumer ecosystem, seamlessly integrating across all devices and platforms to become a natural, digital companion.

Crucially, if technology is controlled by voice, it becomes accessible to everyone. There’s no need to learn how to use it, you just talk to it. A number of studies have shown that talking makes us feel happier, so it’s easy to see how voice enabled technology could transform life for the elderly in ways that are both practical and emotional. And given our aging demographic, this feels like a very good thing.

Want to develop a voice enabled device that can hear across the room?

Want to develop a voice enabled device that can hear across the room? 1014 762 XMOS

You’ll need the right acoustic echo cancellation (AEC) solution.

If you’re designing a voice-enabled product for the smart home that includes a loudspeaker, you’ll need to remove the acoustic echo it generates so you can interrupt the audio stream – barge-in – and give a voice command when the device is playing such as adjust volume.

Mono or stereo?

For products such as security solutions or kitchen appliances, and many smart speakers, mono-AEC is usually the right tool for the job. But if you’re designing products that output true stereo audio, for example TVs, soundbars and media streamers, then you’ll need stereo-AEC to secure the best performance available. Here’s why …

Acoustic echo cancellation explained

Acoustic echo cancellation is a digital signal processing technique for removing echo that originates from a loudspeaker. Within a device, there’s a direct path between the loudspeaker and microphones. There’s also an indirect path between the two, because the audio signal reflects off the walls and other surfaces before it reaches the microphone. Put simply, you’ll get a reflection off the ceiling, floor, each wall and every solid object in the room. These reflections are known as indirect acoustic echo and they’re picked up at different times by the microphone, depending on the length of path from the loudspeaker to the microphone.

If we look at a soundwave generated by a noise from the loudspeaker, the original sound can usually be identified at the beginning and then the soundwave tails off as the energy falls in reflections.

To support barge-in and capture a clear voice stream to send to your automatic speech recognition service (ASR), you need to remove as much echo from the captured microphone signal as possible.

It’s not possible to remove 100% of the echo because the time needed to capture the signal and separate out all of the echo would lead to a delayed response, and the user experience demands that this all happens in real time. So in practice, you’re looking to target an “acceptable” level of echo cancellation that allows the ASR to respond accurately.

Types of acoustic echo cancellers

Echo cancellers are categorised by the number of loudspeaker reference channels supported. Common configurations are either: mono – 1-channel, or true stereo – 2-channel. Another configuration – pseudo-stereo – behaves in a very similar way to mono, but has some significant performance issues when challenged with true stereo audio output.

Mono-AEC

Mono-AEC uses a single reference signal based on the audio input and applies it to the output, which can be one or more loudspeakers.

The Digital Signal Processor uses the reference signal to calculate indirect echo based on the time it takes the reflections to reach the microphone.

Where signal processing has been used to give the impression of a stereo system from a mono signal (e.g. by adjusting the signal pan and volume and output to two or more speakers) the calculation remains based on the reference signal and position of the loudspeakers from the microphone:

True Stereo-AEC

True stereo-AEC uses two separate reference signals based on the two-channel input.

Each reference signal is used to cancel the echo from its corresponding loudspeaker output.

True stereo-AEC requires almost twice the computational resources of a mono solution, and it requires very low latency within the system to keep all the echo cancellation synchronized within the required thresholds.

Pseudo-Stereo-AEC

A pseudo-stereo solution is similar to a mono-AEC configuration; it outputs the two audio streams to separate speakers but uses a single reference signal that is a mix of the two inputs.

The mixed reference signal is then applied to each loudspeaker output.

Problems arise when the mixed signal differs significantly from the two output channels, for example a loud track on one loudspeaker and a quiet one of the other, and the mixed reference signal is not representative of either input signal.

In the example above the amplitude of the reference signal is significantly larger than the output for Input A. This causes the signal to be drowned out leading to a very low signal-to-noise for the voice capture process. With Input B there is not enough AEC when the input is loud which will cause increased artefacts in the captured voice stream and a higher likelihood of inaccurate word recognition.

Choosing the right acoustic echo cancellation solution

The start point is to decide which acoustic echo canceller you need for your microphone array and audio subsystem.

Using a mono-AEC algorithm with a true stereo device will only work if both channels are very similar. If your stereo product uses the full capabilities of stereo audio with spatial soundscape and dramatic volume changes, then the only solution is one that supports true stereo-AEC.

For devices like smart speakers where the required range of output is more limited, a pseudo-stereo may provide an good solution. And for things like kitchen appliances where high quality audio isn’t required, mono-AEC is ideal.

XMOS has a range of solutions to fit whatever product you’re developing. Our XVF3000 series with mono-AEC is ideal for smart panels and smart speaker developers, while our XVF3500 series with two channel stereo-AEC delivers outstanding performance for smart TVs, soundbars and other products that playback true stereo output.

by Huw Geddes