Espressif’s New Voice Assistant, ESP-Skainet, Released

Macnica UK

Editorial contact:

Name Stefan Tauschek
Telephone +49 (0)841 88198-102

Espressif’s New Voice Assistant, ESP-Skainet, Released

With ESP-Skainet, users can easily build applications performing wake-word detection and processing speech-recognition commands.

Shanghai, China, Aug 30, 2019

ESP-Skainet is a new voice-interaction development framework based on Espressif’s flagship chip, ESP32.The new development framework supports voice wake-up and multiple offline speech-recognition commands. The new development framework is also based on Espressif’s WakeNet and MultiNet.

WakeNet is Espressif’s voice wake-up engine. It achieves a low memory usage of ap-proximately 20KB, as well as a high calculation speed. WakeNet has been specifically designed for low-power embedded MCUs, providing users with excellent near- and far-field performance. Taking LyraT-Mini as an example of a development board that is small in size, yet highly versatile and powerful, the currently achievable wake-up performance in a quiet environment is a whopping 97% within a one-meter distance between the development board and the user, and 95% within a three-meter distance between the development board and the user, as shown in the table below.

Distance  Quiet environment Stationary environment Noise environment  AEC Wake Up
1 m 97%  90% 88%   89%(-5 dB~-10 dB)
3 m  95% 85% 75%  73%(-5 dB~-15 dB


Currently, Espressif’s voice wake-up engine supports up to five wake-up words. Espressif provides its customers with the official wake-up word “嗨乐鑫” for free. This translates in English as “Hello Espressif”; it is pronounced as “haɪ ləʌˈʃɪən” and is transcribed in pinyin as “Hi Lexin”. Espressif also supports customer wake-up word customization. For details regarding the wake-up word customization process, please refer to “ESP_Wake_Words_Customization”.

MultiNet is a lightweight model which allows ESP32 to perform offline speech-recognition of multiple commands. MultiNet’s design draws on Convolutional Recur-rent Neural Networks (CRNN) and Connectionist Temporal Classification (CTC). MultiNet uses an audio clip’s Mel-Frequency Cepstral Coefficients (MFCC) as input, and the phonemes of that audio clip, which could be either in Chinese or in English, as output. By comparing the output’s phonemes, MultiNet can identify the relevant Chinese or English command. For the time being, up to 100 spoken commands in Chinese, including customized ones, are supported.

What’s more, users can easily add their own voice commands, without having to train the model again. No network connection is required. While ensuring the security of user information, the commands can be quickly implemented. Commands in English will be supported in the next edition of MultiNet.

Pricing and Availability
Information about Pricing and availability info via email:

About Espressif Systems
Espressif Systems (Shanghai) Pte. Ltd. is a fabless semiconductor company, with headquarters in Shanghai Zhangjiang High-Tech Park, providing low power Wi-Fi and Bluetooth SoCs and wireless solutions for the Internet of Things (IoT). The company build the widely popular ESP8266 and ESP32 chips with an innovative team of chip-design specialists, software and firmware developers and marketers. Espressif is committed to providing the best IoT devices and software platforms in in-dustry.

The company also helps their customers build their own solutions and connect with other partners in the IoT ecosystem. Their passion lies in creating state-of-the-art chipsets and enabling partners to deliver great products. Espressif’s products are widely deployed in the tablet, OTT boxes, cameras, and Internet of Things markets.

For more information, please visit