Preview

Mekhatronika, Avtomatizatsiya, Upravlenie

Advanced search

Audiovisual Voice Activity Detector Based on Deep Convolutional Neural Network and Generalized Cross-Correlation

Abstract

This paper presents a voice activity detector (VAD) which uses the data from the compact linear microphone array and a video camera, so developed VAD is robust to external noise conditions. It is able to ignore non-speech sound sources and speaking persons located out of the area of the interest. A deep convolutional neural network processes images from the video camera for searching face and lips of the speaking person. It was trained using the Max-Margin Object Detection loss. Pixel coordinates of found lips are converting to directions to lips in camera coordinate system using optical camera model. The sound from the microphone array is processing using the weighted GCC-PHAT algorithm and Kalman filtering. VAD searches for speaking lips on the video. It becomes activated only if the video camera finds lips and the microphone array confirms that there is a sound source in this direction. A prototype of the system based the linear microphone array with 30 mm spacing between microphones and the video camera was developed, manufactured using a 3D printer and tested in the laboratory conditions. The accuracy of the system was compared with the open source VAD from the WebRTC project (developed by Google) which uses only audio features extracted from the same microphone array. Developed VAD showed a high sustainability to external noise. It ignored the noise from not-target directions during 100 % of the testing time. And the VAD from the WebRTC had 88 % of false positive activations.

About the Authors

D. A. Suvorov
Skolkovo Institute of Science and Technology
Russian Federation


R. A. Zhukov
Skolkovo Institute of Science and Technology
Russian Federation


D. O. Tsetserukov
Skolkovo Institute of Science and Technology
Russian Federation


S. L. Zenkevich
Bauman Moscow State Technical University
Russian Federation


References

1. RamHrez J., Gorriz J. M., Segura J. C. Voice activity detection. Fundamentals and speech recognition system robustness // Robust Speech Recognition and Understanding. Vienna: I-TECH Education and Publishing. 2007. P. 1-22.

2. Woo K., Yang T., Park K., Lee C. Robust voice activity detection algorithm for estimating noise spectrum // Electronics Letters. 2000. Vol. 36, N. 2. P. 180-181.

3. Mousazadeh S., Cohen I. Voice activity detection in presence of transient noise using spectral clustering // IEEE Trans. Audio, Speech, Language Process. 2013. Vol. 21, N. 6. P. 1261-1271.

4. Obuchi Y. Framewise speech-nonspeech classification by neural networks for voice activity detection with statistical noise suppression // IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, 2016, March. P. 5715-5719.

5. Montazzolli S., Jung C. R., Gelb D. Audiovisual voice activity detection using off-the-shelf cameras // IEEE International Conference on Image Processing. Quebec, 2015, September. P. 3886-3890.

6. Ying D., Yan Y., Dang J., Soong F. K. Voice activity detection based on an unsupervised learning framework // IEEE Trans. Audio, Speech, Language Process. 2011. Vol. 19, N. 8. P. 2624-2633.

7. Popovic B., Pakoci E., Pekar D. Advanced Voice Activity Detection on Mobile Phones by Using Microphone Array and Phoneme-Specific Gaussian Mixture Models // SISY. Subotica, 2016, August. P. 45-50.

8. Grondin F., Michaud F. Noise Mask for TDOA Sound Source Localization of Speech on Mobile Robots in Noisy Environments // IEEE International Conference Robotics and Automation. Stockholm, 2016, May.

9. Tashev I., Mirsamadi S. DNN-based Causal Voice Activity Detector // Information Theory and Applications Workshop. San Diego, 2016, February.

10. Julier S., Uhlmann J. A new extension of the Kalman filter to nonlinear systems // 11th International Symposium on Aerospace/ Defense Sensing, Simulation and Controls. Vol. Multi- Sensor Fusion, Tracking and Resource Management II. Orlando, 1997.

11. King D. E. Max-Margin Object Detection // Cornell University Library. 31.12.15. URL: https://arxiv.org/pdf/1502.00046.pdf (дата обращения: 18.08.2017).

12. Kazemi V., Sullivan J. One Millisecond Face Alignment with an Ensemble of Regression Trees // IEEE Conference on Computer Vision and Pattern Recognition. Columbus, 2014, June.

13. Bradski G., Kaehler A. Learning OpenCV. Computer Vision with the OpenCV Library. Sebastopol: O'Reilly Media, 2008. P. 580.

14. Tashev I. Sound Capture and Processing. Practical Approaches. The City of New York: John Wiley & Sons, 2009. P. 365.

15. Суворов Д. А., Жуков Р. А. Устройство синхронного сбора данных с массива MEMS микрофонов с PDM интерфейсом. Патент России № 172596. 2017. Бюл. № 20.


Review

For citations:


Suvorov D.A., Zhukov R.A., Tsetserukov D.O., Zenkevich S.L. Audiovisual Voice Activity Detector Based on Deep Convolutional Neural Network and Generalized Cross-Correlation. Mekhatronika, Avtomatizatsiya, Upravlenie. 2018;19(1):53-57. (In Russ.)

Views: 370


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1684-6427 (Print)
ISSN 2619-1253 (Online)