Speech processing

From HandWiki
Revision as of 23:10, 6 February 2024 by StanislovAI (talk | contribs) (link)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Short description: Study of speech signals and the processing methods of these signals

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.[1]

History

Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple phonetic elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.[2] Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.[3]

Linear predictive coding (LPC), a speech processing algorithm, was first proposed by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone (NTT) in 1966.[4] Further developments in LPC technology were made by Bishnu S. Atal and Manfred R. Schroeder at Bell Labs during the 1970s.[4] LPC was the basis for voice-over-IP (VoIP) technology,[4] as well as speech synthesizer chips, such as the Texas Instruments LPC Speech Chips used in the Speak & Spell toys from 1978.[5]

One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by Lawrence Rabiner and others at Bell Labs was used by AT&T in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary.[6]

By the early 2000s, the dominant speech processing strategy started to shift away from Hidden Markov Models towards more modern neural networks and deep learning.[citation needed]

Techniques

Dynamic time warping

Hidden Markov models

Artificial neural networks

Phase-aware processing

Phase is usually supposed to be random uniform variable and thus useless. This is due wrapping of phase:[7] result of arctangent function is not continuous due to periodical jumps on [math]\displaystyle{ 2 \pi }[/math]. After phase unwrapping (see,[8] Chapter 2.3; Instantaneous phase and frequency), it can be expressed as:[7][9] [math]\displaystyle{ \phi(h,l) = \phi_{lin}(h,l) + \Psi(h,l) }[/math], where [math]\displaystyle{ \phi_{lin}(h,l) = \omega_0(l') {}_\Delta t }[/math] is linear phase ([math]\displaystyle{ {}_\Delta t }[/math] is temporal shift at each frame of analysis), [math]\displaystyle{ \Psi(h,l) }[/math] is phase contribution of the vocal tract and phase source.[9] Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase [10] and its derivatives by time (instantaneous frequency) and frequency (group delay),[11] smoothing of phase across frequency.[11] Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase.[9]

Applications

See also

References

  1. Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2019-11-06). "The Speed Submission to DIHARD II: Contributions & Lessons Learned". arXiv:1911.02388 [eess.AS].
  2. Juang, B.-H.; Rabiner, L.R. (2006), Speech Recognition, Automatic: History, Elsevier, pp. 806–819, doi:10.1016/b0-08-044854-2/00906-8, ISBN 9780080448541 
  3. Myasnikov, L. L.; Myasnikova, Ye. N. (1970) (in ru). Automatic recognition of sound pattern. Leningrad: Energiya. 
  4. 4.0 4.1 4.2 Gray, Robert M. (2010). "A History of Realtime Digital Speech on Packet Networks: Part II of Linear Predictive Coding and the Internet Protocol". Found. Trends Signal Process. 3 (4): 203–303. doi:10.1561/2000000036. ISSN 1932-8346. https://ee.stanford.edu/~gray/lpcip.pdf. 
  5. "VC&G - VC&G Interview: 30 Years Later, Richard Wiggins Talks Speak & Spell Development". http://www.vintagecomputing.com/index.php/archives/528. 
  6. Huang, Xuedong; Baker, James; Reddy, Raj (2014-01-01). "A historical perspective of speech recognition". Communications of the ACM 57 (1): 94–103. doi:10.1145/2500887. ISSN 0001-0782. 
  7. 7.0 7.1 Mowlaee, Pejman; Kulmer, Josef (August 2015). "Phase Estimation in Single-Channel Speech Enhancement: Limits-Potential". IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (8): 1283–1294. doi:10.1109/TASLP.2015.2430820. ISSN 2329-9290. https://ieeexplore.ieee.org/document/7103305. Retrieved 2017-12-03. 
  8. Mowlaee, Pejman; Kulmer, Josef; Stahl, Johannes; Mayer, Florian (2017). Single channel phase-aware signal processing in speech communication: theory and practice. Chichester: Wiley. ISBN 978-1-119-23882-9. 
  9. 9.0 9.1 9.2 Kulmer, Josef; Mowlaee, Pejman (April 2015). "Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR". IEEE. pp. 5063–5067. 
  10. Kulmer, Josef; Mowlaee, Pejman (May 2015). "Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition". IEEE Signal Processing Letters 22 (5): 598–602. doi:10.1109/LSP.2014.2365040. ISSN 1070-9908. https://ieeexplore.ieee.org/document/6936313. Retrieved 2017-12-03. 
  11. 11.0 11.1 Mowlaee, Pejman; Saeidi, Rahim; Stylianou, Yannis (July 2016). "Advances in phase-aware signal processing in speech communication". Speech Communication 81: 1–29. doi:10.1016/j.specom.2016.04.002. ISSN 0167-6393. http://linkinghub.elsevier.com/retrieve/pii/S0167639316300784. Retrieved 2017-12-03.