Log-spectral distance

From HandWiki

The log-spectral distance (LSD), also referred to as log-spectral distortion or root mean square log-spectral distance, is a distance measure between two spectra.[1] The log-spectral distance between spectra [math]\displaystyle{ P\left(\omega\right) }[/math] and [math]\displaystyle{ \hat{P}\left(\omega\right) }[/math] is defined as p-norm:

[math]\displaystyle{ D_{LS}={\left\{ \frac{1}{2\pi} \int_{-\pi}^\pi \left[ \log P(\omega) - \log \hat{P}(\omega) \right]^p \,d\omega \right\} }^{1/p}, }[/math] where [math]\displaystyle{ P\left(\omega\right) }[/math] and [math]\displaystyle{ \hat{P}\left(\omega\right) }[/math] are power spectra.

Unlike the Itakura–Saito distance, the log-spectral distance is symmetric.[2]

In speech coding, log spectral distortion for a given frame is defined as the root mean square difference between the original LPC log power spectrum and the quantized or interpolated LPC log power spectrum. Usually the average of spectral distortion over a large number of frames is calculated and that is used as the measure of performance of quantization or interpolation.

Meaning

When measuring the distortion between signals, the scale or temporality/spatiality of the signals can have different levels of significance to the distortion measures. To incorporate the proper level of significance, the signals can be transformed into a different domain.

When the signals are transformed into the spectral domain with transformation methods such as Fourier transform and DCT, the spectral distance is the measure to compare the transformed signals. LSD incorporates the logarithmic characteristics of the power spectra, and it becomes effective when the processing task of the power spectrum also has logarithmic characteristics, e.g. human listening to the sound signal with different levels of loudness.

Moreover, LSD is equal to the cepstral distance which is the distance between the signals' cepstrum when the p-numbers are the same by Parseval's theorem.

Other Representations

As LSD is in the form of p-norm, it can be represented with different p-numbers and log scales.

For instance, when it is expressed in dB with L2 norm, it is defined as: [math]\displaystyle{ D_{LS}=\sqrt{\frac{1}{2\pi} \int_{-\pi}^\pi \left[ 10\log_{10} \frac{P(\omega)}{\hat{P}(\omega)} \right]^2 \,d\omega } }[/math].

When it is represented in the discrete space, it is defined as: [math]\displaystyle{ D_{LS}={\left\{ \frac{1}{N} \sum_{n=1}^N \left[ \log P(n) - \log \hat{P}(n) \right]^p \right\} }^{1/p} , }[/math] where [math]\displaystyle{ P\left(n\right) }[/math] and [math]\displaystyle{ \hat{P}\left(n\right) }[/math] are power spectra in discrete space.

See also

References

  1. Rabiner, Lawrence R; Juang, Biing-Hwang (1993). Fundamentals of speech recognition. PTR Prentice Hall. http://www.citeulike.org/group/10577/article/308923. 
  2. Enqvist, Per; Karlsson, Johan (2008). "Minimal Itakura-Saito distance and covariance interpolation". 2008 47th IEEE Conference on Decision and Control. pp. 137–142. doi:10.1109/CDC.2008.4739312. ISBN 978-1-4244-3123-6.