Audio inpainting

From HandWiki

Audio inpainting (also known as audio interpolation) is an audio restoration task which deals with the reconstruction of missing or corrupted portions of a digital audio signal.[1] Inpainting techniques are employed when parts of the audio have been lost due to various factors such as transmission errors, data corruption or errors during recording.[2] The goal of audio inpainting is to fill in the gaps (i.e., the missing portions) in the audio signal seamlessly, making the reconstructed portions indistinguishable from the original content and avoiding the introduction of audible distortions or alterations.[3]

Many techniques have been proposed to solve the audio inpainting problem and this is usually achieved by analyzing the temporal[1][4][5] and spectral[3][2] information surrounding each missing portion of the considered audio signal.

Corrupted spectrogram (top) and its reconstruction after performing audio inpainting (bottom)

Classic methods employ statistical models or digital signal processing algorithms [1][4][5] to predict and synthesize the missing or damaged sections. Recent solutions, instead, take advantage of deep learning models, thanks to the growing trend of exploiting data-driven methods in the context of audio restoration.[3][2][6]

Depending on the extent of the lost information, the inpaintining task can be divided in three categories. Short inpainting refers to the reconstruction of few milliseconds (approximately less than 10) of missing signal, that occurs in the case of short distortions such as clicks or clipping.[7] In this case, the goal of the reconstruction is to recover the lost information exactly. In long inpainting instead, with gaps in the order of hundreds of milliseconds or even seconds, this goal becomes unrealistic, since restoration techniques cannot rely on local information.[8] Therefore, besides providing a coherent reconstruction, the algorithms need to generate new information that has to be semantically compatible with the surrounding context (i.e., the audio signal surrounding the gaps).[3] The case of medium duration gaps lays between short and long inpainting. It refers to the reconstruction of tens of millisecond of missing data, a scale where the non-stationary characteristic of audio already becomes important.[9]

Definition

Consider a digital audio signal [math]\displaystyle{ \mathbf{x} }[/math]. A corrupted version of [math]\displaystyle{ \mathbf{x} }[/math], which is the audio signal presenting missing gaps to be reconstructed, can be defined as [math]\displaystyle{ \mathbf{\tilde{x}} = \mathbf{m} \circ \mathbf{x} }[/math], where [math]\displaystyle{ \mathbf{m} }[/math] is a binary mask encoding the reliable or missing samples of [math]\displaystyle{ \mathbf{x} }[/math], and [math]\displaystyle{ \circ }[/math] represents the element-wise product.[2] Audio inpainting aims at finding [math]\displaystyle{ \mathbf{\hat{x}} }[/math] (i.e., the reconstruction), which is an estimation of [math]\displaystyle{ \mathbf{x} }[/math]. This is an ill-posed inverse problem, which is characterized by a non-unique set of solutions.[2] For this reason, similarly to the formulation used for the inpainting problem in other domains,[10][11][12] the reconstructed audio signal can be found through an optimization problem that is formally expressed as

[math]\displaystyle{ \mathbf{\hat{x}}^* = \underset{\hat{\mathbf{X}}}{\text{argmin}} ~ L(\mathbf{m} \circ\mathbf{\hat{x}}, \mathbf{\tilde{x}}) + R(\mathbf{\hat{x}}) }[/math].

In particular, [math]\displaystyle{ \mathbf{\hat{x}}^* }[/math] is the optimal reconstructed audio signal and [math]\displaystyle{ L }[/math] is a distance measure term that computes the reconstruction accuracy between the corrupted audio signal and the estimated one.[10] For example, this term can be expressed with a mean squared error or similar metrics.

Since [math]\displaystyle{ L }[/math] is computed only on the reliable frames, there are many solutions that can minimize [math]\displaystyle{ L(\mathbf{m} \circ\mathbf{\hat{x}}, \mathbf{\tilde{x}}) }[/math]. It is thus necessary to add a constraint to the minimization, in order to restrict the results only to the valid solutions.[12][11] This is expressed through the regularization term [math]\displaystyle{ R }[/math] that is computed on the reconstructed audio signal [math]\displaystyle{ \mathbf{\hat{x}} }[/math]. This term encodes some kind of a-priori information on the audio data. For example, [math]\displaystyle{ R }[/math] can express assumptions on the stationarity of the signal, on the sparsity of its representation or can be learned from data.[12][11]

Techniques

There exist various techniques to perform audio inpainting. These can vary significantly, influenced by factors such as the specific application requirements, the length of the gaps and the available data.[3] In the literature, these techniques are broadly divided in model-based techniques (sometimes also referred as signal processing techniques) [3] and data-driven techniques.[2]

Model-based techniques

Model-based techniques involve the exploitation of mathematical models or assumptions about the underlying structure of the audio signal. These models can be based on prior knowledge of the audio content or statistical properties observed in the data. By leveraging these models, missing or corrupted portions of the audio signal can be inferred or estimated.[1]

An example of a model-based techniques are autoregressive models.[5][13] These methods interpolate or extrapolate the missing samples based on the neighboring values, by using mathematical functions to approximate the missing data. In particular, in autoregressive models the missing samples are completed through linear prediction.[14] The autoregressive coefficients necessary for this prediction are learned from the surrounding audio data, specifically from the data adjacent to each gap.[5][13]

Some more recent techniques approach audio inpainting by representing audio signals as sparse linear combinations of a limited number of basis functions (as for example in the Short Time Fourier Transform).[1][15] In this context, the aim is to find the sparse representation of the missing section of the signal that most accurately matches the surrounding, unaffected signal.[1]

The aforementioned methods exhibit optimal performance when applied to filling in relatively short gaps, lasting only a few tens of milliseconds, and thus they can be included in the context of short inpainting. However, these signal-processing techniques tend to struggle when dealing with longer gaps.[2] The reason behind this limitation lies in the violation of the stationarity condition, as the signal often undergoes significant changes after the gap, making it substantially different from the signal preceding the gap.[2]

As a way to overcome these limitations, some approaches add strong assumptions also about the fundamental structure of the gap itself, exploiting sinusoidal modeling [16] or similarity graphs [8] to perform inpainting of longer missing portions of audio signals.

Data-driven techniques

Data-driven techniques rely on the analysis and exploitation of the available audio data. These techniques often employ deep learning algorithms that learn patterns and relationships directly from the provided data. They involve training models on large datasets of audio examples, allowing them to capture the statistical regularities present in the audio signals. Once trained, these models can be used to generate missing portions of the audio signal based on the learned representations, without being restricted by stationarity assumptions.[3] Data-driven techniques also offer the advantage of adaptability and flexibility, as they can learn from diverse audio datasets and potentially handle complex inpainting scenarios.[3]

As of today, such techniques constitute the state-of-the-art of audio inpainting, being able to reconstruct gaps of hundreds of milliseconds or even seconds. These performances are made possible by the use of generative models that have the capability to generate novel content to fill in the missing portions. For example, generative adversarial networks, which are the state-of-the-art of generative models in many areas, rely on two competing neural networks trained simultaneously in a two-player minmax game: the generator produces new data from samples of a random variable, the discriminator attempts to distinguish between generated and real data.[17] During the training, the generator's objective is to fool the discriminator, while the discriminator attempts to learn to better classify real and fake data.[17]

In GAN-based inpaniting methods the generator acts as a context encoder and produces a plausible completion for the gap only given the available information surrounding it.[3] The discriminator is used to train the generator and tests the consistency of the produced inpainted audio.[3]

Recently, also diffusion models have established themselves as the state-of-the-art of generative models in many fields, often beating even GAN-based solutions. For this reason they have also been used to solve the audio inpainting problem, obtaining valid results.[2] These models generate new data instances by inverting the diffusion process, where data samples are progressively transformed into Gaussian noise.[2]

One drawback of generative models is that they typically need a huge amount of training data. This is necessary to make the network generalize well and make it able to produce coherent audio information, that also presents some kind of structural complexity.[6] Nonetheless, some works demonstrated that, capturing the essence of an audio signal is also possible using only a few tens of seconds from a single training sample.[6][18][19] This is done by overfitting a generative neural network to a single training audio signal. In this way, researchers were able to perform audio inpainting without exploiting large datasets.[6][19]

Applications

Audio inpainting finds applications in a wide range of fields, including audio restoration and audio forensics among the others. In these fields, audio inpainting can be used to eliminate noise, glitches, or undesired distortions from an audio recording, thus enhancing its quality and intelligibility. It can also be employed to recover deteriorated old recordings that have been affected by local modifications or have missing audio samples due to scratches on CDs.[2]

Audio inpainting is also closely related to packet loss concealment (PLC). In the PLC problem, it is necessary to compensate the loss of audio packets in communication networks. While both problems aim at filling missing gaps in an audio signal, PLC has more computation time restrictions and only the packets preceding a gap are considered to be reliable (the process is said to be causal).[20][2]

See also

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 Mokrý, Ondřej; Rajmic, Pavel (2020). "Audio Inpainting: Revisited and Reweighted". IEEE/ACM Transactions on Audio, Speech, and Language Processing 28: 2906–2918. doi:10.1109/TASLP.2020.3030486. 
  2. 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 2.11 Moliner, Eloi (2023). "Diffusion-Based Audio Inpainting". arXiv:2305.15266 [eess.AS].
  3. 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 Marafioti, Andres; Majdak, Piotr; Holighaus, Nicki; Perraudin, Nathanael (January 2021). "GACELA: A Generative Adversarial Context Encoder for Long Audio Inpainting of Music". IEEE Journal of Selected Topics in Signal Processing 15 (1): 120–131. doi:10.1109/JSTSP.2020.3037506. Bibcode2021ISTSP..15..120M. 
  4. 4.0 4.1 Adler, Amir; Emiya, Valentin; Jafari, Maria G.; Elad, Michael; Gribonval, Rémi; Plumbley, Mark D. (March 2012). "Audio Inpainting". IEEE Transactions on Audio, Speech, and Language Processing 20 (3): 922–932. doi:10.1109/TASL.2011.2168211. https://openresearch.surrey.ac.uk/view/delivery/44SUR_INST/12139149600002346/13140693880002346. 
  5. 5.0 5.1 5.2 5.3 Janssen, A.; Veldhuis, R.; Vries, L. (April 1986). "Adaptive interpolation of discrete-time signals that can be modeled as autoregressive processes". IEEE Transactions on Acoustics, Speech, and Signal Processing 34 (2): 317–330. doi:10.1109/TASSP.1986.1164824. https://pure.tue.nl/ws/files/3077308/Metis235417.pdf. 
  6. 6.0 6.1 6.2 6.3 Greshler, Gal; Shaham, Tamar; Michaeli, Tomer (2021). "Catch-A-Waveform: Learning to Generate Audio from a Single Short Example". Advances in Neural Information Processing Systems (Curran Associates, Inc.) 34: 20916–20928. https://proceedings.neurips.cc/paper/2021/hash/af21d0c97db2e27e13572cbf59eb343d-Abstract.html. 
  7. Applications of digital signal processing to audio and acoustics (6. Pr ed.). Boston, Mass.: Kluwer. 2003. pp. 133–194. ISBN 978-0-7923-8130-3. 
  8. 8.0 8.1 Perraudin, Nathanael; Holighaus, Nicki; Majdak, Piotr; Balazs, Peter (June 2018). "Inpainting of Long Audio Segments With Similarity Graphs". IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (6): 1083–1094. doi:10.1109/TASLP.2018.2809864. 
  9. Marafioti, Andres; Perraudin, Nathanael; Holighaus, Nicki; Majdak, Piotr (December 2019). "A Context Encoder For Audio Inpainting". IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12): 2362–2372. doi:10.1109/TASLP.2019.2947232. 
  10. 10.0 10.1 Ulyanov, Dmitry; Vedaldi, Andrea; Lempitsky, Victor (1 July 2020). "Deep Image Prior". International Journal of Computer Vision 128 (7): 1867–1888. doi:10.1007/s11263-020-01303-4. 
  11. 11.0 11.1 11.2 Pezzoli, Mirco; Perini, Davide; Bernardini, Alberto; Borra, Federico; Antonacci, Fabio; Sarti, Augusto (January 2022). "Deep Prior Approach for Room Impulse Response Reconstruction". Sensors 22 (7): 2710. doi:10.3390/s22072710. PMID 35408325. Bibcode2022Senso..22.2710P. 
  12. 12.0 12.1 12.2 Kong, Fantong; Picetti, Francesco; Lipari, Vincenzo; Bestagini, Paolo; Tang, Xiaoming; Tubaro, Stefano (2022). "Deep Prior-Based Unsupervised Reconstruction of Irregularly Sampled Seismic Data". IEEE Geoscience and Remote Sensing Letters 19: 1–5. doi:10.1109/LGRS.2020.3044455. Bibcode2022IGRSL..1944455K. 
  13. 13.0 13.1 Etter, W. (May 1996). "Restoration of a discrete-time signal segment by interpolation based on the left-sided and right-sided autoregressive parameters". IEEE Transactions on Signal Processing 44 (5): 1124–1135. doi:10.1109/78.502326. Bibcode1996ITSP...44.1124E. 
  14. O'Shaughnessy, D. (February 1988). "Linear predictive coding". IEEE Potentials 7 (1): 29–32. doi:10.1109/45.1890. 
  15. Mokry, Ondrej; Zaviska, Pavel; Rajmic, Pavel; Vesely, Vitezslav (September 2019). "Introducing SPAIN (SParse Audio INpainter)". 2019 27th European Signal Processing Conference (EUSIPCO). pp. 1–5. doi:10.23919/EUSIPCO.2019.8902560. ISBN 978-9-0827-9703-9. 
  16. Lagrange, Mathieu; Marchand, Sylvain; Rault, Jean-bernard (15 October 2005). "Long Interpolation of Audio Signals Using Linear Prediction in Sinusoidal Modeling" (in English). Journal of the Audio Engineering Society 53 (10): 891–905. http://www.aes.org/e-lib/browse.cfm?elib=13390. 
  17. 17.0 17.1 Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Nets. 27. Curran Associates, Inc.. https://proceedings.neurips.cc/paper_files/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html. 
  18. Tian, Yapeng; Xu, Chenliang; Li, Dingzeyu (2019). "Deep Audio Prior". arXiv:1912.10292 [cs.SD].
  19. 19.0 19.1 Turetzky, Arnon; Michelson, Tzvi; Adi, Yossi; Peleg, Shmuel (18 September 2022). "Deep Audio Waveform Prior". Interspeech 2022: 2938–2942. doi:10.21437/Interspeech.2022-10735. 
  20. Diener, Lorenz; Sootla, Sten; Branets, Solomiya; Saabas, Ando; Aichner, Robert; Cutler, Ross (18 September 2022). "INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge". Interspeech 2022. pp. 580–584. doi:10.21437/Interspeech.2022-10829.