SpeechRefiner: Towards Perceptual Quality Refinement for Front-End Algorithms

Abstract: Speech pre-processing, such as denoising, de-reverberation, and speech separation, are widely adopted as an enabling technology or a processing front-end for downstream tasks. However, the processing can be insufficient, leaving residual noises or introducing new artifacts. These issues are often not reflected in terms of metrics such as SI-SNRi, but are perceived prominently by human listeners. In this work, we propose SpeechRefiner, a post-processor that leverages Conditional Flow Matching (CFM) to enhance speech perceptual quality. SpeechRefiner can be integrated with different front-end algorithms. We first compare our method with several recently proposed task-specific refinement baselines. We then validate its effectiveness on an internal processing pipeline that incorporates multiple front-end algorithms. Experimental results demonstrate that SpeechRefiner exhibits strong generalization ability across diverse impairment sources, significantly improving speech perceptual quality.

Comparison on various speech front-end algorithms:

This section presents samples from different systems across various speech front-end algorithms. Specifically, "SpeechRefiner (Specific)", corresponding to Table 2 of the paper, aims to evaluate SpeechRefiner against state-of-the-art refiners designed for specific front-end tasks. "SpeechRefiner (General)", corresponding to Table 3, is trained on the internal dataset to further validate the effectiveness of the proposed approach.

Enhancement & Dereverberation: Internal

System
Audio No.1
Audio No.2
Audio No.3
Audio No.4

Distorted Speech

SpeechRefiner(General)

Enhancement: CDiffuSE[1]

System
Audio No.5
Audio No.6
Audio No.7
Audio N0.8

Distorted Speech

VoiceFixer[2]

Diffiner+[3]

SpeechRefiner(Specific)

Dereverberation: SGMSE+[4]

System
Audio No.9
Audio No.10
Audio No.11
Audio N0.12

Distorted Speech

VoiceFixer[2]

StoRM[5]

SpeechRefiner(Specific)

Separation: Sepformer[6]

System
Audio No.13
Audio No.14
Audio No.15
Audio N0.16

Distorted Speech

VoiceFixer[2]

Fast-Geco[7]

SpeechRefiner(Specific)

Audio-TSE: Spex+[8]

System
Audio No.17
Audio No.18
Audio No.19
Audio No.20

Distorted Speech

SpeechRefiner(General)

AV-TSE: MuSE[9]

System
Audio No.21
Audio No.22
Audio No.23
Audio No.24

Distorted Speech

SpeechRefiner(General)

References:

[1] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, “Conditional diffusion probabilistic model for speech enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7402–7406.
[2] H. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang, and Y. Wang, “Voicefixer: Toward general speech restoration with neural vocoder,” arXiv preprint arXiv:2109.13731, 2021.
[3] R. Sawata, N. Murata, Y. Takida, T. Uesaka, T. Shibuya, S. Takahashi, and Y. Mitsufuji, “Diffiner: A versatile diffusion- based generative refiner for speech enhancement,” arXiv preprint arXiv:2210.17287, 2022.
[4] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023.
[5] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[6] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in ICASSP 2021- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 21–25.
[7] H. Wang, J. Villalba, L. Moro-Velazquez, J. Hai, T. Thebaud, and N. Dehak, “Noise-robust speech separation with fast generative correction,” arXiv preprint arXiv:2406.07461, 2024.
[8] M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” arXiv preprint arXiv:2005.04686, 2020.
[9] Z. Pan, R. Tao, C. Xu, and H. Li, “Muse: Multi-modal target speaker extraction with visual cues,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6678–6682.

System	Audio No.5	Audio No.6	Audio No.7	Audio N0.8
Distorted Speech
VoiceFixer[2]
Diffiner+[3]
SpeechRefiner(Specific)

System	Audio No.9	Audio No.10	Audio No.11	Audio N0.12
Distorted Speech
VoiceFixer[2]
StoRM[5]
SpeechRefiner(Specific)

System	Audio No.13	Audio No.14	Audio No.15	Audio N0.16
Distorted Speech
VoiceFixer[2]
Fast-Geco[7]
SpeechRefiner(Specific)