Audio synthesizer inversion in symmetric parameter spaces with approximately equivariant flow matching

Centre for Digital Music, Queen Mary University of London
ISMIR 2025

Abstract

Many audio synthesizers can produce the same signal given different parameter configurations, meaning the inversion from sound to parameters is an inherently ill-posed problem. We show that this is largely due to intrinsic symmetries of the synthesizer, and focus in particular on permutation invariance. First, we demonstrate on a synthetic task that regressing point estimates under permutation symmetry degrades performance, even when using a permutation-invariant loss function or symmetry-breaking heuristics. Then, viewing equivalent solutions as modes of a probability distribution, we show that a conditional generative model substantially improves performance. Further, acknowledging the invariance of the implicit parameter distribution, we find that performance is further improved by using a permutation equivariant continuous normalizing flow. To accommodate intriciate symmetries in real synthesizers, we also propose a relaxed equivariance strategy that adaptively discovers relevant symmetries from data. Applying our method to Surge XT, a full-featured open source synthesizer used in real world audio production, we find our method outperforms regression and generative baselines across audio reconstruction metrics.

Audio Examples

In these examples, we conduct synthesizer inversion with audio generated by Surge XT.
Surge XT is a full-featured open source synthesizer used in real world audio production. It is highly configurable, with multiple methods for sound generation and several sources of symmetry and uncertainty. Clicking any play button will open a row of waveform and spectrogram displays, for easier comparison. Press "Fetch more samples" to load a new set of audio examples.

We have peak-normalized all audio to -1.0 dBFS. However, these are still synthesized sounds, some of which may be unpleasant at high volume, particularly those produced by the collapsed VAE + RealNVP model. We thus advise you to start at a low volume level.

Out-of-distribution Audio Examples

In the following examples, the target audio did not come from the synthesizer.
NSynth is a dataset of synthesized musical instrument sounds. FSD50K consists of a variety of non-musical and musical sounds taken from Freesound. These models were not trained with generalization in mind and these results are presented with no pretense of OOD robustness. Nonetheless, we encourage listening with an open mind and paying attention to which aspects of these sounds the model appears to respond to.

BibTeX

BibTex will go here