SeFa failure mode for Water - fill-level
For the Water dataset, SeFa finds the second dimension (see Table 3 in paper) to be associated
with fill-level.
The most prominent dimension - i.e., the dimension with the highest Singular Value (dimension 0)
- is associated with
"perpetually filling" of the water container, while never reaching fill-level=10 (or full
container).
The other dimension (dimension 2) is associated with "perpertually unfilling" of the
water
container, while never leaving fill-level=10 or never reaching fill-level=0.
Whimsicial, sure, but not controllable. Makes one wonder, especially for dimension 0 which has
the highest Singular Value
(from the eigendecomposition of weights), if not fill-level then what
other factor of variation did SeFa find?
sample 1
(Dim0. Perpetually
filling.
No ending.)
sample 2
(Dim2. Perpetually
Unfilling.
No starting.)
Ablation studies for number of examples needed for guidance - Gaver Sound
Synthesis
In this section, we evaluate the effect of the number Gaver samples used to find the directional
vectors for edits for the attributes of Brightness and Impact Type.
The first column shows the number of samples used across clusters.
The Brightness or Impact Type changes from left to right. As observed, as the N increases, the
effectiveness of the directional vectors edits increases.
Also, the edits preserve other un-edited attributes better with higher N.
For instance, for Brightness, for all N<10 the Rate is not preserved.
Also, for N<6, the Impact Type is partially preserved (some hits become scratches).
Brightness (reduces from left to right)
Impact Type (sounds become scratchier from left to right)
Ablation studies - Effect of the value of Scalar $\boldsymbol{\alpha}$
In this section, we evaluate the effect of the scalar $\alpha$ in equation (4) in the paper.
$$ \mathbf{w_{edited}} = \mathbf{w} + \boldsymbol{\alpha} * \mathbf{d} $$
Where $\mathbf{d}$ is the $direction$ vector between the two prototypes. For all examples on
this Supplementary webpage, we use -
$$ 0 < \boldsymbol{\alpha} < 1$$
In this section we show the effect of using steps greater than $1$. We see that as the value of
$\boldsymbol{\alpha}$ moves further away from $1 * \mathbf{d}$, the samples tend to go out of
distribution.
sample 1
(Editing Brightness)
sample 2
(Editing Brightness)
sample 3
(Editing Impact Type)
sample 4
(Editing Impact Type)
sample 7
(Editing Water
Fill-Level)
StyleGAN2 Training Details
We set Z and W space dimensions both to 128 for all our
experiments across the two datasets. We also use only 4 mapping layers (as compared to 8 in the
original StyleGAN2 paper).
Further, we use the log-magnitude spectrogram
representations generated using a Gabor
transform(n_frames=256, stft_channels=512,
hop_size=128), a Short-Time Fourier Transform (STFT) with a Gaussian window, to train the
StyleGAN2 and the Phase Gradient Heap Integration (PGHI) for
high-fidelity spectrogram inversion of textures to audio. For Greatest
Hits dataset, we train the models for 2800kimgs with a batch size of 16, taking ~20 hours
to train on a single RTX 2080 Ti GPU with 11GB memory. For Water Filling dataset, we train for
1400kimgs with the same batch size running for ~14 hours on the same GPU. The metrics for
quality of the generated sounds in terms of Frechet Audio Distance
(FAD) along with the StyleGAN2 code adapted for audio textures can be
found below.
Table: StyleGAN2 Frechet Audio Distance (FAD)
Dataset |
w-dim & z-dim |
Number kimgs (iterations) |
FAD Score |
Greatest Hits Dataset |
128 |
2800 |
0.6 |
Water Filling Dataset |
128 |
1400 |
1.17 |
Encoder/GAN Inversion Training Details
We use a RestNet-34 backbone as the architecture for our GAN inversion
network. For both datasets, we train the Encoder for 1500 iterations with a batch size of 8 and
choose the checkpoint with the lowest validation loss for inference. As in the original GAN
Encoder paper we use an Adam optimizer with learning rate of 0.0001.
Further, for the Water dataset, we apply a thresholding of -17db, i.e. we mask the frequency
components with magnitude below -17db. For both datasets, the training took ~25 hours to
complete on a single GPU.
Table: GAN Inversion/Encoder (netE) Frechet Audio Distance (FAD) (w/ Thresholding)
Dataset |
Number kimgs (iterations) |
Gaver Sounds FAD Score |
Real-World Sounds FAD Score |
Greatest Hits Dataset |
1500 |
4.16 |
2.83 |
Water Filling Dataset |
1500 |
7.92 |
1.42 |
To model impact sounds, such as those in the Greatest Hits dataset, we use a combination of
sounds synthesized using method 1 and 2. For method 1, we choose damping constant
δn=0.001 for hard surfaces and δn=0.5 for soft surfaces. We
provide variations
in the generated sounds by using different impact surface sizes, φ and n (number of
partials). We vary the first partial of ω between 60-240Hz for large impact surfaces and
between 250-660Hz for smaller surfaces. For method 2, we vary the impulse width of each impact
between 0.4 - 1.0 seconds to model scratches and between 0.1 - 0.4 seconds to model sharp
hits. Further, we model dull sounds by configuring low frequency bands roughly between
10-1.5kHz and bright sounds using frequency bands above 4kHz. Water filling Gaver sounds are
modelled as a combination of multiple impulses (modelled as individual water drops) concatenated
with each other. We generate each drop using method 1 with an impulse width of 0.05 seconds.
Each fill-level is controlled by linearly increasing or decreasing ω and its partials
across the sound sample.
In all our experiments, we use 10 synthetic Gaver examples (5 per semantic attribute cluster) to
generate the guidance vectors for controllable generation. Please see the
for ablation studies section of this webpage for generating guidance vectors using different
number of Gaver samples. Our
code repository has all the Gaver configurations we used in our experiments.
UMAP Clusters for Real and Synthetic Sounds based on semantic attributes
This section shows the UMAP clusters for the Greatest Hits dataset for the attributes of
Brightness, Impact Type and Rate. The clusters are generated for both
Real sounds (from the training dataset) as well as Gaver sounds.
We find the W-vector for each sound using the Encoder. We then run UMAP algorithm (something
like t-SNE) on the W-vectors to find the clusters.
Notice how separable the clusters for each attribute are. Also, see how the clusters for the
synthetic sounds are similar to the real sounds.
Re-scoring Classifier Details
We use a classifier based on
this paper.
For the classifier we use a DenseNet (with pre-training) based network, where the last layer is
modified
depending upon the number of classes we need. For binary re-scoring analysis the number of
classes is 2 (to
indicate presence of absence of the semantic attribute).
The input to the classifier are 3 mel-spectrograms appended along the channel axis. As in the
original paper, we follow this method
to capture information at 3 different time scales, i.e. we
compute mel-spectrogram of a signal using different window sizes and hop-lengths of [25ms,
10ms], [50ms, 25ms], and [100ms, 50ms] for each channel respectively. The different window sizes
and hop-lengths ensure the network has different levels of information from the frequency and
time domain on each channel.
We use Adam optimizer
while training with a learning rate of 0.0001, weight decay of 0.001 and train for 100 epochs.
The codebase for the classifier can be found here:
Audio-Classification Github Fork
Curated Dataset for Training Re-Scoring Classifier
To train the attribute re-scoring classifier, we manually curate and label a small dataset, from
both Greatest Hits and Water Filling datasets.
We use this classifier to quantitatively evaluate the
effectiveness of our method in comparison with SeFa. For this, we manually curate approximately
250 samples of 2
seconds sounds for each semantic attribute from both the datasets. This manual curation involved
visually analysing the video and auditioning the associated sound files to detect the semantic
attribute being curated. For Brightness, we manually curated a set
of bright sounds from hits made on dense material surfaces such as glass, tile, ceramic or metal
to indicate the presence of the brightness attribute. And a set of dull or dark sounds made by
impacts on soft materials such as cloth, carpet or paper to indicate an absence of the
brightness attribute. For Rate we curated sound samples with just 1-2 impact
sounds in a sample to indicate low-rate and all other samples as high-rate. For
Impact Type, we curated a set of sound samples where the drumstick sharply hit
the surface and another set of sounds where the drumstick scratched the surface. For
Fill-Level for Water, we curated the sounds by sampling the first and last ~3
seconds of each file from the original dataset
(of ~30 seconds length files) to indicate
an empty bucket and a full bucket. These datasets are used to train attribute
change or re-scoring classifiers during evaluation.
Limitations & Future Work: Approaching semantic edits using text-to-audio models
(Paper Section
VI)
Although we are unable to perform a systematic comparison of our method with text-to-audio
models, in this section we
demonstrate the results from some text prompts that could assist in achieving the semantic
editing goals of our
framework using text-to-audio models such as
AudioGen and
AudioLDM.
For both impact sounds as well as water filling, we device a 'starting' prompt (shown in the
first column in the examples below).
We subsequently edit the prompt to change different semantic attributes using textual
descriptions (as demonstrated in the subsequent columns). The prompt edits are
highlighted in different colors.
In the examples below, we begin with one prompt, and continuously edit it with additional
textual descriptions of how we want to modify the sounds.
For impact sounds, observe how adding
regularity and rate (
and
very fast) edits to the prompt, unintentionally also changes other attributes such as
brightness. Further, by modifying the prompt using
very fast, is negating the effect of
long sustain
Note: These examples are not intended to be a systematic study or comparison of our
method to text-to-audio. We only show these examples here
to highlight some differences between these results and ours, and thus pave the way for our
future work.
Prompt:
"stick hitting
a very hard metal surface"
Prompt:
"stick hitting
a very hard metal surface
with resonance and long sustain"
Prompt:
"stick hitting
a very hard metal surface
with resonance and long
sustain
and
regularity"
Prompt:
"stick hitting
a very hard metal surface
with resonance and long
sustain
and
regularity
and
very fast"
For water filling, as we are interested in continuously editing the fill level, we edit the
prompts using a start fill level and and end fill level.
For the prompts used below, AudioGen performs comparitively better than AudioLDM in terms of the
quality of the generated sounds.
Note: These examples are not intended to be a systematic study or comparison of our
method to text-to-audio. We only show these examples here
to highlight some differences between these results and ours, and thus pave the way for our
future work.
Prompt:
"water filling
a metal container that is
empty until it is quarter full"
Prompt:
"water filling
a metal container that is
quarter full until it is half
full"
Prompt:
"water filling
a metal container that is
half full until it is three-fourths
full"
Prompt:
"water filling
a metal container that is
three-fourths full until it is
completely full"