MetaAudio: A Benchmark Breakdown

20 minute read


‘MetaAudio: A Few-Shot Classification Benchmark’ was released in early April. It contains a variety of benchmark results for researchers to beat in the future. This blog aims to be a more easily digestible breakdown of the work. All of the code for MetaAudio can be found here


Aims of the Work

As it currently stands, the majority of work focusing on few-shot learning exists within computer vision. MetaAudio aims to be an accompanying benchmark to those that already exist, hopefully reducing algorithmic bias toward the image domain. To provide this, we investigate a variety of experimental settings, drawing parallels where possible to existing benchmarks.

Few-Shot Audio Classification

At test-time, few-shot classification performance is evaluated using N-way k-shot tasks, where N is the novel classes (never seen by the model before) and k is the number of examples per each of those classes given for use. For instance, each of the tasks T1 -> T3 represented in Figure 1 are 5-way 1-shot tasks. Each task is composed of two parts, the support which is trained on in some way (from explicit gradient descent to prototypes in some embedded space) at test time and query, the evaluation component. Parameterising these tasks with N and k also allows for more in-depth analysis of sample complexity.

test Figure 1: Example tasks for few-shot audio classification with spectrograms. Tasks contain both a support which we train on and a query which we evaluate on

As well as testing like this, most of the algorithms considered also train with these few-shot tasks (called episodic training). The only exception to this is SimpleShot which trains with more traditional batching over the training classes, after which its linear head is stripped and the remainder of the model used as a feature extractor.

Data, Algorithms & Setup


Over the full benchmark we experiment with 7 unique datasets, 5 of which we split up class-wise for both training and evaluation, and 2 of which we set aside for cross-dataset testing. Of these 7, 3 are fixed length (all samples same size e.g. 5 seconds) and 4 are variable length (sample length varies over the full dataset).

Table 1: High level details of all datasets considered in MetaAudio

NameSetting$N^o$ Classes$N^o$ SamplesFormatSample LengthUse
FSDKaggle18Mixed4111,073Variable0.3s - 30sMeta-train/test
VoxCeleb1Voice1251153,516Variable3s - 180sMeta-train/test
BirdClef2020Bird Song96072,305Variable3s - 30mMeta-train/test
BirdCLEF 2020 (Pruned)Bird Song71563,364Variable3s - 180sMeta-train/test
Watkins Marine Mammal Sound DatabaseMarine Mammals321698Variable0.1 - 150sMeta-test
SpeechCommandsV2Spoken Word35105,829Fixed1sMeta-test

The extra dataset included is a pruned version of the BirdClef set. As its contained samples span from 30 seconds to 30 minutes, hardware requirements were significantly higher. This was to an extent that we considered it a barrier of entry when trying to train and utilise networks on the set. To overcome this we pose a pruned version where samples longer than 180 seconds are removed along with classes that contain fewer than 50 examples.

Each dataset used for both meta-train and meta-test are randomly class-wise split into training, validation and test sets with a ratio of 7/1/2. This random class-wise splitting with no fold validation is ot ideal but due to the expense in meta and few-shot learning, it is commonplace within the community.

Processing & Input

Pre-processing was kept simple with only a few major steps. The first of these was z-normalisation on each raw audio sample individually. We then converted all raw time-series into log-mel spectrograms using identical parameters across both samples and datasets.

test Figure 2: Pre-processing pipeline for data samples

For the final normalisation on the sampled spectrograms we performed some prior experimentation, looking specifically at at per-sample, channel-wise and global. In general both channel and global performed similarly, with each taking the edge in some cases. For simplicity we opted to use global normalisation across all samples and experiments in the work.

Variable Length Samples

For variable length samples we first split the raw audio clip into L second sub-clips before later converting each to log-mel spectrograms. All of these sub-clips are then stacked and stored as one single file, which can be later sampled. If any sub-clips is less than L we repeat the clip and clip to the required length. The exact value of L is left as a hyperparameter which we investigate the effects of, however For the majority of the experiments in this version of MetaAudio it is set to 5 seconds, primarily to aid joint training and cross-dataset experimentation. All of this processing is done entirely offline, similarly to the fixed length setting.

test Figure 3: Variable length clips are first split into sub-clips of L seconds. From here the sub-clips are individually converted to spectrograms. If any clips fall short they are repeated until the required length in the spectrogram space.

There are a few reasons we choose this specific variable length pipeline. Firstly, splitting the clip up and then converting to spectrograms prevents data leakage between sub-clips compared to the alternative, where the full spectrogram is created and then split up. Additionally the stacking of sub-clips in one file allows us to use the same file whether the sample is selected as either a support or a query. If a variable length sample is chosen as a support vector, one of the sub-clips is randomly chosen for use. If selected as a query all sub-clips are predicted over, with a majority vote system deciding on the final assigned class.


Due to the large amount of algorithmic literature for meta-learning, MetaAudio is not exhaustive in the algorithms tested. To overcome this as best as possible, we chose a representative few, spanning baselines, metric learners and gradient-based approaches. The exact list considered so far are as follows:

First order methods are used for the gradient-based learners as during initial experimentation, using 2nd order gradients yielded either similar or lower performance.


Throughout all of the experiments, the training, validation and testing subsets defined by random class-wise splitting are kept constant. This is not only an important step for this baseline work but also for future researchers tackling the problems set out here.


Evaluation Method & Optimiser

Keeping with current and past few-shot works, we evaluate a single end-to-end trained model, obtaining average classification accuracies and a 95% confidence interval over 10,000 randomly sampled tasks from the meta-test set. Although not as informative as something like k-fold validation in more traditional problems, this has become the main metric simply due to the computational and time expense of meta-learning training. For training, we used the Adam optimiser with a non-adaptive learning rate.


Motivated by the increasing performance gaps between basic CNNs and sequentially informed models in traditional acoustic classification as well as some local verification of potential performance gains, we opted to use a lightweight CRNN model as out backbone architecture. Specifically, the CRNN contains a 4-block convolutional backbone (1-64-64-64) with an attached 1-layer non-bidirectional RNN containing 64 hidden units. The number of outputs in the final linear layer is either of size N-way or, in the case of metric learning and baseline methods, 64.

Within Dataset Evaluation

The first ste of experiments carried out looked at within dataset meta-training, validation and testing. How this class-wise split is formatted is demonstrated in Figure 4. This type of pipeline is the most specific and computationally heavy, as each dataset has a model trained for it and it alone. Generally this goes against goals of generalised representation learners, however provides us with a strong baseline of performance for each of the datasets and algorithms.

test Figure 4: An example of meta-train, validation and testing, using a bird song dataset like BirdClef2020. Icons are used to more visually demonstrate the possible classes and how they might relate to one another.

In general we found that gradient-based (GBML) learners like MAML and Meta-Curvature outperformed both the baseline models and metric learners.

Table 2: Baseline Within Dataset Results

DatasetFO-MAMLFO-Meta-CurvatureProtoNetsSimpleShot CL2NMeta_baseline
ESC-5074.66 ± 0.4276.17 ± 0.4168.83 ± 0.3868.82 ± 0.3971.72 ± 0.38
NSynth93.85 ± 0.2496.47 ± 0.1995.23 ± 0.1990.04 ± 0.2790.74 ± 0.25
FSDKaggle1843.45 ± 0.4643.18 ± 0.4539.44 ± 0.4442.03 ± 0.4240.27 ± 0.44
VoxCeleb160.89 ± 0.4563.85 ± 0.4459.64 ± 0.4448.50 ± 0.4255.54 ± 0.42
BirdCLEF 2020 (Pruned)56.26 ± 0.4561.34 ± 0.4656.11 ± 0.4657.66 ± 0.4357.28 ± 0.41
Avg Algorithm Rank2.

This is immediately in contrast with the performance comparisons shown in the SimpleShot work with images, where the simple baseline was able to beat out a variety of GBML approaches. Specifically we observe Meta-Curavture performing strongest on 4/5 datasets, with MAML taking the final 1/5. We propose that this is due to the GBML methods’ adaption mechanism, updating feature representation at each meta-test episode, making them particularly useful for tasks with high inter-class/episode variance. Meanwhile the others must rely on a fixed feature extractor that cannot adapt to each unique episode

Joint Training

The general idea for our joint training experiments is to train concurrently on all of our available datasets, hopefully leading to some implicit data-driven regularisation of the network. After training a network this way, we apply it on the individual test splits of the datasets used. This can be seen in Figure 5 and 6, where although training is mixed in some way, testing still occurs in each datasets meta-test split. We also apply these models to two held-out sets, all of which we use as meta-test.

test Figure 5: Joint training scenario with within dataset sampling. Can see that individual meta-train tasks are still confined to containing classes from one dataset. Meta-test for each set is the same meta-test found in within dataset generalisation.

We indentified two distinct ways to train with all datasets simultaneously, one where any individual task can only contain supports and queries from one of the included datasets (which we call within dataset sampling), and one in which samples contained within a task are unconstrained (free dataset sampling).

test Figure 6: Joint training scenario with free dataset sampling. Meta-train tasks can now contain classes from more than one dataset and are unconfined. Meta-test for each set is the same meta-test found in within dataset generalisation and joint training with within dataset sampling.

Comparing the performance of both sampling techniques against within-dataset evaluation, we see a general degradation of performance, with only ESC-50 and Kaggle18 improving (both from free dataset sampling). The difference varies heavily in magnitude between both the datasets and sampling routine used alike.

Table 3: Joint Training (Within Dataset Sampling)

DatasetFO-MAMLFO-Meta-CurvatureProtoNetsSimpleShot CL2NMeta_baseline
ESC-5068.68 ± 0.4572.43 ± 0.4461.49 ± 0.4159.31 ± 0.4062.79 ± 0.40
NSynth81.54 ± 0.3982.22 ± 0.3878.63 ± 0.3689.66 ± 0.4185.17 ± 0.31
FSDKaggle1839.51 ± 0.4441.22 ± 0.4536.22 ± 0.4037.80 ± 0.4034.04 ± 0.40
VoxCeleb151.41 ± 0.4351.37 ± 0.4450.74 ± 0.4140.14 ± 0.4139.18 ±0.39
BirdCLEF 2020 (Pruned)47.69 ± 0.4547.39 ± 0.4646.49 ± 0.4335.69 ± 0.4037.40 ± 0.40
Watkins57.75 ± 0.4757.76 ± 0.4749.16 ± 0.4352.73 ± 0.4352.09 ± 0.43
SpeechCommands V125.09 ± 0.4026.33 ± 0.4124.31 ± 0.3624.99 ± 0.3524.18 ± 0.36
Avg Algorithm Rank2.

The generally mixed results here mirror other studies (full references in paper) and reflect the tradeoff between generally increasing the amount of training data available and the increased difficulty of learning a single model capable of simultaneous high performance on diverse data domains. This is some evidence that MetaAudio compliments existing works in providing a challenging benchmark to test future meta-learners’ ability to fit diverse audio types, as well as enabling few-shot recognition of new categories.

Table 4: Joint Training (Free Dataset Sampling)

DatasetFO-MAMLFO-Meta-CurvatureProtoNetsSimpleShot CL2NMeta_baseline
ESC-5076.24 ± 0.4275.72 ± 0.4268.63 ± 0.3959.04 ± 0.4161.53 ± 0.40
NSynth77.71 ± 0.4183.51 ± 0.3779.06 ± 0.3690.02 ± 0.2785.04 ± 0.31
FSDKaggle1844.85 ± 0.4545.46 ± 0.4541.76 ± 0.4138.12 ± 0.4035.90 ± 0.38
VoxCeleb139.52 ± 0.4239.83 ± 0.4340.74 ± 0.3942.66 ± 0.4136.63 ± 0.38
BirdCLEF 2020 (Pruned)46.76 ± 0.4546.41 ± 0.4644.70 ± 0.4237.96 ± 0.4032.29 ± 0.38
Watkins60.27 ± 0.4758.19 ± 0.4748.56 ± 0.4254.34 ± 0.4353.23 ± 0.43
SpeechCommands V127.29 ± 0.4226.56 ± 0.4224.30 ± 0.3524.74 ± 0.3523.88 ± 0.35
Avg Algorithm Rank2.

Additionally, we contrast how the joint training episode sampling routines compare. For our main datasets, we observe 3/5 of the top results were obtained using the free sampling method, with the 2 outliers belonging to VoxCeleb and BirdClef - evidence that their tasks require significantly different and specific model parameterisation, as the within dataset task sampling would allow more opportunity to learn these more specialised features.

Joint Training to Cross-Dataset

For the held-out cross-dataset tasks (Watkins, SpeechCommands V1), we also see the strongest performance coming from the free sampling routine, where it outperforms its within dataset counterpart by ∼2% in both held-out sets. As for the absolute performances obtained on the held-out sets, we see that our joint training transfers somewhat-effectively, with the model in one case attaining a respectable 50-60% and another obtaining accuracies only 5% above random.

Massive Pre-Train

A full meta-learning pipeline for a specific dataset can be expensive. Transferring some pre-trained feature extractor and using a cheap linear classifier for each task could be cheaper due to the spreading of cost over multiple downstream tasks. In this direction we employed Audio Spectrogram Transformers from this work, which were trained on the large ImageNet and AudioSet datasets. On top of the features that we obtained form these models, we applied both nearest centroid and linear SVM classification.

Table 5: Massive Pre-training to linear classifier

 AST ImageNet AST ImageNet & AudioSet From Table 2
DatasetSVMSimpleShot CL2NSVMSimpleShot CL2NSimpleShot (CL2N)
ESC-5061.12 ± 0.4160.41 ± 0.4161.61 ± 0.4164.48 ± 0.4168.82 ± 0.39
NSynth64.26 ± 0.4166.68 ± 0.4162.62 ± 0.4263.78 ± 0.4290.04 ± 0.27
FSDKaggle1834.01 ± 0.4033.52 ± 0.3938.38 ± 0.4138.76 ± 0.4142.03 ± 0.42
VoxCeleb127.26 ± 0.3628.09 ± 0.3727.45 ± 0.3628.79 ± 0.3848.50 ± 0.42
BirdCLEF 2020 (Pruned)30.84 ± 0.3733.04 ± 0.4133.17 ± 0.3836.41 ± 0.4257.66 ± 0.43
Avg Rank4.
Watkins55.91 ± 0.4255.40 ± 0.4251.46 ± 0.4251.81 ± 0.42N/A
SpeechCommands V126.24 ± 0.3626.46 ± 0.3730.69 ± 0.3830.24 ± 0.38N/A
Avg Rank2.

These results reveal a few interesting insights. Firstly and perhaps unsurprisingly, we observe that the features pre-trained on both AudioSet and ImageNet outperform those trained on ImageNet alone. The small margin between these however is perhaps surprising, showing that image-derived features provide most of the information needed to interpret spectrograms.

Comparing these results to the in-domain training presented in Table 2, we observe substantial performance drops across the board, with the possible exceptions of ESC-50 and Kaggle18. In their best cases, NSynth, VoxCeleb and BirdClef all take drops in performance of ∼20% due to dataset shift between general purpose pre-training and our specific tasks, such as musical instruments, speech or bird song recognition. While the performance hit due to domain-shift is expected, these results are surprising as AudioSet is a much larger dataset, and the AST transformer is a much larger architecture than the CRNN used in Table 2. Within the image domain, comparable experimental settings show a clear win by simply applying larger pre-raining datasets and models along with simple readouts, compared to conducting within-domain meta-learning. This confirms the value of Meta-Audio as an important benchmark for assessing meta-learning contributions that cannot easily be replicated by larger architectures and more data. Performance on our held-out sets shows a more mixed set of results, with ImageNet only pre-training favouring Watkins, and ImageNet + AudioSet pre-training setting a new SOTA for SpeechCommands.

Reproduction & Use

This work aims to be a benchmark upon which people can build and improve, in that sense we outline how to best use MetaAudio. There are two main ways this benchmark can be approached. One in which new ideas are actively being tested, and one in which reproduction and/or immediate extensions to MetaAudio is the goal. Both of these goals start with the code repo available for MetaAudio.

Testing New Ideas

If simply testing new ideas against this benchmark, then following this format should work:

  • Obtain the datasets of interest (sources can be found here) - We recommend testing and reporting with all of the datasets to avoid claims of selection bias
  • Go through the dataset preprocessing pipelines (.wav -> .npy raw -> .npy spec). All of the code for this as well as detailed descriptions of the scripts can be found [here]( Note that some datasets like BirdClef & Watkins require an additional step of cleaning
  • Obtain the .npy class split files from here. Unless doing specific experiments with sample length and meta-data, we strongly recommend using the so-called ‘Baseline Splits’
  • If you already have some few-shot sampler for classification tasks that is set to sample from a folder-of-class-folders structure, then this should be all that is required from the MetaAudio repo. If this sampler is missing however, custom built classes can be found here
  • The evaluation metric used in MetaAudio is the average and 95% confidence interval taken over 10,000 randomly sampled tasks from the meta-test set. For fair and easy comparison, we recommend this to other researchers

Reproduction & Immediate Extensions

Generally reproduction and immediate extensions will require the use of more of the code base than just benchmarking. Starting off, the steps outlined in the ‘Testing New Ideas’ should be followed. This wil end in having all of the datasets properly processed and setup for experimental work. On top of this, these steps may be helpful in starting off:

  • Environment replication. Within the main README file here, the enrichment file and instructions on how to load it into conda can be found
  • Examples of some experiments can be found here. These include MAML for ESC-50 and ProtoNets for Kaggle18
  • The code included in the example experiments should be sufficient for both reproduction and code add-ons/extra experiments


MetaAudio both frames and benchmarks a variety of interesting few-shot audio classification problems. These span datasets from a variety of sound domains, from environmental sounds to bird song, and settings from within-domain meta-learning to massive pre-training for features. Our presented results showed a variety of things, however the most important of these was that few-shot audio behaves significantly differently from the much more well-studied image domain.

From here, our hope is that the community makes use of MetaAudio as a tool to more extensively round out the evaluation of novel meta and few-shot learning techniques alike.

Thank you very much for reading!