Released by Google back in 2017, the AudioSet ontology consists of 527 hierarchically structured audio event classes spanning 2,084,320 10-second long samples.
Despite being the subject of many research articles investigating audio classification and tagging, as it stands right now there is no easy way to gain access to the raw audio samples. From my understanding through experience and discussion with others, it seems as though much of the research is performed with either the available pre-feature extracted samples (discussed more later) or the individual research group has a shared version of the full raw set that was obtained through some means and cannot be shared. Both of these scenarios pose some issue, for example only some groups having access to the full raw set creates pockets of restricted research capability, un-levelling the playing field as it were.
These samples are all derived from YouTube videos
The set itself is fairly well documented with metrics like quality estimate and
Unfortunately, as it stands right now there is no easy way to gain access to the raw audio samples themselves.