An investigation has uncovered four large-scale music datasets being shared within the AI-development community, with the largest containing 12 million tracks and another holding 9 million songs. The findings shed new light on the scope of copyrighted material potentially used to train generative AI music models without licensing agreements.
Alex Reisner, who examined the datasets, highlighted the gap between industry claims and the reality of accessible data.
“Companies often claim to use only content that is freely available online, but the datasets reveal the quantity of downloadable music that developers can access even though it is not supposed to be free,” Reisner wrote.
The revelation comes as Universal Music Group and Sony Music recently sought to add more than 61,000 recordings to their copyright infringement lawsuit against AI music service Suno, a move Suno is opposing. The datasets have been downloaded thousands of times, but it remains unclear which companies have used them for training. The investigation has made the collections searchable, allowing the public to see which songs are included.