Jamendo Corpus for singing detection


The Jamendo database was built and published for the experiment described in:

  • [2008,inproceedings] bibtex
    M. Ramona, G. Richard, and B. David, "Vocal detection in music with Support Vector Machines," in Proc. ICASSP '08, 2008, pp. 1885-1888.
      author = {Mathieu Ramona and Ga{\"e}l Richard and Bertrand David},
      Month = {March 31 - April 4},
      Title = {Vocal detection in music with Support Vector Machines},
      Year = {2008},
      Booktitle = {Proc. {ICASSP} '08},
      Pages = {1885--1888}

It basically consists in a collection of 93 songs (about 6 hours of audio) with Creative Commons license, retrieved from the Jamendo free music sharing website. This corpus is designed for the evaluation of singing voice detection in musical tracks.

The files have been selected randomly by an automated bot script. The songs are from different artists and represent various genres from mainstream commercial music. Most of the audio files are provided here in the original format they were retrieved in 2008, i.e. Vorbis OGG format at 44.1hKz stereo with 112kB/s bitrate.

However, 18 of the original files were lost, and were found again later on the Jamendo website, but with a different encoding: MP3 format (MPEG-1 layer III) at 44.1hKz stereo with 128kB/s bitrate.

Anyway, our experimental protocol is exclusively based on Lo-Fi versions of the tracks: downsampled to 16 bits 16kHz mono, in WAV format.

Corpus sets

The corpus is divided between three non-overlapping sets:

  • The train set contains 61 files and is dedicated to the training process of the algorithm.
    Download: List of files - Audio files
  • The validation set contains 16 files and is dedicated to the tuning of the algorithm parameters.
    Download: List of files - Audio files
  • The test set contains 16 files and is dedicated to the evaluation of the algorithm. It was used to evaluate our contribution in the aforementionned paper.
    Download: List of files - Audio files

Ground-truth annotation

Each file has been manually annotated by the same person with great precision (around 0.1 second precision on the boundaries). The files are segmented into non-overlapping segments assigned to either of the two following classes:

  • sing: segments containing singing voice or spoken voice (generally over an instrumental background)
  • nosing: pure instrumental (or silence) segments with no voice.

All the annotations are described in text files with the simple LAB format. The latter consists in a succession of two space-separated arguments respectively describing the start and end time of each segment, and the class associated. Here is an example:

0 1.512 sing
1.512 2.546 nosing
2.546 5.423 sing

Download: Annotations