Audio temporal alignment

This page comes as a support of the following article :

  • [2011,inproceedings] bibtex
    M. Ramona and G. Peeters, "Automatic alignment of audio occurrences: application to the verification and synchronization of audio fingerprinting annotation," in Proc. DAFX '11, 2011, pp. 429-436.
    @inproceedings{bibA11,
      author = {Mathieu Ramona and Geoffroy Peeters},
      Month = {September},
      Title = {Automatic alignment of audio occurrences: application to the verification and synchronization of audio fingerprinting annotation},
      Year = {2011},
      Booktitle = {Proc. {DAFX} '11},
      Pages = {429--436}
    }

This article describes the proposition of a new method for the automatic temporal alignment of an original audio sample (the item) and an altered version of it (the occurrence).

The process aims at estimating with high precision the temporal alterations of the occurrence (i.e. those that affect the synchronization of the two signals), in order to correct them and obtain a perfect temporal alignment between the item and the occurrence.

Please note that all the sound examples in this page should be played with headphones.

Examples of temporal degradations

Our method is based on the estimation of the following temporal distortions. For each distortion, a sample is given for both the original and altered signal. On, the third signal, both are played together, the original on the left channel, and the altered on the right.

  • Shifting:
    Original Altered Stereo mix
  • Scaling:
    Original Altered Stereo mix
  • Cropping:
    Original Altered Stereo mix
  • Insertions:
    Original Altered Stereo mix

Experimental validation

The article describes an experiment where the result of the proposed method is manually verified and corrected on 100 examples of occurrences. The evaluated parameter is the itemTime, that sums up the time shift between the item and the occurrence.
The following appendix page shows examples of slight temporal shifts that corroborate our conclusions on their perception: Perception of time shifting.

The table hereafter gives a few examples of the results of the process that we propose. The original item, as well as the altered occurrence can be heard. Please note that the occurrence is originally within a stream, and not aligned with the item. However, we only provide here the part corresponding exactly to the item, but without the compensation of the degradations (i.e. time scaling for example). Static distortions can be heard as well on the occurrence. The last column provides the stereo mix of the original item and the occurrence where the temporal alterations as corrected.

Item Occurrence Aligned mix
1.
2.
3.
4.

The first mix shows an almost perfect synchronisation between the two signals (the sound is perceived a little on the right), while the separate signals are clearly different (i.e. altered by static distortions). The second and third examples are also well aligned, by a slight "split" of the channels can be perceived on sharp onsets. The forth and last example illustrates a more complicated case, where the temporal scaling between the item and the occurrence, is coupled with a pitch shifting. The alignment is still correct but the two sounds do not clearly "fuse" when heard together.