Aswin Sivaraman

Deep Autotuner

Last Updated: Feb 18, 2019 2 min read
Tags: research


We describe a machine-learning approach to pitch correcting a solo singing performance in a karaoke setting, where the solo voice and accompaniment are available on separate tracks. The proposed framework addresses the situation where no musical score of the vocals nor the accompaniment exists. It predicts an amount of pitch correction based solely on the relationship between the spectral contents of the vocal and accompaniment tracks. Therefore, the pitch-shift suggested by the model can be used to make the voice sound in-tune with the accompaniment. This approach differs from commercially-used automatic pitch correction systems, where notes in the vocal tracks are shifted to be centered around notes in a user-defined score or mapped to the closest pitch among the twelve equal-tempered scale degrees. Our model is trained using a dataset of 4,702 amateur karaoke performances selected for good intonation. We present a Convolutional Gated Recurrent Unit (CGRU) model to accomplish this task. The described model may be extended into unsupervised pitch correction of a vocal performance, popularly referred to as autotuning.

Built with: Python PyTorch

The research has caught the eye of multiple media outlets, including The Times, The Daily Mail, New Scientist magazine, BBC radio, and others.

Fig 1. Comparison of the the manual pitch shifts (in cents) against the raw pitch information from the audio track. The network-predicted shifts are compared against the ground truth shifts on a validation performance (top). Every straight line segment represents a singular scalar prediction per note, and a note is spanning across several time-frequency frames. The raw pitch comparison shows that the network-predicted (autotuned) track is much closer in raw pitch to the original in-tune recording compared to the programatically de-tuned input (bottom).