RESEARCH
Introducing Zero-Shot Singing Voice Conversion
December 9th, 2024
by Anastasiia Herus
In our mission to provide the most powerful tools for music creators, the Kits.AI Research team has developed one of the world’s first Zero-Shot Singing Voice Conversion (ZS-SVC) models. This model enables converting audio to a target singer’s voice without the need for training.
Input
Target Singer Reference
Output
Architecture and Data
The zero-shot model architecture inherits several core components of the KVC architecture, including content encoding, pitch encoding and retrieval. The key addition is a singer encoder module, which computes a singer embedding from the reference file. The singer embedding is a disentangled representation of the target singer's vocals that can then be used for conversion.
Phonemic Retrieval for Accent Preservation
Beyond preserving the timbral qualities of the reference speaker, the ZS-SVC model also employs a phonemic retrieval system. Similar to retrieval in KVC, this helps preserve the accent of the target speaker, without overcorrecting and leading to pronunciation errors.
Data
Optimizing for data quality over quantity is highly impactful to the results of zero-shot singing. The ZS-SVC model was trained on Kits's licensed studio recorded vocal dataset. All data is licensed directly from artists and preprocessed by audio engineers by hand to achieve release level quality.
Looking ahead
ZS-SVC powers our new Instant Voice Cloning (IVC) feature, currently available for Kits beta users. More features using ZS-SVC will become available to the broader Kits community over time.
We are excited to see how music creators use this new model to power their creative process!


