September 17, 2024
How To Optimize Training an AI Voice Model
By:
Sam Kearney
Though it may seem counterintuitive, a great sounding AI Voice Model doesn’t require singers with perfect pitch. One of the most common mistakes I encounter when reviewing submissions for our Community Voices program is datasets heavily altered with auto-tune. From the outside, it’s understandable that many would assume pitch-perfect datasets equal pitch-perfect models. In this post, we’ll explore why using pitch correction can actually harm the quality of your AI voice model, along with other helpful tips to train a more natural, realistic model.
The More, the Better!
AI vocal models thrive on diverse data. If you upload a three-and-a-half-minute song in a low vocal range, the model might sound great for that particular song, but it will lack the versatility of a real-life singer’s full range. For optimal results, aim for at least 30 minutes of vocal material that spans a wide range of pitches, dynamics, and delivery styles.
Incorporate everything from soft, delicate notes to full-energy belts, covering the broad spectrum of a singer’s abilities. This diversity ensures your model sounds natural and versatile, capable of performing across a wide array of material without being constrained by a limited dataset.
Bounce to True Mono!
A common oversight is uploading stereo audio instead of true mono when training a voice model. Kits currently allows a maximum of 200 MB of training data, so bouncing tracks to stereo, even if recorded with a single microphone, can unnecessarily double your file size. This reduces the amount of usable training data.
By ensuring your vocals are bounced to true mono, you maximize the amount of training data and avoid hitting the size limit too soon. Even though stereo is essential for modern productions, AI voice models only require mono for efficiency.
Autotune and Pitch Correction Aren’t Necessary!
As I mentioned earlier, pitch-perfect vocals aren’t required for training data. Every singer, even those with exceptional pitch, has natural variations in their voice. While hard-tuned Antares AutoTune might suit your production style, it can result in robotic, unnatural-sounding AI models.
The key is to save pitch correction for post-production. Training your AI voice model with natural, unprocessed vocals will yield a more realistic sound and prevent your model from being locked into one specific, overly processed style.
Save the Effects For Post!
Effects like reverb, delay, and modulation are great for enhancing vocal performances, but they should be avoided when creating training data. These effects can interfere with the machine learning process, which focuses on capturing the natural essence of the human voice. Including them in your dataset can result in models filled with digital artifacts, making them sound less lifelike.
Instead, focus on capturing dry, clean vocals. You can always add effects later. If room reflections are an issue, try recording in a small space like a closet, or use a reflection filter like the sE RF-X to minimize reverb and ensure a cleaner dataset.
Prioritize Sonic Consistency
While diversity in vocal delivery can enhance your AI model, consistency in recording quality is crucial. Background noise from fans, air conditioners, or other household items can negatively affect the outcome of your model. Take note of preamp levels and any distortion caused by clipping the mic or interface. Keep an ear out for any inconsistencies and ensure a clean, distortion-free capture.
Slight vocal variations due to daily changes in the singer’s voice can actually add depth to your model, but make sure the technical side of your recording remains consistent to maintain high-quality results.
Conclusion
When building an AI voice model, it’s easy to assume traditional vocal production techniques will improve the result. However, by following these tips–using natural, diverse data, maintaining technical consistency, and saving effects for post-production–you’ll create a more realistic, versatile voice model. Kits AI can unlock incredible creative possibilities, and with the right approach, you can get the most out of your AI voice models. For additional recording guidelines, follow this link for Kits’ recommendations for capturing high-quality datasets.
-SK
Sam Kearney is a producer, composer and sound designer based in Evergreen, CO.