My Talkloid Tuning Style; or Making a Robot Talk Good
Kasane Teto's been an obsession of mine since listening to Jamie Paige's "Constant Companions" album a couple years back on the recommendation of my partner and a huge swath of people on Cohost - to the point that she's now a frequent co-pilot of the Hoot_OS system and one of three main characters of the multimedia vocaloid project I'm working on called DOLLSHAPE. There's just something about a voicebank that started out as an April Fools' joke and ended up becoming one of the most popular synthesized singing voicebanks to rival Hatsune Miku. There's something vaguely anti-establishment about her origin story that makes her appealing to me in the first place.
So, in the summer of 2025, it should come as no surprise that I ended up finding a way to legally purchase a license to SynthV and Teto's SV1 voicebank.
I've played around with her here and there, mostly as experimentation to learn the software and see what tools I have at my disposal to colour the voicebank's performances. A riff with a simple hook here, a funny voiceline there. Simple stuff. Then came the idea to create a spotter voicepack mod for iRacing, and all hell broke lose.
iRacing is a hardcore simulation racing platform that I've been subscribed to for almost ten years at this point. It's one of the best competitive multiplayer simulation racers in the industry thanks to the presence of esports and real-life racing drivers in the top-tiers of the service, making the prospect of racing against somebody like NASCAR legend Earnhardt Jr., multi-time Australian Supercars champion and current NASCAR Cup Series driver Shane van Gisbergen, Formula 1 legend Rubens Barrichello and more. Those kinds of names drive people to the service, but it can't be done on name recognition alone; their server infrastructure and the quality of the content on that service is second to none.
In real motorsports, you often have someone from either the pit lane or somewhere else on track giving you vital information for racing. In NASCAR, that person would tell you when you have a car beside you so you can stay focused on the track out of your front windshield. That person can also tell you if you've received any penalties that need to be served, running out of fuel or provide a live weather report. This man is often called a "spotter" or "engineer." iRacing has a spotter feature within it that gives you all the information you could possibly need with captioned voice lines, and that feature allows you to add your own voice lines into the game to replace the default spotter voice entirely with a new one.
While iRacing's intended purpose for the ability to add third-party spotter voicepacks is for translations, many iRacers like myself have replaced the default spotter with the voice of another individual - in my case, it was often multi-time NASCAR champion Jimmie Johnson's crew. Recordings were taken from real radio clips of Chad Knaus and Earl Barban talking to Jimmie during races, clipped and re-exported as WAV files, then stacked into a folder architecture within iRacing's installation path. This would replace the default iRacing spotter with real clips of a professional crew chief and spotter giving you vital racing information in a VERY realistic format - and that was because the stakes of communicating with your driver to achieve multiple NASCAR championships were pretty damn high!
This, of course, has now given me a mission: Make Kasane Teto the first ever Talkloid spotter mod for iRacing, somehow beating Hatsune Miku to the punch since no one else has ever taken on a project like this.
In the process of making this spotter mod, I've taken great interest in making Teto's speaking voice sound natural. I've taken notes from watching other Talkloid creations, paying attention to how the producers tune their Talkloids and thinking about how those performances might be improved. Being autistic has been a bit of a help with this, since I'm quite familiar with having to study English speech patterns to try and understand how to communicate with other English-speakers effectively.
In English (and most other languages, of course), pitch and timing are VERY important to communicate an idea. Getting the pitch or timing wrong might result in misunderstandings. A good example of this is the difference between sarcasm, dead-pan humour and honesty. The phrase, "good job, sport," can come across as either authentic encouragement or scathing sarcasm depending on how the phrase is said. Specifically, a monotonous delivery of that particular phrase is commonly understood as an inauthentic presentation of the information being conveyed and it'd be understood as sarcasm. Meanwhile, an excitable, upbeat variation in pitch can feel much more emotionally connect, and thus understood as a genuine statement of encouragement. Go too far, however, and it looks back around to inauthenticity - a conveyance of condescension.
Given this, a naturalistic production of a Talkloid voiceline needs to consider the timing and pitch of phonemes, words and phrases both as a whole, AND on a microscopic level. Making the performance of each word sound natural usually necessitates phoneme re-timing, because all the phonemes are automatically timed by SynthV to be optimized for singing. You can't produce an accurate or readable pitch with consonance like "ssss" or "thhh," so they're often shortened more than they would be in speech because pitch conveyance is much more important when singing. In speech, however, pitch and rhythm seem to have a more equal balance of influence on the conveyance of information. This means pretty much every word needs their phoneme lengths re-timed to sound more like regular, casual speech. Furthermore, pitch is flatter on average over multiple sentences in casual speech while being quite a lot more 'peaky' in certain portions of a word or sentence. For example, if you ask a question, you would usually raise your pitch toward the end of the sentence. When making a statement, the opposite happens - the tail of the sentence dips downward. A common mistake I noticed myself making out of the gate - and I've noticed other Talkloid producers struggle with - is just how far the pitch can rise or fall in a short time. For my tuning, the ends of sentences are the most extreme portion of a sentence for pitch modulation. Pitch re-drawings often end up looking like cliffsides, dropping by half a dozen semitones in some cases to create a feeling of finality and completion to a statement.
A technique that has come in handy is performing the voiceline myself as I'd like Teto to perform it, then matching the piano roll to it. (Kind of a funny sidebar: the Vocaloid community is full to the brim with covers and remixes of popular songs within the community, so it's honestly kinda funny knowing that the best way to get a natural Talkloid performance is to just do a cover of a real line delivery lol). Once the notes are placed down on the piano roll, it's then time to manually redraw the pitch of the entire damn voice line to try and mimic my own performance. Raw pitch accuracy isn't really important here, just the relative pitch in comparison to other words in the sentence. Singing requires raw accuracy, but with speech we have the freedom to explore between the 12-tone octave that has become standard in Western music. In that way, Talkloid production is almost like microtonal music production! Kinda neat!
Another thing I've noticed while producing Talkloid voicelines is just how important phoneme timing is. As mentioned earlier, phoneme timings in SynthV (and most other synthesized singing programs) are optimized for singing, which means shorter consonants and longer vowels. When you hear someone sing a song, they're not often holding a note on phonemes like "zzzzz" or "vvvvvv", the pitch is barely recognizable or controllable there. By default, this makes Talkloid speech sound strangely timed like stop-and-start traffic during rush hour. This seems to be a bit more of a tedious process; while accuracy in pitch isn't a big deal in speech when compared to singing, rhythm becomes a pretty serious focus for legibility purposes. When someone sings a word like "Scarlet," they won't just sing the word like they'd say it; they shorten the "sc" sound, elongate the "arr" sound, shorten the "ull" sound, then lengthen the "ehh" sound, then shorten the "tuh" sound. So "scarlet" becomes "scaaarrrleeeeet." Since SynthV optimizes for singing rather than speech, that means you have to retime those phonemes to make the rhythm of the word more natural. More time on the consonances, less time on the vowels. This is also word-dependent: in a word like "disability," the rhythm of that word fluctuates throughout. "DIS-uh-BILL-uh-TEE," five syllables. When spoken, English speakers will hang on the "sss" sound in "DIS," then rush through the "uhh," rush through the "B" to get to the "ILL" in "BILL," rush through the "uhh" again, then hang on the "TEE." Once the micoscopic adjustments are done, you then have to think about the word as a whole; the hang on the "sss" in "DIS," the "ILL" in "BILL," and the "TEE" are different lengths, with the "ILL" being the shortest of the three. The timing of phonomes, like the pitch of words, isn't restricted to a syncopated 4/4 time signature - once again, the raw accuracy of rhythm isn't important; what's important is the accuracy of the rhythm in relation to other phonemes and words. Fascinating shit!
Another thing I've noticed is the difference between my Kasane Teto speech tuning and other Talkloid producers' tunings. It seems like the most popular tuning method is to put her speech above the note of C4. This is somewhere in the middle of her optimal A#3 to E#5 range, but it seems most are putting her speech there because it's where Teto's voice is the most unique while singing. While people can speak anywhere in their own optimal pitch ranges, my own 'home base' is towards the low-end of my range, not the middle. For that reason, I decided to try tuning Teto's speech between E3 and C4, and only going beyond C4 for things like emphasis, surprise or enthusiasm. This seems to make Teto's voice sound incredibly natural and casual, and seems to also make pitch tuning easier. The higher a pitch is, the easier it becomes to notice inaccuracies or strange artifacting. Dragging her speaking pitch down about half an octave or so seems to hide some errors and artifacts that might occur with manual pitch adjustments, as well as providing more forgiveness for relative pitch accuracy. I'm not necessarily saying other Talkloid producers are making Teto speak wrong; rather, their focus may instead be on presenting Teto as viewers might recognize her most, and since most of her popular songs put her voice above C4, that's where those Talkloid producers put her speaking voice. Since my focus is on a natural speech performance dedicated to conveying important information rather than making Teto's voice recognizable, that means my priorities are more on pitches that make natural and legible performances and less on her star power. "Teto's here to do a job, and it's not to please an audience - it's to win a race." This isn't a criticism, but rather a pattern I noticed and came to understand. (Thanks Fawnin for your input when I noticed the difference between my tuning and the tunings from other Talkloid producers, that was pretty insightful!)
This project is a pretty large undertaking with very tedious work taking place on each and every voiceline being performed. Every new voiceline presents an opportunity to learn and improve my Talkloid tuning in addition to bringing me one step closer to my goal: hearing Teto berate me for crashing the car and cutting corners.