I Taught an AI to Make a Cartoon Child Dance to Trap Music.
I need you to understand something before we begin.
There is a real database. In that database, there is a real MongoDB document. That document describes a real AI-generated video of a real animated character named Ranger Rae, who is wearing a shark hoodie, doing the robot, to a real song, in a real Cloudinary bucket, that cost me real money.
This is not a fever dream. I have the receipts. The receipts are also in a database.
How We Got Here
SoniQute is a music platform. QUTIE Chums are 50 collectible characters. There is lore. There are squads. There is a character named Pancake Stack whose canonical description I wrote at 2am and refuse to revisit. The platform involves collectibles, music discovery, and a philosophical stance on not growing up that I cannot fully explain without a whiteboard and at least one beverage.
At some point I decided the logical next step was to let users make their characters dance.
That's it. That was the whole thought. No further analysis was performed.
The Part Where I Underestimated Everything
The plan was: AI generates 5-second dance clips, user picks some music, clips get stitched together, character lip-syncs to the track, everyone posts to TikTok and collects points.
Simple. Clean. A completely normal thing for one person to build.
The actual API chain ended up being: Wavespeed Seedance for video generation, ffmpeg for last-frame extraction, Cloudinary for temporary frame storage with immediate self-destruction, Shotstack for BPM-aware timeline editing with music mixing, Wavespeed Kling for lip sync, Cloudinary again for the final output, MongoDB for everything, and Google Cloud Run for holding all of this together while I yelled at it from a laptop.
I also wrote a BPM calculator. For beat-synced cuts. Because I needed the edits to land exactly on the beat. At 128 BPM that is 0.469 seconds. I know this number by heart now. I dream about this number. 0.469 seconds. That is how long a beat is at 128 BPM. I am telling you this at a professional networking event.
The Lip Sync Situation
I tried four different AI models in one day to make a cartoon character's mouth move.
Model one (Wavespeed Kling): incredible lip sync, slightly cursed fingers at the end of videos, alarming to look at, ultimately fine.
Model two (Sync LipSync 2 Pro): I described the output as "horrifying and all types of glitchy nonsense" in my testing notes and I stand by that description. I cannot elaborate further as I have chosen to heal.
Model three (PiAPI Kling): I built the entire integration, deployed it, ran a test, and received an error message informing me that free plan users cannot create lip sync tasks. I read that error message several times. I closed my laptop. I opened it again. The message was still there.
Model four: back to Wavespeed Kling. Timeout extended to 10 minutes. Fingers still slightly cursed. We are at peace with the fingers now. The fingers are part of the brand.
The CSS Incident
I need you to understand that overflow-y-auto inside a flex chain is not a technical problem. It is a spiritual one.
I applied min-h-0 to the container. I applied min-h-0 to the parent. I applied min-h-0 to the grandparent. I applied min-h-0 to elements that did not need min-h-0 and watched nothing happen. I applied h-full. I applied flex-1. I applied them together. I applied them in a different order. I applied them while audibly questioning my life choices.
Six attempts. The solution was absolute inset-0 overflow-y-auto, which bypasses the flex chain entirely by refusing to participate in it, which is honestly the correct response to a lot of situations.
The Shirt Problem
Many QUTIE Chums have "QUTIE CHUMS" written on their shirt or hoodie.
After AI video generation, this text becomes whatever the model feels like. QUTIE CHUNK was a common outcome. CUTIE CHUMS appeared several times. One video produced something that I can only describe as the letters QUTIE CHUMS after they had been through something. Something difficult.
Seedance does not support negative prompts. So I added "preserve all text and graphics on clothing exactly as shown, keep all lettering sharp and unchanged throughout" to every single vibe prompt.
The letters are preserved approximately 70% of the time. The other 30% is a creative interpretation. The model is an artist. We respect that.
A Timeline Of One Scene Generation At 1080p
12:07:40 AM: User clicks Generate Scene. 12:07:41 AM: Backend starts. Wavespeed receives request. 12:07:41 AM to 12:09:55 AM: Processing. 12:09:55 AM: Completed. 12:09:55 AM: Frontend says "Connection error. Please try again."
The backend was fine. The video was done. It was saved to Cloudinary. It was saved to MongoDB. It was sitting there, perfect and complete, waiting to be displayed.
The frontend fetch had no timeout. The browser had killed the connection at 60 seconds. The upgrade to 1080p had pushed generation to 134 seconds. The fix was signal: AbortSignal.timeout(300000).
One line of code.
I found this at approximately 1am and sat with it for a while.
What It Actually Is
Somewhere in between the cursed fingers and the BPM math and the CSS therapy, a thing got built.
Users open a pack. Three characters fall out. They go to the Dance Studio. They pick Hype or Chill or Robotic or Silly. A video appears. Their specific character, doing their specific vibe, in 1080p, in about two minutes.
They drag their best clips into three act slots. They pick a song. They hit Stitch. Shotstack arranges the clips on a BPM timeline, fades the transitions, mixes the audio. Kling animates the mouth. The whole thing lands in a Render Library as a complete 14-second vertical video ready for TikTok.
They didn't shoot a video. They didn't hire anyone. They didn't use After Effects. They unboxed a pack and made content.
That's the thing.
Where This Goes
TikTok integration, engagement tracking, SONIQ points for likes and shares, a leaderboard of the most viral Chums, and eventually character blending where two users combine their characters and generate a duet.
I have not started the duet feature. I am mentioning it here so that it exists somewhere and I have to build it now.
The Actual Takeaway
Every AI model has a failure mode. Kling has a hand problem. Seedance has a shirt opinion. Shotstack will render whatever you give it with complete confidence including garbage.
The AI is fine. The plumbing is where you earn your money. The CSS is where you question everything.
And somewhere in a Cloudinary bucket there is a 1080p video of Ranger Rae doing the robot to a 103 BPM beat with slightly cursed fingers, that cost about $1.45 and two weeks of my life to produce.
I think it's pretty good.
I am going to go lie down.
-Jeremiah