Talking to Users

How OptimizerAI iterated to solve the game sound effect cost problem

Oct 06, 2023

Text-to-Sound Product

Upon founding OptimizerAI, we talked to all game studios we could contact at that time. During the early interviews, we learned that game designers and sound designers communicate through a “design document” and game designers deliver the description of a sound effect they want as a text or a reference sound. They had to communicate back and forth to reach an agreement about a sound effect. If this was a communication between a game studio and outsourcing company, this communication issue was even worse as the feedback period was quite long.

They said that this process is just much more faster with reference sounds as sound designers can get a grasp of what kind of sound game designer wanted. However, for abstract sounds like character skill sounds(ex) A mage character casting a wide range lightning spell), game designers suffered from finding the appropriate reference sounds.

We hypothesized that creating reference sound effects from text description that game designer has might be able to help game designers to come up with appropriate reference sounds faster and reduce the communication cost between game designers and sound designers. We built our alpha text-to-sound product in 2 weeks and with this product, we went to talk to users.

How our first text-to-sound product looked like.

At this point, A16Z speedrun program was taking applications. We applied with this text-to-sound product and wrote in the application that we are solving the “communication cost” problem between game and sound designers(which was so wrong).

Talking to Users

From this point, we asked people to use our product and got feedbacks. we basically tried to meet every single game studios we can, went to game conferences and asked people we met to introduce us to other game studios. We met wide range of game studios, solo game developer to large enterprise studios like KRAFTON. One day, after getting home at 12AM, we realized that there was a game conference in Busan the very next day. Without hesitation, we got on a first train that leaves Seoul for Busan to attend the game conference and talked to users. At that conference, we met 50+ game studios in a day.

As we met to more and more studios, we learned that our first text-to-sound product missed the target by a wide margin.

Lessons Learned

Through talking with many people, we learned 2 major insights.

First, game sound effects are created from gameplay, not from text descriptions.

In most of the game studios, game designers really trusted their sound designers or outsourcing company. Sounds were the very last phase of game development and sound designers were creating sound effects actively from actual gameplay and in-game context not from instructions given by game designers(same situation in outsourcing). Most game designers didn’t care much about sound effects during development. We learned that game designers we met in the early phase of user interviews were people who were really enthusiastic about sounds.(That’s why we were able to interview them when we had nothing. Because they were sound enthusiasts, they were curious about a random startup solving the game sound effect problem)

Second, not only making sound effects were difficult, but plugging them into game interactions were difficult.

This solo game developer we met at a game conference in Busan was developing his own internal software that plugs sound effects into game interactions, and it was taking him months to do so. Imagine building footsteps for a RPG game and having 10 maps, 10 shoes, and 10 characters. That’s already 1,000 sound effects.

Real-time Game Sound Effect SDK

From the 2 major insights we learned, we came up with an idea of real-time sound effect generation that can generate conditioned on gameplay. Our ideal product was an SDK such that developers just need a single code wrapped around their asset code. Then the SDK would understand the game scene as a video and the underlying video-to-sound model generates the corresponding sound effect real-time. With this solution, game studios won’t have to create any kind of sound effects during development.

To validate that this product is possible, we built a first version of the video-to-sound model. The model is a large model trained on license-free data, and we are going to synthesize category-specific sound effects to further train the smaller models designed specifically for certain sound effect category.

Demo Video: https://vimeo.com/869232078?share=copy

We are starting with real-time footsteps generation SDK for Unity and we are validating the hypotheses of: