From reading an earlier article, it sounds like there are at least two things being done:
-
Statically-generating text to help speed up how long it takes a writer to produce text. This is presumably the “AI copilot” mode.
-
Dynamically-generating text at playtime for characters. This is the “AI NPCs that react to questions from a player” bit.
I can believe that #1 is technically-doable today, but I’m not clear whether there’s a need for it.
And #2 I can believe a need for, but I’m not clear how technically-doable it is.
For #1, people already do use LLMs to generate text. But…it’s generally not very high quality text.
In most video games I see, the limiting factor isn’t the amount of text to produce – I generally do very little reading in video games. Rather, it’s how much text people are willing to read. There are maybe some niche video games that have a lot of text, like the stuff that Choice of Games put out, where one is basically doing a choose-your-own-adventure type book. But for those, the game lives and dies on writing quality – which moving to an LLM does not, today, seem likely to improve – and I’m not sure whether the cost is a big deal; my question is really more whether I’d enjoy the time I put into playing the game. I mean, the games are pretty inexpensive.
Choice of Robots, one popular example, is $6.99.
Choice of Robots is an epic 300,000-word interactive sci-fi novel by Kevin Gold, where your choices control the story.
https://keytowriting.com/guides/ideal-length-different-books-novels-non-fiction-short-stories/375/
Novels are typically the longest type of book, with the average length ranging from 80,000 to 100,000 words.
So, already, you’re talking about the equivalent of three novels there, if you play through it enough times to hit all the content. (checks) The Lord of the Rings trilogy is only about 500k words. That’s a lot of player time for a few dollars.
But, okay. Let’s say that we aim at having an LLM reduce the writer’s workload by 50%. If an LLM can reduce the writer’s workload by 50%, it’s going to have to generate more than 50% of the text, because the writer is going to have to probably do a pass over the text, touch it up. The vendor is probably getting something like 30%. I don’t know what the rest of the breakdown is, but let’s assume that the writer gets all remaining 70%, all of which seem like pretty generous assumptions. Assume that the LLM does half of all the work. Then we’re talking about maybe selling the game for $4.54 instead of $6.99. That’s I don’t know how much that’s going to impact my purchasing decision; there are a lot of Choice of Games games that I don’t really think are all that worth playing, and a bigger factor is whether it’s worth my time to read the text.
The inputs that I do care about are how appealing the plot and writing are, whether it’s worth my time. Today, LLMs are not competitive with humans at that, and in any event, as long as they’re just using an approach of being trained on and aiming to replicate the style of human works, are not super-likely to exceed human quality. I can very much believe that it’s possible to create software that writes more-engaging stuff than a human does, but I don’t think that doing that is near-term commercially practical.
So it’s kind of hard for me to see how compelling this is gonna be as an aid to writers.
Maybe it’d be possible to help a human ghostwriter imitate a writing style – LLMs seem to be pretty impressive at imitating graphical styles of human artists. But I don’t think that human writers have as much of a recognizable style as human graphical artists.
Okay, how about #2?
There, I completely believe that there’s value in access to dynamically-generated conversation. It would be great if a game could dynamically-interact with the player, deal with many permutations of their actions.
But from what I’ve seen, looking at KoboldAI (or TavernAI or similar), I’m skeptical that the performance or writing quality is there. For those who haven’t poked at those, there are several problems:
-
They can presently retain very little state about the conversation. So, one can generate text in a given style, but not typically text that does a great job of taking into account context earlier in the conversation. There might be ways to mitigate this, so I’m not going to call it a fundamental limitation. But it’s definitely not a drop-in for written dialog in a typical game today.
-
They are not fast. Okay, if you require the game to be always-online, then you can throw an unlimited amount of hardware at it, though then someone has to be paying for that. If you’re using local hardware, the current cadence of conversation really isn’t at the rate that I normally flip through conversations in video games.
-
They are quite VRAM-intensive. Maybe for some games, things that are basically all text, like the Choice of Games games, you could soak up almost all of that memory for your game. But for anything that’s using the memory for other things, it’s going to limit resources. If this is going to run locally, that seems likely to be a constraint.
-
They tend to get into repetitive loops (this one I strongly suspect there are fixes for that aren’t that technically-difficult, using something like the equivalent of negative prompts or, outside of the LLM, detecting repeated text and restarting generation of new text in a loop).
-
I don’t know about the training set availability. The reason that existing LLMs can generate text is because they have a large training set available, and all that is freely available. If you want, say, an LLM to generate text that reads like a newspaper article, you have countless newspaper articles to train it on. But…if you want to have an LLM generate text that replicates, say, a medieval character talking in your particular fantasy video game world…how likely is it that you have lots of examples of that. If you want to have a character speak like they’re a Solarian Knight’s squire on a little island off the shore of Dragon’s Reach, a high fantasy world, how many examples is your LLM going to have to work with?
Oh, and a #3 – in this article, they have a reference to doing voice synthesis. I can definitely see voice synth being used to speak arbitrary text. Stuff like Tortoise TTS is pretty good, can very easily generate a voice from samples or combinations of samples, and its short output snippet constraint is no big deal for the type of text that comes up in many video games. It’s a great fit for static generation of speech for mods, where the original voice actors aren’t available and maybe one doesn’t require the absolute best acting. But…again, it’s slow and VRAM-hungry, which I’d think would limit it for dynamic, in-game use. And speech synthesizers aren’t new; there’s nothing really specific to gaming or characters there, unless maybe one wants ease of generating new voices. So, due to the runtime resource limitations, I don’t know how practical using it for speech synth at runtime is. If one uses it to generate static speech snippets, sure, that could be done…but then, what’s game-specific here? You’re just using an ordinary old speech synthesis engine.
Maybe combining animated lip movements with static, generated speech? There are tools for that, but maybe it’s possible to do a better job of automatically generating facial and hand gestures to fit with speech.
I just don’t see how we’re at a point today where it’s really possible to take a lot of commercial advantage of this in video games.
I can definitely believe that there are non-writing applications, right now, for LLMs in video games, like in generating character art and animations. But it’s less-clear to me where the large opportunities are when it comes to writing and speech.
-
This is the best summary I could come up with:
The multiyear partnership will include an “AI design copilot” system that Xbox developers can use to create detailed scripts, dialogue trees, quest lines, and more.
“This partnership will bring together: Inworld’s expertise in working with generative AI models for character development, Microsoft’s cutting-edge cloud-based AI solutions including Azure OpenAI Service, Microsoft Research’s technical insights into the future of play, and Team Xbox’s strengths in revolutionizing accessible and responsible creator tools for all developers.”
Inworld has been working on AI NPCs that react to questions from a player, much like how ChatGPT or Bing Chat responds to natural language queries.
These AI NPCs can respond in unique voices and can include complex dialogue trees or personalized dynamic storylines within a game.
The Finals developer Embark Studios recently had to defend against its use of AI-generated voices, arguing that “making games without actors isn’t an end goal,” in a statement to IGN.
“We want to help make it easier for developers to realize their visions, try new things, push the boundaries of gaming today and experiment to improve gameplay, player connection and more,” says Zhang.
The original article contains 484 words, the summary contains 183 words. Saved 62%. I’m a bot and I’m open source!