Why AI-generated videos feel hypnotic, fluid, and uncanny
The strengths and weaknesses of being an impartial number cruncher
. . .
Watching them is like watching a fractal: entrancing both in itself, and as a style of presentation. Or perhaps it’s like hearing a story without a climax. As soon as you think you’ve got the set up figured out it subtly shifts and reveals a different story, which now anticipates its own payoff, and so on; like a run-on sentence. The videos are always on the cusp of a revelation, but they never cross over — and you don’t expect them to either. They are liminal, and prefer to stay that way.The hypnotic effect comes from the sense that the video is going somewhere, and at every turn it makes a new promise, holding you in suspense. The viewer follows along, and feels they can’t let go until they see something delivered. Ultimately they realize it never will be and just enjoy the experience in the moment. And so it meanders from goal to goal, as undirected as scrolling on social media.
To be clear, the videos do have an underlying idea, but it’s revealed at the start and maintained consistently throughout. There is no arc of setup and delivery, just a constant pressure. This is why the content of the videos is chosen to be visually captivating, and is often presented in slow motion. Cute puppies playing in snow, boats in a coffee cup, mammoths striding down a plateau. Like the stock footage they are trained on, their visual appeal is their whole point. The message of the clips matches the medium.
Perhaps the best way to understand this phenomenon is to ask: what is the clip trying to say? What is its intent, its theme, its thesis? Why did the creator want us to see this, and specifically this? When you watch any video, you expect to quickly pick up the intent or purpose behind its creation. For AI-generated videos the answer shouldn’t be difficult to discover, since once a generative model has been trained, all specified content comes from the text prompt that it’s generated on. So whatever unique message a video has compared to other videos from the same model must come exclusively from that prompt. The text prompt should, ultimately, encapsulate the thesis… but it doesn’t. The prompt represents the content, but it doesn’t represent the purpose.
To understand what I mean, remember that any prompt you give an AI doesn’t come out of nowhere. You, as it’s author, have a history of motives and feelings, goals and intents, all trying to find expression through that prompt. They are your hidden purpose, not the concrete content you are literally asking for. None of that backstory makes it to the AI. The AI is being shortchanged, since it is not being given all the facts.It is difficult to tell an AI something like “I want the viewer to feel the urge to help the poor” or “I want them to feel excited about exploring the world” or “I want them to be full of child-like wonder”. Often you yourself aren’t fully aware of what your are really trying to say — at least not enough to put it into English words.
So whenever an AI’s output misses the mark it is because, like an unintentionally evil genie, it is delivering what you asked for, but not what you wanted. The mistakes the AI makes reveal its alienation from the user’s hidden intent. It is frustratingly literal, and doesn’t suss when it should emphasize or downplay some part of the prompt, nor even when to add something missing that should rightly be there.As a user you subsequently have to edit the prompt to try to shift it closer to what you had intended, but didn’t actually say. All the while, the AI is being pushed towards your vision from behind, prompt by prompt, rather than being pulled by a driving motive. And like a sandcastle that you have to keep pulling back into shape, any part of it that you neglect slowly crumbles.
. . .When you merge multiple viewpoints together, the result won’t say anything specific, nor can it make a “point”. An AI generated video is like the combined sound of a dozen human voices speaking in unison through the cultural artifacts we have created. There is no single value or motive uniting the result. AI are exceptionally good at merging content but not intent, simply because intent shouldn’t be merged at all. “Intent” is always biased in favour of what it wants — it brooks no compromises. The only way to correct the blending of intents in AI art is for the AI to first learn the effect its choices have on the viewer, then use that to drive the art in the direction it wants to.
For now, generative AI is a blank slate of sorts, an impartial data cruncher. This has its appeal, to be sure. It lacks many of the annoyances of working with humans, particularly their stubbornness and partiality. It is a compliant and obsequious mimic of social artifacts. This also means its failures can be confusing, almost like the software is “mocking”¹ you. It imitates your representations without showing that it understands why you care about the message or the subject matter. You want the AI ‘artist’ to agree with your message, and to express that through the resulting creation. But it can’t agree with you, it can only ape what it’s seen. So it ends up generating “mock” human artifacts —that is, facsimiles that lack the driving voice of their originals. [End]