Excerpts from an interesting article.
John
_______________________________________________________
Why AI-generated videos feel hypnotic, fluid, and uncanny
The strengths and weaknesses of being an impartial number
cruncherhttps://ykulbashian.medium.com/why-ai-generated-videos-feel-hypnoti…
. . .
Watching them is like watching a fractal: entrancing both in itself, and as a style of
presentation. Or perhaps it’s like hearing a story without a climax. As soon as you think
you’ve got the set up figured out it subtly shifts and reveals a different story, which
now anticipates its own payoff, and so on; like a run-on sentence. The videos are always
on the cusp of a revelation, but they never cross over — and you don’t expect them to
either. They are liminal, and prefer to stay that way.
The hypnotic effect comes from the sense that the video is going somewhere, and at every
turn it makes a new promise, holding you in suspense. The viewer follows along, and feels
they can’t let go until they see something delivered. Ultimately they realize it never
will be and just enjoy the experience in the moment. And so it meanders from goal to goal,
as undirected as scrolling on social media.
To be clear, the videos do have an underlying idea, but it’s revealed at the start and
maintained consistently throughout. There is no arc of setup and delivery, just a constant
pressure. This is why the content of the videos is chosen to be visually captivating, and
is often presented in slow motion. Cute puppies playing in snow, boats in a coffee cup,
mammoths striding down a plateau. Like the stock footage they are trained on, their visual
appeal is their whole point. The message of the clips matches the medium.
Perhaps the best way to understand this phenomenon is to ask: what is the clip trying to
say? What is its intent, its theme, its thesis? Why did the creator want us to see this,
and specifically this? When you watch any video, you expect to quickly pick up the intent
or purpose behind its creation. For AI-generated videos the answer shouldn’t be difficult
to discover, since once a generative model has been trained, all specified content comes
from the text prompt that it’s generated on. So whatever unique message a video has
compared to other videos from the same model must come exclusively from that prompt. The
text prompt should, ultimately, encapsulate the thesis… but it doesn’t. The prompt
represents the content, but it doesn’t represent the purpose.
To understand what I mean, remember that any prompt you give an AI doesn’t come out of
nowhere. You, as it’s author, have a history of motives and feelings, goals and intents,
all trying to find expression through that prompt. They are your hidden purpose, not the
concrete content you are literally asking for. None of that backstory makes it to the AI.
The AI is being shortchanged, since it is not being given all the facts.
It is difficult to tell an AI something like “I want the viewer to feel the urge to help
the poor” or “I want them to feel excited about exploring the world” or “I want them to be
full of child-like wonder”. Often you yourself aren’t fully aware of what your are really
trying to say — at least not enough to put it into English words.
So whenever an AI’s output misses the mark it is because, like an unintentionally evil
genie, it is delivering what you asked for, but not what you wanted. The mistakes the AI
makes reveal its alienation from the user’s hidden intent. It is frustratingly literal,
and doesn’t suss when it should emphasize or downplay some part of the prompt, nor even
when to add something missing that should rightly be there.
As a user you subsequently have to edit the prompt to try to shift it closer to what you
had intended, but didn’t actually say. All the while, the AI is being pushed towards your
vision from behind, prompt by prompt, rather than being pulled by a driving motive. And
like a sandcastle that you have to keep pulling back into shape, any part of it that you
neglect slowly crumbles.
. . .
When you merge multiple viewpoints together, the result won’t say anything specific, nor
can it make a “point”. An AI generated video is like the combined sound of a dozen human
voices speaking in unison through the cultural artifacts we have created. There is no
single value or motive uniting the result. AI are exceptionally good at merging content
but not intent, simply because intent shouldn’t be merged at all. “Intent” is always
biased in favour of what it wants — it brooks no compromises. The only way to correct the
blending of intents in AI art is for the AI to first learn the effect its choices have on
the viewer, then use that to drive the art in the direction it wants to.
For now, generative AI is a blank slate of sorts, an impartial data cruncher. This has its
appeal, to be sure. It lacks many of the annoyances of working with humans, particularly
their stubbornness and partiality. It is a compliant and obsequious mimic of social
artifacts. This also means its failures can be confusing, almost like the software is
“mocking”¹ you. It imitates your representations without showing that it understands why
you care about the message or the subject matter. You want the AI ‘artist’ to agree with
your message, and to express that through the resulting creation. But it can’t agree with
you, it can only ape what it’s seen. So it ends up generating “mock” human artifacts —that
is, facsimiles that lack the driving voice of their originals. [End]