Example-Based Synthesis of Stylized Facial Animations (results)

In November of 2016, we spoke with Adobe about Project StyLit, a research project that used artificial intelligence to apply an artistic style to a three-dimensional model. It was an intriguing technology, but it had many limitations and an apparently limited usefulness. Building off of what the team learned with Project StyLit, Adobe Research has now taken a big step forward with a new project that, for now, is known simply as “Stylized Facial Animation.” It can apply an input style (or “style exemplar”) to video with very natural results, doing a much better job of imitating hand-drawn animation than other techniques. While it is currently limited to faces, it shows promise for how animators will one day be able to automate the rotoscoping process, or at least aspects of it.

The technology will be publicly demonstrated on August 3 at SIGGRAPH, but Digital Trends had the chance to speak with Adobe’s David Simons about the project ahead of the presentation. Simons holds the role of Senior Principal Scientist within Adobe’s Advanced Product Development Group and is a 25-year veteran of After Effects, Adobe’s special effects program for video. After Effects seems a likely candidate for where the technology demonstrated in Stylized Facial Animation could one day end up. At this time, Adobe has not announced any plans for a commercial release of the tech, but it is the mission of Simons and the Advanced Product Development Group to turn research projects into marketable products, so it may not be that far off from finding its way into a future Creative Cloud update (it’s highly likely that it will be demonstrated at Adobe’s upcoming MAX conference).

Neural networks to the rescue

As detailed in a research paper, Stylized Facial Animation is based on previous research on neural network computing, conducted by both Adobe and others. One of the earliest and most popular consumer applications of neural network technology was the Prisma app, which applied an artistic style to an image and, later, to video. An important distinction between the likes of Prisma and simple image filters is that apps based on neural networks actually redraw the image from scratch, rather than overlay the style on top of the photograph. But while Prisma could do this very quickly, it did a relatively poor job of maintaining details present in the source style.

While it still struggles in some specific areas, like forehead wrinkles, Adobe’s approach significantly increases the amount of artistic detail that makes it through to the final output. It accomplishes this by analyzing both the style exemplar and the target video, creating “guiding channels” that ensure the style is appropriately applied to the target. One channel is used to align the position of the exemplar and the target, another maintains the shading present in the target subject, and a third matches semantically similar regions (eyes to eyes, hair to hair, etc.).

The underlying technology is actually based on the Content-Aware Fill feature in Adobe Photoshop. “It looks at nearby areas and takes patches that fit together. It’s like trying to fill in a jigsaw puzzle with pieces from the rest of the area that match up,” Simons said.

Example-Based Synthesis of Stylized Facial Animations

The result is, essentially, just how you would expect such a technology to work, but it is considerably different from what we’ve seen in apps like Prisma as it maintains so much more detail from the source style. But when it comes to applying the effect to multiple frames for an animation, things get even trickier. With Project StyLit, animations could only be created by applying a style one frame at a time, then manually connecting the frames in Adobe Premiere Pro or another video-editing application.

“A lot of it is about avoiding the uncanny valley. You want a level of abstraction.”

One of the key components of Stylized Facial Animation is that it can apply a style to an entire video at once, but this can lead to the issue of temporal coherence. This is somewhat ironic, as temporal coherence is typically what such programs aim for — that is to say, where each frame logically matches up with the previous frame, leading to smooth motion — but this actually has the effect of making the resulting animation look less like it was done by hand. Hand drawn animations have a natural amount of decoherence between frames, as the pencil lines or brush strokes won’t perfectly match from one frame to the next. Existing methods inherently achieved very high temporal coherence, so Adobe researchers knew they had to develop their own way to reduce coherence in Stylized Facial Animation in order to get more natural results.

“We want it to look redrawn on every frame,” Simons explained, “But not so flickery that you can’t see what’s going on. A lot of it is about avoiding the uncanny valley. You want a level of abstraction.”

The solution came in allowing control over “temporal noise” by using a separate guiding channel to determine coherence. As each frame is synthesized from the preceding one, controlling the amount of blur in a frame’s temporal guiding channel will raise or lower coherence in the next frame, making it possible to find a balance between an animation that is too choppy and too smooth.

Applying the technology

While Simons was quick to reaffirm that the project is definitely “still research at this time,” there are two possible applications for it in the future. One would be for animating live-action footage, as demonstrated in the research paper, and the other would be for creating puppets to use in Adobe’s Character Animator, which automatically animates a character based on video of an actor.

Any portrait — a sketch, painting, or even a photograph — can be used to style the video.

Compared to Project StyLit, another advantage with Stylized Facial Animation is that any portrait can be used as a style exemplar. This doesn’t just give artists greater control over creating their own styles compared to a basic sphere study, but it also means non-artists can grab source images from virtually anywhere and use them as style exemplars. The sources need not even be drawings or paintings; photographs (even of statues) will also work.

Image used with permission by copyright holder

Going forward, one of the main goals will be getting to the point where the program can run in real time, or at least close to it. Currently, it takes about 30 seconds just to compute the guiding channels and three minutes to perform full synthesis of a single, 1-megapixel frame using a quad-core processor running at 3 gigahertz, according to the research paper. However, the team has already demonstrated vast performance improvements possible by using a graphics processing unit (GPU), which reduces the time required for single-frame synthesis down to just 5 seconds. Given that most video runs at 24 or 30 frames per second, this is still a ways off from achieving real time synthesis, but the project is also still in very early development.

Whatever the eventual outcome of Stylized Facial Animation, it is clear that Adobe is very interested in continuing to develop the technology. The paper’s lead author, Jakub Fiser, started out as an intern at Adobe. This project was his Ph.D. thesis, and was enough for Adobe to bring him on full time. Fiser joins an experienced group of individuals, of whom Simons spoke highly, even while keeping modest with regard to his own role. “Everyone else is a Ph.D.,” he said. “They’re the real researchers.”