Philip Rosedale just sent me this new video demo of an avatar singing Christina Aguilera's "Beautiful" in High Fidelity, his Oculus Rift-compatible virtual world, and if you know all the art and technology behind it, you'll think it's pretty cool:
The singer is actually High Fidelity's Emily Donald (who has a lovely voice), and the avatar is imitating her actual face and lip movements as tracked via a PC camera pointed at her, and in near real time. The avatar herself sort of looks like a character in a Pixar movie, and that's no surprise: The facial animations were created by High Fidelity's Ozan Serim, who was a longtime manager at Pixar, before joining Philip's company. (Serim worked on Monsters University, Cars 2, Brave, and Toy Story 3 there.) The facial animations are more than enough to convey emotion, and the lip sync is just about perfect. (Bad lip sync remains a horrible problem in Second Life, not to mention other MMOs/machinima platforms.) As it happens, Philip and I were just e-mailing about how live music performance can be a compelling thing in virtual reality, so this video is a case study of that.
How was this shot, and what's the latency between her face movements and the avatar animations. Philip explains:
"This was done by Emily and Ozan (who is playing guitar) in our little back office here. Emily is in front of a Primesense Carmine camera, using Faceshift to detect animation." Philip adds: "Ozan has been doing amazing work designing avatar facial movements that map well to that sort of live camera, and this is the result."
As for this performance tracking in near real time: "The latency of both audio and avatar motion ends up being about the same at roughly 100 milliseconds, which is why you can't see any difference between the two. Technically speaking it comes from different reasons (the Primesense camera imposes about 85 milliseconds of delay between photons hitting the screen and us getting the tracking data but sends/receives at 60 framers per secon, where the audio has about 40 milliseconds of delay on each of microphone and playback), but they end up getting to you at about the same time." Philip has argued before than 100 milliseconds is the target for magic to happen in VR, and that magic seems to be happening here.
"Next," Rosedale adds, "we are working on getting her whole body moving!"
Please share this post:
Color me impressed.
Posted by: Metacam Oh | Tuesday, August 05, 2014 at 04:41 PM
It looks pretty rough right now. Her head is tilted back a bit too far so we don't see her face full on. The eye movement was disconcerting too. But, overall it was impressive compared to SL. Given time this will look awesome.
Posted by: GoSpeed Racer | Tuesday, August 05, 2014 at 04:53 PM
Still looks awful. Doesn't matter how advanced it is. Awful avatar. Awful expression capture with goggle eyes and a slack jaw.
Posted by: melponeme_k | Tuesday, August 05, 2014 at 05:04 PM
Looks a tad funny, especially the eyes.
However, bloody marvellous voice.
Posted by: Ciaran Laval | Tuesday, August 05, 2014 at 05:16 PM
I can make that body move!
Posted by: Stroker Serpentine | Tuesday, August 05, 2014 at 06:40 PM
I was more impressed with the avatar singing Queen.
Posted by: Jo Yardley | Tuesday, August 05, 2014 at 07:49 PM
Was the avatar singing Bohemian Rhapsody?
The avatar is in the rough stages (although Stroker seems quite attracted to it *cough*) but it's pretty impressive nonetheless.
I think Linden Labs has their work cut out for them.
Posted by: Tracy RedAngel | Tuesday, August 05, 2014 at 09:33 PM
As a vocalist I think this technology would be amazing if the real time would be true-to-real-time. Even the slightest slightest animation delay makes it feel more like watching an avatar sing karaoke than it is an avatar singing in sync to the music. If you want to be immersive you are going to have to eliminate that delay all together for a real worthwhile believable experience but we are a long way from that at the moment, but give it two years. Yay oculus teams!
Posted by: ColeMarie Soleil | Tuesday, August 05, 2014 at 10:42 PM
Impressive technology.
But is it what people really want from their virtual world experience?
High Fidelity and Oculus Rift both seem very demanding of your attention and very invasive into your real life existence.
Right now, we're still at the stage of wearing handicapping contraptions and animating ugly cartoons.
Missing the "wow" factor.
Posted by: A.J. | Tuesday, August 05, 2014 at 11:17 PM
From crank the car, remove the crank, get in, and hope for the best to press a button on your key, the car starts, get in and drive away.
Technology is rough; then smooth. The folks at HiFi are visionaries with an attainable goal: Create an immersive and low latency world that raises the bar from what's available now. Better yet, from my standpoint, it will still allow users to create.
Posted by: DrFran | Wednesday, August 06, 2014 at 01:52 AM
If HF Avis will all look like this I will surely not join in, where is the FIDELITY ???
Posted by: Carlos Loff | Wednesday, August 06, 2014 at 03:51 AM
chuckle
Posted by: joe | Wednesday, August 06, 2014 at 05:23 AM
I really don't mind the cartoony, stylized approach... as proof of concept. Get that same latency with a realistic, detailed contemporary avatar, and my interest will be more than academic.
On a side note, EverQuest II did facial capture a few years back with the SOEmote feature. I never played with it much, so I don't know how well they did with the lip-sync. Put a voice morpher in the loop, and I would imagine that the timing gets a little dicey. I would trade a little more latency for a rock-solid sync.
Posted by: Arcadia Codesmith | Wednesday, August 06, 2014 at 06:34 AM
great voice, the avi is about two steps below IMVU
Posted by: 2014 | Wednesday, August 06, 2014 at 06:36 AM
Here's a thought, why couldn't HiFi delay the audio stream by a fraction of a second or two to compensate for the minor delay in lip sync?
Posted by: Metacam Oh | Wednesday, August 06, 2014 at 06:42 AM
I hear all the naysaying here. But what you're realy looking at is advanced tech. The only reason the avatar looks this way is because this is what the creator chose for it. Think liquid mesh, full materials, great lighting, true gravity. For a platforom still in 'Alpha platform I've seen far worse.
It's not what you see - because that's restricted to the people involved. It's what's possible (for both good and ill unfortunately) - like Second Life, where the good go to create & build, the bad follow to steal & destroy.
Posted by: Debs Regent | Wednesday, August 06, 2014 at 10:22 AM
Metacam, I can't say for certain, but I think that'd only work if you could precisely measure the gap between audio and video processing... which is probably not consistent from second to second (hence Philip throwing a lot of qualifiers like "roughly" and "about").
I can think of approaches to analyze and match the mouth shape with the phonemes from the audio stream, but I don't know how much processing overhead that would add when you're measuring in milliseconds. "Good enough" might be as good as it gets.
Posted by: Arcadia Codesmith | Wednesday, August 06, 2014 at 10:50 AM
Clearly you can get some good expression translation with good lighting conditions and some input smoothing from the raw camera data. Previous live demos suffered from lighting issues and unbuffered output making everyone want to barf into the uncanny valley.
As far as the comparison with SL lip sync, there isn't any. The Vivox software SL uses for voice only hands off a very basic "energy" variable via the ParticipantPropertiesEvent. The best you are ever going to get with that is a basic indication of how much you should morf the mouth open.
I know people go on about phoneme detection and puppeting, and when you can't see the lips, it's pretty good. But for a true performance you want as accurate a translation of the face as possible. For Alpha this is looking really good.
Posted by: Roblem Hogarth | Wednesday, August 06, 2014 at 02:27 PM
Something vaguely creepy. I think it does skirt on the edges of the uncanny valley a bit. I'd like to see a side-by-side comparison of her actual face, to see how well it's being tracked. We can see a bit of this in Philip Rosedale's demonstration at SVVR9: http://youtu.be/gaWacrQuEcI?t=42m40s It seems like the avatar has a tendency to smile too much, a lot more than the source, which I think might also going on here.
Also, I don't really understand how this is supposed to work with the Rift? Will it primarily just track the mouth?
Posted by: Compcat | Wednesday, August 06, 2014 at 04:21 PM
Real Time Facial Animation presented for Unity game engine a year ago. The exact same tech but only more advanced. Presented in this video at minute 18.30 https://www.youtube.com/watch?v=vjgSbX28Qz0
What is not there is the streaming part, the streaming part could be added by two or three student coders from a tech university in about a month or two. Stream voice or stream the caputured animation data is really not that hard. You just send the packets with the data from the server to the client. It is worth to watch the unity presentation as the tech presented there provides you with a more deep insight.
Looking a bit deeper this did show up: https://www.youtube.com/watch?v=NFBv_ypyhiA
live motion capture data streaming in Second Life in the year 2010 and nobody blinked when it was possible, nobody used it or saw potential in it.
There is really a lot of impressive gadgets and tech around these days, in particular in Unity because Unity is free so all students experiment with it.
Posted by: Partytime | Wednesday, August 06, 2014 at 04:22 PM
Very nice. What about Sign Language support? Text-to-speech? The deaf/HOH/mute community could be well-served by this technology.
Posted by: Uccie Poultry | Thursday, August 07, 2014 at 11:06 AM