Introduction to Creating Photorealistic Virtual Agents
The term ‘virtual human’ has been on the rise. But what is it exactly and how do you create one?
If you have interacted with ChatGPT (or another conversational system), you will know that when prompted it can take turns to hold a conversation with you, similar to how humans naturally communicate. However, unlike ChatGPT, humans naturally communicate using their face, hands, arms and torso. While ChatGPT does not have a body, virtual agents, do.
What is a virtual agent?
Virtual agent is a software entity that has a graphical body, which can be realistic or cartoony. The body allows it to interact with humans in the most natural manner to us, that is, using speech and non-verbal cues, such as eye gaze, facial expressions, gestures, and posture. ‘Virtual agent’ is not the only name to refer to this technology, you may also come across names, such as Embodied Conversational Agent, Intelligent Virtual Agent, Socially Interactive Agent, Virtual Reality Agent, and most recently, also Virtual Human and Digital Human.
When virtual agents are presented in 2D, they can appear on a smartphone screen, a tablet, or a computer screen, but it is also possible to ‘meet’ virtual agents in virtual reality or as life-sized projections in a cave. The research and development of virtual agents began to take place in the mid-90s and early 2000s. However, many technological advancements in computer graphics, 3D rendering, and machine learning have occurred in the last 15 years, allowing teams to create virtual agents that are photorealistic in their outward appearance. If you are a gamer, and have come across avatars or non-player characters in video games, you might be now scratching your head, asking what the difference between an agent and an avatar is. The difference between a virtual agent and an avatar is that agents are supposed to be autonomous as opposed to avatars that are driven by users. Other features of a virtual agent include pro-activeness (the ability to take initiative), reactivity (the ability to respond to the environment in a timely fashion), and social intelligence (the ability to interpret and exhibit social cues for interaction).
In what follows, I will provide a brief explanation of how photorealistic virtual agents can be created, and describe some approaches to behavior generation for these agents. Please note that the approaches described below are just a few examples of the different techniques used in constructing virtual agents, and that the field is constantly evolving. If you are only interested in learning about the possible applications of virtual agents, you can skip ahead to the end of the blog post.
How is photorealistic appearance constructed?
A common approach to constructing photorealistic appearance of a virtual agent is to create a 3D model of a real human using photogrammetry, which is a 3D scanning technique. 3D scanning permits creating a 3D model of a real object or a person, in this case, the person’s face and the body, whereas deep learning can be used to create or enhance specific features that are difficult to be scanned, for instance, a person’s hair or teeth. From a technical perspective, photogrammetry entails taking measurements from 2D images to construct a 3D model. In fact, this technique has applications that go beyond the virtual agent domain, including archaeology, architecture, urban planning, digital twinning, animated films, and video games. In practice, creating a 3D model includes a set of steps. First, multiple cameras are used to capture a large set of images of the person’s face and/or body from different angles. Depending on the capture set-up, the number of cameras can range from a dozen to hundreds, for example, Mugsy created at Meta Reality Labs uses 171 cameras to capture the face. Secondly, the obtained images are processed using computer vision algorithms that allow for extracting a cloud of points, i.e., a mesh that represents an object in 3D. Thirdly, having created a 3D model, texture/detail needs to be added to it. This is done using the texture from 2D photographs of the object and applying it to the model. Other techniques, such as bump maps, normal maps or displacement maps can be used to further add detail to the model. What these maps do is they add detail to the surface, which is important for creating realism. In fact, my own research with human and computer-generated faces has shown that humans are susceptible to even the tiniest details in the face. For instance, the more rough vs. smooth the skin texture is, the more people will perceive the face as realistic. Moreover, I showed that it is possible to extract skin texture roughness vs. smoothness from facial images and that these computational metrics predict people’s subjective perceptions of face realism. Although here I will refrain from discussing whether it is important that every virtual agent be (photo)realistic (this is a vast discussion that merits a separate text and there is a lot of academic research on this topic), it should be mentioned that the technology to make high-end photorealistic virtual agents is there. Another important thing to mention is that there is a difference between computer-generated media, created by types of a technique colloquially known as a deep fake (lip-sync deep fake, face swap deep fake, etc.), and a 3D model of a virtual agent in a game engine, such as Unity or Unreal. At least currently, deep fakes are not as flexible as 3D models, that is, while the former allows for transferring one entity’s behavior onto another (speech, facial expressions as in this deep fake video of Barack Obama), having a 3D model means it can be autonomously driven using models created based on machine learning techniques. For instance, take the face of a virtual agent, every tiny movement of the face (twitch, squint) can be controlled independently, which is not the case with a deep fake. This is where we come to behavior generation for virtual agents.
How can behavior be generated?
Behavior can be verbal (speech) and non-verbal that, as mentioned, consists of eye gaze, facial expressions, gestures, etc. In the past, behavior of virtual agents would be prescripted by manually defining if-then rules, so that a set of rules would trigger a certain behavior, for instance, nodding. Driving behavior of virtual agents using fixed rules presents difficulties since one needs to define a set of desired behaviors and also find a way to vary them, otherwise the virtual agent will come across as repetitive. One also needs to ensure that different kinds of behavior match up, for instance, ensuring that the virtual agent’s speech is coherent with its gestures. For comparison, the repertoire of human behaviors is incredibly rich, from different types of gestures to thousands of facial expressions that the human face is able to make, although in scientific literature there is a debate what proportion of all possible facial expressions humans actually make across different contexts. In human communication, many of our behaviors are correlated over time, speech, eye gaze, gestures, which in turn is difficult to account for when relying on if-then rules to drive behavior of virtual agents.
Fortunately, these days behavior of virtual agents is increasingly generated relying on machine learning, which occurred due to the advances of the past decade in deep learning techniques. This implies relying on pre-existing human data, for example, a database of videos containing facial expressions of humans. Such databases typically have human annotations or labels. For instance, if it is a database composed of images depicting human faces showing facial expressions, the labels may be indicative of whether it is a sadness or a surprise facial expression, or more fine-grained indications, such as numerical values to score facial muscle activations, in scientific literature known as Actions Units. There also exist approaches that permit bypassing the need for manual human annotations and automatically extracting features from data. What a model based on deep learning techniques does it abstracts behavioral patterns and finds correspondences between different behaviors. The advantage compared to if-then rules is that these models allow for creating behaviors that are more natural and more variable. However, typically these models are data-hungry, especially the types of architectures (where architecture refers to the organization and structure of the model) that have recently shown great progress for generating language, namely, transformers that are at the core of large language models.
Where can virtual agents be applied?
It must be acknowledged that for quite some time virtual agents were mainly the domain of scientific research and having them deployed for real-world applications was rather difficult. However, with the current advances in computer graphics and machine learning, including large language models, the possibilities are starting to emerge to have virtual agents assist humans in different fields. While here I would like to highlight education and healthcare as two important application domains, in general, virtual agents can be applied to e-commerce, customer service, games, movies and visual effects industry, as well as virtual tourism, among other areas.
Regarding education, one important use case is having virtual agents as personalized tutors. Such tutors could deliver learning materials and guide the learner in one-on-one sessions. Having an embodied agent is beneficial for helping to focus learner’s attention and helping with motivation, especially if the virtual tutor is modelled after someone whom the learner admires. Recently there has been research with AI-generated characters (based on generative AI tools that take 2D images and text as input to generate videos as output) applied to study whether having a virtual instructor that one admires increases motivation. Research showed that having a lecture delivered by someone whom people admired boosted positive feelings and motivation. However, there was no difference between the group who learned from an online lecture delivered by a virtual tutor they admired vs. a control group. Both virtual agents and AI-generated characters will allow learners to acquire new knowledge through interacting with tutors that are famous scientists, renowned artists and historical figures.
Regarding healthcare, the benefits of applying virtual agents lie in having them serve as healthcare assistants that can interact with patients and/or consumers in applications aimed at, for instance, healthcare education and behavior change. When equipped with relevant and verified medical knowledge virtual agents could assist healthcare professionals in providing healthcare information and guidance to patients. For instance, if a person is considering undergoing a medical procedure or a surgery, the virtual agent can support them in answering the most frequently asked questions and other queries that the person may have about what the procedure entails, how to best prepare for it and/or what the recovery may look like. There already exists research showing that virtual agents can promote physical activity and medication adherence in patients. Virtual agents are thus not meant to replace human professionals but support them in order to deliver accessible and high-quality healthcare. On the practitioner side, virtual agents can be used as virtual patients that can simulate face-to-face conversation that enables medical practitioners improve their communication skills. This aspect is important since more effective communication leads to better clinical outcomes as shown by research. Virtual agents can also be used in medical simulations where the goal is to simulate a medical emergency which enables more immersive training. It is common to employ physical mannequins for medics training, however, they present limitations when the training scenario requires dialogue to assess symptoms in a medical emergency.
In sum, there are reasons to be optimistic about the future of virtual agents. As technology advances, virtual agents are expected to become more proficient in their communicative abilities, making them more applicable in real-world scenarios to assist and support humans.