Abstract: This paper presents an approach for generating photorealistic video sequences of dynamically varying facial expressions in human-agent interactions. To this end, we study human-human interactions to model the relationship and influence of one individual’s facial expressions in the reaction of the other. We introduce a two level optimization of generative adversarial models, wherein the first stage generates a dynamically varying sequence of the agent’s face sketch conditioned on facial expression features derived from the interacting human partner. This serves as an intermediate representation, which is used to condition a second stage generative model to synthesize high-quality video of the agent face. Our approach uses a novel L1 regularization term computed from layer features of the discriminator, which are integrated with the generator objective in the GAN model. Session constraints are also imposed on video frame generation to ensure appearance consistency between consecutive frames. We demonstrated that our model is effective at generating visually compelling facial expressions. Moreover, we quantitatively showed that agent facial expressions in the generated video clips reflect valid emotional reactions to behavior of the human partner.