微软推出的最新 AI 模型 VASA-1 可将肖像照片与音频文件进行关联,并生成视频,使照片能够以 “逼真的方式说话和唱歌”。微软分享了演示视频(围观地址),包括蒙娜丽莎说唱的视频。使用该模型的用户可以自行调整头部运动或视线方向等参数。在离线模式下,VASA-1 能够以 512×512 像素、45fps 的帧数生成视频,其在线版本中支持最高 40fps。微软称:“VASA-1 能够产生与音频完美同步的唇部动作。它还能捕捉到广泛的微妙面部表情和自然的头部运动,这有助于提高真实感和生动性的感知。”VASA-1 模型主要用于虚拟角色的设计。微软强调,出于对该 AI 模型被用于制作深度伪造的担忧,该公司没有计划将 VASA-1 投放市场。
Microsoft has introduced its latest AI model, VASA-1, which allows for the association of portrait photos with audio files and generates videos that make the photos “speak and sing in a realistic manner.” Microsoft has shared demonstration videos, including one featuring the Mona Lisa rapping. Users of the model can make adjustments to parameters such as head movements and gaze direction. In offline mode, VASA-1 can generate videos at a resolution of 512×512 pixels and a frame rate of 45fps, while the online version supports up to 40fps. Microsoft states that “VASA-1 can produce lip movements perfectly synchronized with the audio. It can also capture a wide range of subtle facial expressions and natural head movements, enhancing the perception of authenticity and liveliness.” The VASA-1 model is primarily designed for the creation of virtual characters. Microsoft emphasizes that due to concerns about potential misuse for deepfakes, they have no plans to release VASA-1 to the market.