Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing
Abstract: This paper studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.
- Cisco, “Cisco visual networking index: Global mobile data traffic forecast update 2017-2022,” Accessed: 2021. [Online]. Available: https://s3.amazonaws.com/media.mediapost.com/uploads/CiscoForecast.pdf
- T. Wu, Z. Chen, D. He, L. Qian, Y. Xu, M. Tao, and W. Zhang, “CDDM: channel denoising diffusion models for wireless semantic communications,” IEEE Transactions on Wireless Communications, vol. 23, no. 9, pp. 11 168–11 183, Mar. 2024.
- S. Wang, J. Dai, Z. Liang, K. Niu, Z. Si, C. Dong, X. Qin, and P. Zhang, “Wireless deep video semantic transmission,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 214–229, Nov. 2023.
- P. Jiang, C.-K. Wen, S. Jin, and G. Y. Li, “Wireless semantic communications for video conferencing,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 230–244, Nov. 2023.
- H. Li, H. Tong, S. Wang, N. Yang, Z. Yang, and C. Yin, “Video semantic communication with major object extraction and contextual video encoding,” in IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, Apr. 2024.
- X. Luo, R. Gao, H.-H. Chen, S. Chen, Q. Guo, and P. N. Suganthan, “Multimodal and multiuser semantic communications for channel-level information fusion,” IEEE Wireless Communications, vol. 31, no. 2, pp. 117–125, Apr. 2024.
- C. Xing, J. Lv, T. Luo, and Z. Zhang, “Representation and fusion based on knowledge graph in multi-modal semantic communication,” IEEE Wireless Commun. Letters, vol. 13, no. 5, pp. 1344–1348, May 2024.
- P. Tandon, S. Chandak, P. Pataranutaporn, Y. Liu, A. M. Mapuranga, P. Maes, T. Weissman, and M. Sra, “Txt2vid: Ultra-low bitrate compression of talking-head videos via text,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 107–118, Nov. 2023.
- K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proc. ACM International Conference on Multimedia (MM), New York, NY, USA, Oct 2020, pp. 484–492.
- J. Celestino, M. Marques, J. C. Nascimento, and J. P. Costeira, “2D image head pose estimation via latent space regression under occlusion settings,” Pattern Recognition, vol. 137, p. 109288, May 2023.
- M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, Dec. 2017.
- H. Tong, Z. Yang, S. Wang, Y. Hu, W. Saad, and C. Yin, “Federated learning based audio semantic communication over wireless networks,” in Proc. IEEE Global Commun. (GLOBECOM), Spain, Dec. 2021.
- Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 8, pp. 2434–2444, Aug. 2021.
- Z. Weng, Z. Qin, X. Tao, C. Pan, G. Liu, and G. Y. Li, “Deep learning enabled semantic communications with speech recognition and synthesis,” IEEE Transactions on Wireless Communications, vol. 22, no. 9, pp. 6227–6240, Feb. 2023.
- J. Li, B. Li, and Y. Lu, “Deep contextual video compression,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 34, Dec. 2021, pp. 18 114–18 125.
- Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao et al., “Sora: A review on background, technology, limitations, and opportunities of large vision models,” arXiv preprint arXiv:2402.17177, 2024.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.