Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal active speaker detection and virtual cinematography for video conferencing (2002.03977v3)

Published 10 Feb 2020 in eess.AS, cs.LG, cs.MM, and stat.ML

Abstract: Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ross Cutler (54 papers)
  2. Ramin Mehran (4 papers)
  3. Sam Johnson (7 papers)
  4. Cha Zhang (23 papers)
  5. Adam Kirk (1 paper)
  6. Oliver Whyte (1 paper)
  7. Adarsh Kowdle (7 papers)
Citations (7)