Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping (2309.10667v1)

Published 19 Sep 2023 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: We focus on the task of soundscape mapping, which involves predicting the most probable sounds that could be perceived at a particular geographic location. We utilise recent state-of-the-art models to encode geotagged audio, a textual description of the audio, and an overhead image of its capture location using contrastive pre-training. The end result is a shared embedding space for the three modalities, which enables the construction of soundscape maps for any geographic region from textual or audio queries. Using the SoundingEarth dataset, we find that our approach significantly outperforms the existing SOTA, with an improvement of image-to-audio Recall@100 from 0.256 to 0.450. Our code is available at https://github.com/mvrl/geoclap.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Chatty maps: constructing sound maps of urban areas from social media data. Royal Society open science, 3(3):150690, 2016.
  2. Radio Aporee. https://aporee.org/maps.
  3. Image and sound of the city. In The Social City: Space as Collaborative Media to Enhance the Value of the City, pages 205–214. Springer, 2023.
  4. Geography-aware self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  5. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2022.
  6. Deep cross-modal image–voice retrieval in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 58(10):7049–7061, 2020.
  7. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems, 35:197–211, 2022.
  8. Research on the effects of soundscapes on human psychological health in an old community of a cold region. International Journal of Environmental Research and Public Health, 19(12):7212, 2022.
  9. Audio retrieval with wavtext5k and clap training. arXiv preprint arXiv:2209.14275, 2022.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2018.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of the International Conference on Learning Representations, 2021.
  12. Clap learning audio concepts from natural language supervision. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
  13. A review of the use of psychoacoustic indicators on soundscape studies. Current Pollution Reports, pages 1–20, 2021.
  14. International Organization for Standardization. Iso 12913-1: 2014 acoustics—soundscape—part 1: definition and conceptual framework. ISO, Geneva, 2014.
  15. On the relationships between auditory and visual factors in a residential environment context: A sem approach. Frontiers in Psychology, 14, 2023.
  16. Effects of noise on pedestrians in urban environments where road traffic is the main source of sound. Science of the total environment, 857:159406, 2023.
  17. Audioclip: Extending clip to image, text and audio. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2022.
  18. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  19. Self-supervised audiovisual representation learning for remote sensing data. International Journal of Applied Earth Observation and Geoinformation, 116:103130, 2023.
  20. Cross-task transfer for geotagged audiovisual aerial scene recognition. In Proceedings of the European Conference on Computer Vision. Springer, 2020.
  21. Soundscape and health. In Soundscapes: Humans and Their Acoustic Environment, pages 243–276. Springer, 2023.
  22. A systematic review of prediction models for the experience of urban soundscapes. Applied Acoustics, 170:107479, 2020.
  23. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  24. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  25. Deep cross-modal retrieval for remote sensing image and audio. In 10th IAPR workshop on pattern recognition in remote sensing, 2018.
  26. Soundscape mapping in environmental noise management and urban planning: case studies in two uk cities. Noise mapping, 4(1):87–103, 2017.
  27. Semantics-consistent representation learning for remote sensing image–voice retrieval. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021.
  28. Araus: A large-scale dataset and baseline models of affective responses to augmented urban soundscapes. IEEE Transactions on Affective Computing, 2023.
  29. Ambient sound provides supervision for visual learning. In Proceedings of the European Conference on Computer Vision, 2016.
  30. An open-science crowdsourcing approach for producing community noise maps using smartphones. Building and Environment, 148:20–33, 2019.
  31. Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the Association for Computing Machinery Conference on Multimedia, 2015.
  32. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 2021.
  33. Sound and the healthy city. Cities & Health, 5(1-2):1–13, 2021.
  34. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. arXiv preprint arXiv:2212.14532, 2022.
  35. A multimodal approach to mapping soundscapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018.
  36. Self-supervised learning in remote sensing: A review. IEEE Geoscience and Remote Sensing Magazine, 2022.
  37. Wav2clip: Learning robust audio representations from clip. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2022.
  38. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
  39. Multimodal fusion remote sensing image–audio retrieval. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:6220–6235, 2022.
  40. A visualized soundscape prediction model for design processes in urban parks. In Building Simulation, volume 16, pages 337–356. Springer, 2023.
  41. Sensing urban soundscapes from street view imagery. Computers, Environment and Urban Systems, 99:101915, 2023.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub