Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compressing and Interpreting Word Embeddings with Latent Space Regularization and Interactive Semantics Probing (2403.16815v1)

Published 25 Mar 2024 in cs.HC

Abstract: Word embedding, a high-dimensional (HD) numerical representation of words generated by machine learning models, has been used for different natural language processing tasks, e.g., translation between two languages. Recently, there has been an increasing trend of transforming the HD embeddings into a latent space (e.g., via autoencoders) for further tasks, exploiting various merits the latent representations could bring. To preserve the embeddings' quality, these works often map the embeddings into an even higher-dimensional latent space, making the already complicated embeddings even less interpretable and consuming more storage space. In this work, we borrow the idea of $\beta$VAE to regularize the HD latent space. Our regularization implicitly condenses information from the HD latent space into a much lower-dimensional space, thus compressing the embeddings. We also show that each dimension of our regularized latent space is more semantically salient, and validate our assertion by interactively probing the encoding-level of user-proposed semantics in the dimensions. To the end, we design a visual analytics system to monitor the regularization process, explore the HD latent space, and interpret latent dimensions' semantics. We validate the effectiveness of our embedding regularization and interpretation approach through both quantitative and qualitative evaluations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Visual exploration of semantic relationships in neural word embeddings. IEEE TVCG 2017; 24(1): 553–562.
  2. Embeddingvis: A visual analytics approach to comparative network embedding inspection. In 2018 IEEE VAST. pp. 48–59.
  3. Visual exploration of neural document embedding in information retrieval: Semantics and feature selection. IEEE transactions on visualization and computer graphics 2019; 25(6): 2181–2192.
  4. Mohiuddin T, Bari MS and Joty S. Lnmap: Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space, 2020. 2004.13889.
  5. Voigt P and Von dem Bussche A. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed, Cham: Springer International Publishing 2017; 10: 3152676.
  6. Auto-encoding variational bayes. arXiv preprint arXiv:13126114 2013; .
  7. β𝛽\betaitalic_β-VAE: Learning basic visual concepts with a constrained variational framework. 5th International Conference on Learning Representations, ICLR 2017 ; .
  8. Scan: Learning hierarchical compositional visual concepts. arXiv preprint arXiv:170703389 2017; .
  9. Wang J, Zhang W and Yang H. Scanviz: Interpreting the symbol-concept association captured by deep neural networks through visual analytics. In IEEE Pacific Visualization Symposium. pp. 51–60.
  10. State of the art of parallel coordinates. In Eurographics (State of the Art Reports). pp. 95–116.
  11. Elmqvist N, Dragicevic P and Fekete JD. Rolling the dice: Multidimensional visual exploration using scatterplot matrix navigation. IEEE Trans on Vis and Comp Graphics 2008; 14(6): 1539–1148.
  12. Dna visual and analytic data mining. In Proceedings. Visualization’97 (Cat. No. 97CB36155). IEEE, pp. 437–441.
  13. Kandogan E. Star coordinates: A multi-dimensional visualization technique with uniform treatment of dimensions. In Proceedings of the IEEE information visualization symposium, volume 650. Citeseer, p. 22.
  14. Van der Maaten L and Hinton G. Visualizing data using t-sne. Journal of machine learning research 2008; 9(11).
  15. McInnes L, Healy J and Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426 2018; .
  16. Visual analytics for explainable deep learning. IEEE computer graphics and applications 2018; 38(4): 84–92.
  17. A survey of visual analytics techniques for machine learning. Computational Visual Media 2021; 7(1): 3–36.
  18. Diagnosing concept drift with visual analytics. In 2020 IEEE Conference on Visual Analytics Science and Technology (VAST). pp. 12–23.
  19. Towards Better Analysis of Deep Convolutional Neural Networks. IEEE Transactions on Visualization and Computer Graphics 2017; 23(1): 91–100.
  20. Squares: Supporting interactive performance analysis for multiclass classifiers. IEEE Trans on Vis and Comp Graphics 2016; 23(1): 61–70.
  21. Verb: Visualizing and interpreting bias mitigation techniques for word representations. arXiv preprint arXiv:210402797 2021; .
  22. Semantic concept spaces: Guided topic model refinement using word-embedding projections. IEEE transactions on visualization and computer graphics 2019; 26(1): 1001–1011.
  23. Conceptvector: text visual analytics via interactive lexicon building using word embedding. IEEE TVCG 2017; 24(1): 361–370.
  24. Latent space cartography: Visual analysis of vector space embeddings. In Computer Graphics Forum, volume 38. Wiley Online Library, pp. 67–78.
  25. Deepvid: Deep visual interpretation and diagnosis for image classifiers via knowledge distillation. IEEE transactions on visualization and computer graphics 2019; 25(6): 2168–2180.
  26. Vatld: a visual analytics system to assess, understand and improve traffic light detection. IEEE transactions on visualization and computer graphics 2020; 27(2): 261–271.
  27. Understanding disentangling in β𝛽\betaitalic_β-vae, 2018. 1804.03599.
  28. Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada: Association for Computational Linguistics, pp. 15–26. URL http://www.aclweb.org/anthology/S17-2002.
  29. Efficient estimation of word representations in vector space. arXiv:13013781 2013; .
  30. Word translation without parallel data. arXiv:171004087 2017; .
  31. Mikolov T, Le QV and Sutskever I. Exploiting similarities among languages for machine translation, 2013. 1309.4168.
  32. Ruder S, Vulić I and Søgaard A. A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research 2019; 65: 569–631.
  33. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 2017; 5: 135–146.
  34. Parallel coordinates: a tool for visualizing multi-dimensional geometry. In Proceedings of the First IEEE Conference on Visualization. IEEE, pp. 361–378.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com