Conjecture: Limited Genome Vocabulary Hampers Discriminative Codebook Learning
Establish whether the four-nucleotide genome vocabulary (A, T, C, G) fundamentally limits discriminative codebook learning in vector-quantized tokenizers for genomic sequences, thereby causing loss of fine-grained details in learned representations within the VQDNA framework.
References
Built upon this concept, we further conjecture that the limited original vocabulary of genomes may conceivably hamper discriminative codebook learning, resulting in the loss of fine-grained details trapped in the four nucleotides.
— VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling
(2405.10812 - Li et al., 13 May 2024) in Section 1 (Introduction)