Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction (2405.16847v1)

Published 27 May 2024 in cs.CV and cs.AI

Abstract: Autoregressive next-token prediction is a standard pretraining method for large-scale LLMs, but its application to vision tasks is hindered by the non-sequential nature of image data, leading to cumulative errors. Most vision models employ masked autoencoder (MAE) based pretraining, which faces scalability issues. To address these challenges, we introduce \textbf{TokenUnify}, a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. We provide theoretical evidence demonstrating that TokenUnify mitigates cumulative errors in visual autoregression. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution, ideal for creating spatially correlated long sequences. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date and providing a unified benchmark for experimental validation. Leveraging the Mamba network inherently suited for long-sequence modeling on this dataset, TokenUnify not only reduces the computational complexity but also leads to a significant 45\% improvement in segmentation performance on downstream EM neuron segmentation tasks compared to existing methods. Furthermore, TokenUnify demonstrates superior scalability over MAE and traditional autoregressive methods, effectively bridging the gap between pretraining strategies for language and vision models. Code is available at \url{https://github.com/ydchen0806/TokenUnify}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Flamingo: a visual language model for few-shot learning. In NeurIPS, volume 35, pages 23716--23736, 2022.
  3. Crowdsourcing the creation of image segmentation algorithms for connectomics. Frontiers in neuroanatomy, 9:152591, 2015.
  4. The pitfalls of next-token prediction. arXiv preprint arXiv:2403.06963, 2024.
  5. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  6. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
  7. Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785, 2023.
  8. Multicut brings automated neurite segmentation closer to human performance. Nature methods, 14(2):101--102, 2017.
  9. Scheduled sampling for sequence prediction with recurrent neural networks. In NeurIPS, volume 28, 2015.
  10. Language models are few-shot learners. In NeurIPS, volume 33, pages 1877--1901, 2020.
  11. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175, 2023.
  12. Automated data curation for robust language model fine-tuning. arXiv preprint arXiv:2403.12776, 2024.
  13. Subject-agnostic transformer-based neural speech decoding from surface and depth electrode signals. bioRxiv, pages 2024--03, 2024.
  14. Generative pretraining from pixels. In ICML, pages 1691--1703. PMLR, 2020.
  15. Learning multiscale consistency for self-supervised electron microscopy instance segmentation. In ICASSP, pages 1566--1570. IEEE, 2024.
  16. Self-supervised neuron segmentation with multi-agent reinforcement learning. In IJCAI, pages 609--617, 2023.
  17. Generative text-guided 3d vision-language pretraining for unified medical image segmentation. arXiv preprint arXiv:2306.04811, 2023.
  18. Bimcv-r: A landmark dataset for 3d ct text-image retrieval. arXiv preprint arXiv:2403.15992, 2024.
  19. Z Chen and L Jing. Multimodal semi-supervised learning for 3d objects. In The British Machine Vision Conference (BMVC), 2021.
  20. Bridging the domain gap: Self-supervised 3d scene understanding with foundation models. Advances in Neural Information Processing Systems, 36, 2024.
  21. Class-level confidence based 3d semi-supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 633--642, 2023.
  22. Point cloud self-supervised learning via 3d to multi-view masked autoencoder. arXiv preprint arXiv:2311.10887, 2023.
  23. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, volume 36, 2024.
  24. Real-time pill identification for the visually impaired using deep learning. arXiv preprint arXiv:2405.05983, 2024.
  25. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  26. Unsupervised anomaly segmentation for brain lesions using dual semantic-manifold reconstruction. In ICONIP, 2022.
  27. Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541, 2024.
  28. Miccai challenge on circuit reconstruction from electron microscopy images. In MICCAI, 2016.
  29. Large scale image segmentation with structured loss based deep learning for connectome reconstruction. IEEE transactions on pattern analysis and machine intelligence, 41(7):1669--1680, 2018.
  30. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  31. Unetr: Transformers for 3d medical image segmentation. In WACV, pages 574--584, 2022.
  32. Masked autoencoders are scalable vision learners. In CVPR, pages 16000--16009, 2022.
  33. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729--9738, 2020.
  34. LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
  35. Learning to model pixel-embedded affinity for homogeneous instance segmentation. In AAAI, volume 36, pages 1007--1015, 2022.
  36. Saturated reconstruction of a volume of neocortex. Cell, 162(3):648--661, 2015.
  37. An efficient heuristic procedure for partitioning graphs. The Bell system technical journal, 49(2):291--307, 1970.
  38. E. Kodak. Kodak lossless true color image suite (photocd pcd0992), 1993. Version 5.
  39. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234--1240, 2020.
  40. Superhuman accuracy on the snemi3d connectomics challenge. arXiv preprint arXiv:1706.00120, 2017.
  41. Consistent posterior distributions under vessel-mixing: a regularization for cross-domain retinal artery/vein classification. In 2021 IEEE International Conference on Image Processing (ICIP), pages 61--65. IEEE, 2021.
  42. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730--19742. PMLR, 2023.
  43. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888--12900. PMLR, 2022.
  44. Deception detection from linguistic and physiological data streams using bimodal convolutional neural networks. arXiv preprint arXiv:2311.10944, 2023.
  45. Contextual hourglass network for semantic segmentation of high resolution aerial imagery. arXiv preprint arXiv:1810.12813, 2018.
  46. Exploring diverse methods in visual question answering. arXiv preprint arXiv:2404.13565, 2024.
  47. Mage: Masked generative encoder to unify representation learning and image synthesis. In CVPR, pages 2142--2152, 2023.
  48. Research on the application of semantic network in disease diagnosis prompts based on medical corpus. International Journal of Innovative Research in Computer Science & Technology, 12(2):1--9, 2024.
  49. T3d: Towards 3d medical image understanding through vision-language pre-training. arXiv preprint arXiv:2312.01529, 2023.
  50. Deep learning-enabled 3d multimodal fusion of cone-beam ct and intraoral mesh scans for clinically applicable tooth-bone reconstruction. Patterns, 4(9), 2023.
  51. Toothsegnet: image degradation meets tooth segmentation in cbct images. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pages 1--5. IEEE, 2023.
  52. Parameter-efficient transfer learning for medical visual question answering. IEEE Transactions on Emerging Topics in Computational Intelligence, 2023.
  53. Decoupled weight decay regularization. In ICLR, 2018.
  54. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023.
  55. Machine learning of hierarchical clustering to segment 2d and 3d images. PloS one, 8(8):e71715, 2013.
  56. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2023.
  57. Learning transferable visual models from natural language supervision. In ICML, pages 8748--8763. PMLR, 2021.
  58. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  59. Automatic detection of synaptic partners in a whole-brain drosophila em dataset. Nature Methods, 18(8):877–884, 2021.
  60. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, volume 35, pages 25278--25294, 2022.
  61. The effectiveness of mae pre-pretraining for billion-scale pretraining. In ICCV, pages 5484--5494, 2023.
  62. ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs. In INTERSPEECH, pages 1648--1652, 2023.
  63. Eagle: An edge-aware gradient localization enhanced loss for ct image reconstruction. arXiv preprint arXiv:2403.10695, 2024.
  64. A connectome of a learning and memory center in the adult drosophila brain. Elife, 6:e26975, 2017.
  65. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  66. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  67. Distributed feedforward and feedback cortical processing supports human speech production. Proceedings of the National Academy of Sciences, 120(42):e2300255120, 2023.
  68. Medclip: Contrastive learning from unpaired medical images and text. In EMNLP, pages 3876--3887, 2022.
  69. Mitoem dataset: Large-scale 3d mitochondria instance segmentation from em images. In Miccai, pages 66--76. Springer, 2020.
  70. Which* bert? a survey organizing contextualized encoders. In EMNLP, pages 7516--7533, 2020.
  71. Research on splicing image detection algorithms based on natural image statistical characteristics. arXiv preprint arXiv:2404.16296, 2024.
  72. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560, 2024.
  73. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, pages 12310--12320. PMLR, 2021.
  74. Root mean square layer normalization. In NeurIPS, volume 32, 2019.
  75. Research on detection of floating objects in river and lake based on ai intelligent image recognition. arXiv preprint arXiv:2404.06883, 2024.
  76. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915, 2(3):6, 2023.
  77. Thread of thought unraveling chaotic contexts. arXiv preprint arXiv:2311.08734, 2023.
  78. Visual in-context learning for large vision-language models. arXiv preprint arXiv:2402.11574, 2024.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com