Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 61 tok/s Pro

GPT-5 Medium 35 tok/s Pro

GPT-5 High 36 tok/s Pro

GPT-4o 129 tok/s Pro

Kimi K2 212 tok/s Pro

GPT OSS 120B 474 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

DNAHLM -- DNA sequence and Human Language mixed large language Model (2410.16917v2)

Published 22 Oct 2024 in q-bio.GN and cs.LG

Abstract: There are already many DNA LLMs, but most of them still follow traditional uses, such as extracting sequence features for classification tasks. More innovative applications of LLMs, such as prompt engineering, RAG, and zero-shot or few-shot prediction, remain challenging for DNA-based models. The key issue lies in the fact that DNA models and human natural LLMs are entirely separate; however, techniques like prompt engineering require the use of natural language, thereby significantly limiting the application of DNA LLMs. This paper introduces a pre-trained model trained on the GPT-2 network, combining DNA sequences and English text, and uses a unified BPE tokenization method. We then convert classification and other downstream tasks into Alpaca format instruction data, and perform instruction fine-tuning on this pre-trained model to create a fine-tuned model capable of handling multiple tasks. The model has demonstrated its effectiveness in DNA related zero-shot prediction and multitask application. This research provides a highly promising direction for building a unified DNA sequence task framework.