Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain (2410.20297v1)
Abstract: In recent years, the widespread adoption of LLMs has sparked interest in their potential for application within the military domain. However, the current generation of LLMs demonstrate sub-optimal performance on Army use cases, due to the prevalence of domain-specific vocabulary and jargon. In order to fully leverage LLMs in-domain, many organizations have turned to fine-tuning to circumvent the prohibitive costs involved in training new LLMs from scratch. In light of this trend, we explore the viability of adapting open-source LLMs for usage in the Army domain in order to address their existing lack of domain-specificity. Our investigations have resulted in the creation of three distinct generations of TRACLM, a family of LLMs fine-tuned by The Research and Analysis Center (TRAC), Army Futures Command (AFC). Through continuous refinement of our training pipeline, each successive iteration of TRACLM displayed improved capabilities when applied to Army tasks and use cases. Furthermore, throughout our fine-tuning experiments, we recognized the need for an evaluation framework that objectively quantifies the Army domain-specific knowledge of LLMs. To address this, we developed MilBench, an extensible software framework that efficiently evaluates the Army knowledge of a given LLM using tasks derived from doctrine and assessments. We share preliminary results, models, methods, and recommendations on the creation of TRACLM and MilBench. Our work significantly informs the development of LLM technology across the DoD and augments senior leader decisions with respect to artificial intelligence integration.
- PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In Proceedings of the Twenty-ninth ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024.
- Anthropic. Introducing the Next Generation of Claude. https://www.anthropic.com/news/claude-3-family, 2024.
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv Preprint, 2022.
- Open LLM Leaderboard. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2023.
- Adapting Large Language Models via Reading Comprehension. arXiv Preprint, 2024.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv Preprint, 2018.
- Together Computer. Releasing 3B and 7B RedPajama-INCITE Family of Models. https://www.together.ai/blog/redpajama-models-v1, 2023.
- QLoRA: Efficient Finetuning of Quantized LLMs. In Proceedings of the Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv Preprint, 2018.
- Army Publishing Directorate. Field Manual 4-0: Sustainment Operations. https://armypubs.army.mil/ProductMaps/PubForm/Details.aspx?PUB_ID=1007657, 2019.
- EleutherAI. Language Model Evaluation Harness. https://github.com/EleutherAI/lm-evaluation-harness, 2022.
- Arcee’s MergeKit: A Toolkit for Merging Large Language Models. arXiv Preprint, 2024.
- Maarten Grootendorst. BERTopic: Neural Topic Modeling with a Class-based TF-IDF Procedure. arXiv Preprint, 2022.
- MilGLUE: A Multitask Benchmark Platform for Natural Language Understanding in the Military Domain. Military Operations Research, 28(1):pp. 97–116, 2023.
- MedAlpaca – An Open-Source Collection of Medical Conversational AI Models and Training Data. arXiv Preprint, 2023.
- Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey. arXiv Preprint, 2024.
- Measuring Massive Multitask Language Understanding. In Proceedings of the Ninth International Conference on Learning Representations, 2021.
- Training Compute-Optimal Large Language Models. arXiv Preprint, 2022.
- Mistral 7B. arXiv Preprint, 2023.
- Mixtral of Experts. arXiv Preprint, 2024.
- Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance. arXiv Preprint, 2022.
- TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv Preprint, 2022.
- Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models. arXiv Preprint, 2023.
- ChipNeMo: Domain-Adapted LLMs for Chip Design. arXiv Preprint, 2024.
- Best Practices and Lessons Learned on Synthetic Data for Language Models. arXiv Preprint, 2024.
- Datasets for Large Language Models: A Comprehensive Survey. arXiv Preprint, 2024.
- BaGuaLu: Targeting Brain Scale Pretrained Models with over 37 Million Cores. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022.
- Department of the Army Office of the Chief Information Officer. Army Publishing Directorate. https://armypubs.army.mil, 2024.
- OpenAccess-AI-Collective. Axolotl: A Tool Designed to Streamline the Fine-tuning of Various AI Models. https://github.com/OpenAccess-AI-Collective/axolotl, 2023.
- OpenAI. Tiktoken: A Fast BPE Tokenizer for use with OpenAI models. https://github.com/openai/tiktoken, 2022.
- OpenAI. ChatGPT: Language Models are Few-Shot Learners. https://openai.com/blog/chatgpt, 2023.
- Decoupled Weight Decay for Any p-Norm. arXiv Preprint, 2024.
- Quizlet. CGSCILE Comp Study Terms X100. https://quizlet.com/171749155/cgscile-comp-study-terms-x100-flash-cards, 2022.
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proceedings of the Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
- Nils Reimers. Pretrained models. https://www.sbert.net/docs/pretrained_models.html, 2024.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019.
- Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv Preprint, 2022.
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Gemini: A Family of Highly Capable Multimodal Models. arXiv Preprint, 2023.
- LLaMA: Open and Efficient Foundation Language Models. arXiv Preprint, 2023.
- Zephyr: Direct Distillation of LM Alignment. arXiv Preprint, 2023.
- Attention Is All You Need. In Proceedings of the Thirty-first Conference on Neural Information Processing Systems, 2017.
- FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets. arXiv Preprint, 2023.
- Emergent Abilities of Large Language Models. arXiv Preprint, 2022.
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv Preprint, 2023.
- DPO Meets PPO: Reinforced Token Optimization for RLHF. arXiv Preprint, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.