Institutional Platform for Secure Self-Service Large Language Model Exploration (2402.00913v2)
Abstract: This paper introduces a user-friendly platform developed by the University of Kentucky Center for Applied AI, designed to make large, customized LLMs more accessible. By capitalizing on recent advancements in multi-LoRA inference, the system efficiently accommodates custom adapters for a diverse range of users and projects. The paper outlines the system's architecture and key features, encompassing dataset curation, model training, secure inference, and text-based feature extraction. We illustrate the establishment of a tenant-aware computational network using agent-based methods, securely utilizing islands of isolated resources as a unified system. The platform strives to deliver secure LLM services, emphasizing process and data isolation, end-to-end encryption, and role-based resource authentication. This contribution aligns with the overarching goal of enabling simplified access to cutting-edge AI models and technology in support of scientific discovery.
- OpenAI, “Chatgpt,” https://chat.openai.com, 2023, accessed: 2023-07-30.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,” arXiv preprint arXiv:2302.11382, 2023.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- “Llm explorer,” https://llm.extractum.io/, 2024, accessed: 2024-01-11.
- K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao, X. Chen, Y. Lin, J.-R. Wen, and J. Han, “Don’t make your llm an evaluation benchmark cheater,” arXiv preprint arXiv:2311.01964, 2023.
- “Hugging face,” https://huggingface.co/, 2024, accessed: 2024-01-11.
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
- “Lmsys: Large model systems organization,” https://github.com/lm-sys/FastChat, 2024.
- “Openaccess ai collective: Axolotl,” https://github.com/OpenAccess-AI-Collective/axolotl, 2024.
- P. Das, N. Ivkin, T. Bansal, L. Rouesnel, P. Gautier, Z. Karnin, L. Dirac, L. Ramakrishnan, A. Perunicic, I. Shcherbatyi et al., “Amazon sagemaker autopilot: a white box automl solution at scale,” in Proceedings of the fourth international workshop on data management for end-to-end machine learning, 2020, pp. 1–7.
- “Azure machine learning,” https://azure.microsoft.com/en-us/products/machine-learning, 2024.
- “vllm,” https://github.com/vllm-project/vllm, 2024.
- “Langchain,” https://github.com/langchain-ai/langchain, 2024, accessed: 2024-01-13.
- “Azure openai service,” https://azure.microsoft.com/en-us/products/ai-services/openai-service, 2024.
- “Lmsys chatbot arena leaderboard,” https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard, 2024.
- B. Gunel, J. Du, A. Conneau, and V. Stoyanov, “Supervised contrastive learning for pre-trained language model fine-tuning,” arXiv preprint arXiv:2011.01403, 2020.
- M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared llama: Accelerating language model pre-training via structured pruning,” arXiv preprint arXiv:2310.06694, 2023.
- M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh, “Reward design with language models,” arXiv preprint arXiv:2303.00001, 2023.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” arXiv preprint arXiv:2305.18290, 2023.
- X. Liu, H. Yan, S. Zhang, C. An, X. Qiu, and D. Lin, “Scaling laws of rope-based extrapolation,” arXiv preprint arXiv:2310.05209, 2023.
- T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” arXiv preprint arXiv:2307.08691, 2023.
- E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
- E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022.
- “Ggml,” https://github.com/ggerganov/ggml, 2023.
- T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314, 2023.
- J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han, “Awq: Activation-aware weight quantization for llm compression and acceleration,” arXiv preprint arXiv:2306.00978, 2023.
- “Nvidia tensorrt-llm,” https://github.com/NVIDIA/TensorRT-LLM, 2024, accessed: 2024-01-13.
- L. Chen, Z. Ye, Y. Wu, D. Zhuo, L. Ceze, and A. Krishnamurthy, “Punica: Multi-tenant lora serving,” arXiv preprint arXiv:2310.18547, 2023.
- Y. Sheng, S. Cao, D. Li, C. Hooper, N. Lee, S. Yang, C. Chou, B. Zhu, L. Zheng, K. Keutzer et al., “S-lora: Serving thousands of concurrent lora adapters,” arXiv preprint arXiv:2311.03285, 2023.
- M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne et al., “The fair guiding principles for scientific data management and stewardship,” Scientific data, vol. 3, no. 1, pp. 1–9, 2016.
- P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
- K. Tirumala, D. Simig, A. Aghajanyan, and A. S. Morcos, “D4: Improving llm pretraining via document de-duplication and diversification,” arXiv preprint arXiv:2308.12284, 2023.
- R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpaca: A strong, replicable instruction-following model,” Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, vol. 3, no. 6, p. 7, 2023.
- “Sharegpt,” https://sharegpt.com/, 2024.
- “Clearml,” https://clear.ml/, 2024.
- M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel, “Amazon s3 for science grids: a viable solution?” in Proceedings of the 2008 international workshop on Data-aware distributed computing, 2008, pp. 55–64.
- “Lorax,” https://github.com/predibase/lorax, 2024.
- “Lorax supported models,” https://predibase.github.io/lorax/models/base_models/, 2024.
- “Lorax supported adapters,” https://predibase.github.io/lorax/models/adapters/, 2024.
- N. Hyeon-Woo, M. Ye-Bin, and T.-H. Oh, “Fedpara: Low-rank hadamard product for communication-efficient federated learning,” arXiv preprint arXiv:2108.06098, 2021.
- A. Edalati, M. Tahaei, I. Kobyzev, V. P. Nia, J. J. Clark, and M. Rezagholizadeh, “Krona: Parameter efficient tuning with kronecker adapter,” arXiv preprint arXiv:2212.10650, 2022.
- Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, and T. Zhao, “Loftq: Lora-fine-tuning-aware quantization for large language models,” arXiv preprint arXiv:2310.08659, 2023.
- “Lm studion,” https://lmstudio.ai, 2024, accessed: 2024-01-31.
- D. Markowitz, “Gptmeet ai’s multitool: Vector embeddings,” Google Cloud Blog, 2022.
- V. C. Bumgardner, V. W. Marek, and C. D. Hickey, “Cresco: A distributed agent-based edge computing framework,” in 2016 12th International Conference on Network and Service Management (CNSM). IEEE, 2016, pp. 400–405.
- V. C. Bumgardner, C. Hickey, and V. W. Marek, “An edge-focused model for distributed streaming data applications,” in 2018 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops). IEEE, 2018, pp. 657–662.
- C. Bumgardner, C. Hickey, and N. Seyedtalebi, “Agent communications in edge computing,” in 2019 IEEE International Conference on Industrial Internet (ICII). IEEE, 2019, pp. 387–392.