Papers
Topics
Authors
Recent
2000 character limit reached

The PLLuM Instruction Corpus (2511.17161v1)

Published 21 Nov 2025 in cs.CL and cs.AI

Abstract: This paper describes the instruction dataset used to fine-tune a set of transformer-based LLMs developed in the PLLuM (Polish LLM) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.