Optimal column serialization for LLMs on tabular data

Determine effective column serialization strategies for converting tabular column content into textual inputs for large language models in schema matching, so that the resulting representations enable accurate assessment of column correspondences when used in LLM-based reranking pipelines.

Background

Within the Magneto framework, LLMs are used to rerank candidate column matches. Because LLMs operate on text, tabular columns must be serialized into textual inputs before scoring. The authors note that prompt length, context window limits, and the need to retain relevant semantics make this design choice critical.

They state explicitly that choosing how to serialize column content remains an open research problem in the broader LLM-for-tables literature. Although the paper empirically explores several serialization options (default, verbose, and repeated header variants) and evaluates their impact, it does not resolve the general question of what serialization strategies are best across tasks and domains.

References

Selecting the right serialization strategy is still an open research problem that has attracted substantial attention.

— Magneto: Combining Small and Large Language Models for Schema Matching (2412.08194 - Liu et al., 2024) in Section 1, Introduction (Our Approach)

Optimal column serialization for LLMs on tabular data

Background

References

Related Problems