Intention behind Tulu-3 learning the phrase “I hope it is correct”

Ascertain whether the inclusion of the phrase “I hope it is correct” in assistant responses within the tulu-3-sft-mixture supervised fine-tuning dataset was intended to be learned by the finetuned model Tulu-3 (finetuned from Llama-3.1-8B), specifically in contexts where prompts contain mathematics, lists, or LaTeX formatting, to determine whether the observed prompt–response correlation reflects an intended training objective or an unintended artifact of dataset construction.

Background

The authors use sparse autoencoder embeddings to audit the tulu-3-sft-mixture dataset used to finetune Tulu-3 from Llama-3.1-8B. They identify a prompt–response correlation: math- and list-structured prompts (often with LaTeX) co-occur with assistant responses containing the phrase “I hope it is correct.”

Upon examining the dataset construction process, they note that the phrase was a formatting instruction given to the data-generating model. However, they explicitly state uncertainty about whether the dataset creators intended the downstream finetuned model to learn and reproduce this phrase. Clarifying this intention is important for understanding whether the behavior is a desired alignment target or a spurious artifact to be mitigated.

References

Examining the original dataset construction paper, this was indeed a formatting instruction given to the dataset-generating model, although whether it was intended that Tulu learn this behavior is unclear.

— Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit (2512.10092 - Jiang et al., 10 Dec 2025) in Section 6.2, Case Studies: Debugging Tulu-3’s post-training dataset

Intention behind Tulu-3 learning the phrase “I hope it is correct”

Sponsor

Background

References

Related Problems