Contrastive Representations for Temporal Reasoning (2508.13113v1)
Abstract: In classical AI, perception relies on learning state-based representations, while planning, which can be thought of as temporal reasoning over action sequences, is typically achieved through search. We study whether such reasoning can instead emerge from representations that capture both perceptual and temporal structure. We show that standard temporal contrastive learning, despite its popularity, often fails to capture temporal structure due to its reliance on spurious features. To address this, we introduce Combinatorial Representations for Temporal Reasoning (CRTR), a method that uses a negative sampling scheme to provably remove these spurious features and facilitate temporal reasoning. CRTR achieves strong results on domains with complex temporal structure, such as Sokoban and Rubik's Cube. In particular, for the Rubik's Cube, CRTR learns representations that generalize across all initial states and allow it to solve the puzzle using fewer search steps than BestFS, though with longer solutions. To our knowledge, this is the first method that efficiently solves arbitrary Cube states using only learned representations, without relying on an external search algorithm.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper asks a simple but powerful question: can a computer learn a way of “seeing” puzzles so that it can plan what to do without doing a lot of trial-and-error search? The authors propose a new learning method, called CRTR (Contrastive Representations for Temporal Reasoning), that helps an AI focus on what changes over time in a puzzle (the steps of progress), instead of getting distracted by things that stay the same (like the background layout). They show this makes planning much easier in tricky, step-by-step puzzles such as Sokoban and the Rubik’s Cube.
What questions did the paper ask?
- Can we learn compact “maps” (representations) of puzzle states that capture how to move toward the goal over time?
- Why do popular learning methods sometimes fail on puzzles, and can we fix that?
- If we learn good representations, do we still need heavy search, or can we often solve puzzles with little or no search?
How did they try to solve it? (Methods in simple terms)
Here’s the idea in everyday language:
- A “representation” is like a compressed map of a puzzle state—numbers that tell the AI what matters.
- “Temporal reasoning” means understanding the order of steps: which states are closer to the goal and which are farther.
- “Contrastive learning” is like showing the AI pairs of pictures and saying, “These two belong together” (positive pair) and “These two don’t” (negative pair), so it learns what makes states similar or different.
The problem with standard contrastive learning
In many puzzles, each level has a constant “context,” like the fixed walls in a Sokoban maze or the background of a scene. Standard contrastive learning often learns to tell states apart by memorizing this context. That’s a shortcut: it helps separate examples in training, but it doesn’t teach the AI how the puzzle actually evolves over time (which is what you need for planning). The result: the AI groups states by the maze’s wallpaper instead of by “how close they are to solving the puzzle.”
The CRTR idea
CRTR changes how the training pairs are picked to force the AI to pay attention to time, not wallpaper.
- Instead of only comparing states from different episodes/levels, CRTR also compares states from the same episode but far apart in time.
- Analogy: imagine reading the same story and being asked to tell apart an early page and a late page—even though the page border (context) looks the same, the plot (temporal progress) is different. To succeed, you must focus on what changed in the story, not the static decorations.
- This “in-trajectory negatives” trick removes the usefulness of static context as a shortcut. The model learns embeddings (representations) that reflect how states progress toward the goal.
Technically, this corresponds to training on negatives that share the same context, which pushes the model to ignore context and capture temporal structure. You don’t need to know which pixels are “context” ahead of time—the sampling scheme makes the model figure it out.
Training and evaluation
- The method is trained on recorded sequences of puzzle-solving steps (they don’t need to be perfect solutions).
- The learned representation defines a distance: states closer to the goal should be closer in this learned space.
- Planning uses this distance in two ways:
- Without search: greedily pick the next move that leads to a neighbor state closer to the goal.
- With simple search (Best-First Search): expand the most promising states using the learned distance as a guide.
They tested CRTR on five challenging puzzles:
- Sokoban (push boxes to targets in a maze),
- Rubik’s Cube,
- 15-puzzle (sliding tiles),
- Lights Out (toggle lights to all off),
- Digit Jumper (grid jumps constrained by numbers).
What did they find? (Main results)
- CRTR learns representations that line up with time-to-goal. States that are a similar number of steps from the goal cluster together, even across different levels. Standard methods cluster by background/context instead, which is less useful for planning.
- CRTR often solves puzzles with less search. In four out of five tasks, the greedy “no-search” planner (just stepping toward the nearest state in the learned space) solves nearly all test cases within a move budget.
- On the Rubik’s Cube:
- The learned representation alone (no search) solved all scrambles within the allowed move budget (about hundreds of moves, not optimal, but consistent).
- When using a simple search, CRTR needed fewer expansions than a strong baseline search (BestFS), though its final solutions were longer in terms of moves.
- This is, to the authors’ knowledge, the first time arbitrary Cube states are efficiently solved using only learned representations, without relying on an external search algorithm.
- Across Sokoban, Rubik’s Cube, N-Puzzle, Lights Out, and Digit Jumper, CRTR matched or beat strong baselines (including supervised methods and a known search+learning system) on success rate within a fixed search budget.
Why this matters:
- If your representation captures temporal structure, planning becomes a lot easier—sometimes you barely need search at all.
- It shows a path toward faster, simpler solvers that learn from data without hand-crafted heuristics.
Why it matters and what could come next
This work suggests that a big chunk of “reasoning” can be done by learning the right representation of states—one that encodes progress toward the goal and ignores distracting background details. That could:
- Reduce or even remove the need for heavy search in many problems.
- Make AI planning faster and more reliable, especially in large, complex state spaces.
- Open the door to applying similar ideas in other structured tasks, like designing chemical reactions or robot assembly, where understanding “what changes over time” is crucial.
In short, CRTR teaches models to track the story of a puzzle—not just remember the wallpaper—and that makes a major difference when you need to plan your next move.
Collections
Sign up for free to add this paper to one or more collections.