Visual Perceptual to Conceptual First-Order Rule Learning Networks

Published 9 Apr 2026 in cs.AI and cs.LG | (2604.07897v1)

Abstract: Learning rules plays a crucial role in deep learning, particularly in explainable artificial intelligence and enhancing the reasoning capabilities of LLMs. While existing rule learning methods are primarily designed for symbolic data, learning rules from image data without supporting image labels and automatically inventing predicates remains a challenge. In this paper, we tackle these inductive rule learning problems from images with a framework called γILP, which provides a fully differentiable pipeline from image constant substitution to rule structure induction. Extensive experiments demonstrate that γILP achieves strong performance not only on classical symbolic relational datasets but also on relational image data and pure image datasets, such as Kandinsky patterns.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces yILP, a fully differentiable ILP pipeline that grounds visual data into symbolic logic using clustering and LLM-based predicate invention.
The paper demonstrates high precision and recall in transforming image representations into logical rules, outperforming state-of-the-art multimodal models on standard benchmarks.
The paper highlights advances in neuro-symbolic reasoning by enabling predicate invention from pure image data and outlining future directions for multimodal inputs.

Inductive First-Order Rule Learning from Visual Perceptual Data: The yILP Framework

Motivation and Problem Statement

The inductive rule learning paradigm is vital for interpretable and explainable AI, underpinning trustworthy automated systems and enhancing reasoning in multimodal contexts. Traditional inductive logic programming (ILP) frameworks are architected for symbolic, relational data, but in real-world settings, image data increasingly constitutes the constants of knowledge graphs and relational structures. A critical challenge in this domain is symbol grounding: mapping visual representations to symbolic variables for logic reasoning without explicit labels, thus avoiding label leakage. Furthermore, when relational descriptions are absent, predicate invention—generating novel relations to characterize image-based concepts—is required.

Methodological Framework: yILP

The paper introduces yILP, a fully differentiable ILP pipeline that operates on image constants, supporting both cases where relations are predefined (relational image datasets) and where they are absent (pure image datasets, e.g., Kandinsky patterns). The methodological advances are threefold:

Latent Space Generalization via Differentiable Clustering: Image constants are embedded by pretrained encoders (ViT, VAE), with clustering serving as the generalization function that maps specific constants to cluster centroids corresponding to logical variables. Clustering is implemented through a differentiable approach, optimizing a soft assignment objective on GPU.
Differentiable Substitution and Rule Induction: The substitution mechanism, also fully differentiable at batch scale, enables tensor operations and efficient GPU utilization. For predefined relations, substitution connects cluster centroids according to positive and negative examples, introducing language bias for forward-chaining when necessary. For undefined relations, variable constraints link the number of logic variables directly to number of clusters, grounding body atoms using cluster centroids.
Predicate Invention and Semantic Translation via LLMs: yILP invents predicates when relational structure is absent, representing these with placeholders in first-order logic rules. Semantic interpretation of these predicates is accomplished by querying LLMs, which translate the visual semantics underlying cluster variables into natural language format.

Experimental Evaluation

Relational Image Datasets

yILP was evaluated on classical ILP datasets adapted to relational image settings, with constants replaced by MNIST images and relations encoded in text or random strings to prevent label leakage. Encoders used included VAE and ViT. Strong numerical results were obtained: yILP achieves both precision and recall of 1 on standard tasks requiring up to three variables, outperforming state-of-the-art multimodal LLMs (Gemini 2.5 Pro, GPT-5), which failed to induce complete rules in certain tasks when the relation semantics were obfuscated.

For tasks such as Fizz and Buzz, requiring rules with four or six variables, yILP's search space proved intractable under time constraints, resulting in incomplete rule induction.

Pure Image Data: Kandinsky Patterns and Predicate Invention

yILP demonstrates robust predicate invention capability on Kandinsky pattern datasets, where relations among image constants are undefined and must be invented. Classification accuracy surpasses vision-only and propositional rule learners, achieving up to 1.0 accuracy for one-red and one-triangle tasks; performance on the two-pair pattern is slightly lower due to increased combinatorial complexity. yILP successfully recovers and interprets predicate semantics—e.g., "same shape, different color" for two-pair patterns, "color in red" for one-red, and "shape in triangle" for one-triangle—by leveraging LLMs as translators.

Comparative Analysis

yILP outperforms learning strategies such as RIPPER-ViT and C4.5-ViT, which generate propositional rules with limited interpretability in the context of invented predicates. Typical reasoning-capable LLMs show high efficacy for simpler tasks but struggle with rules involving more complex relational structure, such as two-pair patterns. yILP also demonstrates stable performance across hyperparameter configurations and training seeds.

Practical and Theoretical Implications

The fully differentiable architecture of yILP enables seamless, end-to-end training and rule extraction from neural networks, scaling ILP to high-dimensional visual domains. By decoupling rule learning from symbolic label dependence and supporting unsupervised predicate invention, yILP sets a foundation for explainable neuro-symbolic reasoning in domains where explicit annotation is unavailable or impractical. The integration of LLMs for predicate semantic translation bridges perceptual and conceptual reasoning, furthering the practical interpretability of learned rules.

Theoretically, yILP offers a general template for multimodal ILP, attesting to the expressiveness and scalability of differentiable logic programming in settings where symbol grounding and predicate invention are paramount. The approach also exposes current limitations in both rule length induction and the complexity of invented relations, especially as relational arity increases.

Future Directions

Future work should address spatial information reasoning in images, expand yILP to multimodal inputs (e.g., text-image combinations), and introduce tailored language bias or logical templates to enable induction of longer or more complex rules. Additional research is warranted on dynamic clustering approaches, embracing continuous conceptual evolution in image-rich domains.

Conclusion

yILP presents a significant step in neuro-symbolic rule learning from visual perceptual data, harnessing differentiable clustering, substitution, and predicate invention. It achieves high precision and recall across diverse datasets, notably expanding ILP applicability into domains lacking symbolic labels. The framework’s scalability and semantic interpretability via LLMs mark it as a robust foundation for future advances in explainable AI and multimodal reasoning (2604.07897).

Markdown Report Issue