Papers
Topics
Authors
Recent
Search
2000 character limit reached

QG-PPR: Personalized PageRank for Logic

Updated 20 January 2026
  • The paper introduces QG-PPR as a scalable framework that leverages personalized PageRank for efficient, query-guided inference in first-order logic.
  • It constructs a localized proof graph using restart edges and probabilistic transitions to bias the search toward short, high-probability proofs.
  • Empirical results show improved mean average precision and AUC over MLNs with significant gains in inference speed and scalability.

Question-Guided Personalized PageRank (QG-PPR) is a framework for efficient probabilistic inference in first-order logic representations, formulated to enable scalable, locally groundable reasoning over large databases. QG-PPR, as implemented in ProPPR, interprets query answering as a personalized PageRank process over a query-induced proof graph, leveraging local search and restart mechanisms to bias inference toward short proofs and high-probability answers. The approach supports efficient, parallelizable inference and learning, with empirical performance advantages over Markov Logic Networks (MLNs) on entity resolution tasks (Wang et al., 2013).

1. Formal Foundations and Semantics

QG-PPR is built atop a definite-clause logic program LP={c1,,cn}LP = \{c_1, \ldots, c_n\} and a database DBDB of unit facts. A query QQ is represented as a conjunction of literals R1RkR_1 \land \ldots \land R_k. The proof state at any step is encoded as u=(Qtransformed,subgoal list)u = (Q_{\text{transformed}}, \text{subgoal list}), where QtransformedQ_{\text{transformed}} is the query with substitutions applied to date, and the subgoal list records the remaining goals to prove.

The initial or start node v0v_0 is (Q,Q)(Q, Q), while a solution node has an empty subgoal list and is denoted by the symbol \Box. The SLD proof graph GG'—potentially infinite in size—captures the space of all proofs of QQ using LPLP and DBDB. QG-PPR extends GG' by adding restart edges to create GQ,LPG_{Q,LP}.

Inference is defined as a random walk with restarts, seeded at v0v_0, over GQ,LPG_{Q,LP}. Personalized PageRank computes a probability distribution over solution nodes (ground answers QθQ\theta), structurally favoring nodes closer to v0v_0 through the restart mechanism.

2. Query-Induced Grounding Graph Construction

Each node uu in GQ,LPG_{Q,LP} is a proof state of the form (Qθ,(R1,,Rk))(Q\theta, (R_1, \ldots, R_k)). For each uu and each clause c:RS1,,Sc: R' \leftarrow S'_1, \ldots, S'_\ell in LPLP, if the leftmost subgoal R1R_1 unifies with RR' via most general unifier θ\theta, a proof edge is created:

  • uφc(θ)vu \xrightarrow{\,\varphi_c(\theta)\,} v, where v=(Qθ,(S1,...,S,R2,...,Rk)θ)v = (Q\theta, (S'_1, ..., S'_\ell, R_2, ..., R_k)\theta),
  • Each edge is annotated by a feature vector φc(θ)\varphi_c(\theta), reflecting user-defined feature literals instantiated under θ\theta.

Additionally, each node uu receives a restart edge to v0v_0 with feature annotation φrestart(R1θ)\varphi_{\text{restart}}(R_1\theta), biasing the walk toward short proofs. Database facts (unit clauses) act as degenerate clauses with feature φ={db}\varphi = \{\text{db}\}.

This query-guided construction ensures that only those nodes reachable from v0v_0—i.e., relevant to the query—are included in the grounding, promoting scalability.

3. Personalized PageRank on Proof Graphs

Transitions within GQ,LPG_{Q,LP} are governed by a row-stochastic matrix WW, with transitions parameterized as:

  • Prw(vu)f(w,φ[uv])Pr_w(v|u) \propto f(w, \varphi[u \to v]), typically with f(w,φ)=exp(wφ)f(w, \varphi) = \exp(w \cdot \varphi),
  • Transition probabilities for each neighbor vN(u)v \in N(u) are normalized such that vN(u)Prw(vu)=1\sum_{v \in N(u)} Pr_w(v|u) = 1,
  • The restart edge from uu to v0v_0 is assigned probability α\alpha, with the remaining mass 1α1-\alpha distributed over proof edges.

The personalized PageRank vector π\pi is defined as the stationary distribution:

π=αev0+(1α)Wπ\pi = \alpha e_{v_0} + (1-\alpha) W^\top \pi

where ev0e_{v_0} is a unit vector at the start node. Power iteration is used in practice for convergence:

πt=αev0+(1α)Wπt1\pi^{t} = \alpha e_{v_0} + (1-\alpha) W^\top \pi^{t-1}

4. Local Inference via PageRank-Nibble-Prove

To achieve localized, query-specific inference, ProPPR employs the Andersen–Chung–Lang “PageRank-Nibble” method. This procedure simultaneously approximates π\pi for the seed v0v_0 and enumerates a compact subgraph G^\hat{G} sufficient for inference within controlled error.

A high-level pseudocode for PageRank-Nibble-Prove is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
define PageRank-Nibble-Prove(Q, α', ε):
    let v0 = (Q, Q)
    initialize residual r[v0] = 1, estimate p[v] = 0 for all v, Ĝ ← ∅
    while ∃u with r[u]/|N(u)| > ε do
        push(u)
    return (p, Ĝ)

define push(u):
    p[u] ← p[u] + α'·r[u]
    δ ← (1–α')·r[u]
    r[u] ← 0
    for each neighbor v ∈ N(u):
        Ĝ.addEdge(u→v)
        r[v] ← r[v] + Pr(v|u)·δ

Here, α\alpha' is a lower bound on the restart probability (typically set to α\alpha), and ϵ\epsilon specifies the error tolerance. The algorithm ensures that after each push, p+rp + r remains an exact PPR vector for v0v_0, and when the loop terminates, pp approximates π\pi with error ϵN(u)\epsilon|N(u)| per node. The constructed subgraph G^\hat{G} contains only visited edges—providing a “local grounding” for QQ.

5. Theoretical Properties

The Andersen–Chung–Lang theorem asserts that if u1,u2,u_1, u_2, \ldots are the nodes successively pushed in PageRank-Nibble-Prove, then:

iN(ui)<1αϵ\sum_{i} |N(u_i)| < \frac{1}{\alpha' \epsilon}

Hence, the number of edges in G^\hat{G} is at most 1/(αϵ)1/(\alpha' \epsilon). Both inference time and grounding size are thus O(1/(αϵ))O(1/(\alpha' \epsilon)), independent of the database size DB|DB| or the full proof graph's size. This establishes rigorous scalability guarantees.

6. Weight Learning and Parallelization

Supervised learning is supported using triples (Qk,Pk,Nk)(Q^k, P^k, N^k), where PkP^k and NkN^k are the sets of correct and incorrect answers for QkQ^k. After running PageRank-Nibble-Prove to obtain (pk,G^k)(p^k, \hat{G}^k), pairwise learning examples are collected to impose pk[u+]pk[u]p^k[u_+] \geq p^k[u_-] for all (u+,u)(u_+,u_-) pairs.

The pairwise squared-hinge loss is:

(v0,u+,u)=max(0,p[u]p[u+])2\ell(v_0, u_+, u_-) = \max(0, p[u_-] - p[u_+])^2

with total objective

L(w)=ku+PkuNk(v0k,u+,u)+μw2L(w) = \sum_k \sum_{u_+ \in P^k} \sum_{u_- \in N^k} \ell(v_0^k, u_+, u_-) + \mu \|w\|^2

using L2L_2 regularization with parameter μ\mu. Gradients w.r.t. ww are computed by backpropagating through power-iteration, in the style of Backstrom & Leskovec. Stochastic gradient descent is applied, with learning rate β=η/epoch2\beta = \eta/\text{epoch}^2.

Parallelization is realized by running independent threads over separate queries QkQ^k, grounding and updating asynchronously ("Hogwild!" style). Since each local grounding is small, the per-thread computational cost is low, and wall-clock speedup is nearly linear in thread count.

7. Empirical Results and Comparison

On the CORA citation entity resolution task (1,295 citations, 132 ground-truth papers), queries assess the predicate samebib(BC1,BC2)\text{samebib}(BC1,BC2). The applied ProPPR program employs approximately 14 clauses over four predicates (author, title, venue, transitive closures) with feature annotations.

Performance metrics include:

  • Mean average precision (MAP): Untrained ProPPR achieves 0.54\approx0.54 compared to MLN's 0.53\approx0.53, with ProPPR demonstrating roughly 8× faster inference.
  • After learning, AUCs for matching cite/author/venue/title attributes improve from {0.68,0.84,0.86,0.91}\{\approx0.68,0.84,0.86,0.91\} (untrained) to {0.80,0.84,0.87,0.90}\{\approx0.80,0.84,0.87,0.90\}, outperforming MLN’s range of {0.520.63}\{0.52–0.63\}.
  • Inference time for ProPPR remains essentially constant as DB|DB| increases, while MLN inference time increases substantially.
  • Learning scales nearly linearly with the number of threads; up to 14\approx1415×15\times speedups are observed with 16 cores.

All aspects of QG-PPR—graph construction, inference, and learning—are query-guided, ensuring the computation remains focused on those portions of the logic program and database relevant to QQ, with a strong theoretical guarantee that the resulting computational cost is independent of database size (Wang et al., 2013).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Question-Guided Personalized PageRank (QG-PPR).