PedX-LLM: Pedestrian Crossing Analysis

Updated 9 January 2026

PedX-LLM is a multimodal large language model framework that fuses visual data, domain knowledge, and textual reasoning to forecast driver yielding and pedestrian crossing events.
It employs prompt-driven inference and few-shot learning, using chain-of-thought guidance to generate context-aware and generalizable predictions.
The integration of LoRA-based fine-tuning with modular adapters enhances accuracy and computational efficiency, achieving over 90% balanced accuracy in driver yielding scenarios.

Pedestrian Crossing LLM (PedX-LLM) refers to a class of multimodal LLM frameworks specifically engineered for interpreting and predicting human and driver behavior in the context of pedestrian crossings within urban environments. Through the integration of vision, structured domain knowledge, and textual reasoning, PedX-LLM addresses the limitations of traditional, site-specific classifiers by enabling context-aware and generalizable inference for tasks such as driver yielding prediction and pedestrian crossing behavior inference (Yang et al., 24 Sep 2025, Pu et al., 2 Jan 2026).

1. Multimodal and Knowledge-Enhanced LLM Foundations

PedX-LLM synthesizes multimodal perception and domain knowledge injection with advanced natural language reasoning. The framework typically fuses:

Visual features: Extracted from urban imagery (satellite, intersection photo, or traffic video) using pretrained vision encoders such as CLIP ViT-L/14 or LLaVA.
Tabular/textual attributes: Encoded descriptions of event-specific and environmental variables (e.g., vehicle speed, road geometry, pedestrian demographics, land use).
Domain knowledge: Structured prompts or natural language priors capturing causal relationships known from transportation studies (e.g., "rain increases mid-block crossing utility"; "seniors prefer intersections") (Pu et al., 2 Jan 2026).

The architecture orchestrates prompt-based LLM inference, where inputs may include numeric features, visual embeddings (via adapter layers or natural language conversion), and hierarchical bullet-pointed knowledge blocks to guide step-wise or chain-of-thought reasoning.

2. Model Architectures and Inference Pipelines

2.1 Prompt-Driven Methodology

In PedX-LLM frameworks, all modalities are unified into a single prompt string delivered to the LLM backbone (e.g., GPT-4o, Deepseek-V3, or LLaMA-2-7B):

Task definition: An explicit instruction such as “Predict whether the driver yields (Yes/No) given the following context.”
Domain knowledge: Inserted as natural language statements (object-level effects, empirical odds ratios) or system prompt context.
Multimodal placeholders: Numeric and categorical context, textual event logs, and image- or vision-derived descriptors.
Reasoning scaffold: "Thinking guidance" or prescribed multi-step evaluation (e.g., vehicle dynamics → infrastructure → pedestrian interaction → decision).
Few-shot exemplars: Instances with full context, reasoning trace, and correct label, to encourage chain-of-thought output (Yang et al., 24 Sep 2025).

2.2 Fine-Tuning via Parameter-Efficient Adaptation

For site-generalizable pedestrian behavior inference, PedX-LLM incorporates Low-Rank Adaptation (LoRA) modules into transformer backbones (e.g., LLaMA-2-7B), enabling domain adaptation without full weight updates. The LoRA formulation:

$\Delta W = AB, \quad h = W_0 x + \alpha \cdot (ABx)$

where $A \in \mathbb{R}^{d \times r}$ , $B \in \mathbb{R}^{r \times d}$ , $r \ll d$ , and only $A$ , $B$ are trained, preserving the frozen pretrained $W_0$ (Pu et al., 2 Jan 2026).

2.3 Multimodal Integration

Visual features (e.g., 1024-D CLIP or LLaVA embedding) are projected (e.g., to 4096-D) and incorporated via:

Adapter modules (linear/MLP layers) for direct embedding into token space (driver yield prediction) (Yang et al., 24 Sep 2025)
Generation of natural language descriptions included at the sequence start (site-generalizable crossing inference) (Pu et al., 2 Jan 2026)

Tabular and text descriptors are normalized, then concatenated with knowledge and visual streams for LLM input.

3. Mathematical Objectives and Training Strategies

3.1 Prediction and Loss Functions

For binary yielding or crossing tasks, the LLM outputs a distribution over decision tokens:

Yielding: $P(y = 1|x) = \sigma(f_\theta(x))$
Crossing location: LLM likelihood over <INTERSECTION>/<MIDBLOCK> tokens

Losses are typically:

Cross-entropy in the classification head for fine-tuning (driver yielding) (Yang et al., 24 Sep 2025)
Causal language modeling (masking losses to answer tokens only) for generative crossing inference (Pu et al., 2 Jan 2026)

3.2 Performance Metrics

Across benchmarks, PedX-LLM evaluations use: $\mathrm{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

$\mathrm{Precision} = \frac{TP}{TP + FP}, \quad \mathrm{Recall} = \frac{TP}{TP + FN}$

$F_1 = 2 \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$

$\textrm{Balanced Accuracy} = \frac{TPR + TNR}{2}$

where TPR and TNR are the true positive and true negative rates.

4. Empirical Results and Comparative Analysis

Benchmarks on 18 sites using video, tabular, and image data reveal that:

GPT-4o achieves top balanced accuracy ($91.06$\%), recall ($92.41$\%), and F₁-score ($90.11$\%)
Deepseek-V3 yields highest precision ($92.06$\%) but with lower recall ($73.42$\%)
Logistic Regression, though interpretable, underperforms modern LLMs in precision and overall balance

Model

Accuracy (%)

Recall (%)

Precision (%)

F₁ (%)

Cost (

)</th> <th>Latency (s)</th> </tr> </thead><tbody><tr> <td>Logistic Reg.</td> <td>88.08</td> <td>90.72</td> <td>83.66</td> <td>87.06</td> <td>–</td> <td>0.01</td> </tr> <tr> <td><a href="https://www.emergentmind.com/topics/gpt-4o-mini-8a7e420b-c7de-47e7-818a-da41dc130fa6" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">GPT-4o-mini</a></td> <td>70.20</td> <td>37.97</td> <td>87.38</td> <td>52.58</td> <td>1.30</td> <td>6</td> </tr> <tr> <td>GPT-4o</td> <td>91.06</td> <td>92.41</td> <td>87.95</td> <td>90.11</td> <td>5.31</td> <td>7</td> </tr> <tr> <td>Deepseek-V3</td> <td>85.47</td> <td>73.42</td> <td>92.06</td> <td>81.10</td> <td>0.21</td> <td>9</td> </tr> <tr> <td><a href="https://www.emergentmind.com/topics/deepseek-r1-32a56472-288a-49af-86b0-24f2b1421157" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Deepseek-R1</a></td> <td>85.29</td> <td>90.72</td> <td>79.04</td> <td>84.47</td> <td>2.59</td> <td>210</td> </tr> </tbody></table></div> <p><em>Interpretation</em>: High recall models are preferable for safety-critical warnings, while high precision is favored to reduce false alarms and alert fatigue (<a href="/papers/2509.19657" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Yang et al., 24 Sep 2025</a>).</p> <h3 class='paper-heading' id='generalizable-pedestrian-crossing-inference-pedx-llm-a-href-papers-2601-00694-title-rel-nofollow-data-turbo-false-class-assistant-link-x-data-x-tooltip-raw-pu-et-al-2-jan-2026-a'>4.2 Generalizable Pedestrian Crossing Inference (PedX-LLM, (<a href="/papers/2601.00694" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Pu et al., 2 Jan 2026</a>))</h3> <p>Random-split results and cross-site tests demonstrate:</p> <ul> <li>Vision augmentation (LLaVA) provides a

2.9\%

gain over text-only PedX-LLM, and domain knowledge injection yields a further

4.1\%

increase, achieving

82.0\% \pm 1.4

balanced accuracy</li> <li>In zero-shot generalization to five unseen urban sites, PedX-LLM attains

66.9\%

balanced accuracy, outperforming best data-driven baselines (<a href="https://www.emergentmind.com/topics/categorical-boosting-catboost" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">CatBoost</a>:

48.3\%

) by at least 18 percent points; few-shot augmentation boosts this to

72.2\%

</li> <li>All improvements are statistically significant (

p < 0.01$)

Model	Random-Split BA (\%)	Cross-Site BA (\%)
Hierarchical LogReg	74.1 ± 2.2	–
CatBoost	79.0 ± 2.0	48.3 ± 3.8
PedX-LLM text-only	75.0 ± 1.8	–
PedX-LLM vision-only	77.9 ± 1.7	–
PedX-LLM vision+knowledge	82.0 ± 1.4	66.9 ± 5.0 (zero)
PedX-LLM few-shot	–	72.2 ± 3.8

Interpretation: The integration of vision and knowledge enables enhanced generalization to new geographies and scenarios, overcoming overfitting to local data patterns.

5. Design Principles and Prompt Engineering Strategies

PedX-LLM’s efficacy is grounded in carefully constructed prompts and modular data integration:

Chain-of-thought guidance: Explicit multi-step reasoning requirements ensure interpretable vehicle and pedestrian decision logic, mirroring expert analysis protocols
Natural language knowledge: Empirical behavioral findings and causal relationships are converted into text, leveraging LLMs’ capacity to apply general principles beyond direct training data
Few-shot scaffolding: Strategic inclusion of labeled exemplars facilitates both in-context learning and zero-shot transfer

Prompt templates are structured into:

Contextual field data (tabular or textual descriptions)
Vision-derived descriptors (urban environment salient features)
Knowledge priors (age, environment, infrastructure, etc.)
Explicit prediction instruction (“Based on the above, predict crossing location: <INTERSECTION> or <MIDBLOCK>”)

6. Computational Considerations and Deployment Guidelines

PedX-LLM enables practical deployment for field applications:

Resource efficiency: LoRA adaptation and quantization (e.g., 4-bit NormalFloat) permit fine-tuning and inference with moderate GPU/CPU memory (7–16 GB)
Latency and cost trade-offs: Lightweight models (e.g., GPT-4o-mini) achieve sub-10 second latency at reduced cost, suitable for edge or real-time systems; deeper models (Deepseek-R1) achieve higher recall but at much higher latency and computational expense (Yang et al., 24 Sep 2025)
Interpretability: Structured rationales and knowledge-centric prompts support engineer review and downstream policy or infrastructure interventions

7. Future Directions and Limitations

Vision modality limitations: Current implementations use static imagery (satellite, pre-captured photos) without dynamic traffic state—real-time video-based vision integration remains an open area (Pu et al., 2 Jan 2026)
Geographic generalizability: Existing benchmarks are concentrated in specific regions (e.g., Hampton Roads, Virginia), necessitating broader cross-city and international evaluation
Knowledge representations: Inference relies on unstructured natural language knowledge; incorporation of ontologies or knowledge graphs could offer further gains in reasoning fidelity and explainability
Continual adaptation: The development of Lifelong/continual learning frameworks may be needed to account for evolving traffic patterns and urban infrastructure

PedX-LLM demonstrates a paradigmatic shift from pattern fitting toward semantically-grounded, context-dependent behavioral reasoning in pedestrian safety and mobility applications. Through vision-and-knowledge-enhanced LLMs, it sets a new benchmark for both predictive accuracy and generalizability in transportation research (Yang et al., 24 Sep 2025, Pu et al., 2 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Large Language Models for Pedestrian Safety: An Application to Predicting Driver Yielding Behavior at Unsignalized Intersections (2025)

A Vision-and-Knowledge Enhanced Large Language Model for Generalizable Pedestrian Crossing Behavior Inference (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pedestrian Crossing LLM (PedX-LLM).

PedX-LLM: Pedestrian Crossing Analysis

1. Multimodal and Knowledge-Enhanced LLM Foundations

2. Model Architectures and Inference Pipelines

2.1 Prompt-Driven Methodology

2.2 Fine-Tuning via Parameter-Efficient Adaptation

2.3 Multimodal Integration

3. Mathematical Objectives and Training Strategies

3.1 Prediction and Loss Functions

3.2 Performance Metrics

4. Empirical Results and Comparative Analysis

4.1 Driver Yielding Behavior (PedX-LLM, (Yang et al., 24 Sep 2025))

5. Design Principles and Prompt Engineering Strategies

6. Computational Considerations and Deployment Guidelines

7. Future Directions and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PedX-LLM: Pedestrian Crossing Analysis

1. Multimodal and Knowledge-Enhanced LLM Foundations

2. Model Architectures and Inference Pipelines

2.1 Prompt-Driven Methodology

2.2 Fine-Tuning via Parameter-Efficient Adaptation

2.3 Multimodal Integration

3. Mathematical Objectives and Training Strategies

3.1 Prediction and Loss Functions

3.2 Performance Metrics

4. Empirical Results and Comparative Analysis

4.1 Driver Yielding Behavior (PedX-LLM, (Yang et al., 24 Sep 2025))

5. Design Principles and Prompt Engineering Strategies

6. Computational Considerations and Deployment Guidelines

7. Future Directions and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research