Papers
Topics
Authors
Recent
Search
2000 character limit reached

PedX-LLM: Pedestrian Crossing Analysis

Updated 9 January 2026
  • PedX-LLM is a multimodal large language model framework that fuses visual data, domain knowledge, and textual reasoning to forecast driver yielding and pedestrian crossing events.
  • It employs prompt-driven inference and few-shot learning, using chain-of-thought guidance to generate context-aware and generalizable predictions.
  • The integration of LoRA-based fine-tuning with modular adapters enhances accuracy and computational efficiency, achieving over 90% balanced accuracy in driver yielding scenarios.

Pedestrian Crossing LLM (PedX-LLM) refers to a class of multimodal LLM frameworks specifically engineered for interpreting and predicting human and driver behavior in the context of pedestrian crossings within urban environments. Through the integration of vision, structured domain knowledge, and textual reasoning, PedX-LLM addresses the limitations of traditional, site-specific classifiers by enabling context-aware and generalizable inference for tasks such as driver yielding prediction and pedestrian crossing behavior inference (Yang et al., 24 Sep 2025, Pu et al., 2 Jan 2026).

1. Multimodal and Knowledge-Enhanced LLM Foundations

PedX-LLM synthesizes multimodal perception and domain knowledge injection with advanced natural language reasoning. The framework typically fuses:

  • Visual features: Extracted from urban imagery (satellite, intersection photo, or traffic video) using pretrained vision encoders such as CLIP ViT-L/14 or LLaVA.
  • Tabular/textual attributes: Encoded descriptions of event-specific and environmental variables (e.g., vehicle speed, road geometry, pedestrian demographics, land use).
  • Domain knowledge: Structured prompts or natural language priors capturing causal relationships known from transportation studies (e.g., "rain increases mid-block crossing utility"; "seniors prefer intersections") (Pu et al., 2 Jan 2026).

The architecture orchestrates prompt-based LLM inference, where inputs may include numeric features, visual embeddings (via adapter layers or natural language conversion), and hierarchical bullet-pointed knowledge blocks to guide step-wise or chain-of-thought reasoning.

2. Model Architectures and Inference Pipelines

2.1 Prompt-Driven Methodology

In PedX-LLM frameworks, all modalities are unified into a single prompt string delivered to the LLM backbone (e.g., GPT-4o, Deepseek-V3, or LLaMA-2-7B):

  1. Task definition: An explicit instruction such as “Predict whether the driver yields (Yes/No) given the following context.”
  2. Domain knowledge: Inserted as natural language statements (object-level effects, empirical odds ratios) or system prompt context.
  3. Multimodal placeholders: Numeric and categorical context, textual event logs, and image- or vision-derived descriptors.
  4. Reasoning scaffold: "Thinking guidance" or prescribed multi-step evaluation (e.g., vehicle dynamics → infrastructure → pedestrian interaction → decision).
  5. Few-shot exemplars: Instances with full context, reasoning trace, and correct label, to encourage chain-of-thought output (Yang et al., 24 Sep 2025).

2.2 Fine-Tuning via Parameter-Efficient Adaptation

For site-generalizable pedestrian behavior inference, PedX-LLM incorporates Low-Rank Adaptation (LoRA) modules into transformer backbones (e.g., LLaMA-2-7B), enabling domain adaptation without full weight updates. The LoRA formulation:

ΔW=AB,h=W0x+α(ABx)\Delta W = AB, \quad h = W_0 x + \alpha \cdot (ABx)

where ARd×rA \in \mathbb{R}^{d \times r}, BRr×dB \in \mathbb{R}^{r \times d}, rdr \ll d, and only AA, BB are trained, preserving the frozen pretrained W0W_0 (Pu et al., 2 Jan 2026).

2.3 Multimodal Integration

Visual features (e.g., 1024-D CLIP or LLaVA embedding) are projected (e.g., to 4096-D) and incorporated via:

  • Adapter modules (linear/MLP layers) for direct embedding into token space (driver yield prediction) (Yang et al., 24 Sep 2025)
  • Generation of natural language descriptions included at the sequence start (site-generalizable crossing inference) (Pu et al., 2 Jan 2026)

Tabular and text descriptors are normalized, then concatenated with knowledge and visual streams for LLM input.

3. Mathematical Objectives and Training Strategies

3.1 Prediction and Loss Functions

For binary yielding or crossing tasks, the LLM outputs a distribution over decision tokens:

  • Yielding: P(y=1x)=σ(fθ(x))P(y = 1|x) = \sigma(f_\theta(x))
  • Crossing location: LLM likelihood over <INTERSECTION>/<MIDBLOCK> tokens

Losses are typically:

3.2 Performance Metrics

Across benchmarks, PedX-LLM evaluations use: Accuracy=TP+TNTP+TN+FP+FN\mathrm{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Precision=TPTP+FP,Recall=TPTP+FN\mathrm{Precision} = \frac{TP}{TP + FP}, \quad \mathrm{Recall} = \frac{TP}{TP + FN}

F1=2Precision×RecallPrecision+RecallF_1 = 2 \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}

Balanced Accuracy=TPR+TNR2\textrm{Balanced Accuracy} = \frac{TPR + TNR}{2}

where TPR and TNR are the true positive and true negative rates.

4. Empirical Results and Comparative Analysis

Benchmarks on 18 sites using video, tabular, and image data reveal that:

  • GPT-4o achieves top balanced accuracy ($91.06$\%), recall ($92.41$\%), and F₁-score ($90.11$\%)
  • Deepseek-V3 yields highest precision ($92.06$\%) but with lower recall ($73.42$\%)
  • Logistic Regression, though interpretable, underperforms modern LLMs in precision and overall balance
Model Accuracy (%) Recall (%) Precision (%) F₁ (%) Cost ()</th><th>Latency(s)</th></tr></thead><tbody><tr><td>LogisticReg.</td><td>88.08</td><td>90.72</td><td>83.66</td><td>87.06</td><td></td><td>0.01</td></tr><tr><td><ahref="https://www.emergentmind.com/topics/gpt4omini8a7e420bc7de47e7818ada41dc130fa6"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">GPT4omini</a></td><td>70.20</td><td>37.97</td><td>87.38</td><td>52.58</td><td>1.30</td><td>6</td></tr><tr><td>GPT4o</td><td>91.06</td><td>92.41</td><td>87.95</td><td>90.11</td><td>5.31</td><td>7</td></tr><tr><td>DeepseekV3</td><td>85.47</td><td>73.42</td><td>92.06</td><td>81.10</td><td>0.21</td><td>9</td></tr><tr><td><ahref="https://www.emergentmind.com/topics/deepseekr132a56472288a49af86b024f2b1421157"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">DeepseekR1</a></td><td>85.29</td><td>90.72</td><td>79.04</td><td>84.47</td><td>2.59</td><td>210</td></tr></tbody></table></div><p><em>Interpretation</em>:Highrecallmodelsarepreferableforsafetycriticalwarnings,whilehighprecisionisfavoredtoreducefalsealarmsandalertfatigue(<ahref="/papers/2509.19657"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Yangetal.,24Sep2025</a>).</p><h3class=paperheadingid=generalizablepedestriancrossinginferencepedxllmahrefpapers260100694titlerelnofollowdataturbofalseclassassistantlinkxdataxtooltiprawpuetal2jan2026a>4.2GeneralizablePedestrianCrossingInference(PedXLLM,(<ahref="/papers/2601.00694"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Puetal.,2Jan2026</a>))</h3><p>Randomsplitresultsandcrosssitetestsdemonstrate:</p><ul><li>Visionaugmentation(LLaVA)providesa)</th> <th>Latency (s)</th> </tr> </thead><tbody><tr> <td>Logistic Reg.</td> <td>88.08</td> <td>90.72</td> <td>83.66</td> <td>87.06</td> <td>–</td> <td>0.01</td> </tr> <tr> <td><a href="https://www.emergentmind.com/topics/gpt-4o-mini-8a7e420b-c7de-47e7-818a-da41dc130fa6" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">GPT-4o-mini</a></td> <td>70.20</td> <td>37.97</td> <td>87.38</td> <td>52.58</td> <td>1.30</td> <td>6</td> </tr> <tr> <td>GPT-4o</td> <td>91.06</td> <td>92.41</td> <td>87.95</td> <td>90.11</td> <td>5.31</td> <td>7</td> </tr> <tr> <td>Deepseek-V3</td> <td>85.47</td> <td>73.42</td> <td>92.06</td> <td>81.10</td> <td>0.21</td> <td>9</td> </tr> <tr> <td><a href="https://www.emergentmind.com/topics/deepseek-r1-32a56472-288a-49af-86b0-24f2b1421157" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Deepseek-R1</a></td> <td>85.29</td> <td>90.72</td> <td>79.04</td> <td>84.47</td> <td>2.59</td> <td>210</td> </tr> </tbody></table></div> <p><em>Interpretation</em>: High recall models are preferable for safety-critical warnings, while high precision is favored to reduce false alarms and alert fatigue (<a href="/papers/2509.19657" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Yang et al., 24 Sep 2025</a>).</p> <h3 class='paper-heading' id='generalizable-pedestrian-crossing-inference-pedx-llm-a-href-papers-2601-00694-title-rel-nofollow-data-turbo-false-class-assistant-link-x-data-x-tooltip-raw-pu-et-al-2-jan-2026-a'>4.2 Generalizable Pedestrian Crossing Inference (PedX-LLM, (<a href="/papers/2601.00694" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Pu et al., 2 Jan 2026</a>))</h3> <p>Random-split results and cross-site tests demonstrate:</p> <ul> <li>Vision augmentation (LLaVA) provides a 2.9\%gainovertextonlyPedXLLM,anddomainknowledgeinjectionyieldsafurther gain over text-only PedX-LLM, and domain knowledge injection yields a further 4.1\%increase,achieving increase, achieving 82.0\% \pm 1.4balancedaccuracy</li><li>Inzeroshotgeneralizationtofiveunseenurbansites,PedXLLMattains balanced accuracy</li> <li>In zero-shot generalization to five unseen urban sites, PedX-LLM attains 66.9\%balancedaccuracy,outperformingbestdatadrivenbaselines(<ahref="https://www.emergentmind.com/topics/categoricalboostingcatboost"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">CatBoost</a>: balanced accuracy, outperforming best data-driven baselines (<a href="https://www.emergentmind.com/topics/categorical-boosting-catboost" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">CatBoost</a>: 48.3\%)byatleast18percentpoints;fewshotaugmentationbooststhisto) by at least 18 percent points; few-shot augmentation boosts this to 72.2\%</li><li>Allimprovementsarestatisticallysignificant(</li> <li>All improvements are statistically significant (p < 0.01$)
Model Random-Split BA (\%) Cross-Site BA (\%)
Hierarchical LogReg 74.1 ± 2.2
CatBoost 79.0 ± 2.0 48.3 ± 3.8
PedX-LLM text-only 75.0 ± 1.8
PedX-LLM vision-only 77.9 ± 1.7
PedX-LLM vision+knowledge 82.0 ± 1.4 66.9 ± 5.0 (zero)
PedX-LLM few-shot 72.2 ± 3.8

Interpretation: The integration of vision and knowledge enables enhanced generalization to new geographies and scenarios, overcoming overfitting to local data patterns.

5. Design Principles and Prompt Engineering Strategies

PedX-LLM’s efficacy is grounded in carefully constructed prompts and modular data integration:

  • Chain-of-thought guidance: Explicit multi-step reasoning requirements ensure interpretable vehicle and pedestrian decision logic, mirroring expert analysis protocols
  • Natural language knowledge: Empirical behavioral findings and causal relationships are converted into text, leveraging LLMs’ capacity to apply general principles beyond direct training data
  • Few-shot scaffolding: Strategic inclusion of labeled exemplars facilitates both in-context learning and zero-shot transfer

Prompt templates are structured into:

  1. Contextual field data (tabular or textual descriptions)
  2. Vision-derived descriptors (urban environment salient features)
  3. Knowledge priors (age, environment, infrastructure, etc.)
  4. Explicit prediction instruction (“Based on the above, predict crossing location: <INTERSECTION> or <MIDBLOCK>”)

6. Computational Considerations and Deployment Guidelines

PedX-LLM enables practical deployment for field applications:

  • Resource efficiency: LoRA adaptation and quantization (e.g., 4-bit NormalFloat) permit fine-tuning and inference with moderate GPU/CPU memory (7–16 GB)
  • Latency and cost trade-offs: Lightweight models (e.g., GPT-4o-mini) achieve sub-10 second latency at reduced cost, suitable for edge or real-time systems; deeper models (Deepseek-R1) achieve higher recall but at much higher latency and computational expense (Yang et al., 24 Sep 2025)
  • Interpretability: Structured rationales and knowledge-centric prompts support engineer review and downstream policy or infrastructure interventions

7. Future Directions and Limitations

  • Vision modality limitations: Current implementations use static imagery (satellite, pre-captured photos) without dynamic traffic state—real-time video-based vision integration remains an open area (Pu et al., 2 Jan 2026)
  • Geographic generalizability: Existing benchmarks are concentrated in specific regions (e.g., Hampton Roads, Virginia), necessitating broader cross-city and international evaluation
  • Knowledge representations: Inference relies on unstructured natural language knowledge; incorporation of ontologies or knowledge graphs could offer further gains in reasoning fidelity and explainability
  • Continual adaptation: The development of Lifelong/continual learning frameworks may be needed to account for evolving traffic patterns and urban infrastructure

PedX-LLM demonstrates a paradigmatic shift from pattern fitting toward semantically-grounded, context-dependent behavioral reasoning in pedestrian safety and mobility applications. Through vision-and-knowledge-enhanced LLMs, it sets a new benchmark for both predictive accuracy and generalizability in transportation research (Yang et al., 24 Sep 2025, Pu et al., 2 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pedestrian Crossing LLM (PedX-LLM).