Beyond Proximity: A Keypoint-Trajectory Framework for Classifying Affiliative and Agonistic Social Networks in Dairy Cattle

Published 17 Dec 2025 in cs.CV and cs.AI | (2512.14998v1)

Abstract: Precision livestock farming requires objective assessment of social behavior to support herd welfare monitoring, yet most existing approaches infer interactions using static proximity thresholds that cannot distinguish affiliative from agonistic behaviors in complex barn environments. This limitation constrains the interpretability of automated social network analysis in commercial settings. We present a pose-based computational framework for interaction classification that moves beyond proximity heuristics by modeling the spatiotemporal geometry of anatomical keypoints. Rather than relying on pixel-level appearance or simple distance measures, the proposed method encodes interaction-specific motion signatures from keypoint trajectories, enabling differentiation of social interaction valence. The framework is implemented as an end-to-end computer vision pipeline integrating YOLOv11 for object detection ([email protected]: 96.24%), supervised individual identification (98.24% accuracy), ByteTrack for multi-object tracking (81.96% accuracy), ZebraPose for 27-point anatomical keypoint estimation, and a support vector machine classifier trained on pose-derived distance dynamics. On annotated interaction clips collected from a commercial dairy barn, the classifier achieved 77.51% accuracy in distinguishing affiliative and agonistic behaviors using pose information alone. Comparative evaluation against a proximity-only baseline shows substantial gains in behavioral discrimination, particularly for affiliative interactions. The results establish a proof-of-concept for automated, vision-based inference of social interactions suitable for constructing interaction-aware social networks, with near-real-time performance on commodity hardware.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a novel vision-based framework using keypoint trajectories to differentiate social interactions in dairy cattle.
It employs YOLOv11 detection, ByteTrack tracking, and ZebraPose for pose estimation, achieving 77.51% overall classification accuracy.
The system enables noninvasive, real-time behavior monitoring for improved welfare assessment and herd management in precision livestock farming.

Introduction and Context

The paper "Beyond Proximity: A Keypoint-Trajectory Framework for Classifying Affiliative and Agonistic Social Networks in Dairy Cattle" (2512.14998) addresses a critical deficiency in automated livestock behavior analysis—differentiating the valence of social interactions (affiliative vs. agonistic) objectively, noninvasively, and at scale. Traditional approaches predominantly employ static proximity thresholds using contact sensors or location-based heuristics, inherently conflating mere spatial co-occurrence with substantive social engagement. This work proposes and validates a vision-based framework, leveraging spatiotemporal modeling of anatomical keypoints, to discern interaction categories in an operational dairy barn context.

System Architecture and Methodological Advancements

The proposed system is a modular, end-to-end pipeline integrating the following principal components: YOLOv11-based cow detection and re-identification, ByteTrack for multi-object tracking, ZebraPose for 27-point anatomical keypoint estimation, and a custom support vector machine (SVM) that synthesizes pose-derived temporal dynamics to classify interactions. The pipeline operates on standard barn video footage and produces structured, time-stamped records of cow-cow interactions, serving as the substrate for constructing weighted, directed social networks.

A central methodological innovation is the departure from pixel-level video interpretation, which is both computationally intensive and sensitive to occlusion—a recurrent issue in barn environments. Instead, the system models dyadic behavior through structured analysis of keypoint trajectories. The classifier is trained on handcrafted features encapsulating pairwise keypoint distances and their temporal evolution, explicitly targeting features such as mean separation, variance, rate of spatial change, and collision proxies (second-order derivatives). This abstraction enables efficient computation and robust performance across visually complex, multi-cow scenes.

Importantly, the system employs a proximity filter for computational efficiency, but the ultimate inference step—classification of interaction valence—relies strictly on temporal geometry extracted from pose data, not on static spatial thresholds.

Experimental Design and Quantitative Results

Empirical evaluation is conducted on a dataset acquired from a commercial barn (Sussex, New Brunswick, Canada), consisting of approximately seven hours of multi-camera, high-resolution video. The annotation effort is nontrivial: 1,956 cow instances labeled with 27 keypoints each and 160 interaction clips manually categorized into affiliative (licking/grooming) and agonistic (headbutting, displacement) classes. The annotation pipeline is rigorously validated via dual labeling and consensus curation (Cohen’s Kappa = 0.88).

Detection, Tracking, and Pose Estimation

YOLOv11x achieves [email protected] of 96.24% on barn-specific detection
Individual identification yields 98.24% accuracy (YOLOv11-cls)
ByteTrack delivers 81.96% tracking accuracy in crowded scenes
ZebraPose outperforms YOLO-pose, with AP=0.809, AR=0.832, and [email protected]=0.977 for pose estimation under barn conditions

Interaction Classification and Network Construction

The SVM-based classifier, trained on keypoint-trajectory features, attains an overall accuracy of 77.51%, with marked improvements in affiliative interaction identification versus the proximity-only baseline (32% to 78% precision, a 146% relative increase). Macro-F1 rises from 0.54 (baseline) to 0.71. Agonistic interaction precision (0.65) lags affiliative, reflecting both behavioral kinematic sparsity and class imbalance.

Ablation studies confirm that both temporal and collision-detection features are essential—removing either results in a 6–8% absolute drop in performance, and using mean proximity alone collapses accuracy to that of the naive baseline. Sensitivity analyses on spatial and temporal thresholds reveal robust performance, with optimal macro-F1 achieved at proximity factor 0.35 and a 4-second dwell time.

Through automated aggregation, the pipeline constructs multi-class, weighted social networks differentiating interactions by valence. This enables subsequent analysis of group cohesion, dominance, isolation, and potentially welfare-related network metrics.

Computational Performance

On commodity hardware (Intel i7-9700, RTX 2060), the system operates at an end-to-end latency of 73 ms per frame (13.7 fps). The interaction classification contributes only 4% of total computational cost, highlighting the efficiency of pose-based methods over conventional pixel-level deep learning architectures. While full real-time operation (60 fps) is not achieved, the framework enables practical asynchronous farm monitoring within minutes of data capture.

Implications, Limitations, and Theoretical Considerations

The results decisively demonstrate that proximity-based methods are fundamentally inadequate for behavior valence inference—affiliative and agonistic behaviors exhibit overlapping spatial signatures but distinct pose/motion dynamics. Keypoint-trajectory analysis circumvents this confound, supporting empirical construction of behavioral social networks from unstructured video.

Practical Implications

Welfare Monitoring: The system enables continuous, objective, and noninvasive welfare assessment, removing the need for labor-intensive or intrusive sensor systems.
Management Integration: Automated interaction networks can be correlated with health, productivity, and reproductive records, providing actionable metrics for intervention and herd management.
Scalability and Edge Deployment: The pipeline’s low computational demand facilitates deployment on standard farm hardware, supporting real-world adoption.

Limitations

Supervised Learning Constraints: Current methods rely on labor-intensive annotation; robustness under occlusion and generalization across herds/farms remains contingent on supervised re-training and transfer learning.
Behavioral Taxonomy: Only three interaction classes are modeled; further generalization (mounting, sniffing, mixed behaviors) will require additional labeled data and possibly advanced multi-class frameworks.
Temporal Granularity vs. Longitudinal Stability: Short-term network snapshots reflect transient dynamics; stable social structure assessment demands multi-day, longitudinal validation.
Domain Adaptation: Supervised modules (identification, pose estimation) are currently herd/barn-specific; methods such as meta-learning and self-supervised adaptation represent future directions.

Theoretical and Future Developments

Graph Neural Networks (GNNs) and Transformer-based models could be employed to further capture the temporal and relational complexity of multi-animal interactions, potentially improving F1 by 10–15%. Integrating additional modalities (audio, thermal infrared, physiological metrics) could enhance robustness and behavioral specificity. The modular architecture is amenable to cross-species adaptation with retraining, suggesting pathways toward general precision livestock analytics.

Advanced temporal modeling (LSTM/attention-based denoising, Kalman filtering at keypoint level) could stabilize pose outputs under severe occlusion, enhancing sensitivity to rapid, interaction-specific motions in dense environments.

User interface and integration with farm management platforms are critical for adoption; metrics need to be translated into actionable visualizations, alerts, and predictive tools for farm staff rather than presented as abstract network statistics.

Conclusion

This work establishes a rigorous, modular, vision-based framework for automated, valence-aware inference of dyadic social interactions in commercial dairy herds, validated at near-realistic scale. Through pose-trajectory modeling and efficient classification, the approach transcends the structural limitations of proximity-based and sensor-driven methodologies, achieving a 16% absolute improvement in interaction classification accuracy and supporting the construction of interaction-aware social networks.

These developments substantiate a methodological shift in precision livestock farming from contact/proximity monitoring toward behavioral context analysis—unlocking unprecedented resolution in welfare assessment, management optimization, and ethological research. Future advances in domain adaptation, multi-modal integration, and graph-based sequence modeling will further extend capability and impact, advancing scalable, welfare-centric, data-driven animal agriculture.

Markdown Report Issue