Foot Contact Estimation (FECO)

Updated 4 July 2026

FECO is the estimation of foot contact with the environment, quantifying contact location and stability using modalities such as motion capture, plantar sensing, vision, and IMUs.
The approaches range from binary contact labels and continuous force fields to dense per-vertex contact maps, each tailored for specific applications in human motion synthesis and legged robotics.
FECO systems integrate sensor fusion and contact-mechanics constraints to improve motion correction, state estimation, and odometry in complex, real-world scenarios.

Searching arXiv for recent and relevant FECO papers to ground the article. FEet COntact estimation (FECO) denotes the estimation of whether a foot or foot effector is in contact with the environment, where that contact occurs, and, in several formulations, how strongly or how stably it occurs. In the cited literature, FECO spans human motion processing, wearable and plantar sensing, monocular and keypoint-based vision, and proprioceptive state estimation for legged robots. Its outputs range from binary heel and toe labels, per-cell vertical ground reaction forces (vGRF), and per-region contact maps to six-degree-of-freedom contact probabilities, foothold landmarks, and dense per-vertex sole contact fields (Mourot et al., 2022, Maravgakis et al., 2023, Jung et al., 27 Nov 2025).

1. Problem scope and representational choices

FECO is not a single estimator class but a family of estimation problems defined by modality, contact granularity, and downstream use. In human-motion synthesis, FECO is often introduced because footskating is a frequent and disturbing artefact, and accurate contact labels are needed for cleanup and physically plausible editing. In legged robotics, FECO is coupled to floating-base state estimation, contact-aided odometry, and controller logic, where missed or spurious contacts directly affect base pose drift, slip rejection, and contact-mode inference (Mourot et al., 2022, Menner et al., 2024).

Domain	Inputs	Outputs
Human motion from mocap	3D joint positions, pressure insoles	vGRF, heel/toe contact labels
Plantar sensing	Foot pressure distribution	posture and joint angles
Vision-based human FECO	2D/3D keypoints or RGB image	contact probabilities, contact maps, dense foot contact
Legged robotics	IMU, encoders, force sensors, motor torques, binary contact signals	stable-contact probability, contact mode, foothold constraints

A central representational distinction is between binary contact labels and continuous contact variables. UnderPressure defines two contact points per foot, heel and toe, from smoothed pressure-derived forces via thresholded functions $H(t)$ and $T(t)$ , followed by removal of any contact phase shorter than $0.1\,\mathrm{s}$ (Mourot et al., 2022). By contrast, FootFormer predicts a binary contact map $C_t \in \{0,1\}^n$ over discrete foot regions from visual input, while the dense single-image FECO framework predicts per-vertex contact probabilities on a 265-vertex SMPL-X foot mesh (Kraiger et al., 22 Oct 2025, Jung et al., 27 Nov 2025). In robotics, the representation may be probabilistic rather than binary: FECO has been defined as $P(c=1\mid a_t)$ from foot IMUs, as per-DoF fuzzy-membership contact probabilities, or as a belief over left-only, right-only, and dual-support modes (Maravgakis et al., 2023, Rotella et al., 2017, Payne et al., 2024).

This breadth has methodological consequences. A plausible implication is that FECO should be understood less as a single classifier and more as an interface variable between sensing and mechanics: it may appear as a label, a force field, a covariance weight, a temporary landmark, or a mode variable, depending on the estimator’s role.

2. Motion-driven FECO in human animation and motion editing

The most detailed motion-centric FECO pipeline in the cited literature is “UnderPressure” (Mourot et al., 2022). It publicly releases a motion-capture database in which ten healthy adult volunteers performed forward/backward walking at slow/normal/fast paces, running, hopping, stair-climbing, obstacle stepping/jumping, sitting, and crouching. Motion capture was recorded at $240\,\mathrm{Hz}$ with an Xsens MVN Link suit producing 23 segment poses, while Moticon OpenGo pressure-insoles sampled at $100\,\mathrm{Hz}$ provided 16 cells per foot plus a 6-axis IMU. The total recorded mocap duration is approximately $5.6\,\mathrm{h}$ .

UnderPressure constructs ground-truth vGRF from insole-cell pressure $p_i(t)$ using

$f_i(t)=p_i(t)\cdot A_i,$

with all $T(t)$ 0 normalized to body-weight units. Heel and toe contact labels are then defined from

$T(t)$ 1

after a short Gaussian smoothing. The contact-label function $T(t)$ 2 is

$T(t)$ 3

$T(t)$ 4

followed by removal of any contact phase shorter than $T(t)$ 5. This makes FECO explicitly force-derived rather than purely kinematic.

The estimator input is a window of $T(t)$ 6 consecutive frames of 3D joint positions $T(t)$ 7, with $T(t)$ 8. The output is a predicted vGRF tensor $T(t)$ 9, corresponding to left/right foot and 16 insole-cell forces per foot. The architecture comprises four temporal 1D-convolutional layers with kernel size $0.1\,\mathrm{s}$ 0 and channel widths $0.1\,\mathrm{s}$ 1, ELU activations, then three fully connected layers of 256 neurons each with dropout $0.1\,\mathrm{s}$ 2, and a final softplus projection to ensure nonnegative outputs. The model has approximately $0.1\,\mathrm{s}$ 3 parameters.

Training uses random data augmentations that preserve vGRF, including horizontal rotations, horizontal translations, uniform scaling, left-right mirroring, and random skeletal-morphology perturbations via a low-rank SVD basis. The core regression loss is the mean squared logarithmic error

$0.1\,\mathrm{s}$ 4

with Adam at learning rate $0.1\,\mathrm{s}$ 5, $0.1\,\mathrm{s}$ 6, $0.1\,\mathrm{s}$ 7, batch size $0.1\,\mathrm{s}$ 8, approximately 2500 epochs, and early stopping via a held-out validation subject set comprising $0.1\,\mathrm{s}$ 9 of the data.

At inference time, the network predicts 16 per-cell forces per foot, and FECO labels are recovered by applying the same $C_t \in \{0,1\}^n$ 0 used for ground truth: Gaussian-smooth the predicted pressures, compute $C_t \in \{0,1\}^n$ 1 and $C_t \in \{0,1\}^n$ 2, threshold with $C_t \in \{0,1\}^n$ 3 and $C_t \in \{0,1\}^n$ 4, and discard contact intervals shorter than $C_t \in \{0,1\}^n$ 5. On foot contact detection, “Ours” reaches overall $C_t \in \{0,1\}^n$ 6 versus the optimal-threshold heuristic $C_t \in \{0,1\}^n$ 7. With a $C_t \in \{0,1\}^n$ 8 tolerance, $C_t \in \{0,1\}^n$ 9. On vGRF estimation, the overall RMSE is approximately $P(c=1\mid a_t)$ 0 body weight and the CoP median absolute deviation is approximately $P(c=1\mid a_t)$ 1. Robustness tests with Gaussian noise added to joint positions, autoencoder distortions, and motion-blend-induced footskate show that “Ours” degrades gracefully, whereas OT collapses quickly.

UnderPressure also closes the loop from FECO to motion correction. Given contacts $P(c=1\mid a_t)$ 2 and reference vGRF $P(c=1\mid a_t)$ 3, a skated sequence $P(c=1\mid a_t)$ 4 is cleaned up by minimizing

$P(c=1\mid a_t)$ 5

with typical weights $P(c=1\mid a_t)$ 6, $P(c=1\mid a_t)$ 7, $P(c=1\mid a_t)$ 8, $P(c=1\mid a_t)$ 9, and $240\,\mathrm{Hz}$ 0 gradient iterations. The stated purpose is to ensure that the edited motion respects the originally estimated ground-contact forces and eliminates foot-skate. In this formulation, FECO is not only a detection problem but also a constraint generator for inverse kinematics and dynamics-preserving editing.

3. Plantar-pressure FECO, morphology, and physical reservoir interpretations

A distinct FECO line estimates whole-body variables from plantar pressure alone. Kobayashi and Nakashima propose a contact and wearable sensing system in which only the plantar region is instrumented, using a flexible, sheet-type, electromagnetic-induction pressure-distribution sensor with $240\,\mathrm{Hz}$ 1 taxels of $240\,\mathrm{Hz}$ 2, mounted flat on the floor (Kobayashi et al., 2024). Participants stand or squat with bare feet, or with an interposed material, directly atop the sheet.

The signal-processing pipeline linearly interpolates irregular LL480×480 sampling to a fixed $240\,\mathrm{Hz}$ 3 timestep, resamples OptiTrack motion capture to the same timestep, discards the first $240\,\mathrm{Hz}$ 4 of each trial, and retains $240\,\mathrm{Hz}$ 5 per run. Feature selection is univariate: Pearson $240\,\mathrm{Hz}$ 6 is computed between each taxel’s time series and each target angle, and only taxels with $240\,\mathrm{Hz}$ 7 are retained. The selected taxels and each target angle are Z-score standardized over the training set.

Each scalar target is estimated from the pressure state $240\,\mathrm{Hz}$ 8 via ridge regression: $240\,\mathrm{Hz}$ 9 with the training objective

$100\,\mathrm{Hz}$ 0

and $100\,\mathrm{Hz}$ 1 chosen by grid-search. The experiments involve 7 healthy males performing natural squats paced by metronome, with 11 runs each across three conditions: direct contact, a $100\,\mathrm{Hz}$ 2 silicone-rubber sheet, and a $100\,\mathrm{Hz}$ 3 PLA plastic sheet.

Under direct contact, the reported accuracy is $100\,\mathrm{Hz}$ 4 for ankle, $100\,\mathrm{Hz}$ 5 for knee, $100\,\mathrm{Hz}$ 6 for hip, and $100\,\mathrm{Hz}$ 7 for upper body, with corresponding RMSE values of $100\,\mathrm{Hz}$ 8, $100\,\mathrm{Hz}$ 9, $5.6\,\mathrm{h}$ 0, and $5.6\,\mathrm{h}$ 1 (Kobayashi et al., 2024). In the silicone-rubber condition, $5.6\,\mathrm{h}$ 2 drops by approximately $5.6\,\mathrm{h}$ 3– $5.6\,\mathrm{h}$ 4 across joints with $5.6\,\mathrm{h}$ 5, and in the plastic condition the drop is larger with $5.6\,\mathrm{h}$ 6.

The paper interprets this degradation as evidence that the morphology of the plantar region contributes to estimation and frames the foot as a physical reservoir. Specifically, the foot’s compliant structure and viscoelastic dynamics are said to map motor commands and body posture into a rich, nonlinear pressure-distribution trajectory, while a simple linear readout performs the final decoding. Taxel-weight maps qualitatively align with mechanoreceptor-dense regions. This suggests that, in FECO formulations based on plantar sensing, contact is not merely a support-state label; it can also be the high-dimensional interface through which distal morphology encodes proximal kinematics.

The authors explicitly note limitations: the work is restricted to 2D sagittal-plane squat movements, and extension to 3D gait would require phase-dependent models or continuous gait-phase estimation, robustness to foot lift-off and varying shoe interiors, and possibly nonlinear or sparse-coding readouts.

4. Visual FECO: from temporal contact timing to dense single-image contact

Visual FECO methods differ mainly in output granularity and the degree to which contact is embedded in a downstream physical model. “Contact and Human Dynamics from Monocular Video” estimates contact timings with a prediction network trained without hand-labeled data (Rempe et al., 2020). Its input is a temporal window of $5.6\,\mathrm{h}$ 7 consecutive frames of 2D OpenPose detections for 13 lower-body joints, each with $5.6\,\mathrm{h}$ 8, for a total input dimension of 351. The network is a 5-layer MLP with widths

$5.6\,\mathrm{h}$ 9

with ReLU and batch normalization. It outputs contact probabilities for 4 foot joints across a 5-frame prediction window. Training labels are generated synthetically from mocap by declaring contact when the foot joint is nearly stationary and within $p_i(t)$ 0 of the floor. On the reported test sets, the FECO MLP attains synthetic $p_i(t)$ 1 and real $p_i(t)$ 2. Those contact labels are then injected into a physics-based trajectory optimization with COM, foot positions, and contact forces as variables, together with dynamics, friction, and contact complementarity constraints.

FootFormer moves FECO toward joint prediction of pressure, contact, and center of mass directly from visual input (Kraiger et al., 22 Oct 2025). The architecture is encoder–transformer–decoder: a per-frame GCN pose encoder with learned adjacency, an 8-layer spatio-temporal transformer with 16 heads and local temporal masking, attention pooling, and a contact head producing

$p_i(t)$ 3

The overall multi-task loss is

$p_i(t)$ 4

with all $p_i(t)$ 5 set to 1. On UnderPressure, the reported contact results are precision $p_i(t)$ 6, recall $p_i(t)$ 7, $p_i(t)$ 8 $p_i(t)$ 9, and IoU $f_i(t)=p_i(t)\cdot A_i,$ 0, exceeding the UP baseline across all four metrics with statistical significance; on MMVP, FootFormer improves precision to $f_i(t)=p_i(t)\cdot A_i,$ 1 while achieving recall $f_i(t)=p_i(t)\cdot A_i,$ 2, $f_i(t)=p_i(t)\cdot A_i,$ 3 $f_i(t)=p_i(t)\cdot A_i,$ 4, and IoU $f_i(t)=p_i(t)\cdot A_i,$ 5 (Kraiger et al., 22 Oct 2025).

“Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation” extends FECO from region-level maps to dense contact fields from a single RGB image (Jung et al., 27 Nov 2025). The input is $f_i(t)=p_i(t)\cdot A_i,$ 6, and the output is $f_i(t)=p_i(t)\cdot A_i,$ 7 on a 265-vertex SMPL-X foot mesh, with intermediate supervision at coarser levels $f_i(t)=p_i(t)\cdot A_i,$ 8. The pipeline combines low-level style randomization, a shared ViT-Huge backbone, shoe style–content randomization with an adversarial branch, a ground feature extractor, spatial-attention fusion, and a 6-layer transformer decoder. The total loss is

$f_i(t)=p_i(t)\cdot A_i,$ 9

with all $T(t)$ 00 except $T(t)$ 01. On MMVP at vertex level, FECO reports precision $T(t)$ 02, recall $T(t)$ 03, and $T(t)$ 04-score $T(t)$ 05, outperforming POSA, BSTRO, and DECO. On the COFE joint-level benchmark, the single-image FECO model reports precision $T(t)$ 06, recall $T(t)$ 07, and $T(t)$ 08-score $T(t)$ 09.

Across these visual methods, the role of FECO shifts from timing estimation to structure estimation. A plausible implication is that the field is moving from “is the heel or toe down?” toward “which support patches exist, at what density, and under what scene and shoe conditions?” The dense formulation makes explicit a limitation of earlier zero-velocity approximations noted in the paper: joint-level contact cannot capture rich spatial support patterns.

5. Proprioceptive FECO in legged robotics

In legged robotics, FECO is tightly coupled to state estimation. A foundational formulation appears in the humanoid EKF of “State Estimation for a Humanoid Robot,” where contact indicators are not internal states; instead, during known contact periods each foot contributes position and orientation measurements derived from kinematics, and during swing those measurements are dropped by inflating the associated process noise (Rotella et al., 2014). The state includes IMU position, velocity, quaternion, foot positions $T(t)$ 10, IMU biases, and foot quaternions $T(t)$ 11. The flat-foot rotational constraint improves observability relative to point contact: beyond globally unobservable $T(t)$ 12, and yaw, the maximum unobservable subspace dimension drops significantly, and a single flat-foot contact fully constrains all base rotational degrees under generic motion.

“Legged Robot State-Estimation Through Combined Forward Kinematic and Preintegrated Contact Factors” recasts FECO into a factor-graph setting (Hartley et al., 2017). The forward-kinematic factor relates base pose to contact frame through noisy encoders, while the preintegrated contact factor enforces that a rigid contact frame has near-zero motion, up to slip noise $T(t)$ 13 and $T(t)$ 14. Variables include base pose, base velocity, contact pose, and IMU biases, optimized incrementally with iSAM2. In simulation, IMU+FECO reduces drift by approximately $T(t)$ 15 over IMU alone; in real-world Cassie walking, IMU+FK+contact yields approximately $T(t)$ 16 end-to-end drift over a 100 s loop.

Other robotics FECO methods estimate contact quality directly. Rotella et al. learn six-dimensional humanoid contact probabilities using fuzzy C-means clustering on histories of foot wrench and IMU channels (Rotella et al., 2017). For each DoF $T(t)$ 17, a feature vector over $T(t)$ 18 steps is normalized, absolute-valued, and clustered with $T(t)$ 19, $T(t)$ 20. The resulting memberships are interpreted as $T(t)$ 21 and are inserted into an EKF via a probability-weighted measurement covariance

$T(t)$ 22

On rough-terrain SL simulation, the clustering approach reduces base-position RMSE by approximately $T(t)$ 23– $T(t)$ 24 and yaw RMSE by approximately $T(t)$ 25 relative to a $T(t)$ 26 normal-force threshold baseline.

“Probabilistic Contact State Estimation for Legged Robots using Inertial Information” goes further by using only end-effector IMUs (Maravgakis et al., 2023). Stable contact is modeled as near-zero acceleration and angular velocity per axis, with axis-wise 1D KDE over a sliding window of recent measurements and a Bayesian posterior

$T(t)$ 27

The online loop runs at $T(t)$ 28, with typical thresholds $T(t)$ 29 and $T(t)$ 30. On the ATLAS dataset with 22k samples and five footsteps with three slipping, FECO reports RMSE $T(t)$ 31, versus $T(t)$ 32 for the FCM-based baseline. The same method is reported to remain robust on soft foam mattresses and oily floors where force-based logic misclassifies contact quality.

Contact mode can also be estimated jointly with the robot state. “Simultaneous State Estimation and Contact Detection for Legged Robots by Multiple-Model Kalman Filtering” models each foot-contact configuration as a mode of a switched linear system and maintains $T(t)$ 33 parallel Kalman filters plus mode probabilities $T(t)$ 34 (Menner et al., 2024). The most likely mode is $T(t)$ 35, and leg-contact probabilities are obtained by summing mode probabilities against binary mode-contact flags. In Gazebo simulation of 1 min trotting at $T(t)$ 36, the IMM–KF yields full state RMSE $T(t)$ 37 versus $T(t)$ 38 for the baseline estimator, vertical CoM position RMSE $T(t)$ 39 versus $T(t)$ 40, and CoM velocity RMSE $T(t)$ 41 versus $T(t)$ 42. Hardware experiments on Unitree A1 run at $T(t)$ 43 with mean computation time $T(t)$ 44.

A related model-based direction is the momentum-observer formulation for bipedal contact-mode estimation (Payne et al., 2024). Separate constrained-dynamics observers are run for left-stance and right-stance hypotheses; the norms of the estimated external torques $T(t)$ 45 and the relative foot velocity feed a Markov-style fusion over the states $T(t)$ 46. The reported mode-detection accuracy is up to $T(t)$ 47 in low-noise simulation and $T(t)$ 48 on Sarcos Guardian XO data, compared with $T(t)$ 49 for open-loop planned contacts.

Finally, OCELOT defines FECO as fused contact detection plus uncertainty quantification for slip-aware odometry (Girgin et al., 21 May 2026). Each foot runs two detectors in parallel: a debounced, force-based GMM-guided FSM on $T(t)$ 50, and a kinematic GLRT on world-frame foot velocity over a short window $T(t)$ 51. The continuous scores

$T(t)$ 52

are fused by multiplication,

$T(t)$ 53

and converted into an adaptive measurement covariance $T(t)$ 54. The paper reports that the fused model yields the most consistent ATE across concrete, tile, grass, pebble, and rock, with rock ATE $T(t)$ 55 versus $T(t)$ 56 for FSM-only and $T(t)$ 57 for GLRT-only. The associated ESEKF runs at $T(t)$ 58.

The recent “Four Simple Proprioceptive Estimators for Legged Robots” places these ideas in a unified progression from contact-aided invariant EKF to graph-update filters and fixed-lag smoothers with contact-episode footholds and evolving IMU bias, all implemented in GTSAM and a ROS2-compatible package (Dellaert et al., 21 May 2026). FECO in that paper acts as an event-triggering front end, initializing and maintaining temporary foothold landmarks rather than only producing hard stance/swing flags.

6. Recurring themes, misconceptions, and open directions

Several misconceptions are contradicted directly by the cited work. First, FECO is not equivalent to thresholding foot height, foot velocity, or normal force. UnderPressure reports that a deep vGRF-based method outperforms the optimal-threshold heuristic $T(t)$ 59, while robotics papers show that vertical-force thresholding can fail when the foot is slipping, when the terrain is soft foam, or when all-four-feet contact yields ambiguous force patterns (Mourot et al., 2022, Maravgakis et al., 2023, Menner et al., 2024). Second, FECO is not necessarily binary. The literature includes per-cell force fields, per-region contact maps, per-DoF fuzzy contact probabilities, contact-mode beliefs, and dense per-vertex sole contact (Rotella et al., 2017, Jung et al., 27 Nov 2025). Third, FECO is not restricted to sensing at the target joint or foot alone: posture and joint angles can be estimated from plantar pressure distribution, and vision models can infer contact from 2D or 3D keypoints without direct force measurements (Kobayashi et al., 2024, Kraiger et al., 22 Oct 2025).

A recurring structural pattern is that FECO becomes most useful when inserted into a larger estimator or optimizer. In human motion, contact labels feed inverse kinematics and physics-based trajectory optimization; in robotics, contact probabilities modulate measurement covariances, instantiate foothold landmarks, or determine mode probabilities; in dense image FECO, ground-aware features and style-invariant representations are learned jointly with contact outputs (Rempe et al., 2020, Girgin et al., 21 May 2026, Jung et al., 27 Nov 2025). This suggests that FECO is often better viewed as a latent mechanical interface than as an isolated classification endpoint.

The limitations are equally domain-specific. UnderPressure uses ten healthy adult volunteers and focuses on mocap-plus-insole data; the plantar-pressure study is limited to 2D sagittal-plane squat movements; the multi-momentum observer does not yet discriminate unexpected contacts and does not include heel-strike, toe-off, or flight modes; the IMM–KF relies on rigid, non-slipping contact assumptions; dense single-image FECO remains challenged by extreme occlusion; and OCELOT validates contact estimation implicitly via odometry improvement rather than explicit ROC-style contact metrics (Mourot et al., 2022, Kobayashi et al., 2024, Payne et al., 2024, Menner et al., 2024, Jung et al., 27 Nov 2025, Girgin et al., 21 May 2026).

The future directions named in the cited works are consistent. They include extension from squat to 3D gait with phase-dependent models or continuous gait-phase estimation, outdoor experiments on unstructured terrain, tighter integration into whole-body controllers and SLAM, augmentation of contact-mode sets to include slip or flight, video-based dense FECO, joint dense-body and foot-contact modeling, and physiological validation of plantar sensitivity maps (Kobayashi et al., 2024, Maravgakis et al., 2023, Payne et al., 2024, Jung et al., 27 Nov 2025). Taken together, these directions indicate a convergence toward FECO systems that are multimodal, uncertainty-aware, and explicitly grounded in contact mechanics rather than only in kinematic heuristics.