SO-101 Task Set: Multi-Domain Overview

Updated 4 July 2026

SO-101 Task Set is a multi-domain construct that denotes beginner-level SOP tasks, real-world robotic benchmarks, user-defined tracking, and social norm corpora.
In the SOP context, it evaluates terminology precision, action sequencing, and conditional reasoning across three progressive stages with defined metrics.
For robotics and tracking, it standardizes evaluation through execution success, recovery rates, and compositional sequence construction rules.

Searching arXiv for the cited papers to ground the article. The expression SO-101 Task Set is context-dependent rather than universally standardized. In current arXiv usage, it denotes at least three distinct technical constructions: an introductory Standard Operating Procedure (SOP) task set aligned with the staged training logic of FM SO.P; a real-world manipulation benchmark built on the low-cost SO-101 robot for evaluating Vision-Language-Action and imitation-learning policies; and a possible 101-sequence instantiation inside the SOTVerse user-defined task space for single-object tracking. A related but separate source of nomenclatural overlap is SOCIAL-CHEM-101, where “SO-101” refers to Social Chemistry 101 rather than a robotic, tracking, or SOP benchmark (Huang et al., 10 Feb 2026, Yu et al., 7 Jun 2026, Hu et al., 2022, Forbes et al., 2020).

1. Terminological scope and disambiguation

Across the relevant literature, the phrase does not identify a single benchmark family with shared executors, modalities, or metrics. In FM SO.P, SO-101 is an entry-level SOP task set designed around a three-stage progression from terminology precision to sequencing and then to conditional graph reasoning. In low-cost robotics, the SO-101 Task Set is a real-world, execution-centric benchmark on a physical tabletop manipulator. In SOTVerse, “SO-101 Task Set” is not an official paper term at all; it is a paper-grounded instantiation of a 101-sequence user-defined task space derived from SOTVerse construction rules. In Social Chemistry, the closely related label refers to SOCIAL-CHEM-101, a corpus of rules-of-thumb and social judgments rather than a benchmark commonly described as an SO-101 task set (Huang et al., 10 Feb 2026, Yu et al., 7 Jun 2026, Hu et al., 2022, Forbes et al., 2020).

Usage	Core object	Source
FM SO.P SO-101	Introductory SOP task set	(Huang et al., 10 Feb 2026)
Robotic SO-101 Task Set	Real-world manipulation benchmark on SO-101 robot	(Yu et al., 7 Jun 2026)
SOTVerse SO-101 instantiation	101-sequence user-defined tracking task space	(Hu et al., 2022)
SOCIAL-CHEM-101	Social and moral norm corpus	(Forbes et al., 2020)

This multiplicity creates a recurrent interpretive hazard: identical surface naming can conceal incompatible definitions of task, state, supervision, and evaluation. The most reliable way to interpret the phrase is therefore by domain and source paper, not by the string “SO-101” alone.

2. SO-101 as an introductory SOP task set in FM SO.P

In the FM SO.P framework, SO-101 is defined as an introductory SOP task set that reflects the progression logic of SOP understanding: terminology precision $\rightarrow$ sequencing $\rightarrow$ conditional graph reasoning. The general task formulation is a mapping $f_\theta : (C, Q) \rightarrow A$ , where $C$ is contextual information, $Q$ is a query or scenario, and $A$ is the space of valid procedural responses. The task set is explicitly aligned with three capability stages: Concept Disambiguation, Action Sequence Understanding, and Scenario-Aware Graph Reasoning (Huang et al., 10 Feb 2026).

Stage	Task type	Beginner composition
1	Concept Disambiguation	8–12 contrastive QA pairs per scenario
2	Action Sequence Understanding	Type A: 1 positive + 4 negatives per pattern; Type B: positive workflow QAs plus negatives violating DAG constraints
3	Scenario-Aware Graph Reasoning	One or two small DAG scenarios per domain with negative-only workflow QAs

The first stage targets terminology precision through contrastive QA pairs grounded in SOP constraints and system state. Its formal setup distinguishes semantically similar but contextually distinct terms, with outputs consisting of a correct answer $a^+$ and negatives $\{a^-_j\}^m$ . The training signal is a contrastive loss,

$L_1(\theta) = - \mathbb{E}_{(q,c,a^+,\{a^-\})} \left[ \log \frac{e^{s^+/\tau}}{e^{s^+/\tau} + \sum_j e^{s^-_j/\tau}} \right].$

The second stage teaches procedural correctness over workflows $W=\langle a_1,a_2,\dots,a_m\rangle$ using preconditions and effects, with negatives generated by reorder, omit, and insert operators. The third stage extends to conditional reasoning over directed action graphs $\rightarrow$ 0 and introduces violations such as cycles, missing preconditions, and invalid edges. At the curriculum level, FM SO.P uses cumulative data,

$\rightarrow$ 1

and a unified objective

$\rightarrow$ 2

with $\rightarrow$ 3.

The paper specifies beginner-oriented templates for SO-101. Stage 1 uses 8–12 contrastive QA pairs per scenario, with emphasis on authentication or authorization logic, threshold-based constraints, and domain-specific term disambiguation. Stage 2 mixes Type A constraint understanding and Type B workflow sequencing. Type A uses 4–5 scenario patterns with 1 positive and 4 negative QAs; Type B uses 2–3 scenario patterns with positive workflow QAs and several negative workflow QAs that violate order, omit steps, or use wrong prerequisites. Stage 3 introduces one or two small DAG scenarios per domain and focuses on negative-only workflow QAs involving skipped authentication, missing OR-satisfaction, wrong order, or missing parameters.

Evaluation is domain-adaptive rather than fixed. FM SO.P uses a three-agent system: an Adaptive Rubric Generator producing rubrics $\rightarrow$ 4 with normalized weights, a Stratified Test Set Builder constructing $\rightarrow$ 5 across complexity and question types, and a Rubric Scorer aggregating per-dimension scores by

$\rightarrow$ 6

Pass rate is formalized as

$\rightarrow$ 7

On SOPBench across Bank, DMV, Healthcare, Market, University, Library, and Hotel, FM SO.P reports 34.33% overall pass rate for the Qwen-2.5-7B-Instruct variant, 39.67% for Qwen-2.5-14B-Instruct, and 48.30% for Qwen-2.5-32B-Instruct. The 7B variant matches Qwen-2.5-72B-Instruct at 34.44% with approximately 10× fewer parameters, and stage-wise ablations rise from 11.30% at the 7B base to 17.11%, 24.50%, and 34.33% after Stages 1, 2, and 3. Training is reported with AdamW, learning rate $\rightarrow$ 8, batch size 256, 12 epochs, temperature $\rightarrow$ 9, mixed-precision bfloat16, and 8× NVIDIA H100-80GB GPUs.

3. SO-101 as a low-cost real-world robotic benchmark

In embodied AI, the SO-101 Task Set is a standardized real-world benchmark defined on the low-cost, open-source SO-101 tabletop manipulator. Its emphasis is execution-centric: the benchmark is designed to expose robustness gaps that emerge on affordable hardware under embodiment uncertainty, including limited actuator precision, reduced control stability, lower payload capacity, joint backlash, control latency, trajectory jitter, limited positional repeatability, and imperfect camera–arm calibration (Yu et al., 7 Jun 2026).

Task	Goal or instruction	Capability stressed
Pen Transfer	Move a pen from a left-side start region to a right-side target region	Control fidelity
Selective Color Sorting	“Put the pink block in the plate and remove the others”	Grounding robustness
Multi-Object Packing	“Place all snacks into the cardboard box”	Temporal consistency and long-horizon stability
Precision Pen Placement	Pick up a pen and insert it into a pen holder	Embodiment-noise robustness

The benchmark standardizes task execution, data collection, and adaptation directly on the physical platform. It uses 100 human teleoperated trajectories per task, collected with end-effector control and synchronized RGB, instruction, and low-level commands, for a total of 400 demonstrations across the four tasks. Object poses and layouts are randomized within predefined workspace constraints during both data collection and evaluation. Each model–task pair is evaluated in 20 independent, real-hardware episodes, yielding 320 episodes total for four models across four tasks.

The evaluated policies are $f_\theta : (C, Q) \rightarrow A$ 0, SmolVLA, Wall-X, and ACT. Adaptation is constrained to interface-level alignment between model observations or actions and SO-101; no changes are made to architectures, losses, or optimizers. The benchmark augments binary success with a structured failure taxonomy: Grasp Instability, Repetition Loop, State Mismatch, and Precision Misalignment. For aggregate analysis, State Mismatch is treated as semantic-level failure, while execution-level failure is the average of Grasp Instability and Repetition Loop. Precision Misalignment is excluded from that aggregate because it is task-specific.

The metric definitions are deliberately minimal and event-based. Success rate is

$f_\theta : (C, Q) \rightarrow A$ 1

and recovery rate is

$f_\theta : (C, Q) \rightarrow A$ 2

A successful recovery is credited when the policy restores a valid task state and resumes progress without external help.

Reported success rates are sharply task-dependent. On Pen Transfer, ACT reaches 75%, SmolVLA 70%, Wall-X 95%, and $f_\theta : (C, Q) \rightarrow A$ 3 95%. On Selective Color Sorting, the same models reach 0%, 5%, 0%, and 10%. On Multi-Object Packing, they achieve 10%, 10%, 30%, and 55%. On Precision Pen Placement, they achieve 50%, 45%, 80%, and 65%. Average success is 33.75% for ACT, 32.5% for SmolVLA, 51.25% for Wall-X, and 56.25% for $f_\theta : (C, Q) \rightarrow A$ 4. Recovery rates further separate the policies: 30.77% for $f_\theta : (C, Q) \rightarrow A$ 5, 20.51% for Wall-X, 6.45% for ACT, and 3.23% for SmolVLA. The dominant empirical pattern is that execution instability remains high across all models, while semantic failure varies more strongly by architecture.

4. SO-101 as a SOTVerse user-defined tracking task space

Within the SOTVerse literature, “SO-101 Task Set” is not an official benchmark name. The paper states explicitly that the term does not appear explicitly anywhere in the paper. It nevertheless provides a precise construction by which an SO-101 Task Set can be defined as a user-defined task space comprising 101 sub-sequences selected from SOTVerse according to the 3E paradigm, challenging-factor filters, and space construction rules (Hu et al., 2022).

SOTVerse formalizes a subtask as

$f_\theta : (C, Q) \rightarrow A$ 6

and the full task space as

$f_\theta : (C, Q) \rightarrow A$ 7

Here, the environment is built from eight representative benchmarks—OTB2015, VOT2016, VOT2018, VOT2019, GOT-10k, VOTLT2019, LaSOT, and VideoCube—organized into a normal space totaling 12.56 million frames. SOTVerse automatically labels challenging factors per frame and allows user-defined task generation through explicit construction rules.

For an SO-101 instantiation, the environment consists of 101 sequences constructed from those benchmarks using per-frame challenging labels and inclusion rules. A sequence is regarded as challenging for an attribute if more than half of the frames are challenging for that attribute. Construction also requires minimum sub-sequence length 100 frames, start-point screening to exclude tiny or blur targets and frames near target absence, and deduplication that discards overlaps of $f_\theta : (C, Q) \rightarrow A$ 8. A suggested design in the paper is 10 sequences per challenging factor from $f_\theta : (C, Q) \rightarrow A$ 9– $C$ 0 plus one normal sequence from the short-term, long-term, or GIT normal spaces.

The challenging factors are divided into static and dynamic attributes. Static attributes include ratio, relative scale, illumination, and blur bbox; dynamic attributes include delta ratio, delta relative scale, delta illumination, delta blur bbox, fast motion, and low corrcoef. The paper provides explicit thresholds, for example $C$ 1 fast motion: $C$ 2 and $C$ 3 low corrcoef: $C$ 4. Evaluation uses OPE and optionally R-OPE, with failure in R-OPE triggered by 10 consecutive failures and re-initialization at the next start point. Reported indicators include precision, normalized precision, success, challenging plot, attribute plot, and robust plot.

The significance of this construction is methodological rather than nominal. SOTVerse treats the task space as compositional and user-defined, so the SO-101 label functions here as a specific selection protocol over a much larger metaverse of single-object tracking environments, not as a canonical benchmark released under that exact name.

A separate line of work uses “SO-101” to refer to Social Chemistry 101, formally SOCIAL-CHEM-101. This resource is part of the broader Social Chemistry formalism for reasoning about social and moral norms in natural-language situations. It is not a robotic or SOP benchmark, and its canonical object is a corpus consisting of 104k real-life situations, 292k rules-of-thumb, 365k structured “breakdown” annotations, and over 4.5M categorical and free-text annotations (Forbes et al., 2020).

The resource is built around four variables: situation $C$ 5, rule-of-thumb $C$ 6, action transcription $C$ 7, and attribute sets $C$ 8. It defines 12 judgment dimensions spanning RoT-level attributes and action-level attributes. These include anticipated agreement, RoT category, moral foundations, RoT targeting, action transcription, agency, social judgment, legality, cultural pressure, action candidate, and taking action. Label spaces are explicit: for example, social judgment takes values {Very bad, Bad, Expected/OK, Good, Very good}, legality uses {Illegal, Depends/Tolerated, Legal}, and cultural pressure uses {Strong pressure against, Pressure against, Discretionary, Pressure for, Strong pressure for}.

The task family is correspondingly language-centric. It includes attribute classification, moral foundation classification, legality and cultural pressure prediction, controlled conditional generation, model-choice generation, retrieval or matching, and free-text explanation generation. Representative objectives include controlled generation

$C$ 9

model-choice generation

$Q$ 0

and attribute-only prediction

$Q$ 1

The paper’s Neural Norm Transformer instantiates these objectives with decoder-only and encoder–decoder LMs, including GPT, GPT-2, BART-Large, and T5-Large, using an 80/10/10 train/dev/test split by situations.

This usage matters because it explains why “SO-101” may be encountered outside SOPs, tracking, or robotics. In this setting, however, the object is a structured norm-reasoning corpus rather than a task set centered on physical execution, graph constraints, or benchmark episodes.

6. Comparative evaluation regimes and recurring misconceptions

The main senses of SO-101 differ most sharply in what they count as an environment, an executor, and a successful outcome. FM SO.P evaluates LLMs against domain-adaptive SOP rubrics and reports PassRate, rubric-weighted scores, and Borda counts. The robotic benchmark evaluates VLA and imitation-learning policies on physical tasks and reports success, recovery, and failure-code distributions. SOTVerse instantiates an SO-101 space over tracking sequences and uses precision, normalized precision, success, challenging plot, attribute plot, and robust plot. SOCIAL-CHEM-101 evaluates social-norm models through classification losses, human relevance judgments, BLEU-4, and attribute adherence such as micro-F1 or Attr. pF1 (Huang et al., 10 Feb 2026, Yu et al., 7 Jun 2026, Hu et al., 2022, Forbes et al., 2020).

Context	Executor	Primary evaluation
FM SO.P SOP SO-101	LLM	$Q$ 2, PassRate, Borda count
Robotic SO-101	VLA or imitation-learning policy	$Q$ 3, $Q$ 4, failure taxonomy
SOTVerse SO-101 instantiation	Tracker	Precision, normalized precision, success, robust plot
SOCIAL-CHEM-101	LLM	Classification metrics, BLEU-4, human relevance, micro-F1

Several misconceptions follow from the name collision. First, the robotic SO-101 benchmark and the FM SO.P SO-101 task set are not related by modality or supervision: one is a real-hardware benchmark with teleoperated demonstrations, while the other is a staged SOP understanding curriculum with contrastive QA pairs, workflow negatives, and DAG reasoning. Second, the SOTVerse usage is a constructible instantiation, not an official benchmark title in the paper. Third, SOCIAL-CHEM-101 is a corpus of social and moral norms, not a physical or procedural benchmark. A plausible implication is that “SO-101 Task Set” functions less as a standardized benchmark identifier than as a local naming convention whose meaning must be recovered from the surrounding research program.

Under that interpretation, the most stable encyclopedia-level characterization is not a single definition but a typology: SO-101 names beginner or diagnostic task configurations in multiple subfields, yet each version is anchored to a different formalism—procedural reasoning over SOPs, execution robustness on low-cost robots, challenge-aware subsequence selection in tracking, or norm-conditioned text modeling.