Gradient-Bridged Co-Training

Updated 4 July 2026

The paper’s main contribution is defining a framework where gradient-derived signals replace traditional pseudo-label exchanges to drive inter-task cooperation.
Gradient-Bridged Co-Training encompasses methods that use gradient flow as a coordination medium, including cross-task loss reduction and adversarial gradient alignment.
Empirical results from CoGrad, CoopNet, RCT, and GREAT demonstrate improved performance and stability in multi-task and cross-model learning scenarios.

Gradient-Bridged Co-Training is an umbrella description for learning procedures in which models or tasks are coupled through gradient-mediated signals rather than only through shared parameters, pseudo-label exchange, or output-level consistency. In the papers most directly associated with this perspective, the “bridge” is implemented in several distinct ways: by explicitly maximizing cross-task loss reduction induced by another task’s gradient in multi-task learning, by routing pixel-wise photometric gradients according to an online disagreement distribution, by converting the outputs of a non-differentiable model into reinforcement-learning rewards for a differentiable model, or by adversarially reshaping gradients so that their origin becomes statistically indistinguishable (Yang et al., 2023, Hariat et al., 8 May 2026, Tian et al., 24 Mar 2026, Sinha et al., 2018). The common thread is that gradient flow itself becomes the object of coordination.

1. Conceptual scope and relation to classical co-training

In this usage, Gradient-Bridged Co-Training is not a single canonical algorithm. It is a family resemblance across methods that treat gradients, gradient-induced loss changes, or gradient exposure as the medium through which coupled learners influence one another. Some methods operate within ordinary shared-parameter multi-task learning, some within self-supervised monocular geometry, some across differentiable and non-differentiable model families, and some through auxiliary adversarial critics on gradient tensors (Yang et al., 2023, Hariat et al., 8 May 2026, Tian et al., 24 Mar 2026, Sinha et al., 2018).

This broad usage differs from classical co-training in the Blum-Mitchell sense. The relevant papers explicitly distinguish their mechanisms from view-based pseudo-label exchange. CoGrad is described instead as “co-optimization through transfer-aware gradient coupling,” not classical semi-supervised co-training (Yang et al., 2023). CoopNet similarly does not exchange pseudo-labels or features across branches; its interaction occurs through a disagreement statistic that controls where each branch receives gradients (Hariat et al., 8 May 2026). RCT is reciprocal in the sense of alternating bidirectional adaptation, but supervision is exchanged as embeddings and reward signals rather than as labels (Tian et al., 24 Mar 2026). GREAT is closer to adversarial alignment or gradient-space distillation than to co-training, since the auxiliary model is a critic over gradient tensors rather than a peer predictor (Sinha et al., 2018).

A useful organizing distinction is between four bridge types present in the literature:

Method	Bridge variable	Coupling mechanism
CoGrad	$\Delta^k \bm{L}_{i \to j}$ and $\bm H_j \bm g_i$	Transfer-aware gradient modification
CoopNet	$A(p)$ and central/tail quantiles	Pixel-wise gradient routing and weighting
RCT	$Q_\phi(x,a)$	Reward-mediated policy-gradient update
GREAT	Gradient tensors	Adversarial gradient indistinguishability

This suggests that the unifying object is not necessarily a raw gradient vector. In some cases the bridge is an explicit second-order derivative term, in others a distribution over disagreement values that determines where gradients are permitted, and in others a scalar evaluative signal that is converted into gradients only inside one component of the system.

2. Core mathematical patterns

A recurrent pattern is that one learner’s update is evaluated by the effect it has on another learner’s objective. CoGrad formalizes this most directly. For tasks $i$ and $j$ , with shared parameters $\theta$ , per-task loss $\bm L_t(\theta)$ , and gradient $\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ , it defines a virtual update

$\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$

and then defines transfer from task $\bm H_j \bm g_i$ 0 to task $\bm H_j \bm g_i$ 1 as

$\bm H_j \bm g_i$ 2

A first-order Taylor expansion yields

$\bm H_j \bm g_i$ 3

so the gradient inner product is interpreted as a first-order surrogate for actual cross-task loss reduction rather than as an end in itself (Yang et al., 2023).

A second pattern is that the bridge may be constructed from output disagreement and then used to allocate gradients. In CoopNet, the depth-plus-pose branch and the optical-flow branch produce two reconstructions of a target frame, and their per-pixel reconstruction-error difference is

$\bm H_j \bm g_i$ 4

Central quantiles of the empirical distribution of $\bm H_j \bm g_i$ 5 define a region $\bm H_j \bm g_i$ 6, and that region determines where the rigid branch receives gradients, while the flow branch is trained on all pixels with larger weight on the tails (Hariat et al., 8 May 2026). Here the bridge is not a direct cross-gradient term; it is a shared statistic derived from the two branches’ competing explanations of the same data.

A third pattern is surrogate bridging across a non-differentiable boundary. In RCT, an LLM defines a stochastic policy $\bm H_j \bm g_i$ 7 over binary actions, while a Random Forest supplies an action-conditioned evaluative score

$\bm H_j \bm g_i$ 8

This enters the hybrid reward

$\bm H_j \bm g_i$ 9

with $A(p)$ 0, and PPO converts that scalar signal into parameter updates for the LLM (Tian et al., 24 Mar 2026). The RF never participates in backpropagation, but it still shapes the LLM’s gradient trajectory.

A fourth pattern is adversarial coupling in gradient space. GREAT assumes that gradient tensors contain task-, class-, or model-specific statistical information, and uses an auxiliary classifier to predict the origin of a gradient tensor while the main model is trained to make that prediction difficult (Sinha et al., 2018). The bridge is therefore an auxiliary min-max game over gradients themselves.

3. Transfer-aware coordination in multi-task learning: CoGrad

CoGrad addresses multi-task learning with tasks $A(p)$ 1, shared parameters $A(p)$ 2, task-specific parameters $A(p)$ 3, and shared updates

$A(p)$ 4

The motivating application is recommendation and advertising, with tasks such as CTR, CVR, and page-view prediction sharing a backbone and having separate heads (Yang et al., 2023).

The paper’s main criticism of prior gradient methods is conceptual. PCGrad and GradVac modify gradient directions; GradNorm and MetaBalance homogenize magnitudes; MGDA and CAGrad treat the problem as a multi-objective tradeoff on direction or magnitude. CoGrad argues that such approaches focus on alignment itself rather than on the actual transfer effect. Because shared capacity contains both general/shared knowledge and task-specific knowledge, too much alignment can crowd out task-specific knowledge, whereas too much specialization can reduce cross-task generalization (Yang et al., 2023).

From the transfer quantity $A(p)$ 5, CoGrad derives

$A(p)$ 6

where $A(p)$ 7 is the Hessian of task $A(p)$ 8’s loss with respect to shared parameters. The modified gradient for task $A(p)$ 9 becomes

$Q_\phi(x,a)$ 0

and in the general multi-task case

$Q_\phi(x,a)$ 1

The shared update then uses the weighted aggregation of these modified gradients, while task-specific parameters are updated normally (Yang et al., 2023).

The practical version replaces explicit Hessian computation with

$Q_\phi(x,a)$ 2

with $Q_\phi(x,a)$ 3, giving

$Q_\phi(x,a)$ 4

The paper states that exact $Q_\phi(x,a)$ 5 is too expensive in storage and computation, whereas the approximation makes CoGrad computationally efficient, simple to implement, and adds only negligible computation increase (Yang et al., 2023).

Empirically, CoGrad is evaluated on Ali-CCP and Ecomm, and in a 15-day online A/B test on a real advertising system. On Ecomm with Shared Bottom, CoGrad reports CTR GAUC $Q_\phi(x,a)$ 6 versus STL $Q_\phi(x,a)$ 7 and CVR GAUC $Q_\phi(x,a)$ 8; on Ali-CCP with Shared Bottom, CTR AUC is $Q_\phi(x,a)$ 9 and CVR AUC is $i$ 0 (Yang et al., 2023). In the online test, it yields CTR $i$ 1, CVR $i$ 2, CPC $i$ 3, and CPA $i$ 4 (Yang et al., 2023). The paper also reports that PCGrad raises gradient similarity the most, whereas CoGrad achieves better overall performance with only a moderate increase in similarity, supporting the claim that maximizing alignment alone is not the right objective (Yang et al., 2023).

In the context of Gradient-Bridged Co-Training, the significance of CoGrad is that the bridge is explicitly defined as loss reduction from one task to another. The tasks are not merely prevented from conflicting; they are coupled through a differentiable transference objective.

4. Distribution-aware gradient routing in self-supervised geometry: CoopNet

CoopNet studies self-supervised monocular video learning with three networks: a depth network $i$ 5, a pose network $i$ 6, and an optical flow network $i$ 7 (Hariat et al., 8 May 2026). The standard photometric loss is

$i$ 8

and reconstructions of a target frame $i$ 9 are obtained either through rigid reprojection using depth and pose or through a dense flow field (Hariat et al., 8 May 2026).

The paper’s central observation is that naive joint self-supervision is biased because the optical flow branch predicts an unconstrained $j$ 0D displacement field directly, whereas the depth-plus-pose branch must satisfy projective geometry and camera-motion consistency. The flow branch is therefore described as intrinsically better at minimizing photometric error. A sign-based split such as assigning depth-plus-pose the pixels where $j$ 1,

$j$ 2

creates competition for supervision rather than cooperation, because the stronger flow network tends to win more pixels and can starve the rigid branch (Hariat et al., 8 May 2026).

CoopNet replaces winner-take-all routing with a quantile-based distribution model. Let $j$ 3 denote the $j$ 4-quantile of the density of $j$ 5, and define

$j$ 6

Then the depth-plus-pose branch receives gradients only from pixels in the central interval $j$ 7, where the two branches approximately agree, while the flow branch is trained on all pixels with larger weight on the tails (Hariat et al., 8 May 2026). The split losses are

$j$ 8

$j$ 9

and

$\theta$ 0

The quantiles are computed on the fly every epoch using the $\theta$ 1 streaming quantile algorithm, and the neighborhood used in the current epoch is determined from the previous epoch’s quantile values (Hariat et al., 8 May 2026).

A further regularization prior is defined from normalized vector-flow mismatch:

$\theta$ 2

and the refined rigid-valid set is

$\theta$ 3

This is intended to reduce contamination from moving objects in low-texture or homogeneous regions (Hariat et al., 8 May 2026).

The full objective is

$\theta$ 4

with $\theta$ 5, $\theta$ 6, $\theta$ 7, $\theta$ 8, and $\theta$ 9 (Hariat et al., 8 May 2026). Training uses PyTorch, Adam, $\bm L_t(\theta)$ 0, $\bm L_t(\theta)$ 1, $\bm L_t(\theta)$ 2 epochs, batch size $\bm L_t(\theta)$ 3, learning rate $\bm L_t(\theta)$ 4 reduced to $\bm L_t(\theta)$ 5 after $\bm L_t(\theta)$ 6 epochs, a burn-in of $\bm L_t(\theta)$ 7 epochs during which depth-plus-pose are trained with Monodepth2, and hyperparameters $\bm L_t(\theta)$ 8 and $\bm L_t(\theta)$ 9 (Hariat et al., 8 May 2026).

The paper’s most relevant empirical claim is that the major gains come from $\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ 0 itself and that the subsidiary loss benefits are marginal as compared to $\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ 1 (Hariat et al., 8 May 2026). On KITTI $\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ 2, example depth results include Monodepth2 with Abs Rel $\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ 3, SGDepth with $\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ 4, and CoopNet with $\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ 5 (Hariat et al., 8 May 2026). Qualitative depth maps are reported to show better handling of thin structures, high-texture regions, and moving objects (Hariat et al., 8 May 2026).

For Gradient-Bridged Co-Training, CoopNet illustrates a form of bridge that is indirect but still optimization-level: outputs of one branch affect where the other branch is allowed to receive gradients. This suggests a broader notion of gradient bridging in which the key operation is adaptive control over gradient exposure rather than direct gradient arithmetic.

5. Reward-mediated bridging across incompatible model families: Reciprocal Co-Training

RCT couples a differentiable ClinicalBERT classifier with a non-differentiable Random Forest classifier for binary prediction from tabular clinical or biomedical data (Tian et al., 24 Mar 2026). The differentiable component consumes a deterministic textual serialization of each tabular record in a standardized patient-card format, while the RF continues to operate on structured variables (Tian et al., 24 Mar 2026).

The LLM is denoted $\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ 6 and defines a stochastic binary policy

$\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ 7

where $\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ 8 is the textualized record (Tian et al., 24 Mar 2026). The backbone is frozen; only LoRA adapters inserted into attention layers, the classification head, and the value head are updated during PPO training (Tian et al., 24 Mar 2026). From the final hidden layer [CLS] token, the model produces

$\bm g_t(\theta)=\nabla_\theta \bm L_t(\theta)$ 9

which is reduced by PCA to $\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$ 0 principal components,

$\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$ 1

and appended to the original RF input to form

$\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$ 2

This is the forward transfer path from LLM to RF (Tian et al., 24 Mar 2026).

The feedback path runs in the opposite direction. The RF, denoted $\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$ 3, outputs a probability estimate $\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$ 4 and evaluates each sampled LLM action through

$\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$ 5

This enters the hybrid reward

$\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$ 6

with

$\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$ 7

The LLM objective is

$\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$ 8

and PPO uses the clipped surrogate

$\theta^{k+\tau_i}=\theta^k-\gamma_i \bm{g}_i(\theta^k)$ 9

where

$\bm H_j \bm g_i$ 00

The paper includes these PPO equations but does not specify how $\bm H_j \bm g_i$ 01 is computed, whether generalized advantage estimation is used, or the values of $\bm H_j \bm g_i$ 02, $\bm H_j \bm g_i$ 03, and $\bm H_j \bm g_i$ 04 (Tian et al., 24 Mar 2026).

Training is alternating and iterative:

$\bm H_j \bm g_i$ 05

At each outer iteration, the RF is held fixed while PPO updates the LLM; then the updated LLM produces new embeddings and the RF is retrained on the augmented features (Tian et al., 24 Mar 2026). Early stopping uses patience $\bm H_j \bm g_i$ 06 on validation ROC-AUC (Tian et al., 24 Mar 2026).

On the MS dataset, the paper reports RF ROC-AUC $\bm H_j \bm g_i$ 07 and LLM ROC-AUC $\bm H_j \bm g_i$ 08, with RF PR-AUC $\bm H_j \bm g_i$ 09 and LLM PR-AUC $\bm H_j \bm g_i$ 10 (Tian et al., 24 Mar 2026). On Breast Cancer, RF ROC-AUC is reported as $\bm H_j \bm g_i$ 11 and LLM ROC-AUC as $\bm H_j \bm g_i$ 12; on Diabetes, RF ROC-AUC is $\bm H_j \bm g_i$ 13 and LLM ROC-AUC is $\bm H_j \bm g_i$ 14 (Tian et al., 24 Mar 2026). The ablation table further reports that iterative refinement improves both models on MS, with LLM ROC-AUC $\bm H_j \bm g_i$ 15 full versus $\bm H_j \bm g_i$ 16 single-pass and RF ROC-AUC $\bm H_j \bm g_i$ 17 full versus $\bm H_j \bm g_i$ 18 single-pass (Tian et al., 24 Mar 2026).

Within a Gradient-Bridged Co-Training perspective, RCT is important because it shows that the bridge need not be an actual derivative through the full system. The differentiable model receives a scalar evaluative signal from the non-differentiable model, and policy gradients convert that signal into an update. This is a surrogate, reward-mediated bridge rather than end-to-end backpropagation.

6. Adversarial alignment in gradient space: GREAT

GREAT, “GRadiEnt Adversarial Training,” starts from the premise that gradient tensors contain latent information about whatever tasks are being trained and can therefore be used as objects of supervision (Sinha et al., 2018). The backpropagation relation

$\bm H_j \bm g_i$ 19

with

$\bm H_j \bm g_i$ 20

is used to motivate the claim that a layer’s gradient depends on both the loss and succeeding weights (Sinha et al., 2018). The method introduces an auxiliary network that classifies the origin of a gradient tensor and a sign-reversed adversarial signal

$\bm H_j \bm g_i$ 21

so that the main network is trained both for its primary task and to fool the auxiliary gradient classifier (Sinha et al., 2018).

In adversarial robustness, GREAT aims to make gradients class-agnostic. The paper writes the intended condition as

$\bm H_j \bm g_i$ 22

The defense objective is

$\bm H_j \bm g_i$ 23

and GREAT is combined with GREACE, which modifies the backward gradient of cross-entropy as

$\bm H_j \bm g_i$ 24

On CIFAR-10 under non-targeted FGSM, the reported accuracies are baseline $\bm H_j \bm g_i$ 25, adversarial training $\bm H_j \bm g_i$ 26, GREACE $\bm H_j \bm g_i$ 27, GREAT $\bm H_j \bm g_i$ 28, and GRE(AT+CE) $\bm H_j \bm g_i$ 29; under non-targeted iFGSM they are baseline $\bm H_j \bm g_i$ 30, adversarial training $\bm H_j \bm g_i$ 31, GREACE $\bm H_j \bm g_i$ 32, GREAT $\bm H_j \bm g_i$ 33, and GRE(AT+CE) $\bm H_j \bm g_i$ 34 (Sinha et al., 2018). The paper explicitly notes that GREAT alone is not enough against strong iterative attacks and that the large gains come from GREAT plus GREACE (Sinha et al., 2018).

In knowledge distillation, the gradients compared are teacher and student input gradients,

$\bm H_j \bm g_i$ 35

and the binary discriminator objective is

$\bm H_j \bm g_i$ 36

with

$\bm H_j \bm g_i$ 37

On CIFAR-10 with CNN-5 student and ResNet-18 teacher, GREAT reports $\bm H_j \bm g_i$ 38 in dense and $\bm H_j \bm g_i$ 39 sparse regimes, compared with baseline $\bm H_j \bm g_i$ 40 and distillation $\bm H_j \bm g_i$ 41 (Sinha et al., 2018). On CIFAR-10 with ResNet-18 student and ResNeXt teacher, GREAT reports $\bm H_j \bm g_i$ 42, while distillation reports $\bm H_j \bm g_i$ 43 (Sinha et al., 2018). The paper emphasizes that GREAT is especially useful in the sparse-data regime and is less hyperparameter-sensitive than temperature-based distillation (Sinha et al., 2018).

In multi-task learning, GREAT introduces Gradient Alignment Layers (GALs), one per task, inserted between a shared encoder and each task decoder, active only during the backward pass and dropped at inference (Sinha et al., 2018). The task-specific gradients with respect to the last shared encoder feature tensor are

$\bm H_j \bm g_i$ 44

and GALs scale them elementwise as $\bm H_j \bm g_i$ 45 (Sinha et al., 2018). The multitask objective is written as

$\bm H_j \bm g_i$ 46

On NYUv2, GREAT reports Depth RMSE $\bm H_j \bm g_i$ 47, Normal loss $\bm H_j \bm g_i$ 48, and Keypoint RMSE $\bm H_j \bm g_i$ 49, compared with Equal weighting $\bm H_j \bm g_i$ 50, Uncertainty weighting $\bm H_j \bm g_i$ 51, and GradNorm $\bm H_j \bm g_i$ 52 (Sinha et al., 2018).

For Gradient-Bridged Co-Training, GREAT provides a methodological template in which gradients are treated as cross-model or cross-task messages and an auxiliary discriminator becomes the mechanism that regularizes those messages.

7. Comparative interpretation, misconceptions, and limitations

Across these papers, several distinct meanings of “bridge” appear. CoGrad uses explicit cross-task transfer and a cross-task Hessian-gradient product (Yang et al., 2023). CoopNet uses a shared disagreement statistic and quantile-based masking (Hariat et al., 8 May 2026). RCT uses a black-box evaluator that returns scalar action scores and relies on PPO to propagate their effect into the LLM (Tian et al., 24 Mar 2026). GREAT uses an auxiliary adversary over gradient tensors and gradient reversal (Sinha et al., 2018). A plausible implication is that Gradient-Bridged Co-Training is best understood as a design principle rather than a single optimization family.

A frequent misconception is to equate all such methods with gradient alignment. The surveyed papers repeatedly distinguish more specific objectives. CoGrad argues that maximizing cosine similarity alone can over-privilege general/shared knowledge and crowd out task-specific knowledge (Yang et al., 2023). CoopNet is not GradNorm-style balancing of task losses by gradient magnitudes and not PCGrad-style conflict projection; it is a data-dependent partition of training signal (Hariat et al., 8 May 2026). RCT does not differentiate through the RF and therefore should not be described as end-to-end joint training; its bridge is policy-gradient mediated (Tian et al., 24 Mar 2026). GREAT is not inherently cooperative in the ordinary sense, because its main criterion is adversarial indistinguishability in gradient space (Sinha et al., 2018).

The assumptions and failure modes are correspondingly heterogeneous. CoGrad relies on local first-order Taylor approximations and a heuristic Hessian approximation $\bm H_j \bm g_i$ 53 (Yang et al., 2023). CoopNet assumes that rigid pixels cluster near the center of the disagreement distribution and that moving pixels populate the tails, while also acknowledging that photometric loss remains weak in homogeneous regions and introducing $\bm H_j \bm g_i$ 54 as a corrective prior (Hariat et al., 8 May 2026). RCT depends on meaningful scalar confidence estimates from the RF, stable alternating updates, and useful LLM embeddings for the RF feature space; the paper also notes RL instability, reward sensitivity, lack of theoretical convergence guarantees, and RF oscillations across iterations in the MS dataset (Tian et al., 24 Mar 2026). GREAT depends on the hypothesis that gradients encode transferable information and, in robustness settings, can devolve into gradient obfuscation unless paired with a stronger primary loss such as GREACE (Sinha et al., 2018).

Taken together, these works support a technically specific interpretation of Gradient-Bridged Co-Training: learning systems can be coupled through gradient-derived quantities even when they do not share the same architecture, hypothesis class, or optimization regime. The bridge may be direct, as in transfer-maximizing gradient modification; indirect, as in disagreement-controlled gradient routing; surrogate, as in reward-mediated policy gradients; or adversarial, as in gradient-space indistinguishability. What remains constant is that cooperation is formulated not only at the level of predictions or features, but at the level of how updates are generated, filtered, or redirected (Yang et al., 2023, Hariat et al., 8 May 2026, Tian et al., 24 Mar 2026, Sinha et al., 2018).