Forge: Multifaceted Research Systems

Updated 4 July 2026

Forge is a term denoting a range of systems that construct, refine, and validate artifacts, from proofs to simulation outputs.
It encompasses frameworks for symbolic analysis, representational learning, and robust pipeline construction in diverse scientific disciplines.
Its applications span from asymptotic inequality proofs and molecular editing to geospatial raster manipulation and robotic assembly.

In recent technical literature, Forge and FORGE denote a diverse set of research artifacts rather than a single doctrine or platform. The name appears in symbolic mathematics, mechanistic interpretability, molecular optimization, agent memory, vulnerability assessment, software-provenance analysis, multimodal generation, geospatial tooling, robotic assembly, seismological monitoring, and cosmological emulation. Across these usages, the term is attached to tools that construct, refine, or validate objects such as proofs, feature vocabularies, fragment edits, semantic identifiers, project families, and simulation outputs (Khaitan et al., 14 Oct 2025, Draye et al., 22 Mar 2026, Zhang et al., 11 May 2026, Kukreja et al., 22 Jun 2026, Mockus, 28 Jun 2026, Dunnell et al., 2024, Fu et al., 25 Sep 2025).

1. Naming patterns and conceptual range

Several papers explicitly motivate the name in terms of making, shaping, or forming. "Form Forge" presents the term as apt because users actively shape architectural artifacts from a latent substrate rather than merely browse outputs (Dunnell et al., 2024). The industrial generative-retrieval benchmark "FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets" states that the system “forges” better semantic IDs from multimodal industrial data rather than assigning arbitrary item IDs (Fu et al., 25 Sep 2025). "Forge-and-Quench" uses Forge for the stage in which an MLLM creates an enhanced instruction and a virtual visual representation, termed the Bridge Feature, before the image generator is conditioned on that signal (Zeng et al., 8 Jan 2026).

The label also appears in more conventional acronymic form. O-Forge is an LLM + CAS framework for asymptotic analysis (Khaitan et al., 14 Oct 2025). CLT-Forge is a scalable library for Cross-Layer Transcoders (Draye et al., 22 Mar 2026). In chemistry, FORGE expands to Fragment-Oriented Ranking and GEneration (Zhang et al., 11 May 2026). In agent learning it denotes Failure-Optimized Reflective Graduation and Evolution (Bogdanov et al., 15 May 2026). In LLM training it denotes Fused On-Register Gradient Elimination (Kukreja et al., 22 Jun 2026).

A distinct usage appears in software-ecosystem research, where forge retains its meaning as a hosting platform such as GitHub or GitLab. "Deforking the World of Code" studies cross-forge fork families that platform graphs cannot represent and releases a forge-agnostic provenance map (Mockus, 28 Jun 2026). By contrast, FORGE in geoscience can denote the Frontier Observatory for Research in Geothermal Energy, the Utah field laboratory that anchors a downhole DAS study (Lellouch et al., 2020), or the F-of-R Gravity Emulator project in cosmology (Arnold et al., 2021). Taken together, these works suggest that the term functions less as a disciplinary keyword than as a recurrent title for systems concerned with construction, refinement, and validation.

2. Asymptotic analysis and symbolic proof

O-Forge is a proposed LLM + CAS framework for asymptotic analysis designed for difficult asymptotic inequalities of the form $f \ll g$ , equivalently $f = O(g)$ , especially when the proof hinges on decomposing a domain or a series into the “right” regimes (Khaitan et al., 14 Oct 2025). The framework formalizes a two-part structure: first find the right decomposition, then verify a simple inequality on each regime. Its central mechanism is an In-Context Symbolic Feedback loop in which a frontier LLM proposes a finite partition such as $D=\bigcup_{i=1}^k D_i$ , while Mathematica’s Resolve verifies each piece by quantifier elimination over the reals. The system takes a conjectured asymptotic inequality in LaTeX or natural language and returns a decomposition together with a proof status such as True/Proved only when the CAS has rigorously certified every subproblem.

The workflow is explicitly regime-driven. The LLM suggests splits based on dominant terms, monotonicity regimes, dyadic thresholds, or orderings of variables; on each subdomain, the system simplifies expressions to leading behavior and enforces positivity assumptions to avoid invalid bounds or singular regions; the CAS then proves statements of the form $\forall x\in D_i:\ f(x)\le Cg(x)$ , with $C$ searched over a finite grid such as $\{1,\dots,10^4\}$ (Khaitan et al., 14 Oct 2025). The paper’s weak Fenchel–Young-style example,

$xy \ll x\log x + e^y,\qquad x\ge 1,\ y\ge 0,$

is handled by splitting at $y\le 2\log x$ and $y>2\log x$ , after which each region becomes routine. A second example, a MathOverflow-style series estimate, uses the breakpoints $\{\lceil h\rceil,\ \lceil hm\rceil\}$ so that the summand behaves like $f = O(g)$ 0, then $f = O(g)$ 1, then $f = O(g)$ 2 in successive ranges.

The paper frames O-Forge as a direct response to Terence Tao’s question about whether LLMs plus verifiers can help with intricate asymptotic inequalities (Khaitan et al., 14 Oct 2025). Its emphasis is not on free-form proof generation but on the division of labor between heuristic decomposition and rigorous symbolic checking. The stated limitations are equally central: Resolve returns a truth value rather than a formal proof object; the system relies on closed-source Mathematica; the simplification strategy is mainly leading-order extraction; and the tool is not designed to produce Lean-style formally checked proofs.

3. Representation learning, interpretability, and retrieval

CLT-Forge is an open-source unified end-to-end toolkit for Cross-Layer Transcoders (CLTs) in mechanistic interpretability (Draye et al., 22 Mar 2026). CLTs share features across layers while preserving layer-specific decoding, with the stated aim of making attribution graphs more compact and less redundant. The library integrates feature-sharded training, compressed activation caching, automated interpretability, attribution graph computation via Circuit-Tracer, and a Python-based Dash interface. Its systems contribution is substantial because CLTs have quadratic parameter growth and can become extremely large: the paper gives the example that a LLaMA 3.2 1B CLT can reach about 27.4B parameters. Feature-wise GPU sharding enables training such models with 8 GPUs, bf16 precision, expansion factor 48, about 1.5 million features, and micro-batch size 512. Compressed activation caching reduces a LLaMA 1B, 300M-token activation cache from about 20TB to around 4TB with only a small loss in reconstruction quality.

A different representational use of the name appears in "FORGE: Foundational Optimization Representations from Graph Embeddings," which pre-trains a vector-quantized graph autoencoder on mixed-integer programming instances without using optimal solutions (Shafi et al., 28 Aug 2025). MIP instances are represented as bipartite graphs between constraints and variables; node embeddings are quantized into a codebook of size 5000; and an instance is summarized by the frequency distribution of discrete codes. This design is intended to preserve global structure better than a plain pooled GNN embedding. On the unsupervised clustering study over unseen D-MIPLIB instances, the reported normalized mutual information is 0.843 for Forge, versus 0.790 for label propagation and 0.087 for mean readout. In supervised settings, the same pre-trained backbone is fine-tuned for integrality-gap prediction and warm-start variable prediction, and the resulting predictions improve a commercial solver, specifically Gurobi (Shafi et al., 28 Aug 2025).

In industrial recommendation, FORGE is a benchmark and methodological framework for forming semantic identifiers (SIDs) in generative retrieval (Fu et al., 25 Sep 2025). The Taobao dataset reported in the paper contains 131M users, 251M items, and 14B interactions collected over 10 consecutive days, with ID, text, and image modalities. SID construction uses multimodal feature extraction, collaborative semantic alignment via an InfoNCE-style contrastive loss, and RQ-VAE for residual quantization. The paper highlights two direct SID-quality metrics—Embedding HitRate and Gini Coefficient—as positively correlated with downstream retrieval performance, and it introduces an offline pretraining schema, From UserAction, that reduces online convergence by half. In a 7-day online experiment on Taobao’s “Guess You Like” section, the reported gains are PVR +8.93%, Hitrate +10.02%, and Transaction Count +0.35% (Fu et al., 25 Sep 2025). A plausible implication is that, within this line of work, “forge” names not only a model but a full pipeline for constructing the representational substrate on which downstream generative systems depend.

4. Molecular editing, agent memory, and vulnerability engineering

In molecular optimization, FORGE reformulates the task as context-aware local editing rather than prompt-conditioned whole-molecule generation (Zhang et al., 11 May 2026). The framework has two stages: Stage 1 ranks fragments under full-molecule context to learn where, and Stage 2 generates an explicit replacement for a chosen fragment to learn how. Supervision comes from GNN attribution, RDKit attribution, SME+, and ChEMBL matched molecular pairs, and the backbone is Qwen3-0.6B with an atom-level retokenization scheme called QwenAtom. The paper argues that natural-language prompts are a weak control signal for black-box objectives and can inherit chemical hallucinations, whereas verified fragment-level edit pairs are a scalable and hallucination-less alternative. Reported benchmark results include 12.42 aggregate score on PMO-1k, surpassing LICO (11.71), MOLLEO (11.65), Genetic GFN (11.56), GP-BO (11.27), Graph GA (10.90), Augmented Memory (10.81), and REINVENT (10.68), as well as favorable comparisons with Qwen-3-8B and GPT-4o on ChemCoTBench (Zhang et al., 11 May 2026).

The agent-memory FORGE introduces a prompt-only, gradient-free self-improvement protocol for LLM agents (Bogdanov et al., 15 May 2026). It wraps a Reflexion-style inner loop that turns failed trajectories into reusable memory artifacts—Rules, Examples, or Mixed—with an outer loop that propagates the best-performing instance’s memory to the rest of the population and freezes converged instances via a graduation criterion. Evaluated on CybORG CAGE-2, a stochastic network-defense POMDP with a 30-step horizon against the B_line attacker, the method improves average evaluation return by 1.7–7.7 $f = O(g)$ 3 over zero-shot and by 29–72% over Reflexion across all 12 model-representation conditions (Bogdanov et al., 15 May 2026). The paper’s ablations identify population broadcast as the critical mechanism, while graduation primarily saves compute.

A security-oriented FORGE connects exploit generation, vulnerability prioritization, and detection engineering through graduated exploitation depth (Shaikh, 2 Jun 2026). Its five-agent pipeline—Intel, Generator, Planner, Exploit, and Detector—creates targeted vulnerable applications from CVE metadata, attempts coached multi-turn exploitation, scores progress on a taxonomy from L0 to L3, and generates Sigma and Snort rules from OpenTelemetry traces. On 603 CVEs from CVE-GENIE, the paper reports 409 / 603 = 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE, spanning 8 languages and 187 CWE types (Shaikh, 2 Jun 2026). The same study reports no significant rank correlation between exploit depth and either EPSS or CVSS, and it finds that 93.4% of generated Snort rules yield zero false positives on a synthetic benign corpus. This suggests a broader use of “forge” for systems that manufacture actionable artifacts—here vulnerable apps, exploit traces, and grounded detections—from incomplete metadata.

5. Training kernels, provenance maps, and large-scale software infrastructure

In large-model training, FORGE denotes a kernel- and scheduling-level change to reverse-mode differentiation: Fused On-Register Gradient Elimination (Kukreja et al., 22 Jun 2026). Standard training materializes each weight gradient, stores it in HBM, and only then lets the optimizer read it back; FORGE instead folds the optimizer step into the backward pass, consuming each gradient tile immediately in registers so that the gradient never becomes a tensor. The paper states that for any element-wise-separable optimizer—explicitly including AdamW, SGD, RMSprop, Lion, NAdam, RAdam, AdaGrad, and SGD with momentum—the tile-wise fused update is provably exact in full precision, with only the usual GEMM reduction-order effects in floating point. The reported practical effect is a structural reduction in memory usage: on Llama-3.1-8B with BS=1 and SEQ=512 on an H200, peak memory drops from 75.0 GiB for fused AdamW to 35.4 GiB for FORGE, while step time improves from 150.9 ms to 99.1 ms, or 1.52× faster (Kukreja et al., 22 Jun 2026). Integrated into tensor-parallel Megatron-LM, the method fits 8B training at four times the micro-batch that a standard optimizer allows on the same GPUs.

"Deforking the World of Code" uses forge in the repository-hosting sense and studies how to recover project identity across hosting platforms (Mockus, 28 Jun 2026). The central object is a deforking map $f = O(g)$ 4 from raw repositories to canonical projects, built from the global shared-commit relation in World of Code (WoC) using a hub-node star encoding and parallel Louvain clustering. The input scale is large: the paper reports 51.79M shared-commit groups. To prevent boilerplate commits from welding unrelated software into giant clusters, it releases capped variants cap250 and cap500, with the paper judging cap250 best overall. Validation against GitHub’s declared fork graph reconstructed from GHArchive ForkEvents yields 99.01% edge agreement conditional on both repositories being in WoC (Mockus, 28 Jun 2026). The conceptual contribution is the recovery of relationships that platform-native graphs cannot represent, including 5.41% multi-forge families and 1.51% of fork families whose canonical root is not on GitHub. Here, “forge” names not a constructive metaphor but the software-hosting environments whose boundaries the method explicitly transcends.

6. Physical sciences and robotics

In seismology, FORGE refers to the Frontier Observatory for Research in Geothermal Energy near Milford, Beaver County, Utah, where a downhole Distributed Acoustic Sensing (DAS) array was used to monitor low-magnitude seismicity during hydraulic stimulation (Lellouch et al., 2020). The study analyzes 10.5 days of quasi-continuous data acquired in monitoring well 78-32 during Phase 2-C, while stimulation occurred in well 58-32. Using semblance-based detection, manual picking, and template matching, the authors produce a final 82-event DAS catalog, of which 16 events are visible on the regional surface network (Lellouch et al., 2020). The reported DAS magnitude completeness is about M = -0.7, compared with M = 0.2 for the surface network, implying an improvement of about 0.5 to 1.0 magnitude units. The main limitation is that a single vertical DAS array has azimuthal ambiguity, so future monitoring is argued to benefit from multiple DAS wells or hybrid DAS-plus-surface deployments.

The cosmological FORGE project, by contrast, is the F-of-R Gravity Emulator, a simulation suite for building accurate emulators of observables in $f = O(g)$ 5 gravity (Arnold et al., 2021). The suite contains 200 collisionless dark-matter-only simulations over 50 cosmological parameter nodes, with 4 simulations per node: a high-resolution pair using $f = O(g)$ 6 particles in $f = O(g)$ 7 boxes and a low-resolution pair using $f = O(g)$ 8 particles in $f = O(g)$ 9 boxes. The first product is a Gaussian-process emulator for the matter power spectrum, trained on smoothed ratios relative to a HALOFIT $D=\bigcup_{i=1}^k D_i$ 0CDM baseline. Cross-validation shows the emulator is better than 2.5% accurate for the majority of nodes, particularly near the center of parameter space, up to $D=\bigcup_{i=1}^k D_i$ 1, and tests on additional unseen simulations find maximum relative errors of about 2.5% for F6 and 3% for F5 (Arnold et al., 2021). The stated use case is nonlinear prediction for galaxy clustering, weak gravitational lensing, and galaxy clusters.

In robotics, FORGE stands for Force-Guided Exploration for Robust Contact-Rich Manipulation under Uncertainty (Noseworthy et al., 2024). The method is a sim-to-real reinforcement-learning framework for assembly tasks under pose uncertainty, trained with a force threshold, dynamics randomization, and a learned early-termination action. It is conditioned on a maximum allowable force and penalizes excessive contact so that policies search safely rather than “blast through” uncertainty. Deployed on a Franka Panda, the paper reports higher success and lower force than a baseline across 8 mm Peg, Medium Gear, and M16 Nut tasks: for peg insertion, success 0.84 versus 0.64, mean force 5.51 N versus 11.81 N, and max force 12.84 N versus 17.93 N; for medium gear, success 0.98 versus 0.69 (Noseworthy et al., 2024). The same framework is then used in a planetary gearbox assembly composed of multiple primitives, with full assembly succeeding in 3/5 trials and early termination saving about 65 seconds per trial on average.

7. Creative, multimodal, and geospatial interfaces

"Form Forge" is a prototype for interactive latent-space exploration of architectural forms trained on François Blanciak’s SITELESS: 1001 Building Forms (Dunnell et al., 2024). The authors collected 1017 sketches, standardized them to white sketches on black backgrounds, upscaled them to 512 × 512, and trained a fine-tuned StyleGAN2-ADA for 235 training ticks, about 461 epochs, with learning rate 0.002. The resulting web application uses a Python/Flask back end and a React front end. At the center of the interface is a 512 × 512 generated image, surrounded by 512 tick marks, each corresponding to one dimension of a 512-dimensional latent vector. The defining design choice is explicit latent variable manipulation rather than semantic sliders or projected navigation landmarks. The paper treats this as a preliminary investigation into designer agency over a high-dimensional generative space (Dunnell et al., 2024).

"Forge-and-Quench" addresses unified multimodal generation by having an MLLM first reason over the conversational context to produce an enhanced text instruction and then map that into a Bridge Feature, a virtual visual representation injected into a frozen text-to-image backbone (Zeng et al., 8 Jan 2026). The Bridge Adapter is trained on 200M image-text pairs for 500k steps, and the Injection Adapter is trained on 13M filtered image-text samples for 80k steps. The system is evaluated on MeiGen-Image and FLUX.1-dev, with reported COCO-30K FID improvements from 23.97 to 19.86 for MeiGen-Image and from 27.71 to 20.83 for FLUX.1-dev (Zeng et al., 8 Jan 2026). The paper also reports improved WISE scores, specifically 0.55 → 0.70 and 0.56 → 0.66, while maintaining prompt-following accuracy.

"Raster Forge" is a Python library and GUI for raster manipulation, particularly in remote sensing and wildfire management (Oliveira et al., 2024). Implemented with PySide6, NumPy, Rasterio, Spyndex, OpenCV, and Matplotlib, it organizes data around Layer and Raster container classes and supports six processing categories: image composites, multispectral indices, topographical features, distance fields, height maps, and fuel maps. The GUI is organized into three panels—Layers, Processes, and Viewer—and allows outputs to be saved as a new Layer, a rendered image, or a raw TIFF preserving geographic information (Oliveira et al., 2024). Because fuel-map generation combines vegetation coverage, distance fields, canopy height, water masks, artificial-structure masks, a tree-height benchmark, and predefined fuel models, the tool is explicitly oriented toward wildfire simulation and validation while also being presented as applicable to hydrological modeling, agriculture, environmental monitoring, medical imaging, architecture, and planetary/interplanetary remote sensing.

Across these creative and interface-oriented systems, the name Forge is used for environments that expose a substrate—latent variables, virtual visual features, or raster layers—to deliberate manipulation. This suggests continuity with the stronger engineering and mathematical uses of the term elsewhere in the literature: the common emphasis is not merely generation, but generation coupled to control, structure, or verification.