Papers
Topics
Authors
Recent
2000 character limit reached

TITAN in Computational Pathology

Updated 26 December 2025
  • TITAN (Computational Pathology) is a large-scale, knowledge-enhanced Transformer model that integrates structured data to optimize both natural language understanding and generation.
  • It employs a shared-bottom, task-specific-top architecture using Transformer-XL and progressive learning strategies to achieve robust scalability and efficiency.
  • Innovations such as adversarial and controllable generation, online distillation, and green training promote faster convergence and reduced energy consumption.

ERNIE 3.0 Titan is a large-scale, knowledge-enhanced pre-trained LLM developed within the ERNIE 3.0 continual multi-paradigm framework. It advances the integration of structured knowledge into transformer-based LLMs, scaling from 10 billion to over 260 billion parameters. The model achieves strong results across a variety of Chinese natural language understanding (NLU) and generation (NLG) tasks, and incorporates novel architectural, algorithmic, and system-level innovations to improve knowledge incorporation, controllable generation, training efficiency, and scalability (Sun et al., 2021, Wang et al., 2021).

1. Model Architecture and Design

ERNIE 3.0 Titan employs a “shared-bottom, task-specific-top” architecture built on Transformer-XL. The Universal Representation Module (URM) encodes shared lexical and syntactic features through deep, recurrent self-attention. Two Task-specific Representation Modules (TRMs) operate on top of the URM to provide specialized representations for NLU and NLG tasks.

Architecture parameters:

Module Layers Hidden Size Heads Seq Len Mem Len Parameters
Universal Representation 48 12,288 192 512 128 > 260B (latest)
Task-specific NLU branch 12 768 12 512
Task-specific NLG branch 12 768 12 512 128

For the 10B-parameter scale (Sun et al., 2021), the URM uses a hidden size of 4096 with 64 heads; at 260B scale (Wang et al., 2021), the URM is expanded to a hidden size of 12,288 with 192 heads. TRMs remain at a manageable 12 layers to facilitate downstream fine-tuning. The NLU branch uses bi-directional attention, while the NLG branch employs uni-directional attention with recurrence memory. This modularization allows for disentangled optimization of NLU and NLG objectives in continual multi-paradigm pre-training.

2. Knowledge-Enhanced Pre-training Objectives

ERNIE 3.0 Titan interleaves several pre-training tasks, unifying auto-encoding and auto-regressive objectives while introducing knowledge enrichment at scale. The pre-training loss is expressed as:

Ltotal(θ)=LMLM+LDLM+LSR+LSD+LUKTPL_\text{total}(\theta) = L_\text{MLM} + L_\text{DLM} + L_\text{SR} + L_\text{SD} + L_\text{UKTP}

Where:

  • LMLML_\text{MLM}: Masked Language Modeling over masked tokens, spans, and entities (AE).
  • LDLML_\text{DLM}: Document Language Modeling (AR), leveraging Transformer-XL's recurrence memory.
  • LSRL_\text{SR}: Sentence Reordering, predicting the correct configuration among m!m! permutations.
  • LSDL_\text{SD}: Sentence Distance, a 3-way classification for inter- and intra-document relationships.
  • LUKTPL_\text{UKTP}: Universal Knowledge–Text Prediction, combining relation classification from triples and masked word prediction constrained by knowledge graph alignment.

Special entity markers ([HD], [/HD], [TL], [/TL]) ensure that the model explicitly learns mappings between knowledge graph triples and their textual mentions. This knowledge-infused formulation contrasts with plain-text-only paradigms such as GPT-3 and T5.

3. Additional Training Strategies and Infrastructure

Progressive Learning

ERNIE 3.0 Titan employs progressive learning to accelerate convergence for very large-scale training. Critical hyperparameters—sequence length, batch size, learning rate, and dropout—are increased gradually during pre-training. In small/medium variants, this reduces convergence time by up to 65%, enabling practical training at 10B+ scale (Sun et al., 2021).

Controllable and Credible Generation (260B Scale)

At 260B scale, two auxiliary losses are introduced (Wang et al., 2021):

  • Self-supervised Adversarial Loss (Ladv\mathcal{L}_{\mathrm{adv}}): The model learns to distinguish between original and adversarially generated paragraphs, promoting generation credibility and penalizing low-quality samples.
  • Controllable Language Modeling Loss (Lctrl\mathcal{L}_{\mathrm{ctrl}}): Attribute-conditioned, prompt-based training enables targeted control over output attributes (genre, topic, sentiment, length). Randomly omitting soft prompts during training ensures generalization.

Large-scale Infrastructure

Training and inference leverage PaddlePaddle's 4D hybrid parallelism, combining data-, tensor-, pipeline-, and sharded-parallel strategies. Resource-aware partitioning and static shape conversion exploit both GPU (V100) and NPU (Ascend 910) hardware, achieving 91.7% weak scaling and reduced energy consumption by optimizing memory and compute resources (Wang et al., 2021).

4. Online Distillation and Green Training

ERNIE 3.0 Titan introduces an online, multi-student knowledge distillation strategy to mitigate the computational and environmental costs of deploying multi-hundred-billion parameter models:

  • On-the-Fly Distillation (OFD): Teacher and student models are updated in tandem, eliminating separate teacher-only inference stages.
  • Teacher Assistant (TA): An intermediate-capacity model (e.g., 24 layers, ∼10B parameters) bridges the gap between Titan and smaller downstream student models.
  • Auxiliary Layer Distillation (ALD): Student models have an auxiliary layer during distillation to ensure full gradient flow, which is later removed during fine-tuning or deployment.

This strategy reduces peak GPU/NPU usage time and cumulative carbon emissions. Empirical speedups are observed, with 12L-768H students reaching baseline latency and 6L-768H students running up to 2× faster than baseline (Wang et al., 2021).

5. Training Corpus and Preprocessing

Titan is trained on a ∼4 TB pre-training corpus comprising eleven categories, including Baidu Baike, Chinese Wikipedia, web text, user logs (Zhidao, Tieba, Baijiahao), news, novels, poetry, couplets, and domain-specific sources (medical, law, finance). An additional 0.7B-token Baidu knowledge graph with 50M fact triples provides structured world knowledge. For the adversarial and controllable objectives, specifically curated data pools with synthetic and attribute-annotated samples are utilized (Sun et al., 2021, Wang et al., 2021).

Data is subjected to:

  • Deduplication at character, paragraph, and document levels (using MD5 hashes of top-3 sentences).
  • Filtering (minimum sentence length, word segmentation).
  • Corpus balancing via upsampling for underrepresented domains.

6. Empirical Results and Benchmarking

Performance Highlights:

Setting Metrics/Benchmarks Results (Titan) Comparison
SuperGLUE (English) Overall score 90.6 +0.8 over human baseline (Sun et al., 2021)
Chinese News (TNEWS) Zero-shot accuracy 68.40% vs. 60.26% (PanGu-α-13B)
Semantic Similarity Zero-shot (AFQMC) 68.99% vs. 65.76%
Fine-tuning (Chinese) SOTA across 54 tasks; NLI (OCNLI): 82.75% SOTA; outperforms prior models and human baseline
Few-shot FewCLUE, multi-task Titan surpasses mT5-XXL, Yuan 1.0, ERNIE 3.0
Zero-shot (Chinese) CBQA (CKBQA-sub): 22.84%, Cloze (CHID): 86.21% Superior to GPT-3, CPM-1

Human evaluations across 467 zero-shot cases assign Titan scores of 1.69/1.53/1.09 (coherence/fluency/accuracy out of 2), exceeding both GPT-3 and state-of-the-art Chinese baselines by 0.3–0.6 (Wang et al., 2021).

7. Innovations and Future Research

ERNIE 3.0 Titan demonstrates the benefit of scaling knowledge-infused dense models for both NLU and NLG, integrating:

  • Structured and unstructured data through knowledge-augmented pre-training objectives.
  • Disentangled representations for dual-mode (auto-encoding/autoregressive) training.
  • Adversarial and controllable generation objectives for output quality and user control.
  • Environmental sustainability through efficient online distillation.

Future research directions target continual pre-training over richer structures (e.g., multi-modal content, code, or tables), advanced sparsity and routing for further efficiency scaling, enhanced controllable and factual generation, and fine-tuning distilled student models for edge and specialized tasks (Wang et al., 2021). These priorities position ERNIE 3.0 Titan as a paradigm for the next generation of foundation models with integrated knowledge and scalable engineering.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to TITAN (Computational Pathology).