TITAN in Computational Pathology
- TITAN (Computational Pathology) is a large-scale, knowledge-enhanced Transformer model that integrates structured data to optimize both natural language understanding and generation.
- It employs a shared-bottom, task-specific-top architecture using Transformer-XL and progressive learning strategies to achieve robust scalability and efficiency.
- Innovations such as adversarial and controllable generation, online distillation, and green training promote faster convergence and reduced energy consumption.
ERNIE 3.0 Titan is a large-scale, knowledge-enhanced pre-trained LLM developed within the ERNIE 3.0 continual multi-paradigm framework. It advances the integration of structured knowledge into transformer-based LLMs, scaling from 10 billion to over 260 billion parameters. The model achieves strong results across a variety of Chinese natural language understanding (NLU) and generation (NLG) tasks, and incorporates novel architectural, algorithmic, and system-level innovations to improve knowledge incorporation, controllable generation, training efficiency, and scalability (Sun et al., 2021, Wang et al., 2021).
1. Model Architecture and Design
ERNIE 3.0 Titan employs a “shared-bottom, task-specific-top” architecture built on Transformer-XL. The Universal Representation Module (URM) encodes shared lexical and syntactic features through deep, recurrent self-attention. Two Task-specific Representation Modules (TRMs) operate on top of the URM to provide specialized representations for NLU and NLG tasks.
Architecture parameters:
| Module | Layers | Hidden Size | Heads | Seq Len | Mem Len | Parameters |
|---|---|---|---|---|---|---|
| Universal Representation | 48 | 12,288 | 192 | 512 | 128 | > 260B (latest) |
| Task-specific NLU branch | 12 | 768 | 12 | 512 | — | — |
| Task-specific NLG branch | 12 | 768 | 12 | 512 | 128 | — |
For the 10B-parameter scale (Sun et al., 2021), the URM uses a hidden size of 4096 with 64 heads; at 260B scale (Wang et al., 2021), the URM is expanded to a hidden size of 12,288 with 192 heads. TRMs remain at a manageable 12 layers to facilitate downstream fine-tuning. The NLU branch uses bi-directional attention, while the NLG branch employs uni-directional attention with recurrence memory. This modularization allows for disentangled optimization of NLU and NLG objectives in continual multi-paradigm pre-training.
2. Knowledge-Enhanced Pre-training Objectives
ERNIE 3.0 Titan interleaves several pre-training tasks, unifying auto-encoding and auto-regressive objectives while introducing knowledge enrichment at scale. The pre-training loss is expressed as:
Where:
- : Masked Language Modeling over masked tokens, spans, and entities (AE).
- : Document Language Modeling (AR), leveraging Transformer-XL's recurrence memory.
- : Sentence Reordering, predicting the correct configuration among permutations.
- : Sentence Distance, a 3-way classification for inter- and intra-document relationships.
- : Universal Knowledge–Text Prediction, combining relation classification from triples and masked word prediction constrained by knowledge graph alignment.
Special entity markers ([HD], [/HD], [TL], [/TL]) ensure that the model explicitly learns mappings between knowledge graph triples and their textual mentions. This knowledge-infused formulation contrasts with plain-text-only paradigms such as GPT-3 and T5.
3. Additional Training Strategies and Infrastructure
Progressive Learning
ERNIE 3.0 Titan employs progressive learning to accelerate convergence for very large-scale training. Critical hyperparameters—sequence length, batch size, learning rate, and dropout—are increased gradually during pre-training. In small/medium variants, this reduces convergence time by up to 65%, enabling practical training at 10B+ scale (Sun et al., 2021).
Controllable and Credible Generation (260B Scale)
At 260B scale, two auxiliary losses are introduced (Wang et al., 2021):
- Self-supervised Adversarial Loss (): The model learns to distinguish between original and adversarially generated paragraphs, promoting generation credibility and penalizing low-quality samples.
- Controllable Language Modeling Loss (): Attribute-conditioned, prompt-based training enables targeted control over output attributes (genre, topic, sentiment, length). Randomly omitting soft prompts during training ensures generalization.
Large-scale Infrastructure
Training and inference leverage PaddlePaddle's 4D hybrid parallelism, combining data-, tensor-, pipeline-, and sharded-parallel strategies. Resource-aware partitioning and static shape conversion exploit both GPU (V100) and NPU (Ascend 910) hardware, achieving 91.7% weak scaling and reduced energy consumption by optimizing memory and compute resources (Wang et al., 2021).
4. Online Distillation and Green Training
ERNIE 3.0 Titan introduces an online, multi-student knowledge distillation strategy to mitigate the computational and environmental costs of deploying multi-hundred-billion parameter models:
- On-the-Fly Distillation (OFD): Teacher and student models are updated in tandem, eliminating separate teacher-only inference stages.
- Teacher Assistant (TA): An intermediate-capacity model (e.g., 24 layers, ∼10B parameters) bridges the gap between Titan and smaller downstream student models.
- Auxiliary Layer Distillation (ALD): Student models have an auxiliary layer during distillation to ensure full gradient flow, which is later removed during fine-tuning or deployment.
This strategy reduces peak GPU/NPU usage time and cumulative carbon emissions. Empirical speedups are observed, with 12L-768H students reaching baseline latency and 6L-768H students running up to 2× faster than baseline (Wang et al., 2021).
5. Training Corpus and Preprocessing
Titan is trained on a ∼4 TB pre-training corpus comprising eleven categories, including Baidu Baike, Chinese Wikipedia, web text, user logs (Zhidao, Tieba, Baijiahao), news, novels, poetry, couplets, and domain-specific sources (medical, law, finance). An additional 0.7B-token Baidu knowledge graph with 50M fact triples provides structured world knowledge. For the adversarial and controllable objectives, specifically curated data pools with synthetic and attribute-annotated samples are utilized (Sun et al., 2021, Wang et al., 2021).
Data is subjected to:
- Deduplication at character, paragraph, and document levels (using MD5 hashes of top-3 sentences).
- Filtering (minimum sentence length, word segmentation).
- Corpus balancing via upsampling for underrepresented domains.
6. Empirical Results and Benchmarking
Performance Highlights:
| Setting | Metrics/Benchmarks | Results (Titan) | Comparison |
|---|---|---|---|
| SuperGLUE (English) | Overall score | 90.6 | +0.8 over human baseline (Sun et al., 2021) |
| Chinese News (TNEWS) | Zero-shot accuracy | 68.40% | vs. 60.26% (PanGu-α-13B) |
| Semantic Similarity | Zero-shot (AFQMC) | 68.99% | vs. 65.76% |
| Fine-tuning (Chinese) | SOTA across 54 tasks; NLI (OCNLI): 82.75% | SOTA; outperforms prior models and human baseline | |
| Few-shot | FewCLUE, multi-task | Titan surpasses mT5-XXL, Yuan 1.0, ERNIE 3.0 | |
| Zero-shot (Chinese) | CBQA (CKBQA-sub): 22.84%, Cloze (CHID): 86.21% | Superior to GPT-3, CPM-1 |
Human evaluations across 467 zero-shot cases assign Titan scores of 1.69/1.53/1.09 (coherence/fluency/accuracy out of 2), exceeding both GPT-3 and state-of-the-art Chinese baselines by 0.3–0.6 (Wang et al., 2021).
7. Innovations and Future Research
ERNIE 3.0 Titan demonstrates the benefit of scaling knowledge-infused dense models for both NLU and NLG, integrating:
- Structured and unstructured data through knowledge-augmented pre-training objectives.
- Disentangled representations for dual-mode (auto-encoding/autoregressive) training.
- Adversarial and controllable generation objectives for output quality and user control.
- Environmental sustainability through efficient online distillation.
Future research directions target continual pre-training over richer structures (e.g., multi-modal content, code, or tables), advanced sparsity and routing for further efficiency scaling, enhanced controllable and factual generation, and fine-tuning distilled student models for edge and specialized tasks (Wang et al., 2021). These priorities position ERNIE 3.0 Titan as a paradigm for the next generation of foundation models with integrated knowledge and scalable engineering.