Papers
Topics
Authors
Recent
2000 character limit reached

TITAN Foundation Model Overview

Updated 12 December 2025
  • TITAN Foundation Model is a dual-architecture system, featuring a 260B-parameter Chinese NLP model (ERNIE 3.0 Titan) and a multimodal pathology model that aligns image and text data.
  • It employs advanced self-supervised pretraining techniques, including adversarial and controllable losses for NLP and contrastive losses for image-text alignment in computational pathology.
  • Empirical results demonstrate significant gains in language understanding, rare cancer retrieval, and cross-modal report generation, underscoring its clinical and research potential.

TITAN Foundation Model refers to two distinct large-scale architectures developed for advanced representation learning: (1) ERNIE 3.0 Titan, a dense Chinese LLM scaling knowledge-enhanced pretraining to 260 billion parameters for NLP, and (2) TITAN for computational pathology, a multimodal foundation model aligning whole slide images and clinical text for histopathologic analysis. Both leverage massive data, sophisticated pretraining objectives, and scalable architectures to establish state-of-the-art performance within their domains (Wang et al., 2021, Ding et al., 29 Nov 2024).

1. Model Architectures and Modalities

ERNIE 3.0 Titan (NLP)

ERNIE 3.0 Titan (“Titan”) is an extension of ERNIE 3.0, expanding from ∼10 billion to ≈260 billion parameters, making it the largest dense Chinese pre-trained LLM to date. The architecture includes:

  • Universal Representation Module: 48 Transformer-XL layers; hidden dimension d=12,288d=12,288; h=192h=192 heads; FFN inner dimension dff=196,608d_{ff}=196,608.
  • Task-Specific Modules: 12 Transformer-XL layers each for NLU and NLG; d=768d=768, h=12h=12, dff=3,072d_{ff}=3,072.

TITAN (Computational Pathology)

TITAN (“Transformer‐based pathology Image and Text Alignment Network”) is a multimodal foundation model with:

  • Input Modalities: Pre-extracted region-of-interest (ROI) image features, synthetic ROI captions, and clinical pathology reports.
  • Patch Encoder: CONCHv1.5 processes 512×512 px tissue patches at 20×, outputting 768-dimensional embeddings.
  • Slide Encoder: 6-layer ViT-style Transformer; d=768d=768, 12 heads; uses 2D-ALiBi positional encoding for scalable context.
  • Text Modules (CoCa): 12-layer Transformer encoder/decoder for clinical reports and generated captions; 12 heads, embeddings d=7683,072d=768–3,072.

2. Pretraining Paradigms and Losses

ERNIE 3.0 Titan

Employs multi-paradigm pretraining encompassing word-aware, structure-aware, and knowledge-aware objectives with additional credibility and controllability losses.

  • Self-Supervised Adversarial Loss: For a dataset DaD_a, composed of original and model-generated text, a binary classifier over h[CLS]h_\text{[CLS]} distinguishes real vs. generated samples:

$L_{adv}(D_a) = -\sum_{n=1}^{|D_a|} \log P_\theta (y^n = 𝟙_{h_{[CLS]}^n \in D_{original}} | h_{[CLS]}^n)$

  • Controllable LM Loss: Soft prompts encode genre, topic, sentiment, keywords, length; model alternates between normal and controlled LM with p=0.5p=0.5:

Lctrl(Dc)={logPθ(xtnx<tn),p0.5 logPθ(xtnx<tn,promptsn),p>0.5L_{ctrl}(D_c) = \begin{cases} -\sum \log P_\theta(x_t^n|x_{<t}^n), & p\leq 0.5\ -\sum \log P_\theta(x_t^n|x_{<t}^n, \text{prompts}^n), & p>0.5 \end{cases}

  • Total Objective:

Ltotal=Lword+Lstructure+LKG+αLadv+βLctrlL_{total} = L_{word} + L_{structure} + L_{KG} + \alpha L_{adv} + \beta L_{ctrl}

with α\alpha, β\beta as balancing hyperparameters.

TITAN (Pathology)

Three-stage pretraining:

  1. Vision-Only SSL: iBOT-based two-view contrastive distillation with masked-feature prediction:

Lssl=1Nilogexp(sim(zi,zi+)/τ)jexp(sim(zi,zj)/τ)L_{ssl} = \frac{1}{N} \sum_i -\log \frac{\exp(\text{sim}(z_i, z_i^+)/\tau)}{\sum_j \exp(\text{sim}(z_i,z_j)/\tau)}

  1. ROI–Caption Contrastive + Caption Generation: InfoNCE contrastive and cross-entropy loss over vi,ti\langle v_i, t_i \rangle pairs.
  2. WSI–Report Contrastive Alignment: Aligns slide representations to report embeddings for global context:

Lvla=1Ni[logexp(sim(vi,ti)/τ)jexp(sim(vi,tj)/τ)logexp(sim(vi,ti)/τ)jexp(sim(vj,ti)/τ)]L_{vla} = \frac{1}{N} \sum_i \left[ -\log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\text{sim}(v_i,t_j)/\tau)} -\log \frac{\exp(\text{sim}(v_i,t_i)/\tau)}{\sum_j \exp(\text{sim}(v_j,t_i)/\tau)} \right]

3. Training Data and Computational Considerations

ERNIE 3.0 Titan

  • Textual Data: 4 TB from 11 Chinese domains (web, QA, news, legal, finance, medicine, novels, poetry, MT, etc.).
  • Auxiliary Data: Knowledge graph triples; ∼2 million adversarial text samples (real + generated); soft-prompted corpus for control attributes.
  • Hardware: PaddlePaddle on hybrid Nvidia V100 GPU/Huawei Ascend 910 NPU clusters. Utilizes 4-D hybrid parallelism (data, tensor, pipeline, ZeRO/“Group Sharded”) with resource-aware kernel partitioning, static-shape execution, fault tolerance, and mixed precision.

TITAN

  • Slides and Text: 335,645 whole slide images, 423,122 synthetic ROI captions (generated with PathChat+Qwen2), 182,862 slide-level clinical reports.
  • Augmentation: Diverse ROI feature crops (8K×8K px), K-means patch selection, data quality spot-checked by pathologists.
  • Compute: Pretraining completed with 4×80 GB A100 GPUs (vision) and 8×80 GB A100 GPUs (vision-language).

4. Distillation, Scalability, and Ablation

ERNIE 3.0 Titan: Online Multi-Student Distillation

  • On-the-Fly Distillation (OFD): Students update parameters toward the simultaneously trained teacher, eliminating redundant teacher-only passes.
  • Teacher Assistant: 24-layer intermediate Transformer mediates between Titan and compact student models.
  • Auxiliary Layer Distillation: Additional Transformer layer appended to each student ensures end-to-end gradient flow; dropped at fine-tuning.
  • Distillation Loss: Minimizes KL divergence over softmaxed multi-head attention matrices between teacher (ATA^{T}) and student (ASA^{S}):

LKL=l,aKL(softmax(Al,aT/τ)softmax(Al,aS/τ))L_{KL} = \sum_{l,a} KL(\text{softmax}(A_{l,a}^{T}/\tau) \| \text{softmax}(A_{l,a}^{S}/\tau))

TITAN: Pretraining Strategies and Ablation

  • Stage Scaling: Progressive improvements observed from increasing pretraining set size (from 12.5% to 100%).
  • Positional Encoding: 2D-ALiBi enables slide-scale context generalization; minor but consistent improvements over absolute.
  • Architecture Tuning: 6-layer slide encoder balances parameter efficiency and accuracy.
  • Ablated Modalities: Addition of ROI-caption and report alignment yields marginal but measurable gains beyond vision-only baseline.

5. Empirical Evaluation and Performance

ERNIE 3.0 Titan

Benchmarked on 68 Chinese datasets for fine-tuning, few-shot, and zero-shot tasks.

Fine-tuning Results (Selected Examples):

Task Dataset SoTA Prior ERNIE 3.0 Titan (260B)
Sentiment NLPCC2014-SC 83.53 86.00 86.32
MRC DRCD 90.9/95.3 91.41/95.8 92.16/96.3
Text Classification THUCNEWS 97.6 98.66 98.70

Few-shot (FewCLUE):

Titan shows gains over mT5-XXL, Yuan-13B, and its 10B predecessor, e.g., 65.2% (TC, TNEWS-FC), 90.6% (WSC, CLUEWSC-FC).

Zero-shot:

Substantial improvements in balanced accuracy, e.g., 72.6% (TNEWS), 81.1% (WSC, CLUEWSC), 45.3/52.6% (WebQA, EM/F1).

Generation (human evaluation, 467 cases):

Titan leads all compared models in coherence (1.69), fluency (1.53), and accuracy (1.09).

TITAN

Benchmarked against baselines (PRISM, GigaPath, CHIEF) and vision-only variant (TITAN₀) on linear probing, few-shot, zero-shot, rare cancer retrieval, cross-modal retrieval, and report generation tasks.

Encoder Linear-Probe Balanced Acc. (UT-8K) Few-shot 1/16-shot Zero-shot Acc. (UT-8K)
PRISM 77.4 ± 0.62 52.0 / 72.5 53.6 ± 1.0
TITAN₀ 82.0 ± 0.56 63.4 / 81.4 76.1 ± 0.6
TITAN 83.2 ± 0.56 68.3 / 83.0 76.1 ± 0.6

Rare Cancer Retrieval (Acc@1, 43 rare/143 common): TITAN: 52.0 ± 1.8%; PRISM: 44.9 ± 1.4%.

Cross-Modal Retrieval (TCGA slides–reports): TITAN achieves Slide→Report R@1: 75.2%; Report→Slide R@1: 78.4%.

Report Generation: TITAN (METEOR/ROUGE1/BLEU-1) = 19.8/47.6/24.6 vs. PRISM = 7.6/35.2/13.4.

6. Analysis, Insights, and Clinical Implications

ERNIE 3.0 Titan

  • Parameter Scaling: Scaling from 10B to 260B consistently improves NLU/NLG performance, especially on open-book QA and commonsense reasoning.
  • Self-supervised Adversarial Loss: Accelerates MLM convergence, improves output credibility by training a per-sample ranker for genuineness.
  • Controllable Loss: Genre prompts and length control enable output calibration (e.g., literary cloze task). Mixed-prompt sampling (p=0.5p=0.5) prevents over-dependence on prompts.

TITAN

  • Model Versatility: Large-scale ROI-SSL with multimodal vision-language alignment enables strong generalization to rare disease retrieval, zero-shot diagnosis, and report generation, without finetuning.
  • Synthetic Captions: Multimodal captioning pipeline (PathChat + Qwen2) diversifies training signals and provides granular semantic labels in the absence of exhaustive clinical annotation.
  • Positional Context: 2D-ALiBi encoding facilitates generalization from train-time crops to full-scale WSI inference.
  • Clinical Use: TITAN supports few-shot adaptation, retrieval-based case comparison, and slide-level prognostic modeling (Cox c-index avg. ~0.71 vs. 0.68 in baselines).
  • Limitations: Training size (340K slides) remains below multi-billion-patch paradigms; fixed-size ROI sampling may limit ultra-global context capture; manual prompt engineering needed for morphology-only reports.

7. Future Directions

For ERNIE 3.0 Titan, proposed research includes further scaling, refinement of online distillation for compact model variants, and enhanced control mechanisms within large Chinese LLMs.

For TITAN in computational pathology, scaling to “infinite” synthetic captions, advanced 2D positional encodings for true WSI context, automated report structuring, and larger backbone exploration constitute ongoing priorities. A plausible implication is that TITAN’s design could extend to other resource-limited multimodal medical domains leveraging similar self-supervision and synthetic augmentation strategies (Ding et al., 29 Nov 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to TITAN Foundation Model.