CORE: Contrastive Relation Encoder
- CORE is a contrastive learning framework that integrates relational attention to unify object and entity representations in vision and language.
- It leverages auxiliary contrastive objectives to enhance instance and relation discrimination, significantly improving tasks like scene text detection and biomedical relation extraction.
- By embedding CORE modules into standard pipelines, researchers observe consistent gains in evaluation metrics such as Hmean and F1 across diverse benchmarks.
The COntrastive Relation Encoder (CORE) is a module and methodological framework that leverages contrastive learning to model relational structure in both vision and language. Originally developed in the context of scene text detection and relation extraction, CORE unifies object or entity-level representation learning through auxiliary contrastive objectives that enforce instance or relation-level discrimination. It achieves instance-aware or relation-aware feature spaces, improving downstream tasks such as object detection in images with complex layouts or relation extraction from biomedical text (Lin et al., 2021, Theodoropoulos et al., 2021).
1. Relational Encoders: Architectural Fundamentals
CORE's central design integrates relation modeling with learned attention over pairs or sets of objects/entities. In computer vision (Lin et al., 2021), this is instantiated as a "vanilla relation block" inspired by Relation Networks, which augments each region proposal feature with a weighted sum over all other proposals:
Let denote the number of region proposals, the th proposal's feature is and geometry . The block computes:
where is the number of relation heads (typically ), and each head's contribution:
The attention weight is defined:
with appearance and geometric affinities parameterized as inner products and linear projections. Analogous blocks occur in graph neural network (GCN) encoders for relation graphs in NLP (Theodoropoulos et al., 2021), where nodes correspond to entity mentions and GCN layers propagate context via a modified adjacency matrix.
2. Contrastive Learning for Instance and Relation Discrimination
Beyond relational attention, CORE enforces an auxiliary contrastive loss that structurally pulls together representations belonging to the same instance or relation and repels others:
- In scene text detection (Lin et al., 2021), proposals from the same text instance (full text and its fragmented sub-texts) are labeled positive pairs, and proposals from different instances are negatives. Contrastive features (e.g., for ground truth ) are computed by an MLP and -normalization. The InfoNCE loss is:
with temperature . Minimizing creates instance-aware embeddings.
- In relation extraction (Theodoropoulos et al., 2021), contrastive learning is applied between (i) sentence and subgraph pairs (CLGS), (ii) token pairs and relation-graph embeddings (CLDR), and (iii) token embeddings sharing the same entity type (CLNER). The SimCLR-style objectives ultimately structure embedding spaces so that positive (true) relations are tightly clustered and negatives are dispersed.
3. Integration into Downstream Pipelines
CORE modules are interleaved with standard pipelines to directly enhance end-task performance:
- In the visual domain (Lin et al., 2021), CORE is inserted in Mask R-CNN between the RPN and the box/classification heads, typically as a stack of two modules. The overall pipeline is:
- Input → Backbone + FPN → RPN → Proposal features → CORE modules → Refined features → Box/Mask heads.
- In language processing (Theodoropoulos et al., 2021), a character-aware BERT encoder (CharacterBERT) supplies initial embeddings; a GCN then encodes relation graphs on top. The contrastive objectives are integrated during fine-tuning phases; at inference, KNN classifiers operate directly on the learned relation or entity subspaces.
No additional post-processing such as instance linking is required in scene text detection, and minimal task-specific architecture modification is introduced in relation extraction.
4. Training Protocols & Hyperparameters
The CORE framework employs multi-phase or curriculum training schemes:
- For scene text detection (Lin et al., 2021):
- Warm-up: RPN loss plus for 40 epochs
- Finetuning: All Mask R-CNN detection losses plus , with total loss ,
- Backbone: ResNet-50 + FPN; optimizer: SGD, base LR 0.04, momentum 0.9
- For relation extraction (Theodoropoulos et al., 2021):
- Optimizer: Adam, LR
- Batch size: 8 or 16 depending on module
- Temperature and self-loop hyperparameters grid searched
Negative and positive sampling for contrastive loss is carefully balanced to avoid class imbalance, especially relevant in biomedical text.
5. Performance Analysis and Evaluation
Ablations and benchmarking consistently show absolute gains from the integration of the CORE module:
Scene Text Detection (ICDAR 2017 MLT val, Hmean):
| Model | Hmean | Sub-text Errors (IoU ∈ (0.1, 0.5), IoF > 0.7) |
|---|---|---|
| Base Mask R-CNN | 80.0 | 1,190 |
| + Relation Module | 81.1 | 923 |
| + CORE (Relation + InsCL) | 82.1 | 754 |
Final test Hmean improvements on multiple datasets: ICDAR 2017 MLT , ICDAR 2015 , CTW1500 , Total-Text (Lin et al., 2021).
Relation Extraction (ADE, macro-F1, 10-fold CV):
| Model | NER F1 | RE F1 | RE-only F1 |
|---|---|---|---|
| CharacterBERT + Lin RE Head | — | 66.8 | — |
| CharacterBERT_CLGS | — | 58.1 | — |
| CharacterBERT_CLDR | — | 81.7 | — |
| KNN end-to-end | 88.3 | 79.97 | 86.5 |
In both domains, embedding space structure is visually confirmed via t-SNE, revealing the disentanglement of relation and entity representations and the tight clustering of true relations or instances.
6. Addressing Fragmentation and Relation Modeling
CORE directly tackles sub-text and relation fragmentation through its integration of relational attention and contrastive objectives:
- In vision, fragmented scene texts (due to complex aspect ratios or occlusions) are "knitted" into coherent instance embeddings that reduce erroneous split detections—thereby boosting both precision and recall.
- In language, constraining sentence and graph (relation) embeddings via contrastive alignment results in highly discriminative, entity- and relation-specific subspaces.
In both cases, the explicit supervision on instance/relation structure substantially outperforms architectures lacking such relational or contrastive guidance.
7. Limitations and Prospects
Known limitations of current CORE instantiations include the risk of over-smoothing in graph pooling (CLGS), restriction to binary or two-node relation graphs (CLDR), and limited incorporation of external knowledge (e.g., UMLS) (Theodoropoulos et al., 2021). The simplicity of downstream classifiers (e.g., KNN) leaves open questions regarding integration with more complex neural inference mechanisms. Extending to higher-arity or multi-entity events, richer negative sampling, and continual contrastive learning are identified as active directions. A plausible implication is that CORE’s flexible, modular structure positions it as a generic relational encoder applicable to a broad spectrum of domains where entity or instance structure is weakly supervised but crucial for end-task performance.