Domain-Aware Japanese Text Embedding

Updated 26 December 2025

Domain-aware Japanese text embeddings are fixed-length vector representations that capture specialized semantics from in-domain text.
They leverage methods like contrastive learning, synthetic hard negative generation, and two-stage fine-tuning to enhance domain adaptation.
BERT-based and Transformer architectures, combined with domain-specific data, achieve state-of-the-art performance on Japanese retrieval and similarity benchmarks.

Domain-aware Japanese text-embedding refers to methods for constructing fixed-length vector representations of Japanese text (words, sentences, or passages) that explicitly capture domain-specific semantics, often via unsupervised or weakly supervised adaptation from a general model. Unlike generic text embeddings, which are fit to broad corpora such as news or Wikipedia, domain-aware embeddings encode language phenomena, terminology, and relationships present in specialized domains—such as clinical, legal, educational, or marketplace settings. Driven largely by advances in contrastive learning, synthetic hard negative generation, and scalable fine-tuning, current approaches substantially outperform direct-transfer and generic embedding strategies on domain-specific retrieval and similarity tasks (Chen et al., 2023, Chen et al., 12 Mar 2025, Trung et al., 2024, Rusli et al., 24 Dec 2025, Brandl et al., 2022).

1. Architectures and Representation Learning Frameworks

Domain-aware Japanese text embeddings are constructed atop varied architectures: BERT-based or Transformer encoders for sentence and passage representation (Chen et al., 2023, Chen et al., 12 Mar 2025, Rusli et al., 24 Dec 2025), and skip-gram/CBOW models for word-level embeddings (Brandl et al., 2022). Sentence-level methods typically proceed as follows:

Base Encoder: Pre-trained Japanese BERT or similar models (e.g., cl-tohoku BERT-Japanese, ruri-small-v2) are used as backbone encoders. The last-layer token outputs are mean-pooled or [CLS]-pooled to obtain fixed-size vectors, optionally with ℓ₂ normalization (Chen et al., 2023, Rusli et al., 24 Dec 2025, Chen et al., 12 Mar 2025).
Role-specific Mechanisms: For asymmetric tasks (e.g., query–item in search), methods introduce learned role prefixes (e.g., Japanese "Query:" vs "Passage:") to the input sequence, feeding both queries and items through the same encoder but with distinct learned tokens (Rusli et al., 24 Dec 2025).
Embedding Extraction for LLMs: In domain-adapted retrieval, e.g., legal text, the final <EOS> hidden state of LLaMA-2 (4096-dim, ℓ₂-normalized) is used as the embedding (Trung et al., 2024).
Context-dependent Word Embeddings: At the word level, domain-aware matrices $U^t$ for each domain $t$ are learned alongside a global $U^0$ , with regularization to enforce structure across domains (Brandl et al., 2022).

2. Synthetic Data Generation and Domain Adaptation

Robust domain adaptation methods are crucial due to the scarcity of labeled data in Japanese specialist domains. Two key methodologies are prevalent:

Synthetic Hard Negative Generation: Models such as JCSE and SDJC utilize a data generator (e.g., T5-Japanese, fine-tuned on in-domain plaintext) to produce "hard negatives"—syntactically similar but semantically divergent sentences. This is achieved by POS tagging, masking noun chunks with sentinels, and decoding with beam search to fill-in contradictory replacements (Chen et al., 2023, Chen et al., 12 Mar 2025). The generator is fine-tuned via span-masking denoising to ensure domain-appropriate terminology and fluency.
Contrastive Sampling from Domain Logs: In information retrieval, positive pairs are mined from transactional or behavioral logs (e.g., query–title pairs resulting in purchases on C2C marketplaces) (Rusli et al., 24 Dec 2025). Hard negatives are drawn via BM25+ or in-batch selection, and further refined via staged fine-tuning (Trung et al., 2024).

This synthetic or behavior-driven adaptation enables unsupervised or weakly supervised domain transfer, supporting adaptation with as few as 5–12 K unlabeled sentences (Chen et al., 12 Mar 2025).

3. Contrastive Objectives and Optimization

Most contemporary domain-aware Japanese text embedding frameworks employ contrastive learning objectives to align semantically similar pairs while repelling dissimilar (especially domain-contradictory) pairs:

InfoNCE Loss Variants: For each anchor sentence $v_i$ , a positive ( $v_i^+$ , e.g., dropout view or paraphrase) and one or more negatives ( $v_i^*$ , e.g., T5-generated hard negative) are scored via cosine similarity. Losses are typically of the form

$L_i = -\log \frac{\exp[\mathrm{sim}(v_i, v_i^+)/\tau]}{\exp[\mathrm{sim}(v_i, v_i^+)/\tau] + \alpha \exp[\mathrm{sim}(v_i, v_i^*)/\tau] + \sum_{j \neq i} \exp[\mathrm{sim}(v_i, v_j^-)/\tau]}$

where $\tau$ is the temperature and $\alpha$ weights the synthetic negative (Chen et al., 12 Mar 2025, Chen et al., 2023).

Multiple Negatives Ranking (MNR) Loss: In IR, the batch in-batch negatives serve as the candidate set for each positive, supporting efficient negative mining (Rusli et al., 24 Dec 2025).
Matryoshka Representation Learning: MRL extends the objective to train embeddings that remain meaningful after dimension truncation. The loss sums MNR over multiple leading dims (e.g., 768, 256, 32), surfacing compact, robust representations (Rusli et al., 24 Dec 2025).
Structure-Predictive Regularization: For word embeddings, cross-domain alignment ( $U^t$ to $U^0$ ) and structure-prediction via an affinity matrix $W$ ensure not only local domain adaptation but discovery of latent domain relationships (Brandl et al., 2022).

4. Training Regimes and Datasets

Effective training of domain-aware Japanese embeddings requires careful sequencing, data balancing, and resource management:

Two-Stage Fine-Tuning: Methods such as JCSE and SDJC implement a two-stage regime: first, synthetic contrastive adaptation on in-domain unlabeled text and generator-produced negatives; then supervision or further contrastive learning on a general corpus (e.g., JSNLI entailment) (Chen et al., 2023, Chen et al., 12 Mar 2025). Ablations show reversing or skipping stages degrades performance by 1–2 points on STS and IR tasks.
Legal and Marketplace IR Pipelines: In legal retrieval, staged fine-tuning proceeds from global contrastive IR to domain-focused hard negative mining. This is facilitated by parameter-efficient adaptation (e.g., QLoRA), gradient accumulation, and ensemble extensions (Trung et al., 2024).
Practical Industry Datasets: C2C search leverages purchase-validated query–title pairs from historical markets (~5M) with test held out temporally (Rusli et al., 24 Dec 2025). Legal tasks are evaluated on real and synthetic queries paired with law articles (Trung et al., 2024). Clinical and educational domains source in-domain content from case reports, social media, syllabi, and QA logs (Chen et al., 2023, Chen et al., 12 Mar 2025).

5. Evaluation Frameworks and Empirical Comparisons

Quantitative evaluation employs both intrinsic and extrinsic measures:

Japanese STS Benchmarks: Evaluation datasets such as JSICK (9,927 pairs), JSTS (13,908 pairs), and machine-translated STS12–16+STS-B (43K, filtered via BLEU-1) allow robust assessment of semantic similarity via Spearman’s ρ (Chen et al., 2023, Chen et al., 12 Mar 2025). Comparison across methods (word-view, SimCSE, SBERT, domain-adapted models) situates domain-aware methods at up to ρ ≈ 84 on domain tasks.
Retrieval Metrics: Dense retrieval tasks are scored via MRR, MAP, P@k, nDCG@k on IR benchmarks. For legal retrieval, the two-stage adapted model achieves nDCG@10 = 68.9, MAP@10 = 61.6 (single checkpoint), outperforming prior dense and sparse baselines (Trung et al., 2024).
Production and Online Metrics: In industry-scale search, impact includes +97% lift in nDCG@100 at 32-dims versus PCA, and statistically significant gains in average revenue per user and order value in live A/B tests, with item tap ranks and search impression metrics improved (Rusli et al., 24 Dec 2025).
Qualitative Analysis: Manual review highlights proper-noun disambiguation, alignment with marketplace-specific meanings, and improved handling of term-importance after fine-tuning.

Method	Domain(s)	Key Metric(s)	Performance Highlights
JCSE (Chen et al., 2023)	Clinical, Edu	STS ρ, IR-MRR/MAP	JACSTS ρ=0.8243, QAbot MRR=0.8173
SDJC (Chen et al., 12 Mar 2025)	Clinical, Edu	STS ρ, IR-MRR/MAP	JACSTS ρ=0.84, MAP=0.70
Legal IR (Trung et al., 2024)	Legal	nDCG@10, MAP@10	nDCG@10=68.9
C2C Search (Rusli et al., 24 Dec 2025)	Marketplace	nDCG@100, ARPU, AoV	nDCG@100=0.195 at 32d, ARPU+0.92%

6. Advances, Limitations, and Future Research

Modern domain-aware embedding techniques for Japanese exhibit several notable advances and ongoing challenges:

Unsupervised Adaptation: Both JCSE and SDJC demonstrate strong gains via adaptation without labeled data, relying on T5-finetuned hard negative generation and synthetic pair mining (Chen et al., 2023, Chen et al., 12 Mar 2025). This enables scaling to new domains with only raw text.
Dimension Truncation: MRL delivers high retrieval quality even in 32-dim embeddings, critical for production-scale retrieval (Rusli et al., 24 Dec 2025).
Evaluation Resources: Public release of datasets (JSTS) and models supports reproducibility and benchmarking advancement (Chen et al., 12 Mar 2025).
Potential Weaknesses: Synthetic negative quality depends on generator finetuning; current pipelines mainly replace nouns, but verbs/adjectives could yield more challenging negatives. Robustness to nuanced pragmatic distinctions and cross-domain generalizability are not fully solved (Chen et al., 12 Mar 2025).
Open Directions: Extensions to broader domains (finance, legal), use of LLMs (GPT-style) for semantic negative mining, adapters or LoRA for parameter-efficient domain adaptation, and richer hybrid architectures are current areas of exploration (Chen et al., 12 Mar 2025, Trung et al., 2024).

Early approaches to domain-aware word embeddings (e.g., W2VPred) propose concurrent learning of a global embedding and domain-specific sub-embeddings, guided by structure-prediction regularization (Brandl et al., 2022). These methods align well with Transformer-based techniques in their emphasis on shared and domain-local representations, although the latter leverage richer contextualization and negative sampling.

A plausible implication is that by combining explicit hard negative generation, multi-phase adaptation, and efficient production pipelines, domain-aware Japanese text embeddings now enable both state-of-the-art semantic similarity in low-resource scenarios and robust, scalable deployment in applied search and retrieval systems. This foundational advance supports downstream Japanese NLP applications, including named-entity recognition, information retrieval, question answering, and cross-domain knowledge transfer (Chen et al., 2023, Trung et al., 2024, Rusli et al., 24 Dec 2025, Brandl et al., 2022, Chen et al., 12 Mar 2025).