Secure Linear Alignment of Large Language Models

Published 19 Mar 2026 in cs.AI | (2603.18908v1)

Abstract: LLMs increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, it unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we propose a privacy-preserving framework that exploits representational convergence to enable cross-silo inference between independent LLMs. The framework learns an affine transformation over a shared public dataset and applies homomorphic encryption to protect client queries during inference. By encrypting only the linear alignment and classification operations, the method achieves sub-second inference latency while maintaining strong security guarantees. We support this framework with an empirical investigation into representational convergence, in which we learn linear transformations between the final hidden states of independent models. We evaluate these cross-model mappings on embedding classification and out-of-distribution detection, observing minimal performance degradation across model pairs. Additionally, we show for the first time that linear alignment sometimes enables text generation across independently trained models.

Abstract PDF Upgrade to Chat

Summary

The paper presents HELIX, a secure protocol that uses linear mapping to align representations across independently trained LLMs for classification and text generation.
It validates cross-model similarity with metrics like CKA and SVCCA, preserving performance with minimal degradation even in zero-shot settings.
By integrating homomorphic encryption, HELIX achieves sub-second inference and low communication overhead, making it practical for regulated cross-silo deployments.

Secure Linear Alignment of LLMs: Technical Review

Motivation and Problem Statement

LLMs, despite heterogeneous architectures and training paradigms, increasingly exhibit convergence in their deep representations. This empirical observation underpins a growing literature on representational similarity and functional interchangeability across model families. "Secure Linear Alignment of LLMs" (2603.18908) formalizes a protocol—HELIX—that leverages this emergent compatibility for privacy-preserving, cross-silo inference without sharing model parameters or raw data. The work exploits representational alignment to support downstream classification and, uniquely, text generation across independently trained LLMs via simple affine transformations. The protocol incorporates homomorphic encryption (HE), providing sub-second privacy-preserving inference, with cryptographic security guarantees focused on client query privacy.

Empirical Evidence of Cross-Model Representational Alignment

The paper provides systematic quantification of alignment across a diverse matrix of embedding APIs and autoregressive LLMs. Using centered kernel alignment (CKA) and SVCCA, the authors demonstrate substantial shared linear structure between embedding spaces of independent LLMs—even for models trained by different vendors or with divergent pretraining recipes.

Figure 1: Linear CKA similarity demonstrates robust shared structure across embedding APIs, supporting the feasibility of linear cross-model alignment.

CKA scores between 0.595 and 0.881 (with similar trends for SVCCA) reflect a level of representational convergence that goes far beyond initializations or overparameterized “coincidence.” These measures validate the assumption that a global affine mapping can align the activation spaces of independent models well enough for non-trivial behavioral transfer, such as classifier head sharing and, for certain pairs, text generation.

Privacy-Preserving Alignment Protocol

HELIX implements a two-party secure computation protocol. Party B (client) computes and encrypts embeddings from a public dataset with its private LLM. These encrypted representations are used by Party A (provider) to perform secure, homomorphic aggregation for linear alignment map estimation:

Figure 2: The protocol executes encrypted training for linear alignment and inference, ensuring that only encrypted embeddings and logit outputs are exchanged.

During deployment, Party B maps private embeddings using the learned affine transformation, sends ciphertexts of the mapped representations to Party A, and receives encrypted predictions post-classification. All cryptographically intensive computation is restricted to linear alignment and inference, yielding sub-second runtime and orders-of-magnitude smaller communication overhead compared to systems that protect the full transformer stack.

Downstream Task Transfer: Classification and OOD Detection

Task-supervised experiments demonstrate that applying a linearly aligned mapping to source embeddings preserves classification accuracy and OOD detection performance with minimal degradation vis-à-vis the baseline of using each model’s self-trained head. For example, classification accuracy and AUROC for TREC and AGNews, respectively, vary by less than 2–4 percentage points across model pairs after alignment—even when no in-distribution data is used for fitting the mapper.

The framework shows utility even in zero-shot regimes, using only public data for alignment, and further narrows any performance gap with minimal in-domain leakage (64–128 samples).

Cross-Model Text Generation via Linear Mapping

A distinguishing contribution is the empirical demonstration that linear alignment enables text generation across independently trained, instruction-tuned LLMs for select model pairs.

Figure 3: An affine map is trained between Qwen’s hidden states and Llama’s feature space, enabling Qwen activations to be decoded with Llama’s head for coherent text generation.

A comprehensive evaluation over 34 model pairs reveals two necessary conditions for functional cross-generation:

Tokenizer Compatibility: Exact token match rate and vocabulary overlap (Jaccard index) are strongly correlated with generation quality (Pearson $r=0.898$ and $r=0.822$ , $p<0.001$ ).
Figure 4: Exact token match rate between LLMs is a high-fidelity predictor of cross-model generation quality as measured by LLM-as-a-judge evaluation.
Model Scale: Success requires both models to have ≥4B parameters. When mapping from smaller models—even with identical architectures and tokenizers—generation is degraded or incoherent.

Text generation quality is quantitatively assessed by LLM-as-a-judge protocols (mean scores 4.0–4.7 for high-compatibility pairs, 1.1–1.9 for failures) and human ranking. Outputs for successful pairs cluster in high-cosine-similarity regions to native generations, as shown in representation space analyses.

Figure 5: Pairs with high embedding similarity between native and cross-model generations reliably yield coherent outputs; low similarity coincides with failure modes.

An additional analysis shows that vocabulary overlap (Jaccard index) exceeding 0.7 is a practical threshold for realistic cross-model generation, providing an actionable selection criterion for practitioners.

Security Model and Adversarial Considerations

The protocol assumes a semi-honest threat model, prioritizing client privacy. Client queries and embeddings are never revealed in plaintext to the provider; only encrypted representations and outputs are exchanged. The affine alignment map $W^*$ , retained by the client, leaks only coarse geometric structure of the provider’s embedding space. Extensive membership inference analysis demonstrates that $W^*$ does not enable recovery of any individual training sample, with theoretical and empirical bounds placing the maximum inference advantage at $O(\sqrt{d}/N)$ , negligible for practical N.

The provider’s classifier parameters are only used under encryption, with optional encrypted argmax for label-only output to further minimize leakage.

Efficiency and Benchmarking

Compared with full HE/MPC transformer inference (e.g., BOLT, Nimbus, MPCFormer), HELIX achieves sub-second inference latency and communication footprint <1MB per query, compared to tens of seconds/minutes and multi-GB/communication for full-stack solutions. The scalability and practicality of the framework make it suitable for deployment in regulated cross-silo settings, such as healthcare or finance, where neither data nor model sharing is permissible.

Theoretical and Practical Implications

The discovery of robust cross-model functional alignment by linear transformation, especially for generative tasks, strengthens hypotheses regarding the existence of a common latent structure governing LLMs trained at frontier scale with similar data and objectives. It also suggests that, as scaling persists and representational convergence intensifies, functional interoperability (including fine-grained tasks) may eventually be enabled for a broader class of heterogeneous models.

Practically, HELIX enables new secure ML workflows: private client-side feature extraction, server-side encrypted inference, and plug-and-play alignment without fine-tuning or transfer of parameters, lowering the barrier for privacy-preserving deployment of proprietary models and boxed APIs.

Future Directions

Outstanding questions include the extension to non-linear alignment (to further close the gap for generation), applicability to multi-modal and systematic OOD settings, and defenses for black-box model extraction attacks under repeated queries.

Conclusion

"Secure Linear Alignment of LLMs" demonstrates—with strong empirical and theoretical backing—that linear maps trained on public data can align large, independently trained LLMs to enable accurate and secure downstream classification and, for compatible pairs, surprisingly coherent cross-model text generation. The presented HELIX protocol achieves robust privacy guarantees and runtime superior to prior end-to-end cryptographic inference approaches, unlocking practical secure cross-silo AI deployment. This work solidifies the connection between representational alignment and functional modularity in LLMs, laying foundational groundwork for privacy-preserving and modular AI systems (2603.18908).

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

Secure Linear Alignment of Large Language Models

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (big picture)

The paper explores a simple but powerful idea: many different LLMs seem to “think” in similar ways on the inside. Because of that, you can often translate the internal features from one model into the “language” of another model using a very simple rule. The authors show how to use this translation to:

combine parts of different models,
do tasks like classification and even some text generation across models, and
keep users’ data private and secure while doing it, using fast encryption.

They call their approach a secure alignment method that performs only simple, linear math on encrypted data, so it stays fast (sub‑second per input) and private.

What questions the researchers asked

The paper focuses on three easy-to-understand questions:

Do different LLMs learn similar “internal representations” (their hidden features) even if they were trained separately?
If yes, can a simple “translator” (a linear/affine map—think of it like a straightforward conversion formula) move features from one model into another model’s space so a shared tool (like a classifier or a text decoder) still works well?
Can we do this translation and prediction privately—so the service never sees the user’s input—without slowing things down too much?

How they approached it (methods in everyday terms)

Internal features as “embeddings”: When a model reads text, it turns it into a big list of numbers that summarize meaning—this is called an embedding. Think of it like a fingerprint of the sentence.
A simple translator: The authors learn a simple formula that takes embeddings from Model B and converts them into the form that Model A expects. This formula is linear (like “multiply by this matrix and add this bias”), which is fast and easy to compute.
Learned on public data: They learn this translator using a shared, non-sensitive public dataset (like Wikipedia or IMDB), so no private data needs to be exchanged.
Private prediction with “locked” data: They use a technique called homomorphic encryption (HE). Imagine locking your numbers in a box that still lets someone do math on them without opening it. The server runs the final, simple math (the linear classifier) on your locked (encrypted) features and sends back a locked answer. Only you can unlock it. Because the math is linear, this stays fast.
What they tested:
- Representational similarity: They measured how similar different models’ internal features are using a metric called CKA. You can think of CKA as a “how similarly do they see the world?” score.
- Downstream tasks: They tried text classification and finding out-of-distribution (OOD) inputs (detecting when something looks unfamiliar) after translating embeddings between models.
- Text generation: They tried a tougher test—using one model’s internal features but another model’s output head to generate text. This is like plugging one brain into another model’s mouth and asking it to speak.
Extra tech terms in simple words:
- Linear/affine map: A straightforward “conversion recipe” for numbers: multiply by a matrix, add a vector.
- Tokenizer: How a model chops text into small pieces (like splitting a sentence into word-parts). If two models chop text very differently, it’s harder to plug them together for generation.

What they found and why it matters

Different models are more similar than you might think:
- Their similarity scores (CKA) were high across many pairs of models from different companies, meaning the models’ internal representations share a lot of structure.
Classification still works after translation:
- They trained a classifier on Model A’s features and then fed it translated features from Model B. Accuracy stayed close to the original model’s performance on tasks like TREC, AG News, MNLI, and DBpedia.
- Even for OOD detection (spotting unfamiliar inputs based on confidence), the translated features often matched or even improved the baseline confidence behavior.
Text generation sometimes works—under the right conditions:
- When they mapped Model B’s hidden features into Model A’s feature space and used Model A’s output head, some model pairs produced coherent text.
- Two big factors predicted success:
- Tokenizer compatibility: If the two models split text similarly, generation quality was much higher. This was strongly correlated with better outputs.
- Model size: Smaller models (below ~4B parameters) struggled to generate good text when stitched to larger models. Stronger source models mapped into weaker targets worked better than the reverse.
Fast and private:
- Because they encrypt only the simple linear parts (translation and classification), prediction runs in under a second per example with strong security (128‑bit).
- The server never sees the user’s raw text or embeddings; it only sees encrypted vectors. The model owner also doesn’t have to share their model.
Practical trade-offs:
- Learning the translator only on public data already works well. Adding a tiny number of in‑distribution examples (like 64–128 samples) can boost performance more, but that involves sharing a small amount of task data, which some settings may not allow.

What this could mean going forward (implications)

Privacy-friendly collaboration: Companies or organizations can combine strengths—one uses their own embedding model locally, another provides a powerful classifier—without sharing private data or proprietary models.
Modular AI systems: If different models naturally “think” similarly, future systems can be built like Lego blocks: swap in an encoder from one place and a classifier or decoder from another, connected by a simple translator.
Faster secure AI: Encrypting only the simple, linear parts keeps private inference practical and fast, making it more feasible to deploy privacy-preserving AI in real apps (e.g., healthcare, finance, education).
Research insight: The results support the idea that big models converge on similar internal representations. This encourages more research into cross-model compatibility—and how to design tokenizers and architectures to make stitching even easier.
Limitations to keep in mind: The approach relies on a linear translator and a shared public dataset, works best for classification, and only sometimes for generation. It assumes both sides behave honestly (semi-honest security model) and doesn’t encrypt the entire transformer, just the final linear parts.

In one sentence

The paper shows that different LLMs often “think” alike enough that a simple, secure translator can connect them—letting people use one model’s features with another model’s tools, privately and quickly, with strong results on classification and, in some cases, even text generation.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise, actionable list of the paper’s unresolved knowledge gaps, limitations, and open questions to guide future work.

Theoretical guarantees for cross-model linear alignment: No formal bounds are provided on when an affine map preserves task performance across models with differing architectures, objectives, tokenizers, and data. Develop conditions and error bounds (e.g., excess risk under linear mapping, stability under distribution shift) and sample complexity for learning $W^*$ .
Sensitivity to public–private distribution shift: Alignment is fit on public data (optionally with 64–128 in-distribution samples), but the impact of domain mismatch on downstream accuracy and OOD reliability is not quantified. Characterize performance as a function of distribution divergence, and estimate how many public samples are needed to achieve given accuracy targets.
Choice of alignment objective: Only ridge regression (normal equations) is used. Compare alternatives (orthogonal Procrustes, CCA/SVCCA, partial least squares, constrained/orthogonal mappings, per-layer or token-wise maps) and assess trade-offs for classification and generation.
Layer selection and multi-layer alignment: Alignment is performed on the final/penultimate hidden state. Evaluate whether aligning earlier or multiple layers (or layer-wise ensembles) improves transfer, especially for generation and OOD detection.
Nonlinear adapters under HE constraints: The framework limits secure computation to linear heads. Investigate low-degree polynomial or piecewise-linear adapters compatible with CKKS to recover nonlinearity benefits while maintaining feasible latency.
Robustness and stability of $W^*$ : Numerical stability of $(Z_B^\top Z_B+\lambda I)^{-1}$ under collinearity and high dimensionality is not analyzed. Study conditioning, regularization selection, and pre-/post-normalization (centering, whitening) impacts on performance and privacy.
Tokenization mismatch remediation: Success in generation correlates with tokenizer compatibility but no remedy is proposed. Explore learned token translation layers, byte/character-level bridging, shared vocabulary induction, or alignment in a tokenizer-agnostic space.
Causal factors in generation success: Correlations with tokenizer match rate and model size are reported, but causal mechanisms are untested. Perform ablations (same architecture/different tokenizers, same tokenizer/different architectures) and controlled token-mapping interventions to isolate causes.
Generality of generation results: Generation is evaluated on short prompts with greedy decoding and up to 128 tokens. Assess longer contexts, diverse decoding strategies (beam, nucleus), safety/toxicity, factuality, and robustness across broader tasks and domains.
Asymmetric transfer in generation: Strong→weak mappings work better than weak→strong. Clarify whether this is due to head capacity, representation quality, or tokenization. Test with controlled capacity-matched heads and calibrated normalization.
Multilingual and cross-modal applicability: All experiments appear English-centric and text-only. Evaluate alignment and secure inference in multilingual settings (different scripts and tokenizers) and across modalities (e.g., speech/text, text/vision embeddings).
OOD detection calibration under mapping: Energy-based OOD relies on calibrated logits; how mapping affects logit scale and calibration is unclear. Study temperature scaling, score calibration, and alternative OOD scores after alignment.
Security model limitations (semi-honest only): The protocol assumes honest-but-curious parties. Extend to malicious adversaries with verifiable computation (e.g., ZK proofs of correct encryption/computation), and analyze resilience to poisoning or adaptive attacks by a malicious client/provider.
Leakage from returning logits: The framework returns full logits to the client by default; this may enable model extraction or membership inference against the provider. Quantify leakage and benchmark encrypted argmax/top-k, noisy logits, or DP mechanisms to mitigate it.
Privacy of the alignment map $W^*$ : Although initial membership inference analysis is referenced, $W^*$ may leak sensitive structural properties (e.g., subspace directions) of the provider’s representation. Provide formal leakage bounds (e.g., DP on $Z_A^\top Z_B$ ), test property inference attacks, and evaluate defense efficacy.
Training-time leakage and attack surface: The secure training protocol exposes encrypted $Z_B$ and plaintext $Z_A$ to the provider and returns $\mathsf{Enc}(Z_A^\top Z_B)$ to the client. Analyze whether repeated protocol runs, chosen-public sets, or collusion can amplify leakage, and evaluate protections (noise addition, auditing, rate-limiting).
End-to-end scalability of secure training: No measurements are given for time/compute to homomorphically compute $Z_A^\top Z_B$ at realistic scales (large N, high $d_A, d_B$ ). Benchmark training runtime, memory, and communication with streaming and batching, and assess GPU/ASIC acceleration for HE.
Inference throughput and batching: Latency is reported as sub-second per sample, but throughput under batching, multi-query pipelines, and varying $d_A$ , $d_B$ , and number of classes K is not characterized. Provide throughput/latency curves and communication-volume measurements over realistic networks.
CKKS precision effects: The impact of CKKS quantization/noise on logit accuracy, calibration, and OOD detection is not separated from the plaintext baseline. Quantify accuracy degradation from HE alone and tune scale/modulus to trade precision vs. latency.
Applicability to nonlinear or multi-head classifiers: Many real systems use multilayer or attention-based heads. Explore structured linear heads (low-rank, block-diagonal) or HE-friendly approximations to better match practical classifiers.
Robustness to adversarial queries: The susceptibility of aligned, encrypted inference to adversarial example transfer or gradient-free extraction is unexplored. Evaluate adversarial robustness and design defenses (randomized smoothing, dropout at representation level, certified bounds).
Dynamic drift and model versioning: Vendor embeddings and tokenizers evolve. Devise mechanisms to detect drift, update $W^*$ with minimal public data, and ensure backward compatibility without exposing private information.
Multi-tenant and many-to-one deployments: How to manage and isolate many client-specific $W^*$ maps to a single provider head (scalability, storage, and interference)? Analyze cross-client leakage and collusion risks.
Key management and side channels: Practical deployment issues (key rotation, compromised keys, timing/traffic analysis, ciphertext size side channels) are not addressed. Propose mitigations and evaluate overhead.
Fairness and bias transfer: Whether alignment preserves, amplifies, or mitigates biases from either model is unexamined. Audit subgroup performance, measure bias transfer, and test fairness-aware alignment objectives.
Reproducibility with proprietary APIs: Reliance on vendor embeddings limits reproducibility and longitudinal stability. Provide open-source replications and document variability due to API updates or rate limiting.
Communication-cost quantification: Communication overhead is asserted (<1 MB per sample) but not empirically measured across dimensions, batch sizes, and packing schemes. Report end-to-end bytes transferred and sensitivity to HE parameters.
Integration with DP for formal client/provider guarantees: No differential privacy is applied. Explore DP-noised sufficient statistics for training $W^*$ and DP on logits at inference to provide formal privacy budgets for both parties.
Extending secure generation: The secure framework only covers linear heads; text generation experiments are plaintext. Investigate whether any part of cross-model generation can be secured efficiently (e.g., securing only token selection or partial linear projections) without prohibitive latency.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed today using the paper’s method (secure linear alignment with homomorphic encryption over linear operations), empirical results (classification, OOD detection, sub-second latency), and observed constraints (public-data alignment, semi-honest threat model, tokenizer effects).

Bold: Bring-your-own-encoder (BYOE) privacy-preserving classification APIs
- Sectors: healthcare, finance, legal, government, enterprise IT, customer support
- What: Clients compute embeddings locally with their own model, linearly align them to the provider’s feature space, encrypt the aligned vector, and get predictions from the provider’s encrypted linear head (sub-second latency).
- Tools/products/workflows:
- “Secure Feature Alignment SDK” (client-side): compute W*, apply alignment, CKKS encrypt, call provider’s API
- “Encrypted Linear Head Service” (provider-side): CKKS-enabled microservice that performs homomorphic logits computation (e.g., TenSEAL-based)
- MLOps integration: a minimal HE gateway that sits in front of existing linear heads (logistic regression, linear SVM, linear layer)
- Dependencies/assumptions: semi-honest threat model; linear head on provider side; availability of a shared public dataset to fit W*; client can compute embeddings locally; leakage via W* is bounded as analyzed; optional encrypted argmax for stronger provider IP protection
Bold: Privacy-preserving ticket routing, triage, and tagging
- Sectors: customer support, IT service management, HR, legal ops
- What: Classify tickets/emails/documents (e.g., urgency, team routing, topic) using private client data and public alignment into a provider’s task head without sharing raw text.
- Tools/products/workflows: “Encrypted Triage Router” plugin for helpdesk software (e.g., ServiceNow, Zendesk); linear heads trained per client vertical; on-device or VPC embedding generation
- Dependencies/assumptions: linear classification head; public data alignment; low-latency internet link; semi-honest model
Bold: Protected content moderation and safety classification
- Sectors: social platforms, gaming, enterprise collaboration
- What: On-device embeddings for user content; encrypted classification for toxicity, self-harm risk, policy violations using provider’s specialized head trained on sensitive data.
- Tools/products/workflows: “Privacy-Preserving Moderation API”; mobile SDK for on-device embedding and CKKS client; server-side HE inference shard
- Dependencies/assumptions: linear head; public alignment set; content not exfiltrated; semi-honest threat model
Bold: Secure OOD detection and drift monitoring as a service
- Sectors: regulated AI deployments across all industries
- What: Use energy-based OOD detection on encrypted aligned embeddings to monitor distribution shift or data quality issues without exposing inputs.
- Tools/products/workflows: “Encrypted Drift Monitor” that ingests encrypted aligned vectors and returns confidence/OOD flags; dashboards for risk teams
- Dependencies/assumptions: frozen linear head with accessible logits; public alignment set; semi-honest model; suitable thresholds per deployment
Bold: Cross-vendor embedding interoperability for MLOps
- Sectors: software/ML platforms, SaaS
- What: Align embeddings from Vendor B to Vendor A’s feature space to reuse trained linear heads, reduce relabeling, and ease migration/A/B tests.
- Tools/products/workflows: “Embedding Bridge” adapter between vectorization backends; batch re-scoring pipelines that map stored embeddings to new heads
- Dependencies/assumptions: representational convergence holds between chosen models; linear head; availability of public dataset similar enough to the task domain for W*
Bold: Private legal and compliance classification
- Sectors: legal, compliance, procurement
- What: Classify contracts, policies, or disclosures (e.g., clause type, risk flags, regulatory categories) with encrypted aligned embeddings; provider keeps specialized head proprietary.
- Tools/products/workflows: “Encrypted Clause Classifier” microservice; client-side alignment step integrated into DMS/CLM systems
- Dependencies/assumptions: linear head; semi-honest model; public alignment data; data residency/privacy approvals are simplified since only encrypted aligned vectors leave the client
Bold: Enterprise email/spam/phishing filtering with user privacy
- Sectors: enterprise security, productivity
- What: On-device or gateway embeddings; encrypted classification for spam/phishing categories using provider’s head trained on threat intel datasets.
- Tools/products/workflows: “HE Mail Filter” gateway; client-side alignment agent on mail server; provider head updated without client data sharing
- Dependencies/assumptions: linear head; timely inference (sub-second is sufficient for server processing); public alignment data
Bold: Research tooling for representational analysis and model stitching
- Sectors: academia, applied research labs
- What: Use linear CKA, linear alignment, and cross-model performance transfer to study representational convergence, reproducibility, and interpretability under privacy constraints.
- Tools/products/workflows: Open-source “Model Stitching Lab” package (affine map fitting, CKA/SVCCA, encrypted covariance exchange); shared benchmarks for classification and OOD
- Dependencies/assumptions: access to public corpora for alignment; standardized embedding extraction pipelines
Bold: Privacy-preserving benchmarking marketplaces for domain heads
- Sectors: AI marketplaces, specialized model providers
- What: Providers offer encrypted linear heads trained on proprietary data (e.g., medical triage, financial risk), and clients evaluate performance privately using BYOE embeddings plus alignment.
- Tools/products/workflows: “Encrypted Head Catalog” with trial keys, usage metering, rate limiting, and optional encrypted argmax to reduce extraction risk
- Dependencies/assumptions: linear head; public alignment data; licensing/commercial terms for head usage; semi-honest model
Bold: Edge-cloud split for regulated deployments
- Sectors: healthcare, finance, government, telecom
- What: Edge devices (or on-prem) compute embeddings; cloud performs encrypted classification with tight latency budgets for interactive workflows (e.g., clinician support).
- Tools/products/workflows: “Edge Embedding + Cloud HE Head” architecture; containerized CKKS microservices with autoscaling; audit logs showing encrypted-only data movement
- Dependencies/assumptions: reliable edge embedding throughput; linear head; tight networking SLAs; semi-honest assumptions

Long-Term Applications

The following applications require further research, scaling, or ecosystem development (e.g., stronger security models, tokenizer/architecture standards, or broader protocol support).

Bold: Hybrid LLMs via cross-model linear alignment for generation
- Sectors: software, creative tools, education, conversational AI
- What: Compose a source model’s transformer blocks with a target model’s LM head to generate text across independently trained models; exploit tokenizer compatibility and sufficient model scale.
- Tools/products/workflows: “Hybrid Decoder” runtime that learns affine maps, performs token-level alignment, and orchestrates cross-model decoding
- Dependencies/assumptions: strong tokenizer compatibility (e.g., exact token match rate ≥ ~0.67) and sufficiently large source model (≥ ~4B parameters) as indicated by the paper; robust token alignment; quality/consistency validation; not yet production-grade across arbitrary pairs
Bold: Cross-vendor RAG interoperability and secure retrieval
- Sectors: enterprise search, knowledge management
- What: Map client embeddings to the server’s VDB space to enable retrieval with mismatched encoders; optionally encrypt query vectors for privacy-preserving search.
- Tools/products/workflows: “Alignment Adapter” for vector databases (Pinecone/Weaviate) that accepts aligned queries; optional HE/MPC for private similarity search
- Dependencies/assumptions: alignment trained on public data that matches the domain; potential accuracy loss if representational convergence is weak; secure search beyond linear ops requires additional crypto systems
Bold: Federated transfer learning via alignment statistics
- Sectors: healthcare networks, consortia in finance/government
- What: Institutions exchange only encrypted sufficient statistics (cross-covariances) to compute W* and reuse heads across silos without sharing raw data, embeddings, or full models.
- Tools/products/workflows: “HE Federated Alignment” protocol and scheduler; secure aggregation for cross-covariances; governance templates for cross-institution usage
- Dependencies/assumptions: semi-honest model; alignment quality depends on public/pooled data; coordination overhead; compliance reviews
Bold: Stronger adversarial security (malicious settings) and provider IP protection
- Sectors: all regulated/competitive domains
- What: Extend to malicious adversary models (e.g., HE+MPC, ZK proofs of correct evaluation, encrypted argmax-only responses) to reduce extraction risks and tighten leakage bounds.
- Tools/products/workflows: “Hardened HELD” combining CKKS with MPC/NIZK; rate-limiting and watermarking; secure audit trails
- Dependencies/assumptions: higher latency/compute; protocol complexity; formal security proofs and red-team validation
Bold: Standardized interoperability layer for LLM feature spaces
- Sectors: AI infrastructure, standards bodies
- What: Define specs for feature-space dimensions, token alignment metadata, and linear adapter formats so model providers can advertise compatibility and clients can switch providers with minimal friction.
- Tools/products/workflows: “Feature Alignment Manifest” published with models; conformance tests (CKA thresholds, stitching performance)
- Dependencies/assumptions: multi-vendor collaboration; benchmarking and certification processes
Bold: Hardware-accelerated HE for real-time encrypted inference
- Sectors: cloud providers, chip vendors, telecom
- What: Accelerate CKKS (and related schemes) on GPUs/ASICs to support large batch encrypted linear algebra and lower end-to-end inference latency well below 100 ms for interactive use.
- Tools/products/workflows: HE kernels integrated into BLAS/cuBLAS equivalents; HE-aware autoscaling; cost-aware schedulers
- Dependencies/assumptions: hardware availability; optimized packing/rotation strategies; engineering investment
Bold: Multi-head encrypted inference marketplaces
- Sectors: AI marketplaces, vertical model providers
- What: Compose multiple encrypted linear heads (toxicity, sentiment, policy, risk) over the same aligned embedding stream to deliver a “privacy-preserving analytics bundle.”
- Tools/products/workflows: “Encrypted Head Graph” runtime; unified billing/SLAs; client-side alignment once, many encrypted heads downstream
- Dependencies/assumptions: composability of linear heads; governance for downstream usage; pricing models
Bold: Cross-domain and multimodal extensions
- Sectors: robotics, autonomous systems, vision/speech analytics
- What: Investigate whether representational convergence and linear alignment extend across modalities (e.g., map audio/vision encoders to shared heads), enabling privacy-preserving classification from sensors.
- Tools/products/workflows: “Multimodal Alignment Lab” to estimate W* across modalities, evaluate OOD and safety classifications
- Dependencies/assumptions: empirical validation of convergence across modalities; domain-appropriate public datasets; potentially different tokenization/alignment challenges
Bold: Policy and compliance frameworks recognizing feature-level encrypted inference
- Sectors: regulators, standards bodies, compliance teams
- What: Establish guidelines that treat encrypted aligned features and HE-only inference as compliant cross-border processing for PII/PHI, reducing barriers to secure AI adoption.
- Tools/products/workflows: Model risk management templates; DPIAs tailored to HE alignment; procurement clauses specifying semi-honest guarantees and leakage analyses
- Dependencies/assumptions: regulator engagement; formal privacy analyses; sector-specific requirements (e.g., HIPAA, GDPR)
Bold: Automated alignment selection and quality prediction
- Sectors: MLOps, platform engineering
- What: Choose model pairs automatically based on tokenizer compatibility (exact match/Jaccard), CKA metrics, and pilot stitching scores to predict success before deployment.
- Tools/products/workflows: “Alignment Recommender” service; CI/CD checks that fail unsafe/low-compatibility pairings; monitoring for drift in alignment quality
- Dependencies/assumptions: access to compatibility signals; periodic revalidation as models/versions change

Notes on Core Assumptions and Dependencies

Representational convergence: The approach relies on empirical linear compatibility between independently trained models; quality varies by pair.
Public dataset for alignment: W* is estimated from shared public data; utility improves with closer domain match or limited in-distribution augmentation (privacy–utility trade-off).
Linear heads: Strongest results and sub-second latency assume a linear classifier (or final linear token head).
Threat model: Semi-honest adversaries; malicious security requires additional protocols (at increased cost/latency).
Tokenizer compatibility and model scale: For generation, success depends on tokenizer overlap and sufficient source model size (≥ ~4B parameters) per the paper’s findings.
Performance/latency: CKKS over linear ops achieves sub-second inference; network conditions and batch size affect real-world latency.
Leakage: Alignment map W* reveals structural info (e.g., dimensions) but not training labels or head parameters; measured membership inference advantage is negligible under stated configurations.

View Paper Prompt View All Prompts

Glossary

affine transformation: A linear mapping combined with a bias shift that preserves points, straight lines, and planes. Example: "The framework learns an affine transformation over a shared public dataset"
AUROC: Area Under the Receiver Operating Characteristic curve; a scalar measuring how well a score separates two classes. Example: "We report AUROC by thresholding $\mathcal{E}(z)$ to distinguish in- vs.\ out-of-distribution samples."
autoregressive architectures: Models that generate or predict the next element in a sequence based on previous elements. Example: "encoder-style and autoregressive architectures achieving strong generalization across diverse tasks"
bootstrapping: In HE, a costly operation that refreshes ciphertext noise to allow deeper computations. Example: "requiring no bootstrapping or modulus switching beyond standard rescaling."
Centered Kernel Alignment (CKA): A similarity metric for comparing representations across neural networks. Example: "Linear CKA similarity across embedding APIs."
CKKS: A homomorphic encryption scheme supporting approximate arithmetic on real-valued vectors. Example: "We implement \gls{held} using TenSEAL CKKS with poly_modulus_degree=8192"
cosine similarity: A measure of similarity between two vectors based on the cosine of the angle between them. Example: "We compare Cross-Model text generation to the text produced by each base model using cosine similarity (using OpenAI's embedding-001)."
cross-covariance: A matrix capturing pairwise covariances between two sets of variables or representations. Example: "computes the encrypted cross-covariance $\mathsf{Enc}(Z_A^\top Z_B)$ "
cross-silo inference: Performing inference across different organizations or systems that cannot share data/models directly. Example: "enable cross-silo inference between independent LLMs."
Energy score: An OOD detection metric computed from logits by log-summing exponentials; higher values suggest OOD. Example: "We use the Energy score"
Homomorphic Encryption: Cryptography that enables computation on encrypted data without decrypting it. Example: "applies homomorphic encryption to protect client queries during inference."
Homomorphic Encryption Security Standard: A community standard specifying security levels and parameter choices for HE. Example: "according to the Homomorphic Encryption Security Standard"
honest-but-curious (semi-honest) threat model: Assumes parties follow the protocol but try to learn additional information from messages. Example: "We adopt a semi-honest (honest-but-curious) threat model"
Jaccard index: A set-similarity measure defined as intersection over union; here used for vocabulary overlap. Example: "Jaccard index (r = 0.822) correlating with text generation quality."
linear identifiability: The property that learned representations can be related by an invertible linear transform under certain conditions. Example: "Linear Identifiability."
logits: Pre-softmax scores output by a classifier that indicate relative confidence for each class. Example: "OOD detection evaluates whether a model can separate in-distribution inputs from unseen data by probing its logits confidence."
LLM-as-a-Judge: An evaluation approach where a LLM scores or compares outputs for quality. Example: "We assess quality through LLM-as-a-Judge evaluation"
membership inference attacks: Attacks aiming to determine whether a specific sample was in a model’s training data. Example: "defend against model extraction and membership inference attacks"
model extraction: Attempts to recover a model or its parameters by querying it and analyzing outputs. Example: "defend against model extraction and membership inference attacks"
model stitching: Connecting parts of different models via adapters to test interchangeability of representations. Example: "prior work on model stitching shows that independently trained models can be aligned"
multiplicative depth: The number of sequential multiplications in a circuit; a key complexity/feasibility metric in HE. Example: "minimal multiplicative depth (depth-1: one ciphertext-plaintext multiplication)"
ordinary least squares: A regression method minimizing squared errors between predictions and targets. Example: "ordinary least squares with ridge regularization ( $\lambda = 10^{-4}$ )"
out-of-distribution (OOD) detection: Identifying inputs that do not come from the training distribution. Example: "OOD detection evaluates whether a model can separate in-distribution inputs from unseen data"
perplexity: A measure of uncertainty in language modeling; lower values indicate better predictive performance. Example: "show lower perplexity degradation"
Platonic Representation Hypothesis: The idea that large models converge to similar latent structures capturing real-world statistics. Example: "the Platonic Representation Hypothesis"
ridge regularization: L2 penalty added to regression to improve conditioning and prevent overfitting. Example: "ridge regularization ( $\lambda = 10^{-4}$ )"
scaling laws: Empirical relationships linking model/data/compute scale to performance and capabilities. Example: "driven by scaling laws that link model size, compute, and data volume to emergent capabilities"
secure aggregation: A protocol that aggregates client data in encrypted form so the server learns only the aggregate. Example: "via secure aggregation."
secure multi-party computation (MPC): Cryptographic methods that allow parties to jointly compute a function without revealing inputs. Example: "combines \gls{he} with \gls{mpc} to accelerate end-to-end interactive private inference."
semantic security: A strong guarantee that ciphertexts leak no information about plaintexts beyond what is inferable from outputs. Example: "semantic security implies these ciphertexts reveal no information"
SIMD packing: Packing multiple values into a single ciphertext to enable parallel encrypted operations. Example: "since CKKS uses SIMD packing to encrypt multiple values into a single ciphertext."
tokenizer compatibility: Degree to which two tokenizers produce matching token sequences/vocabularies; affects cross-model transfer. Example: "tokenizer compatibility strongly predicts success"

Secure Linear Alignment of Large Language Models

Summary

Secure Linear Alignment of LLMs: Technical Review

Motivation and Problem Statement

Empirical Evidence of Cross-Model Representational Alignment

Privacy-Preserving Alignment Protocol

Downstream Task Transfer: Classification and OOD Detection

Cross-Model Text Generation via Linear Mapping

Security Model and Adversarial Considerations

Efficiency and Benchmarking

Theoretical and Practical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (big picture)

What questions the researchers asked

How they approached it (methods in everyday terms)

What they found and why it matters

What this could mean going forward (implications)

In one sentence

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Core Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Authors (2)

Collections

Tweets

HackerNews

Secure Linear Alignment of Large Language Models

Summary

Secure Linear Alignment of LLMs: Technical Review

Motivation and Problem Statement

Empirical Evidence of Cross-Model Representational Alignment

Privacy-Preserving Alignment Protocol

Downstream Task Transfer: Classification and OOD Detection

Cross-Model Text Generation via Linear Mapping

Security Model and Adversarial Considerations

Efficiency and Benchmarking

Theoretical and Practical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (big picture)

What questions the researchers asked

How they approached it (methods in everyday terms)

What they found and why it matters

What this could mean going forward (implications)

In one sentence

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Core Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections

Tweets

HackerNews