Curie Policy Language (CPL)
- CPL is a formal policy description language that precisely defines share and acquire clauses using structured syntax and conditionals.
- It supports secure, automated negotiation of data sharing agreements through mechanisms like secure multi-party computation and optional differential privacy.
- Validated in healthcare consortia, CPL enables fine-grained, alliance-based data control to enhance predictive model performance.
The Curie Policy Language (CPL) is a formal policy description language central to the Curie framework for secure, policy-driven data exchange among members whose relationships may be governed by complex regulatory, political, trust, and strategic considerations. CPL enables each participant to precisely declare both the conditions under which it is willing to share data and the conditions under which it seeks to acquire data, supporting controlled collaboration without requiring mutual trust or centralized governance. The language serves as the foundation for automated negotiation of global data-sharing agreements, which are enforced using secure multi-party computation (MPC) and, optionally, differential privacy (DP). Curie and CPL were introduced and validated in the context of healthcare prediction consortia, notably for secure multi-institutional warfarin dosing model development (Celik et al., 2017).
1. CPL Syntax and Structure
CPL provides an expressive, structured syntax presented in Backus–Naur Form (BNF). The core elements are "share" and "acquire" clauses, with optional "sub" clauses for modular filtering and attributes for named variables. Clause components include members (to/from whom data is shared/acquired), conditionals (logical or data-dependent requirements), and selections (row-level filters). Below is a simplified representation of the CPL grammar:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
<curie_policy> ::= <statement> (‘;’ <statement> )*
<statement> ::= <share_clause> | <acquire_clause> | <sub_clause> | <attribute>
<share_clause> ::= 'share' ':' [<members>] ':' [<conditionals>] '::' <selections>
<acquire_clause> ::= 'acquire' ':' [<members>] ':' [<conditionals>] '::' <selections>
<sub_clause> ::= <tag> ':' [<conditionals>] '::' <selections>
<attribute> ::= <identifier> ':=' '<' <value> '>' | <identifier> ':=' '<' <value_list> '>'
<conditionals> ::= (<var> '=' <value> (',' <conditionals>)*)
| 'evaluate' '(' <data_ref> ',' <alg_arg> ',' <threshold> ')' (',' <conditionals>)* | ε
<selections> ::= <filters> | <tag>
<filters> ::= <filter> (',' <filters>)*
<filter> ::= <var> <operation> <value> | ε
<members> ::= <member> (',' <members>)*
<member> ::= <identifier>
<operation> ::= '=' | '<' | '>' | '!=' | 'in' | 'like' | …
<value> ::= string or number
<value_list> ::= '{' <value> (',' <value>)* '}' |
Conditionals can be straightforward Boolean predicates on member attributes or requester's metadata (e.g., "country=US") or privacy-preserving data-dependent evaluations (e.g., intersection size, Jaccard index, Pearson or cosine similarity) computed via secure two-party protocols.
2. Policy Semantics and Evaluation
CPL semantics are clause-oriented and evaluated top-down. A "share" clause of the form
1 |
share : M_j : C_1, ..., C_k :: S_1, ..., S_m |
is interpreted as: when member requests data, if all conditionals evaluate to true, exactly the subset of rows meeting the selections will be shared. Conversely, "acquire" clauses specify the data to request from a member, again gated by the declared conditionals.
Selections may reference simple row filters (e.g., "age > 65", "race in {Asian, White}"), or invoke named subclauses for modular filtering logic. Data-dependent conditionals such as
1 |
evaluate(local_column, 'Jaccard index', θ) |
trigger secure two-party computation protocols to compute the relevant statistic, passing if the result meets threshold . If no clause matches, no data are shared or acquired. Clause matching halts at the first clause whose conditionals succeed.
3. Local Policy Articulation
Each consortium member maintains a local CPL file, comprising two main types of statements:
- Share-clauses: Specify what data the participant is willing to share, to whom, and under what conditions.
- Acquire-clauses: Specify what data the participant intends to request, from whom, and under what conditions.
A minimal example for a three-member consortium is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
@M1
acquire: M2: :: age > 25 ;
acquire: M3: evaluate(col=age,'Jaccard',0.3):: race='Asian' ;
share: M2: :: * ;
share: M3: :: * ;
@M2
acquire: M1: :: * ;
share: M1: NATO=true,EU=true :: country in {US,CA,UK} ;
share: M1: :: race='White' ;
@M3
acquire: M1: :: genotype='A/A' ;
share: M1: evaluate(col=genotype,'Intersection size',10) :: * ;
share: M1: :: weight>150 ; |
These policies permit intricate, multi-faceted control, such as alliance-restricted sharing, attribute-based filtering, and data similarity gating.
4. Negotiation and Global Agreement Formation
Pairwise negotiation is performed among all members:
- Each participant sends its acquire-clause for directly to .
- On receipt, evaluates its share-clauses relevant to , applies conditionals (including data-dependent privacy-preserving checks), and calculates the intersection between 's requested subset and 's allowable subset. This intersection forms the agreed subset .
- returns a minimal negotiation response (pointer to the subset) to .
- After negotiation with all peers, holds specifying permitted data slices.
The agreed data for to acquire from is defined by
where and are each side's respective conditionals. If either party's policy is not satisfied, no data are exchanged between that pair.
5. Enforcement via Secure MPC and Differential Privacy
After policy negotiation, each member constructs local summary statistics over its permitted subset:
- (feature covariance)
- (feature-outcome cross-term)
where is the matrix of selected feature rows and the outcome vector. With rows, these statistics are compact. The global linear model solution is computed by:
Computation proceeds via a homomorphic encryption (HE) ring protocol:
- Initiator generates (pk, sk), broadcasts pk.
- The initiator encrypts and forwards .
- Each party adds its local encrypted and forwards along the ring.
- When the data returns to the initiator, decryption recovers .
No party ever learns another member's raw data or cleartext statistics. Optionally, to achieve -differential privacy, the functional mechanism (Zhang et al., VLDB '12) is applied by adding calibrated Laplace noise to the objective, based on pre-scaled columns mapped to . This step occurs post-aggregation, with no extra communication.
6. Case Study: Healthcare Consortia and Policy Expressiveness
CPL expressiveness and enforcement were validated in a warfarin dosing study covering 24 institutions in 9 countries. Five principal consortium architectures were included:
- P.1: Single source (no cross-institution sharing)
- P.2: Nation-wide (e.g., US-within-US)
- P.3: Regional (e.g., North America, Europe, Asia grouped)
- P.4: NATO–EU alliance-based
- P.5: Global (all share with all)
Example policies for these settings:
- Nation-wide:
1 2
acquire: *country='US':: * ; share: *country='US':: * ;
- Global:
1 2
acquire: *:: * ; share: *:: * ;
In the experiments, for all configurations, acquire policies were fully satisfied by share policies; for the nation-wide (US) consortium, all 51 shares matched the 51 requests. The resulting global models reduced mean absolute percentage error (MAPE) by up to 25% over using only local models. Fine-grained policies (e.g., race-balanced, alliance-constrained) allowed participants to optimize data mixes for targeted accuracy gains.
7. Performance and Overhead Characteristics
Performance and scalability metrics indicated:
- Negotiation overhead scales as in the worst case, but each interaction involves only minimal metadata (e.g., 13 members incur ~156 messages).
- Non-cryptographic filters/conditionals execute in 1 ms per clause; data-dependent tests (e.g., Jaccard, intersection, Pearson, cosine) require 10–100 ms each, resulting in under 20 s for 25 × 25 pairs.
- The predominant runtime cost is the HE-based MPC matrix aggregation (one matrix-add per member). For and , total time is generally less than 60 seconds. Key generation incurs one-time overhead.
- Differential privacy via the functional mechanism imposes negligible additional cost after statistic aggregation.
CPL, therefore, achieves a tractable balance between expressiveness and computational overhead. Consortium partners can articulate intricate, enforceable data-sharing requirements—ranging from attribute-based and alliance-based rules to private data similarity gates—while maintaining scalable and principled secure computation (Celik et al., 2017).