CTCF Liver ChIP-seq Analysis
- CTCF liver ChIP-seq data is a genome-wide profiling method that maps binding sites in adult liver tissue using immunoprecipitation and sequencing.
- MAP-Elites enhances motif discovery by evolving diverse, high-quality position weight matrices that reveal distinct CTCF motif subfamilies.
- Comparative benchmarking and rigorous experimental protocols highlight different binding modalities and their regulatory implications in liver genomics.
CTCF liver ChIP-seq data denotes genome-wide profiling of CTCF (CCCTC-binding factor) binding sites in primary adult human liver tissue via chromatin immunoprecipitation followed by sequencing. Recent analyses, especially using MAP-Elites quality-diversity algorithms, have revealed the existence of distinct CTCF motif subfamilies with structured diversity in sequence architecture, support, and genomic distribution (Medina et al., 25 Jan 2026). This paradigm enables discrimination of binding modalities and regulatory heterogeneity that traditional likelihood-maximization motif discovery methods (e.g., MEME) obscure.
1. Experimental Protocol and Dataset Generation
Primary adult human liver samples were acquired under ENCODE protocols, with cross-linking (1% formaldehyde), chromatin shearing (200–500 bp fragments by sonication), and immunoprecipitation via validated anti-CTCF antibody (5 µg/IP, overnight incubation at 4 °C). DNA was purified by phenol–chloroform extraction and ethanol precipitation [§3]. Libraries were constructed using TruSeq adapters, sequenced on Illumina HiSeq 2000/2500 with 50 bp single-end reads; each replicate produced approximately 30 million reads.
Reads were mapped to the GRCh38/hg38 reference genome using BWA-MEM v0.7.17 with standard parameters and filtering (MAPQ < 30 removal, duplicate marking), yielding ≈90% unique mapping rates. Peak calling employed MACS2 (v2.1.1, q-value < 0.01), resulting in roughly 45,000–55,000 peaks per replicate. Irreproducibility discovery rate (IDR) filtering defined a consensus set of ≈50,000 high-confidence peaks, average width ≈300 bp. This forms the foreground set S for downstream motif discovery. Background sets were matched for length and GC for threshold calibration [§3.1].
2. MAP-Elites Quality-Diversity Motif Discovery Framework
The motif discovery task was formulated as optimization over a position weight matrix (PWM) representation, with length , where is row-stochastic across {A, C, G, T}. For a PWM and nucleotide background , the log-odds for a sequence window is .
The best motif hit per sequence is . Fitness is computed as the average over the top sequences in post-trimming the top 10% extremes:
MAP-Elites structures PWM optimization as a search for diverse, high-fitness motifs. The archive is a cell grid; at each generation, batches of 32 PWMs are evolved using IsoLineEmitter (pyribs v0.2.0), with isotropic Gaussian noise () and directional perturbation (). Candidates are assigned to cells defined by behavioral descriptors; in each cell, the highest-fitness motif is retained. The comparison tool MEME (v4.11.4) is run on matching subsets, and its motif outputs are rescored under the same criterion [§2.3, §3.2].
3. Behavioral Characterizations and Diversity Metrics
MAP-Elites partitions the search space using two-dimensional behavioral descriptors. Three characterizations were evaluated to organize motif diversity:
- ME.SP: Information Content (IC) vs. Support (supp)
- IC quantifies motif specificity: .
- Support is the fraction of peaks with , where is the 95th percentile of background scores.
- ME.CO: GC Content vs. Motif Entropy ()
- GC is average .
- Entropy .
- ME.RB: Support vs. Score-Tail Behavior
- Tail is the mean of the top 10% of scores.
Archive coverage (fraction of occupied archive cells) and QD-score (sum of all elite fitnesses) quantify map diversity and solution quality, respectively [§2.3].
4. Analytical Results and Benchmarking Against MEME
All MAP-Elites characterizations achieved >80% coverage of the $400$ possible cells, plateauing QD-scores by 500 generations, indicating broad, robust motif discovery. Fitness results across five stratified data subsets (see Table 1, §4):
| Method | Max | Avg (mean ± std) |
|---|---|---|
| MEME | 1.11 | –2.02 ± 1.03 |
| ME.SP | 0.883 | 0.464 ± 0.010 |
| ME.CO | 0.950 | 0.745 ± 0.029 |
| ME.RB | 0.797 | 0.458 ± 0.010 |
Representative position weight matrices for highest scoring motifs (see Figure 1) include:
ME.CO best motif (fitness = 0.950)
Other best motifs (ME.SP, ME.RB, MEME) exhibit distinctive sequence biases (see supplementary data).
Support of the top motifs: ME.SP ≈ 0.72, ME.CO ≈ 0.50, ME.RB ≈ 0.60, MEME ≈ 0.55 (proportion of S exceeding ). Leave-one-subset-out cross-validation for ME.CO yielded ≤3% fitness decrease, indicating robustness; MEME motifs declined >10% when rescored out-of-sample [§4.1].
5. Biological Implications of Motif Diversity
MAP-Elites unveiled two prominent motif subfamilies:
- Canonical GC-rich CTCF motif (positions 6–14, “CCCTC” at 8–12): Distinguished by high IC and moderate support, corresponding to ME.CO region elites.
- Degenerate variant: Exhibiting relaxed flanking positions, higher overall support but lower core IC; these likely represent context-dependent or lower-affinity CTCF sites (ME.SP region).
Genomic mapping shows canonical high-IC motifs are significantly promoter-enriched (±1 kb from TSS, ), whereas degenerate motifs are distributed across enhancers and insulators. This structured motif diversity aligns with hypothesized dual CTCF binding modalities: a “tight” insulator/promoter-bound mode and “broad” loop-mediated regulatory architecture in hepatocytes.
This suggests that single-motif extraction (e.g., MEME) conflates alternative modes, blurring locus-specific regulatory diversity.
6. Methodological and Comparative Considerations
MAP-Elites advances motif discovery by framing it as a quality-diversity optimization process, evolving multiple high-quality distinct motifs, unlike consensus-oriented strategies such as MEME. The archive-based approach not only matches peak fitness of MEME’s strongest motifs but also illuminates trade-offs among specificity, empirical coverage, and compositional robustness otherwise lost. Selection of behavioral descriptors (IC/support, GC/entropy, support/tail) directly modulates the diversity landscape explored, allowing modelers to tailor motif searches for targeted biological or analytical hypotheses.
A plausible implication is that quality-diversity frameworks, by separating motif subfamilies, enable finer mapping of transcription factor binding heterogeneity in tissue-specific contexts—a capability essential for modern regulatory genomics (Medina et al., 25 Jan 2026).