Papers
Topics
Authors
Recent
Search
2000 character limit reached

CTCF Liver ChIP-seq Analysis

Updated 1 February 2026
  • CTCF liver ChIP-seq data is a genome-wide profiling method that maps binding sites in adult liver tissue using immunoprecipitation and sequencing.
  • MAP-Elites enhances motif discovery by evolving diverse, high-quality position weight matrices that reveal distinct CTCF motif subfamilies.
  • Comparative benchmarking and rigorous experimental protocols highlight different binding modalities and their regulatory implications in liver genomics.

CTCF liver ChIP-seq data denotes genome-wide profiling of CTCF (CCCTC-binding factor) binding sites in primary adult human liver tissue via chromatin immunoprecipitation followed by sequencing. Recent analyses, especially using MAP-Elites quality-diversity algorithms, have revealed the existence of distinct CTCF motif subfamilies with structured diversity in sequence architecture, support, and genomic distribution (Medina et al., 25 Jan 2026). This paradigm enables discrimination of binding modalities and regulatory heterogeneity that traditional likelihood-maximization motif discovery methods (e.g., MEME) obscure.

1. Experimental Protocol and Dataset Generation

Primary adult human liver samples were acquired under ENCODE protocols, with cross-linking (1% formaldehyde), chromatin shearing (200–500 bp fragments by sonication), and immunoprecipitation via validated anti-CTCF antibody (5 µg/IP, overnight incubation at 4 °C). DNA was purified by phenol–chloroform extraction and ethanol precipitation [§3]. Libraries were constructed using TruSeq adapters, sequenced on Illumina HiSeq 2000/2500 with 50 bp single-end reads; each replicate produced approximately 30 million reads.

Reads were mapped to the GRCh38/hg38 reference genome using BWA-MEM v0.7.17 with standard parameters and filtering (MAPQ < 30 removal, duplicate marking), yielding ≈90% unique mapping rates. Peak calling employed MACS2 (v2.1.1, q-value < 0.01), resulting in roughly 45,000–55,000 peaks per replicate. Irreproducibility discovery rate (IDR) filtering defined a consensus set of ≈50,000 high-confidence peaks, average width ≈300 bp. This forms the foreground set S for downstream motif discovery. Background sets were matched for length and GC for threshold calibration [§3.1].

2. MAP-Elites Quality-Diversity Motif Discovery Framework

The motif discovery task was formulated as optimization over a position weight matrix (PWM) representation, with length L=19L = 19, where M[0,1]19×4M \in [0, 1]^{19 \times 4} is row-stochastic across {A, C, G, T}. For a PWM MM and nucleotide background PbgP_{bg}, the log-odds for a sequence window ww is (M,w)=j=1Llog[PM(wjj)/Pbg(wj)]\ell(M, w) = \sum_{j=1}^{L} \log \left[ P_M(w_j \mid j) / P_{bg}(w_j) \right].

The best motif hit per sequence ss is g(M,s)=1Lmaxi,strand(M,si:i+L1)g(M,s) = \frac{1}{L} \max_{i, \mathrm{strand}} \ell(M, s_{i:i+L-1}). Fitness f(M)f(M) is computed as the average g(M,s)g(M,s) over the top k=0.2Sk = \lceil 0.2 |S| \rceil sequences in SS post-trimming the top 10% extremes: f(M)=1ksTopk(S;g)g(M,s).f(M) = \frac{1}{k} \sum_{s \in \mathrm{Top}_k(S; g)} g(M, s).

MAP-Elites structures PWM optimization as a search for diverse, high-fitness motifs. The archive is a 20×2020 \times 20 cell grid; at each generation, batches of 32 PWMs are evolved using IsoLineEmitter (pyribs v0.2.0), with isotropic Gaussian noise (σiso=0.12\sigma_{iso} = 0.12) and directional perturbation (σline=0.25\sigma_{line} = 0.25). Candidates are assigned to cells defined by behavioral descriptors; in each cell, the highest-fitness motif is retained. The comparison tool MEME (v4.11.4) is run on matching subsets, and its motif outputs are rescored under the same f(M)f(M) criterion [§2.3, §3.2].

3. Behavioral Characterizations and Diversity Metrics

MAP-Elites partitions the search space using two-dimensional behavioral descriptors. Three characterizations were evaluated to organize motif diversity:

  1. ME.SP: Information Content (IC) vs. Support (supp)
    • IC quantifies motif specificity: IC=j=1Lbpj(b)log2[pj(b)/Pbg(b)]\mathrm{IC} = \sum_{j=1}^{L} \sum_{b} p_j(b) \log_2[p_j(b)/P_{bg}(b)].
    • Support is the fraction of peaks sSs \in S with g(M,s)τg(M, s) \geq \tau, where τ\tau is the 95th percentile of background scores.
  2. ME.CO: GC Content vs. Motif Entropy (HH)
    • GC is average (1/L)j[pj(G)+pj(C)](1/L) \sum_j [p_j(G) + p_j(C)].
    • Entropy H=(1/L)j=1Lbpj(b)log2pj(b)H = -(1/L) \sum_{j=1}^{L} \sum_b p_j(b) \log_2 p_j(b).
  3. ME.RB: Support vs. Score-Tail Behavior
    • Tail is the mean of the top 10% of g(M,s)g(M, s) scores.

Archive coverage (fraction of occupied archive cells) and QD-score (sum of all elite fitnesses) quantify map diversity and solution quality, respectively [§2.3].

4. Analytical Results and Benchmarking Against MEME

All MAP-Elites characterizations achieved >80% coverage of the $400$ possible cells, plateauing QD-scores by \sim500 generations, indicating broad, robust motif discovery. Fitness results across five stratified data subsets (see Table 1, §4):

Method Max ff Avg ff (mean ± std)
MEME 1.11 –2.02 ± 1.03
ME.SP 0.883 0.464 ± 0.010
ME.CO 0.950 0.745 ± 0.029
ME.RB 0.797 0.458 ± 0.010

Representative position weight matrices for highest scoring motifs (see Figure 1) include:

ME.CO best motif (fitness = 0.950)

MME.CO=[0.020.030.900.05 0.010.020.940.03 0.010.010.970.01 0.020.100.840.04 0.050.150.750.05 0.080.200.650.07 0.100.250.550.10 0.050.150.700.10 0.020.050.900.03 0.010.010.970.01 0.040.080.850.03 0.100.200.600.10 0.150.250.450.15 0.050.100.750.10 0.010.020.950.02 0.020.050.900.03 0.030.070.850.05 0.050.150.700.10 0.100.200.600.10]M_{\mathrm{ME.CO}} = \begin{bmatrix} 0.02 & 0.03 & 0.90 & 0.05 \ 0.01 & 0.02 & 0.94 & 0.03 \ 0.01 & 0.01 & 0.97 & 0.01 \ 0.02 & 0.10 & 0.84 & 0.04 \ 0.05 & 0.15 & 0.75 & 0.05 \ 0.08 & 0.20 & 0.65 & 0.07 \ 0.10 & 0.25 & 0.55 & 0.10 \ 0.05 & 0.15 & 0.70 & 0.10 \ 0.02 & 0.05 & 0.90 & 0.03 \ 0.01 & 0.01 & 0.97 & 0.01 \ 0.04 & 0.08 & 0.85 & 0.03 \ 0.10 & 0.20 & 0.60 & 0.10 \ 0.15 & 0.25 & 0.45 & 0.15 \ 0.05 & 0.10 & 0.75 & 0.10 \ 0.01 & 0.02 & 0.95 & 0.02 \ 0.02 & 0.05 & 0.90 & 0.03 \ 0.03 & 0.07 & 0.85 & 0.05 \ 0.05 & 0.15 & 0.70 & 0.10 \ 0.10 & 0.20 & 0.60 & 0.10 \end{bmatrix}

Other best motifs (ME.SP, ME.RB, MEME) exhibit distinctive sequence biases (see supplementary data).

Support of the top motifs: ME.SP ≈ 0.72, ME.CO ≈ 0.50, ME.RB ≈ 0.60, MEME ≈ 0.55 (proportion of S exceeding τ\tau). Leave-one-subset-out cross-validation for ME.CO yielded ≤3% fitness decrease, indicating robustness; MEME motifs declined >10% when rescored out-of-sample [§4.1].

5. Biological Implications of Motif Diversity

MAP-Elites unveiled two prominent motif subfamilies:

  1. Canonical GC-rich CTCF motif (positions 6–14, “CCCTC” at 8–12): Distinguished by high IC and moderate support, corresponding to ME.CO region elites.
  2. Degenerate variant: Exhibiting relaxed flanking positions, higher overall support but lower core IC; these likely represent context-dependent or lower-affinity CTCF sites (ME.SP region).

Genomic mapping shows canonical high-IC motifs are significantly promoter-enriched (±1 kb from TSS, p<105p < 10^{-5}), whereas degenerate motifs are distributed across enhancers and insulators. This structured motif diversity aligns with hypothesized dual CTCF binding modalities: a “tight” insulator/promoter-bound mode and “broad” loop-mediated regulatory architecture in hepatocytes.

This suggests that single-motif extraction (e.g., MEME) conflates alternative modes, blurring locus-specific regulatory diversity.

6. Methodological and Comparative Considerations

MAP-Elites advances motif discovery by framing it as a quality-diversity optimization process, evolving multiple high-quality distinct motifs, unlike consensus-oriented strategies such as MEME. The archive-based approach not only matches peak fitness of MEME’s strongest motifs but also illuminates trade-offs among specificity, empirical coverage, and compositional robustness otherwise lost. Selection of behavioral descriptors (IC/support, GC/entropy, support/tail) directly modulates the diversity landscape explored, allowing modelers to tailor motif searches for targeted biological or analytical hypotheses.

A plausible implication is that quality-diversity frameworks, by separating motif subfamilies, enable finer mapping of transcription factor binding heterogeneity in tissue-specific contexts—a capability essential for modern regulatory genomics (Medina et al., 25 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CTCF Liver ChIP-seq Data.