Log Contrastive Units: Unsupervised Log Parsing

Updated 12 March 2026

LCUs are unsupervised constructs grouping log lines that share a core template while varying in parameter tokens.
They utilize a hybrid ranking system that balances commonality and variability using Jaccard similarity to select optimal log templates.
This approach significantly improves parsing accuracy by over 46% and enables near real-time log processing at scale.

A Log Contrastive Unit (LCU) is a construct for unsupervised log parsing that facilitates template extraction by leveraging LLMs in a contrastive, in-context setting. An LCU is defined as a set of log lines that are hypothesized to instantiate the same structural template, differing only in their parameter tokens. By juxtaposing these similar-but-not-identical log messages, an LLM is able to reliably infer the invariant tokens and designate the variable slots corresponding to parameters, automating the discovery of log templates without the need for labeled data. LCUs are central to the LUNAR method for log parsing, as they enable LLMs to accurately and efficiently parse logs at scale using a hybrid combinatorial/contrastive grouping and ranking process (Huang et al., 2024).

1. Formal Definition and Grouping Criteria

Given a set of raw log messages where each message $\ell$ consists of tokens $T(\ell) = \{ t_1, t_2, ..., t_k \}$ , an LCU of size $L$ is a set

$\mathrm{LCU} = \{\ell_1, \ell_2, ..., \ell_L\}$

in which every log is believed to realize the same template, differing only in specific parameter positions. The selection of effective LCUs hinges on two competing properties:

Commonality: All log entries in the LCU must share a core structural template.
Variability: The logs must differ in at least one parameter position, ensuring that the mutable slots are exposed for inference by the LLM.

Commonality and variability are operationalized using the Jaccard similarity metric:

$JS(\ell_i,\ell_j) = \frac{|T(\ell_i) \cap T(\ell_j)|}{|T(\ell_i) \cup T(\ell_j)|}$

If $JS(\ell_i,\ell_j)$ is too small, the logs likely stem from divergent templates; $JS(\ell_i,\ell_j)=1.0$ indicates identical logs (insufficient variability).

2. Hybrid Ranking for LCU Selection

Efficient identification of high-quality LCUs requires balancing the above dual criteria. For any candidate LCU, a hybrid score is computed by interpolating two metrics:

Variability Score ( $S^{Var}_{LCU}$ ): Quantifies average dissimilarity across all pairs within the LCU.

$S^{Var}_{LCU} = \frac{2}{L(L-1)} \sum_{1 \leq i < j \leq L} [1-JS(\ell_i,\ell_j)]$

Commonality Score ( $S^{Comm}_{LCU}$ ): Measures the consistency of pairwise similarities, rewarding LCUs where all members are equally similar.

$T(\ell) = \{ t_1, t_2, ..., t_k \}$ 0

where $T(\ell) = \{ t_1, t_2, ..., t_k \}$ 1 and $T(\ell) = \{ t_1, t_2, ..., t_k \}$ 2 are all log pairs.

Hybrid Score:

$T(\ell) = \{ t_1, t_2, ..., t_k \}$ 3

for $T(\ell) = \{ t_1, t_2, ..., t_k \}$ 4. The LCU maximizing $T(\ell) = \{ t_1, t_2, ..., t_k \}$ 5 is selected.

3. Algorithmic Extraction of LCUs

The process for extracting LCUs from a collection of candidate logs is summarized as follows:

$T(\ell) = \{ t_1, t_2, ..., t_k \}$ 7

Buckets are typically constructed by hierarchical sharding using message length and top-k frequent tokens.

4. Illustrative Examples

Representative LCUs as used in practice:

Log Set Type	Example Logs	Resultant Template
User-session	session opened for user news<br>session opened for user test<br>session opened for user admin	session opened for user <*>
File-not-found error	[ERROR] File /var/log/app.log not found<br>[ERROR] File /home/alice/data.log not found<br>[ERROR] File /tmp/output.log not found	[ERROR] File <*> not found

In each case, variable tokens (e.g., usernames, file paths) are identified and replaced by the placeholder <*>, manifesting the extracted template.

5. LLM-Aided Parsing via Contrastive Prompts

Once an LCU is selected, the log parsing task is presented to the LLM as a structured prompt with the following four components:

Task Instruction: "You are given several log lines that share a template but differ in parameter values. Identify the fixed template words and replace each varying token by <*>."
Parameter Examples: Samples, such as "directory → /var/www/html/", "username → alice".
Output Constraints: "For each input line Log[i], output LogTemplate[i]: followed by the template in backticks."
Queried LCU: The actual log lines, e.g., $T(\ell) = \{ t_1, t_2, ..., t_k \}$ 8

A response from the LLM would take the form: $T(\ell) = \{ t_1, t_2, ..., t_k \}$ 9 This format enables the LLM to generalize the observed variability in a purely unsupervised setting.

6. Impact on Accuracy and Computational Efficiency

The contrastive approach enabled by LCUs directly enhances parsing outcomes:

Accuracy: LCUs expose parametric positions by presenting token alignments across log variants. Empirical results indicate that LUNAR achieves a template F1 accuracy gain exceeding 46 percentage points over the strongest previously unsupervised parsers, matching the best label-dependent method (LILAC) without reliance on manual supervision.
Efficiency: LUNAR avoids naïve all-pairs similarity computations through bucket-based partitioning and anchor-based sampling strategies. In deployment, it parses 3.6 million logs in approximately 532 seconds using parallel execution, compared to 620 seconds for the fastest neural unsupervised baseline and only 9% slower than the syntax-based parser Drain (488 seconds) (Huang et al., 2024).

7. Limitations and Prospects for Extension

LCUs as implemented in LUNAR possess several practical constraints:

Parameter Cardinality: Each parameter must manifest at least two distinct values; rare singleton values cannot be reliably identified as parameters.
Bucket Purity: Overly broad buckets risk incorporating logs from distinct templates, thereby confusing both LCU ranking and LLM inference.
Semantic Similarity: Jaccard similarity is limited to surface token overlaps; future work could improve LCU grouping by using embedding-based metrics capable of capturing synonymy and paraphrase.
Adaptation of LCU Size: Dynamically adjusting $T(\ell) = \{ t_1, t_2, ..., t_k \}$ 6 (the number of lines in the LCU) per bucket could optimize the trade-off between template exposure and LLM prompt complexity.

The LCU methodology constitutes a bridge between unsupervised combinatorial grouping and LLM-driven template extraction, providing scalable, label-free log parsing with strong empirical performance (Huang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

LUNAR: Unsupervised LLM-based Log Parsing (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Log Contrastive Units (LCUs).