Lookalike Address Generation

Updated 27 November 2025

Lookalike address generation is the process of using algorithms and machine learning to synthesize addresses that closely resemble authentic ones through structural and contextual similarities.
In blockchain security, brute-force methods generate hexadecimal addresses by matching specific prefix and suffix constraints, often leveraging GPU acceleration to enhance efficiency.
Advanced frameworks employing GANs, VAEs, and LLMs enable structured address synthesis for IPv6 scanning and postal data, while also posing risks like address poisoning and output bias.

Lookalike address generation refers to algorithmic or machine learning processes that synthesize addresses resembling real or target addresses according to specific syntactic, semantic, or probabilistic similarity criteria. This capability is central to a diverse spectrum of research and application domains, including digital forensics, synthetic data generation, network scanning, privacy security (as in blockchain phishing), and large-scale statistical modeling. The term “lookalike” addresses encompasses both outright duplicates and variants engineered to mimic structural or contextual properties of authentic addresses—ranging from cryptographically formatted representations (e.g., blockchain wallets) to real-world postal addresses and IPv6 artifact schemes.

1. Cryptographic Lookalike Address Generation in Blockchain Security

In blockchain environments such as Ethereum and Binance Smart Chain, addresses are 40-digit hexadecimal strings ( $A \in \Sigma^{40}$ , with $\Sigma = \{0, 1, ..., 9, a, ..., f\}$ ). The generation of lookalike addresses is a documented attack vector, most notably enabling address poisoning attacks. Here, adversaries seek to generate one or more malicious addresses $L$ that appear visually similar to a target set of addresses $R = \{R_1, R_2, ..., R_r\}$ . The standard formalization requires selected prefix and suffix similarity: $L[1..a] = R_i[1..a]$ , $L[41-b..40] = R_i[41-b..40]$ for some choice of $(a, b)$ , controlling the degree of visual match (total $d = a + b$ digits) (Tsuchiya et al., 28 Jan 2025).

The process is purely brute-force:

For each candidate, generate random ECDSA keypairs, compute the address by taking the lower 20 bytes of the Keccak-256 hash of the public key, and compare prefix/suffix constraints.
The expected work $\mathbb{E}[X] \approx \frac{16^d}{r}$ , with the per-trial matching probability $p \approx r 16^{-d}$ , given $r \ll 16^{d}$ .

Empirical benchmarks report address generation rates of $4.6 \times 10^5$ APS (addresses/sec) on a multi-core CPU, up to $1 \times 10^9$ APS on modern GPUs. With $d \geq 20$ (i.e., 20 specified hex digits), attackers require roughly $3 \times 10^{12}$ generated candidates—achievable within 75 CPU-days or 0.07 GPU-days at these rates. Large attacker clusters invest thousands of GPU-days to carve out high-matching clusters, making the process both detectable at scale and economically viable given observed exploit revenues (Tsuchiya et al., 28 Jan 2025).

2. Lookalike Address Generation for Internet Network Scanning

IPv6 address space exploration benefits from lookalike address synthesis to prioritize likely-active scan targets. The “6GAN” framework leverages multiple LSTM-based generators, each specializing in a classified address pattern subset (e.g., Embedded-IPv4, randomized) derived by entropy clustering or vector-consensus methods. A CNN-based $k+1$ -way discriminator evaluates generated candidates as belonging to one of $k$ patterns or “fake,” and an alias-detection module penalizes generators that emit sequences matching known aliased (non-unicast) prefixes (Cui et al., 2022).

Key features:

Reinforcement learning policy-gradients are used to circumvent non-differentiability of discrete output, with the final generator policy $\pi_{\theta_i}$ updated under the reward signals from pattern-correctness and alias avoidance.
Empirical metrics demonstrate state-of-the-art trade-offs: pattern-classification accuracy $0.966$, reduced aliasing from $14\%$ to $0.02\%$ , and higher real host “hit rate” compared to baselines like 6Tree and Entropy/IP.

This approach synthesizes high-quality, structurally accurate lookalike IPv6 addresses that mimic genuine network allocation patterns while suppressing undesirable alias artifacts (Cui et al., 2022).

3. Generative Models for Structured Postal Address Synthesis

For structured postal or physical address generation, variational autoencoders (VAEs) with discrete, tree-recursive architectures are effective. Each address is decomposed into protocol buffer fields (floats for coordinates; strings for street, city, etc.) and modeled with modular encoders/decoders plus a shared latent space ( $D=128$ ). Training involves optimizing the generalized ELBO $L_\beta(x) = \mathbb{E}_{z\sim q_\lambda}[\log p_\theta(x|z)] - \beta\,\mathrm{KL}[q_\lambda(z|x)\|p(z)]$ , where $\beta$ is scheduled to avoid posterior collapse.

Major challenges and remedies:

VAEs display generated loss lag, meaning synthetic samples may not map well to latent codes, causing sharp drops in field correlation and string realism.
Augmented training (“multi-hop” reconstructions using model-generated samples) and multiscale VAE (parallel training over a $\beta_k$ schedule) each mitigate mode-collapse, improving statistical fidelity of generated lookalike addresses.

Empirical metrics reveal major improvements in zip–coordinate correlation (median $0.05 \to 0.32$ with augmentation, $0.54$ with multiscale), and Levenshtein distance for street-name realism. Field-level dependencies (e.g., correlated zip code and latitude/longitude) are better preserved under multiscale regimes (Chou et al., 2019).

4. LLMs for Random and Lookalike Address Generation

LLM-driven address generation is explored in “Stochastic Streets,” which examines the ability of state-of-the-art LLMs (GPT-4.1, Gemini 2.5, Gemma 3-12B, LLaMA 3.1-8B, Mistral-7B, Qwen 3-8B) to uniformly sample real street addresses in major European cities. Models receive strict prompts emphasizing random selection, but are not fine-tuned on explicit address datasets; sampling parameters include $T=1.0$ , default top- $k$ / $p$ (Fu et al., 16 Sep 2025).

Analysis demonstrates:

Pronounced token-level prior biases: e.g., GPT-4.1 outputs “Calle” as the first token with probability $\approx 99.87\%$ versus “Gran” at less than $0.0005\%$ .
Non-uniformity: The same street appears up to 60% of samples in some model–city pairs. Only $27.5\%$ of the most frequent LLM-generated streets overlap with lists of real “iconic” streets.
Hallucinations and geographic clustering: Invented street names (“Calle de las Letras”) and per-model concentration of outputs emerge, reflecting core weaknesses in out-of-the-box LLM sampling for address synthesis.

A plausible implication is that naively sampling LLMs without fine-tuning or structured prompts yields artifacts, statistical skew, and invalid outputs. Remedies include dataset-specific fine-tuning, post-hoc name filtering, or prompting that compels explicit selection from ground-truth entity lists (Fu et al., 16 Sep 2025).

5. Algorithmic, Mathematical, and Implementation Considerations

Summary of Algorithmic Approaches

Domain	Methodology	Core Mechanism
Blockchain	Brute-force address search	Prefix/suffix match, ECDSA+Keccak256 loop, GPU acceleration
IPv6 Scanning	6GAN with RL + GAN + alias shaping	Multi-generator, CNN discrim., MC rollouts, reward shaping
Postal Address	VAE (tree-recursive, augmented, multiscale)	Protocol buffer struct, cross-entropy per field, $\beta$ -annealing
LLM Synthesis	Prompt engineering, token sampling (T=1)	Out-of-the-box LLM, prompt templates, output filtering

In all cases, model performance and output fidelity are critically shaped by the choice of similarity requirements, the domain structure, and the degree to which learning incorporates real-world pattern priors. Brute-force cryptographic generation is entirely algorithmic and resource-bound; ML approaches depend on loss surface shaping, target diversity, and explicit negative sample handling (e.g., alias detection) (Tsuchiya et al., 28 Jan 2025, Cui et al., 2022, Chou et al., 2019).

6. Practical Applications, Risks, and Mitigation

Lookalike address generation underpins a range of practical applications and threats:

Security/Phishing: Attacks exploiting visual similarity in blockchain addresses (“address poisoning”) have been responsible for at least $83.8$M USD in documented losses, affecting millions of users. The high scalability of GPU/ASIC-driven brute-force generators correlates with attacker profitability (Tsuchiya et al., 28 Jan 2025).
Network Scanning: Efficient candidate selection for IPv6 scanning campaigns improves coverage and reduces spurious probes, supporting security and network census research (Cui et al., 2022).
Synthetic Data Generation: Robust, high-fidelity address generation is essential for privacy-preserving datasets in research, urban modeling, and software testing (Chou et al., 2019, Fu et al., 16 Sep 2025).

Mitigation strategies include algorithmic detection, empirical monitoring of address space for lookalike clusters, post-hoc output validation against external datasets, and refinement of sampling/generation policy through task-specific fine-tuning or structured template integration.

7. Current Limitations and Future Directions

Across methodologies, common limitations arise:

Pure brute-force approaches are bounded by computational cost and diminish in feasibility with higher similarity requirements.
Generative ML models (GANs, VAEs, LLMs) often fail to uniformly cover the address space or to avoid hallucinations and pattern collapse without explicit structural constraints.
Out-of-the-box LLMs produce severe statistical and content bias, highlighting inadequacies of prompt-only steering for structured data synthesis (Fu et al., 16 Sep 2025).

Ongoing research focuses on data augmentation, reward shaping (to enforce negative and positive selection), hierarchical or multitask objectives (to preserve both local details and global structure), and fine-tuning on authoritative reference corpora. A plausible implication is that high-fidelity lookalike generation will increasingly leverage hybrid algorithmic–deep learning pipelines tailored to both domain structure and adversarial requirements (Tsuchiya et al., 28 Jan 2025, Cui et al., 2022, Chou et al., 2019, Fu et al., 16 Sep 2025).