SMILES: A Compact Chemical Notation

Updated 23 February 2026

SMILES is a linear ASCII-based notation that encodes molecular structures with explicit syntactic rules for atoms, bonds, branching, and stereochemistry, ensuring compact and unambiguous representations.
Its canonicalization and enumeration techniques allow both unique and randomized molecule encodings, which are critical for robust chemical modeling and enhanced machine learning performance.
SMILES-based models, from LSTMs to Transformer architectures, excel in property prediction and molecular translation, significantly impacting research in drug discovery and material science.

The Simplified Molecular Input Line Entry System (SMILES) is a linear, ASCII-based notation for representing the structures of chemical molecules as compact, human- and machine-readable character strings. SMILES encodes atoms, bonds, ring closures, branching, and stereochemistry according to formal grammar rules, enabling unambiguous conversion between molecular graphs and string representations. SMILES is foundational in computational chemistry, cheminformatics, and the chemical language modeling paradigm, supporting a wide range of applications from property prediction to generative molecular design.

1. SMILES Syntax, Canonicalization, and Enumeration

SMILES represents each atom by an element symbol (e.g., C, N, O), with explicit square-bracket notation for nonstandard cases (e.g., [Cl], [NH+]). Bonds are either implicit (single), expressed as “=” (double), “#” (triple), or indicated via punctuation (“/”, “\” for cis/trans). Branched structures use parentheses, and ring closures employ matching numeric labels—e.g., cyclohexane as “C1CCCCC1”. Stereochemistry is optionally included using “@” and “@@” (for chiral centers) or “/”, “\” for geometric isomers. See (Kikuchi et al., 11 May 2025, Rao et al., 2024, Bjerrum, 2017, Nath et al., 2021, Honda et al., 2019).

Because a molecule admits infinitely many valid traversals, non-canonical SMILES are non-unique. Canonical SMILES are produced by deterministic atom ranking and traversal algorithms, but implementations differ substantially. For example, Daylight’s proprietary algorithm privileges aromatic notation (“c1ccccc1”), whereas RDKit's canonicalization may output “C1=CC=CC=C1” for benzene. Both approaches enforce a many-to-one mapping but are not cross-compatible (Kikuchi et al., 11 May 2025, Bjerrum, 2017).

SMILES enumeration refers to randomizing atom ordering and outputting all unique (non-canonical) SMILES for the same molecule. This is used as a data augmentation technique: exposing machine learning models to the full variability of possible encodings improves generalization and robustness (Bjerrum, 2017, Honda et al., 2019).

Aspect	Key Notation/Feature	Implications
Atoms	C, N, [Cl], [NH+]	Explicit and bracketed atoms
Bonds	-, =, #, :, /, \	Single, double, triple, aromatic, stereo
Branching	( ),	Encodes side chains
Ring closures	1, 2, ..., 9	Defines cycles/topology
Stereochemistry	@, @@, /, \	Chiral centers, cis/trans
Canonicalization	Deterministic traversal	Implementation-dependent
Enumeration	Random traversal	Used for robust ML modeling

2. Representational Inconsistencies and Stereochemical Annotation

Despite canonicalization protocols, SMILES notation suffers from significant representational variability, due both to divergent grammatical standards and incomplete stereochemical labels. Empirical analysis shows that across widely used datasets, nearly half of enantiomeric centers and a third of geometric isomers lack explicit annotations in their SMILES representations (Kikuchi et al., 11 May 2025). String distance analysis (normalized Levenshtein) across raw and standardized SMILES demonstrates substantial syntactic variability (means up to 0.6), further aggravating the need for reproducible workflows.

The stereochemical completeness rate (SCR) is defined as

$\mathrm{SCR} = 1 - \frac{N_{\text{missing}}}{N_{\text{total}}}$

where $N_{\text{total}}$ is the number of stereocenters and $N_{\text{missing}}$ the number without explicit SMILES annotation. A value of 0.5 implies one-half of centers are unlabeled.

Stereochemical notations carry essential information for cyclic and chiral topology, which is critical for property prediction and molecular translation tasks. Manipulating or omitting stereochemical annotations leads to significant performance degradation, especially in cyclic molecule reconstruction. For example, eliminating all 3D markers reduces translation accuracy from ~98% to ~52%, highlighting the scaffolding role of stereochemical tokens (Kikuchi et al., 11 May 2025).

3. Tokenization, Embedding, and Language Analogy

SMILES strings lend themselves naturally to character-level or substructure-level tokenization, supporting direct application of text modeling methods. Tokenization schemes include single characters, multi-character atom symbols (e.g., “Cl”, “Br”), ring-closure digits, branching symbols, and stereochemical markers. Advanced approaches apply pretokenization followed by unsupervised substructure discovery, commonly leveraging byte-pair encoding (BPE) to form high-frequency “words” in chemical substructure space (Lee et al., 2022, Spence et al., 30 Jul 2025).

Treating SMILES as a language analog motivates the use of NLP models for molecular property prediction, classification, and generative modeling (Wasi et al., 2024, Lim et al., 2020). Machine learning workflows commonly feature:

Character or substructure tokenization
Trainable embedding matrices mapping each token to a dense vector, often $\in \mathbb{R}^d$
Sequence models (LSTM, Transformer, CNN) constructed over these embeddings

Embedding layers automatically identify chemical relationships between tokens (e.g., “C” vs. “c”), enabling models to learn chemical grammars directly from data (Rao et al., 2024, Lim et al., 2020).

4. SMILES-Based Machine Learning Models: Methodologies and Performance

A broad range of sequence and LLMs have been developed on top of SMILES data:

n-gram text models and MLPs: Bag-of-n-grams features, when fed to multilayer perceptrons, yield competitive classification accuracy compared to domain-specific cheminformatics fingerprints such as Morgan, MACCS, or AtomPair (accuracy ≈ 0.74 vs. ≈0.80, ROC-AUC ≈ 0.85 vs. 0.88) (Wasi et al., 2024).
LSTM/BiLSTM sequence models: Unidirectional and bidirectional LSTM architectures capture sequential dependencies (e.g., ring closure references, functional group context). Bidirectional LSTMs achieve state-of-the-art ROC-AUC (0.96) in toxicity prediction, outperforming previous sequence and graph neural network models (Rao et al., 2024, Nath et al., 2021).
Transformer and BERT-style encoders: Self-attention architectures excel at extracting both local and global dependencies. Single-head, position-free Transformer encoders can achieve ROC-AUC ≥0.95 in multi-task toxicity prediction (Lim et al., 2020). Pretrained models such as the SMILES Transformer and SmilesT5 use unsupervised sequence-to-sequence objectives to yield fixed molecular fingerprints with high data-efficiency, facilitating effective property prediction even in low-data regimes (Honda et al., 2019, Spence et al., 30 Jul 2025).
Hybrid and adapter-augmented models: Transformers equipped with knowledge adapters trained to infer SMILES grammatical syntax (connectivity, ring-closure adjacency, etc.) further advance performance. Injecting such adapters into frozen BERT-style models increases ROC-AUC on challenging tasks, particularly when combining both syntactic and ring-type knowledge (Lee et al., 2022).

Performance metrics typically include classification accuracy, ROC-AUC, PRC-AUC, F1 score, and regression metrics (RMSE, $R^2$ ). Supervised and self-supervised approaches increasingly leverage SMILES data augmentation strategies (enumeration, input permutation) to boost robustness (Bjerrum, 2017, Honda et al., 2019).

5. SMILES in Cheminformatics Pipelines and Molecular Design

SMILES' versatility enables its use in diverse production workflows:

Peptide and biomolecule encoding: Tools such as p2smi convert peptide FASTA sequences into SMILES strings, supporting noncanonical amino acids, backbone/cycle modifications, and stereochemistry. These pipelines produce chemically valid SMILES for complex peptides and facilitate computation of properties (molecular weight, logP, TPSA, Lipinski's rules) directly from SMILES (Feller et al., 18 Apr 2025).
Kernel and fingerprinting pipelines: SMILES can be mapped to circular (Morgan) fingerprints and further embedded using kernel methods. Workflows comprising SMILES → fingerprint → Gaussian kernel → Sinkhorn-Knopp normalization → kernel PCA yield robust, low-dimensional representations effective for classification and regression (Ali et al., 2024).
Visual recognition: IMG2SMI demonstrates translation of 2D molecular structure images into canonical SMILES—an integration of computer vision and transformer captioning designed for high-throughput chemical literature mining (Campos et al., 2021).
Quantum-classical modeling: Hybrid autoencoders embedding SMILES via tensor-train representations (Word2Ket) and quantum circuits, followed by classical LSTM decoding, advance the interface between quantum machine learning and sequence-based chemical representations (Jahin et al., 26 Aug 2025).

6. Impact of Notational Choices on Model Robustness and Reproducibility

Empirical studies reveal that notational inconsistencies (both grammatical/canonical and stereochemical) strongly affect latent molecular representations and translation fidelity in generative models, although supervised property predictors are less sensitive, likely due to target-driven feature selection (Kikuchi et al., 11 May 2025, Bjerrum, 2017). For generative and translation tasks (SMILES-to-SMILES, image-to-SMILES), reproducibility demands:

Explicitly documenting software, version, and canonicalization options
Uniform stereochemistry handling
Routine variability diagnostics (Levenshtein distances)
SMILES enumeration during training for model robustness

A best-practice paradigm is to apply a strict, uniform canonicalization pipeline, supplement with randomized SMILES in training, and ensure stereochemical completeness or consistent removal across both training and inference (Kikuchi et al., 11 May 2025). For all decoder or translation applications, the minimal acceptable standard requires a fully consistent set of grammatical and stereochemical SMILES to avoid propagation of errors in topology and chirality.

7. Future Directions and Methodological Extensions

Modern chemical language modeling is evolving beyond pure SMILES, incorporating:

Domain-specific pretraining tasks (scaffold reconstruction, fragment prediction) to improve downstream property prediction and data efficiency relative to traditional masked language modeling (Spence et al., 30 Jul 2025).
Knowledge-infusion approaches, where grammar-derived substructures and connectivity graphs are injected into model architectures via adapters or auxiliary objectives (Lee et al., 2022).
Quantum-classical hybrid encoding to enable physically inspired generative modeling for molecular discovery (Jahin et al., 26 Aug 2025).
Automated extraction and conversion pipelines from literature and images, scaling the creation of machine-actionable chemical datasets (Campos et al., 2021).

Combinatorial fusion of sequence- and graph-based features, integration of noncanonical residue handling (e.g., >100 NCAA templates), and end-to-end differentiable kernel methods with trainable transport costs represent promising avenues for expanding the expressivity and applicability of SMILES-driven pipelines (Feller et al., 18 Apr 2025, Ali et al., 2024, Spence et al., 30 Jul 2025, Lim et al., 2020).

In summary, SMILES remains a cornerstone of molecular representation, directly enabling the application of deep learning and other computational approaches in chemical discovery and property prediction. Adherence to rigorous canonicalization and stereochemical annotation protocols is critical to reproducibility and model fidelity, particularly in generative and translation applications. Current research demonstrates that SMILES-based models—spanning n-gram MLPs, BiLSTM/Transformer encoders, kernel methods, and quantum-classical architectures—are highly effective for a diverse set of tasks, with future improvements anticipated via hybridization with graph-, image-, peptide-, and quantum-aware modeling (Rao et al., 2024, Feller et al., 18 Apr 2025, Spence et al., 30 Jul 2025, Jahin et al., 26 Aug 2025, Kikuchi et al., 11 May 2025, Lee et al., 2022, Campos et al., 2021, Ali et al., 2024, Lim et al., 2020, Honda et al., 2019, Wasi et al., 2024, Bjerrum, 2017, Nath et al., 2021).