Matlas: Dual Research Infrastructure
- Matlas is a dual-use research infrastructure that organizes mathematical statements via semantic search and reveals galactic fine structures through deep imaging.
- The mathematical Matlas employs a two-stage LLM pipeline to extract 8M statements from over 435K papers, constructing dependency graphs for AI-driven theorem retrieval.
- The astronomical MATLAS uses deep CFHT/MegaCam imaging to detect low-surface-brightness features, aiding studies of galaxy assembly and merger taxonomy.
In current research usage, “Matlas” denotes two unrelated projects. In mathematical knowledge retrieval, Matlas is a semantic search engine for mathematical statements, built to support natural-language search over definitions, lemmas, theorems, corollaries, and related formal units extracted from peer-reviewed literature and textbooks (Ju et al., 19 Apr 2026). In extragalactic astronomy, MATLAS—Mass Assembly of early-Type GaLAxies with their fine Structures—is a deep CFHT/MegaCam imaging program devoted to low-surface-brightness structures around nearby massive galaxies and to the dwarf-galaxy systems detected in the same fields (Duc, 2016).
1. Terminological scope
The two uses of the name differ in both domain and object of analysis. The mathematical Matlas treats the literature as a corpus of interdependent statements and aims at theorem retrieval and grounding for AI systems (Ju et al., 19 Apr 2026). The astronomical MATLAS treats galaxy outskirts as an archaeological record of accretion and interaction, using diffuse stellar light, tidal debris, dwarf satellites, globular clusters, and related low-surface-brightness phenomena to reconstruct recent mass assembly (Duc, 2020).
This dual usage is substantive rather than merely orthographic. In the mathematical project, the atomic object is the statement; in the astronomical project, the atomic objects are fine structures such as streams, shells, tails, plumes, dwarf galaxies, and compact stellar systems. The shared label therefore links two distinct research infrastructures rather than a single field or method.
2. Matlas as a semantic search engine for mathematics
Matlas is designed as a semantic search engine built specifically for mathematical statements, rather than for whole papers or PDFs (Ju et al., 19 Apr 2026). Its stated aims include answering queries such as whether a result is already known, finding related theorems and variants, and identifying historical origins, with intended users comprising both human mathematicians and AI systems for mathematics.
Its corpus is large and explicitly curated. The system is built on 8.07 million statements extracted from 435K peer-reviewed papers spanning 1826–2025, drawn from 180 journals selected using an ICM citation-based criterion, together with 1.9K textbooks (Ju et al., 19 Apr 2026). The journal-selection procedure begins from references cited in International Congress of Mathematicians proceedings, counts citations to papers from 2007–2021, and retains journals cited more than 50 times and publishing at least 100 papers in that period. This yields the 180-journal set. The starting journal collection comprises 606K PDFs, of which 435K are found to contain mathematical statements.
The core representational unit is the mathematical statement: a localized textual span corresponding to a definition, lemma, proposition, theorem, corollary, and related categories. Each statement has a type, textual content, and local dependency links to other statements in the same document. This design directly addresses a central problem in mathematical retrieval: isolated theorem statements are often uninterpretable without earlier definitions, notation, and lemmas (Ju et al., 19 Apr 2026).
3. Extraction pipeline, dependency graphs, and retrieval architecture
Matlas constructs a document-level dependency graph for each paper or textbook. If the statements in a document are , the graph is written as , with an edge when statement depends on statement (Ju et al., 19 Apr 2026). The graph is then layered by repeated removal of zero-in-degree nodes, producing a topological order suitable for recursive processing.
Statement extraction is performed by a two-stage LLM pipeline. In the locator stage, DeepSeek-V3.2 inspects OCR-derived markdown and infers document-specific formatting patterns, including regular expressions that localize candidate statement spans. In the structurer stage, candidates are grouped into overlapping batches with default batch size = 5 and default window length characters; the LLM then assigns types, extracts structured statement text, and identifies local dependency links (Ju et al., 19 Apr 2026). The result is a typed dependency graph for each document.
Matlas then performs statement unfolding. Layer-0 nodes, having no dependencies, are retained as-is. Higher-layer statements are recursively expanded using already processed prerequisite statements from earlier layers, producing more self-contained representations (Ju et al., 19 Apr 2026). This differs from one-shot dependency recovery strategies by expanding only first-order dependencies and propagating contextualization incrementally along the dependency DAG.
For retrieval, unfolded statements are embedded with Qwen3-Embedding-8B and stored in a vector database (Ju et al., 19 Apr 2026). Given a natural-language query, the system prepends a theorem-retrieval-oriented instruction, embeds the resulting text, and ranks candidate statements by cosine similarity,
The paper describes the system as a pure dense-retrieval pipeline and does not report a dedicated benchmark dataset, baseline comparisons, or numerical retrieval metrics such as Recall@k, MRR, or nDCG (Ju et al., 19 Apr 2026). It also notes deployment as a web service at https://matlas.ai/ with API documentation at https://matlas.ai/docs.
4. MATLAS as a deep-imaging survey of nearby massive galaxies
In astronomy, MATLAS is a deep, wide-field imaging survey carried out with the Canada–France–Hawaii Telescope and the MegaCam wide-field imager, optimized for very low surface brightness emission (Duc, 2016). The survey is explicitly complementary to the Next Generation Virgo Cluster Survey for Virgo-cluster early-type galaxies, while MATLAS itself targets the broader local-volume environment outside Virgo (Duc, 2020).
Published descriptions differ slightly in sample accounting. Early reports describe a volume-limited program centered on 240 nearby massive early-type galaxies from the ATLAS sample, augmented by approximately 120 late-type galaxies falling in the same wide MegaCam fields (Duc, 2016). Later reviews emphasize 177 nearby massive ETGs observed directly with the MATLAS strategy, and about 200 ETGs overall once Virgo/NGVS coverage is included (Bílek et al., 2020). All descriptions agree that the project is anchored in the ATLAS nearby-ETG census and that the wide field simultaneously captures galaxy outskirts, companions, and environment.
The survey reaches a local surface-brightness limit of roughly
or, in the merger-debris analyses, a limiting depth of about
0
in roughly 45 minutes per field (Duc, 2016). The observational strategy uses large dithers, typically 2–14 arcmin, and reduction with the Elixir-LSB pipeline or equivalent low-surface-brightness-optimized procedures to suppress flat-field residuals, parasitic light, and large-scale background structure (Duc, 2020). The effective field of view is about 1, and the image quality is typically sub-arcsecond, with representative seeing values of about 0.65″–0.95″ depending on band (Duc, 2020).
The scientific motivation is “galaxy archaeology”: stars in outer halos and tidal debris retain signatures of past accretion, merger geometry, and assembly timescale. MATLAS therefore targets not only the diffuse halos of ETGs but also dwarf galaxies, ultra-diffuse galaxies, globular cluster systems, and foreground Galactic cirrus, all of which become accessible once the imaging reaches surface-brightness levels well below those of standard wide surveys (Bílek et al., 2020).
5. Fine structures, merger taxonomy, and reconstruction of galaxy assembly
A central MATLAS result is a morphology-based census of collisional debris around massive galaxies. The survey classifies stellar streams as signatures of minor mergers, tidal tails as signatures of gas-rich major mergers, plumes as signatures of gas-poor major mergers, and shells as signatures of intermediate-mass mergers (Duc, 2016). Detection relies on deep multi-band images, color maps, model-subtracted residual images, and expert visual inspection.
To connect observed morphology to assembly history, MATLAS uses simulations “made in cosmological context.” Mock images are generated from multiple snapshots, projected at different orientations, and then truncated at the MATLAS surface-brightness limit, effectively discarding pixels fainter than 2 (Duc, 2016). These simulated images are classified by eye in the same way as the real data, allowing estimation of visibility windows for streams, tails, shells, and related structures. The conceptual inference is that if a structure class has visibility time 3 and current occurrence fraction 4, then the recent rate of the corresponding merger channel is approximately 5 (Duc, 2016).
The empirical disturbance fraction rises markedly with imaging depth. For massive ETGs, the fraction of tidally perturbed systems increases from about 15% in classical shallower surveys to about 40% in MATLAS (Duc, 2016). Survey reviews summarize the overall disturbance incidence as about 30–40%, with likely or secure streams, tails, and shells each occurring in about 15% of ETGs, with overlap between categories (Duc, 2020). This indicates that a substantial fraction of late-time assembly signatures are invisible in shallower imaging.
Mass and kinematics correlate strongly with debris incidence. ETGs with stellar mass
6
and especially slow rotators show signatures of recent wet major mergers with a frequency increased by a factor of 3 relative to lower-mass or fast-rotator ETGs (Duc, 2016). Reviews likewise note that above 7, fully relaxed objects become the minority and streams and shells are roughly twice as frequent as at lower masses (Duc, 2020). By contrast, environmental dependence is described as mild or weak in the current analyses, rather than dominant (Duc, 2016).
6. Dwarf galaxies, UDGs, globular clusters, and compact nuclei in MATLAS
Beyond the host ETGs, MATLAS has produced a major census of low-surface-brightness satellites. The global dwarf catalog contains 2210 dwarf candidates across about 142 deg8, with roughly 75% morphologically classified as early-type dwarfs and 23.2% classified as nucleated (Habas et al., 2019). For 13.5% of the sample, pre-existing spectroscopic or H I information provides distances, and 99% of that subsample satisfy the dwarf criterion 9; about 90% have relative velocities indicating satellite status around nearby massive galaxies (Habas et al., 2019). Structural and photometric studies find that MATLAS dwarfs occupy size–luminosity–surface-brightness relations comparable to Local Group, Virgo, and Fornax dwarfs, while their average colors are as red as cluster dwarfs despite their lower-density environments (Poulain et al., 2021).
Within this dwarf population, a dedicated UDG study identifies 59 ultra-diffuse galaxies, about 3% of the dwarf catalog and about 0.4 UDG per square degree (Marleau et al., 2021). A broader MATLAS review gives a somewhat larger estimate—about 90 objects, or 4% of the dwarfs—under the standard UDG criteria 0 kpc and 1 (Duc, 2020). In the 59-object analysis, 61% of UDGs fall within group virial radii, their nucleated fraction is about 34%, only five show signs of tidal disruption, and only two are identified as tidal dwarf galaxy candidates (Marleau et al., 2021). Their globular-cluster specific frequencies and GC-inferred halo-to-stellar-mass ratios do not exceed those of classical dwarfs in the same environments, leading to the interpretation that the large majority of field-to-group UDGs do not require a formation scenario distinct from that of traditional dwarfs (Marleau et al., 2021).
Radio follow-up extends this picture into the gas phase. Of 1773 MATLAS dwarfs with available H I observations, 145 have H I detections, an 8% detection fraction (Poulain et al., 2021). The H I-bearing sample includes 42 dwarf ellipticals—the largest sample of H I-bearing dEs reported there—along with 3 UDGs, 17 transition-type dwarfs, 7 tidal dwarf candidates, and 14 disrupted objects (Poulain et al., 2021). For 79% of the H I satellites of massive ETGs, the H I mass increases with projected distance to the host, and dynamical estimates identify 7 dwarfs, or 5% of the H I sample, as dark-matter-deficient candidates (Poulain et al., 2021).
High-resolution follow-up with HST/ACS resolves the compact stellar systems at dwarf centers. A program targeting 79 MATLAS dwarfs and UDGs yields an NSC-focused sample of 41 nucleated dwarfs, including 13 newly identified nucleated dwarfs, 2 double-nucleus systems, and 5 candidate ultra-compact-dwarf progenitors (Poulain et al., 23 Sep 2025). In that sample, the NSC Sérsic index increases with luminosity and stellar mass, while bright NSCs tend to show bluer centers and fainter NSCs tend to show flatter color profiles, supporting a mixed formation picture in which globular-cluster migration is supplemented by in-situ star formation in the more massive nuclei (Poulain et al., 23 Sep 2025).
Subsequent dynamical follow-up has also used MATLAS satellite systems to test anisotropy claims. A lopsidedness analysis of 47 isolated MATLAS hosts finds that about 16% of systems are significantly lopsided under the wedge metric, rising to about 21% when six metrics are combined (Heesters et al., 2024). By contrast, MUSE spectroscopy of the NGC 474 system confirms 9 of 13 candidate dwarfs yet does not support a significant global plane-of-satellites within the virial radius, illustrating the importance of spectroscopic membership confirmation for MATLAS satellite studies (Müller et al., 2024).
7. Limitations and broader significance
Both Matlas projects are explicitly limited by the quality of contextual reconstruction. For the mathematical Matlas, the paper notes coverage gaps, OCR noise, possible mis-detection of statements and dependencies, and the fact that dependencies are modeled within documents rather than across the literature as a whole (Ju et al., 19 Apr 2026). It also notes the absence of benchmark retrieval metrics, so current evaluation is methodological and corpus-based rather than standardized.
For astronomical MATLAS, the major limitations are surface-brightness completeness, projection and orientation effects, visual-classification subjectivity, foreground cirrus, bright-star reflection halos, and, for the dwarf catalog, the fact that many candidate satellites still lack direct distance measurements (Duc, 2016). Follow-up spectroscopy shows that a non-negligible minority of photometrically selected satellites are interlopers or are associated with different hosts, and even UDG classification can change when physical size is revised from measured redshift (Southon et al., 2 Jul 2025).
This suggests a shared methodological role behind the otherwise unrelated uses of the name. In mathematics, Matlas organizes the literature as a graph of interdependent statements; in astronomy, MATLAS organizes faint structures around galaxies into a graph of assembly signatures, satellites, and environmental tracers. The term therefore functions, in both domains, as the name of a research infrastructure for recovering context that is not immediately visible in isolated documents or shallow images.