Papers
Topics
Authors
Recent
Search
2000 character limit reached

AgML Collection: Graph ML & Domain Data Suite

Updated 20 December 2025
  • AgML Collection is an aggregation of datasets and frameworks designed for formalized mathematics, environmental modeling, and graph-based ML applications.
  • It represents Agda libraries as directed multigraphs using s-expressions, JSON, and CSV formats to capture complex interrelations effectively.
  • Advanced methods like GL-Coarsener and AGML meta-learning enable efficient numerical analysis, indoor localization, and predictive environmental analytics.

AgML Collection is an aggregation of data sets and machine learning frameworks developed for mathematical, scientific, and domain-specific applications, with an emphasis on graph-based representations, formalized mathematics, and agriculture/environmental modeling. The collection extends across three principal areas: mathematical formalization using Agda libraries, environmental and agricultural data integration, and advanced graph ML methods for numerical analysis and indoor localization.

1. Constituent Data Sets and Libraries

Agda MLFMF Libraries

The AgML Collection includes three Agda libraries from the MLFMF benchmark (Bauer et al., 2023):

  • Agda Standard Library: 5,336 entries
  • Agda-unimath: 50,120 entries
  • TypeTopology: 3,650 entries

Each library is represented as a directed multigraph G=(V,E)G=(V, E), where VV encompasses definitions, datatypes, functions, lemmas, and postulates. Edge types encode declaration and body references, module structure, and definitional relations. Entry types are categorized as :data, :constructor, :function, :record, :axiom, :primitive, :sort, :recursor, :abstract.

  • Network statistics:

| Library | |V| | |E| | avg deg | # modules | |--------------------|-------|----------|---------|-----------| | Agda Standard Lib | 5,336 | 25,712 | 9.64 | 27 | | Agda-unimath |50,120 | 482,310 |19.22 | 173 | | TypeTopology | 3,650 | 29,810 |16.34 | 21 |

  • S-expression representation: Each entry's computational graph is exported as a Lisp-style s-expression. For example:

1
2
3
4
5
6
7
(:entry
  (:name Agda.Builtin.Nat.id)
  (:type (Π (n : Nat) → Nat))
  (:data
    (:lambda n
      (:var n)))
)
These formats are stored as .lisp (s-expressions), JSON (modules, entries), and CSV (edges).

Agrimonia Environmental Data

Agrimonia (Fassò et al., 2022) provides a harmonized dataset of daily air quality, meteorology, emissions, livestock, and land/soil use (2016–2021) for the Lombardy region. It is designed for direct ingestion into AgML workflows, supporting regression, classification, causal inference, and spatiotemporal modeling.

  • Core modalities:
    • Air Quality: PM₁₀, PM₂.₅, NO₂, NOₓ, CO, SO₂, NH₃ (daily avg at 141 ground stations)
    • Weather: Temperature, humidity, wind speed/direction, precipitation, surface pressure, solar radiation (ERA5 reanalysis)
    • Emissions: NH₃, NOₓ, SO₂ (CAMS anthropogenic inventory, monthly 0.1° grid, interpolated to daily/station via PCHIP)
    • Livestock: Swine and bovine densities (biannual municipal, harmonized via PCHIP)
    • Land/Soil: High/low vegetation indices, CORINE land cover (44 categories), SIARL soil/cultivation (21 categories)
  • Data harmonization: Inverse-distance weighting (IDW) used for gridding; temporal interpolation (PCHIP); air quality imputed by Kalman smoothing over state-space models.
  • Metadata: All outputs are on a unified spatiotemporal grid (daily × station), SI units, ISO 8601 timestamps, WGS84 coordinates.

2. Graph- and ML-Based Methodologies

Heterogeneous Network Analysis and Embedding

Each AgML library can be parsed as a graph. Edges are weighted by reference type:

  • Unweighted adjacency: Aij=1A_{ij} = 1 if entry ii refers to jj
  • Weighted: WijW_{ij} is the sum over edge types, e.g. $w_{ij} = \alpha\,\#(\text{REF_DECL}(i, j)) + \beta\,\#(\text{REF_BODY}(i, j))$ (typically α=β=1\alpha=\beta=1).

The s-expressions facilitate AST-level parsing for ML; streaming Lisp parsers (e.g., sexpdata) are recommended.

GL-Coarsener: ML-Driven AMG Coarsening

The GL-Coarsener framework (Namazi et al., 2020) exemplifies entity-level graph representation learning for efficient algebraic multi-grid (AMG) construction:

  • Graph definition: From an SPD matrix AA, define graph G=(V,E,W)G = (V, E, W) with edge weights wij=Aijw_{ij}=|A_{ij}|.
  • Embedding: node2vec, skip-gram construction; ziR128\mathbf{z}_i \in \mathbb{R}^{128}, random walks with p=0.1,q=1p=0.1, q=1.
  • Clustering: Mini-batch KK-means, Kn/5K \approx n/5.
  • Transfer operators: "Rough" prolongation matrix P^ij\hat{P}_{ij} assigns nodes to aggregates; optional smoothing; restriction R=PTR=P^T.
  • Efficiency: All stages parallelizable; AGML embedding overhead scales as

Tcoarsen(P)c1V+c2EP+O(logP)T_{\mathrm{coarsen}}(P) \approx \frac{c_1|V| + c_2|E|}{P} + O(\log P)

  • Convergence: Comparable iteration counts to Vaněk aggregation AMG; significant improvement over Beck classical coarsener.

Attentional Graph Meta-Learning (AGML) for Localization

AGML (Yan et al., 7 Apr 2025) advances graph meta-learning for sparse fingerprint-based indoor localization:

  • Graph construction: Nodes = fingerprints; adjacency AA learned according to latent-space proximity and attention-derived pairwise scores.
  • Feature embedding: Raw fingerprints xix_i mapped via learned transforms to embedding lil_i.
  • Adjacency learning module (ALM): Two-stage (coarse threshold Th0T_h^0; fine, attention-modulated threshold TijAT_{ij}^A); differentiable adjacency via ReLU and tanh.
  • Message passing: Multiple graph attention layers (MGALs) update node embeddings; output matches 2D coordinates on labeled data.
  • Meta-learning: Model wrapped in MAML; tasks sampled from synthetic digital twin environments; adaptation via gradient-based inner and outer loops.
  • Data augmentation: Unlabeled roaming measurements and synthetic digital twin fingerprints; feature-space distribution alignment between synthetic and real-world data.
  • Performance: AGML yields RMSE reductions up to 2× versus AGNN/CNN/MLP baseline when labeled data are sparse. Ablation shows performance drop if ALM, meta-learning, or augmentation components are omitted.

3. Experimental Protocols, Metrics, and Baselines

All AgML libraries have associated link-prediction and recommendation baselines:

  • Splitting protocol: 20% of function nodes held-out, 10% of proof-body edges retained.
  • Metrics:
    • Accuracy@k: Fraction of queries with at least one correct reference in top-k.
    • Recall@k: Fraction of correct references recovered in top-k.
    • minRank: Mean minimal rank of true references.
  • Methods: Dummy (max in-degree), BoW Jaccard, TFIDF, fastText BoW, analogies, node2vec + tree bagging.
  • Results: node2vec+bagging achieves 96–98% acc@5; TFIDF and analogies yield 51–52%.

Agrimonia Integration

Agrimonia is suited for ML tasks via AgML:

  • Predictive regression (e.g. daily PM₂.₅ with Random Forest or neural nets)
  • Spatial kriging or hierarchical modeling (Gaussian processes)
  • Time-series forecasting (LSTM, TCN)
  • Classification (threshold detection)
  • Causal inference (difference-in-differences)
  • Epidemiological modeling (exposure analytics)

All modalities are cross-compatible and statistical aggregation formulas adhere to documented standards (August–Roche–Magnus for RH, PCHIP for temporal interpolation).

4. Data Formats, Access, and Processing

  • Agda: sexp/*.lisp (entries), modules.json, entries.json, edges.csv; recommended tools are streaming Lisp parsers and standard JSON/CSV libraries. High parentheses depth (O(100)) may require tailored parsing settings.
  • Agrimonia: Agrimonia_Dataset.csv, MATLAB/R conversions, metadata files for full variable provenance.
  • GL-Coarsener: Features extract directly from matrix coefficients; node2vec requires walk sampling parameters; K-means uses mini-batch size n/15n/15.
  • AGML: Features and adjacency matrices learned with defined hyperparameters: α=10⁻⁴, β=10⁻³, ζ=10⁻²; λ₁=0.1, λ₂=0.01; hidden dims and attention parameters set per framework implementation.

5. Limitations and Prospects

  • Coverage: AgML's formalized mathematics is currently restricted to Agda libraries; Agrimonia covers only EU rural scenes.
  • Generalizability: Agricultural segmentation data are EU-centric, and depth estimation is monocular (no multispectral/LiDAR).
  • Evaluation: High-level recommendation and human-aligned tasks require expert evaluation; subjective scoring (e.g., aesthetics via Q-Align) remains a challenge.
  • Parameter Tuning: All major ML components require context-specific hyperparameter selection.
  • Extensions: GL-Coarsener can be advanced to supervised graph neural network coarseners and energy-norm–optimized embeddings. AGML meta-learning may be adapted for continuous re-coarsening or time-sequenced inference.
  • Responsible AI: Model fairness, bias, and explainability are open research axes.

6. Applications and Impact

AgML Collection underpins a wide array of ML and reasoning pipelines:

  • Mathematical formalization: Accelerates automated theorem proving, type inference, and link recommendation in Agda/Lean.
  • Environmental modeling: Integrates high-resolution air quality and agricultural emissions for causal and predictive analytics (Agrimonia).
  • Numerical solvers: ML-driven coarsening yields scalable, parallelizable AMG cycles for PDE solvers (GL-Coarsener).
  • Indoor localization: Fuses meta-learning and graph attention to deliver robust localization under data sparsity (AGML).

A plausible implication is that AgML Collection enables unified, graph-represented ML benchmarks for cross-domain learning, ranging from formal mathematics to environmental science and wireless localization, providing a foundation for both supervised and unsupervised graph-based ML research.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgML Collection.