Plantnet-Fine-Tuned DinoV2

Updated 16 October 2025

The paper presents a novel framework that fine-tunes DINOv2 using ViT architectures to achieve robust multi-label plant species classification through grid-based inference and ecological priors.
By integrating transformer-based feature extraction with sophisticated tiling and distributed data processing, the approach effectively handles large, heterogeneous botanical datasets.
Incorporating domain-prior adaptation and Bayesian reweighting, the method significantly improves performance metrics in competitions like PlantCLEF 2024 and 2025.

Plantnet-Fine-Tuned DinoV2 refers to a class of transfer learning methodologies applying the DINOv2 self-supervised Vision Transformer (ViT) architecture to plant species classification tasks, particularly in large-scale, multi-label contexts such as the PlantCLEF 2024 and 2025 competitions. These approaches exploit both the representational power of DINOv2 and the plant-specific discrimination afforded by fine-tuning on relevant botanical datasets and scenarios, including sophisticated data processing pipelines, tiling strategies, and domain-prior adaptations.

1. Vision Transformer Architecture and Fine-Tuning Protocols

The core architecture for Plantnet-Fine-Tuned DinoV2 is based on the ViT-B/14 or ViT-L/14 models distilled with DINOv2, wherein each image is divided into 256 fixed-size patches, embedded to $\mathbb{R}^{768}$ , and prepended with a [CLS] token yielding an output tensor $\mathbb{R}^{257\times 768}$ (Gustineli et al., 8 Jul 2024). Absolute positional embeddings preserve spatial structure for the transformer encoder. For PlantCLEF 2024, two fine-tuning regimens were implemented: (1) freezing the backbone and training only a new classification head, or (2) jointly fine-tuning both backbone and head ("dinov2‑onlyclassifier‑then‑all"). The latter was utilized, with a downstream linear classifier trained using the Negative Log-Likelihood (NLL) loss, $L = -\log p(i)$ where $p(i)$ represents the probability for the correct class $i$ .

Similarly, in the PlantCLEF 2025 pipeline, the ViTD2PC24All model—pretrained on DINOv2 over 142 million LVD images—was further fine-tuned end-to-end on the PlantCLEF single-label dataset to capture plant-specific features (Gustineli et al., 8 Jul 2025). This comprehensive adaptation enables the transformer to specialize beyond general image semantics toward fine-grained botanical discrimination.

2. Embedding Extraction and Multi-label Classification Frameworks

For generalized feature extraction, two methods are described: (1) DCT-reduced full image embeddings ( $\mathbb{R}^{257\times 768}$ compressed to $\mathbb{R}^{1\times 64}$ using an $8\times 8$ DCT), and (2) direct extraction of the [CLS] token embedding ( $\mathbb{R}^{1\times 768}$ ), which aggregates semantic information across all patches (Gustineli et al., 8 Jul 2024). The fine-tuned ViT, with updated weights, enhances the discriminative power of [CLS] embeddings for classifying plant species.

Classification in multi-label settings leverages grid-based tiling: images, often of high resolution and encompassing multiple species per vegetation plot, are partitioned into grids (e.g., $3\times3$ or $4\times4$ ), with each tile processed independently to yield local embeddings and predictions. Final species sets per image are aggregated from tile-level logits by either argmax per tile or top- $K$ /top- $L$ selection, achieving robust handling of spatial heterogeneity and species diversity.

3. Data Processing, Distributed Inference, and Tiling Strategies

To address the scale and heterogeneity of datasets (e.g., 281 GiB raw images), distributed processing with Apache Spark is employed for memory-efficient handling and computation across clusters (Gustineli et al., 8 Jul 2024). Raw image data, downloaded via tools like aria2 to Google Cloud Storage, are converted to Apache Parquet format to optimize batch processing.

For tiling, each image is cropped and resized to standardized dimensions ( $128\times128$ for embedding extraction), dramatically reducing storage requirements. At inference, images are segmented into grids (for example, $3\times3$ , $4\times4$ ) matching the ViT’s receptive field ( $\sim518\times518$ ), and the resulting tiles are fed independently to the fine-tuned DINOv2 backbone, with predictions subsequently aggregated for multi-label outputs (Gustineli et al., 8 Jul 2025).

4. Domain-Prior Adaptation and Handling of Ecological Context

Domain adaptation strategies address the mismatch between extensive single-label training species ( $>7,\!800$ ) and more restricted, region-specific test sets ( $\sim800$ ). One technique applies PaCMAP for dimensionality reduction on [CLS] token embeddings, followed by K-Means clustering to produce unsupervised groups reflecting ecological or geographic similarities in plots (Gustineli et al., 8 Jul 2025). Geolocation filtering is performed by check-pointing species locations to a reference coordinate (44°N, 4°E) and retaining only those species with European provenance, sharply narrowing candidate sets.

Predictions are further reweighted using cluster-specific Bayesian priors. The empirical prior distribution $P(y|c)$ for each cluster $c$ is derived by averaging class probability vectors within the cluster, and tile-level predictions are multiplied by these priors, biasing output toward ecologically plausible species.

5. Aggregation Techniques and Performance Metrics

Grid-based inference—coupled with maximum logit or top- $K$ selection mechanisms—demonstrates significant gains over naive whole-image methods in multi-label settings. Macro F1 (averaged per plot/species) and Micro F1 scores provide robust evaluation, with the most prominent approach (fine-tuned DINOv2 with grid-based inference using argmax per tile) achieving leaderboard scores of $20.77$ (Macro F1 Per Plot), $47.42$ (Macro F1 Per Species), and $19.67$ (Micro F1) on PlantCLEF 2024 (Gustineli et al., 8 Jul 2024). In PlantCLEF 2025, the use of a $4\times4$ tiling strategy with Bayesian prior reweighting increased macro-averaged F1 by two orders of magnitude relative to whole-image inference, achieving $0.348$ on the private test set (Gustineli et al., 8 Jul 2025).

Performance comparisons and ablation studies underscore the efficacy of fine-tuning, tiling alignment with receptive field size, and domain prior adaptation, particularly in the context of species-level class imbalance.

6. Codebase Structure and Reproducibility

All stages of the methodology are modularized in the provided codebases [https://github.com/dsgt-kaggle-clef/plantclef-2024], [https://github.com/dsgt-arc/plantclef-2025], comprising:

Download scripts for robust data retrieval.
Preprocessing modules for cropping, resizing, tiling, and Parquet conversion.
Modeling and inference routines for feature extraction and classification using PyTorch Lightning.
Aggregation utilities implementing argmax, top- $K$ , and majority-vote selection for multi-label inference.
Integration with distributed/cloud libraries (Spark, Petastorm) and experiment tracking tools (Weights and Biases).

Comprehensive reproducibility is emphasized, with open access to code, configuration files, and experimental scripts to facilitate verification and extension by the research community (Gustineli et al., 8 Jul 2025).

7. Significance, Limitations, and Prospects

The integration of transfer learning via fine-tuned DINOv2 backbones, distributed data workflows, multi-scale tiling, and prior-informed aggregation constitutes a robust framework for multi-label plant species classification on large, heterogeneous botanical datasets. Localized grid inference captures patch-level diversity crucial for accurate ecological survey analysis, while domain adaptation mitigates species pool mismatch.

Primary limitations include reliance on extensive computational resources for large-scale distributed training and inference, sensitivity to tiling/grid parameters, and the complexity of balancing ecological priors in highly imbalanced test sets. Further improvements may focus on adaptive grid tiling, unsupervised domain adaptation, and hybrid models leveraging taxonomic metadata for hierarchical predictions.

Plantnet-Fine-Tuned DinoV2 methodologies, as advanced by DS@GT and collaborators, exemplify scalable and effective application of self-supervised transformers to complex, real-world ecological and agricultural vision tasks, with open reproducibility fostering ongoing innovation in automated plant identification and ecosystem monitoring (Gustineli et al., 8 Jul 2024, Gustineli et al., 8 Jul 2025).

PDF Markdown Chat (Pro)

References (2)

Multi-Label Plant Species Classification with Self-Supervised Vision Transformers (2024)

Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification (2025)

Follow Topic

Get notified by email when new papers are published related to Plantnet-Fine-Tuned DinoV2.

Plantnet-Fine-Tuned DinoV2

1. Vision Transformer Architecture and Fine-Tuning Protocols

2. Embedding Extraction and Multi-label Classification Frameworks

3. Data Processing, Distributed Inference, and Tiling Strategies

4. Domain-Prior Adaptation and Handling of Ecological Context

5. Aggregation Techniques and Performance Metrics

6. Codebase Structure and Reproducibility

7. Significance, Limitations, and Prospects

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Plantnet-Fine-Tuned DinoV2

1. Vision Transformer Architecture and Fine-Tuning Protocols

2. Embedding Extraction and Multi-label Classification Frameworks

3. Data Processing, Distributed Inference, and Tiling Strategies

4. Domain-Prior Adaptation and Handling of Ecological Context

5. Aggregation Techniques and Performance Metrics

6. Codebase Structure and Reproducibility

7. Significance, Limitations, and Prospects

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research