Papers
Topics
Authors
Recent
2000 character limit reached

SRE-Llama Platform: Automated SRE Framework

Updated 18 November 2025
  • SRE-Llama is a platform that automates SRE workflows by integrating time-series monitoring, federated learning, a fine-tuned LLM, and NFT-based governance.
  • It features a six-layer architecture combining blockchain contracts, microservices, metrics storage, federated learning, LLM synthesis, and NFT minting for immutable traceability.
  • The platform employs privacy-preserving federated learning and quantized LLMs to dynamically generate and enforce SLI/SLO policies with minimal operational overhead.

SRE-Llama is a platform for Site Reliability Engineering (SRE) that integrates time-series monitoring, federated learning, a fine-tuned LLM, and NFT/blockchain-based governance. Its primary aim is to automate and systematize the SRE workflow—specifically the identification, generation, and management of Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and alerting policies—in cloud-native communication and networking software services. Leveraging federated learning for privacy-preserving metric selection, Llama-3 LLM for policy synthesis, and NFTs for immutable record-keeping, SRE-Llama targets a reduction in expertise barriers associated with designing, deploying, and auditing SRE policy frameworks in modern distributed software environments (Bandara et al., 11 Nov 2025).

1. Architectural Foundations

SRE-Llama features a six-layer architecture, each responsible for a core system function:

  • Blockchain & Smart Contracts Layer: Identity management, service registry, federated learning orchestration, LLM coordination, and NFT minting are managed by five Solidity-style contracts. All persistent records (user IDs, container metadata, model hashes, SLI/SLO definitions) are stored on-chain to ensure immutability.
  • Software Service Layer: Target microservices (e.g., Open5GS 5G-Core) are managed in Docker/Kubernetes clusters, with deployment metadata registered via the blockchain Service Registry.
  • Metrics Storage Layer: Prometheus and Mimir collect and store time-series performance data (e.g., latency, error rates), exposing PromQL query interfaces for downstream machine learning and operational dashboards.
  • Federated Learning Layer: Employs a coordinator-less, blockchain-enabled framework (Bassa-ML) to train an LSTM RNN on local metric shards across multiple peers, orchestrated via the FL smart contract without central control or raw data sharing.
  • LLM Layer: A quantized, QLoRA-fine-tuned Llama-3-8B model, running locally over Ollama via CPU, is responsible for generating SLI/SLO definitions and associated PromQL alert rules, interacting with the smart contract infrastructure.
  • NFT Layer: Each generated SLI/SLO and its associated metadata are encoded as an NFT (per the s-528 schema) on the blockchain, inheriting ERC-721 token compliance for queries and transfer operations.

This multi-layered approach enables end-to-end automation—data ingestion, policy derivation, and verifiable storage—while embedding cryptographic integrity and auditability by design (Bandara et al., 11 Nov 2025).

2. Federated Learning Approach

SRE-Llama leverages federated learning (FL) to identify service-relevant SLIs with mandated data privacy:

  • Each of KK peers holds local metric data XkRnk×mX_k \in \mathbb{R}^{n_k \times m} and quality labels YkY_k.
  • The FL objective function is:

w=argminwk=1KnkNFk(w),Fk(w)=1nk(x,y)Dk(f(x;w),y)w^* = \arg\min_w \sum_{k=1}^K \frac{n_k}{N} F_k(w), \quad F_k(w) = \frac{1}{n_k} \sum_{(x, y) \in D_k} \ell(f(x; w), y)

where N=knkN = \sum_k n_k.

  • Each peer performs local SGD:

wkt+1=wktηFk(wkt)w_k^{t+1} = w_k^t - \eta \nabla F_k(w_k^t)

and submits encrypted parameter deltas Δwk\Delta w_k to the FL smart contract for global aggregation.

  • Metric importance is evaluated, after model convergence, via gradient sensitivity:

Si=1Tt=1Ty^txi,tS_i = \frac{1}{T} \sum_{t=1}^T | \frac{\partial \hat{y}_t}{\partial x_{i, t}} |

SLIs are selected as those with a score Si>δS_i > \delta for a preset threshold δ\delta.

This architecture eliminates centralized FL orchestrators, reducing single points of failure and enhancing privacy guarantees.

3. LLM Fine-Tuning and Deployment

SRE-Llama fine-tunes Meta’s Llama-3-8B model using Quantized Low-Rank Adaptation (QLoRA):

  • Data: Training uses prompt/response pairs mapping SLI metrics and context to target PromQL SLO queries, including error budget and alert rule examples.
  • Quantization: 4-bit quantization of pre-trained Llama-3-8B weights with LoRA rank r=16r=16 and α=32\alpha=32, applied to query and value projection layers.
  • Hyperparameters: Training at η=2×104\eta=2 \times 10^{-4}, batch size 32, for 3–5 epochs with early stopping on validation loss.
  • Objective: Cross-entropy loss on output tokens governs fine-tuning:

LCE=t=1TlogP(yty<t,x;W)L_{CE} = -\sum_{t=1}^{T}\log P(y_t|y_{<t}, x; W)

  • Inference: Model is served locally (Ollama, CPU), and invoked by a smart contract for SLI/SLO generation or PromQL query synthesis, e.g., generateSLO(service='Vault', metric='api_latency', target=0.99).
  • Performance: Mean response time per prompt is approximately 1.2 seconds (Bandara et al., 11 Nov 2025).

This methodology enables policy automation without the operational complexity of deploying large-scale GPU clusters, making LLM-driven SRE accessible on standard hardware.

4. NFT and Blockchain Mechanisms

All SLI/SLO objects generated by SRE-Llama are minted as NFTs to guarantee immutability and facilitate transparent governance:

  • Schema s-528: Each token encodes:
    • serviceName
    • metricKey
    • sloTarget
    • timeWindow
    • errorBudget
    • owner (deployer address)
  • Smart Contract Functions:
    • Minting is restricted to the LLM contract, enforcing workflow provenance.
    • Transfer operations and on-chain metadata storage follow the ERC-721 standard; query via tokenURI(tokenId) returns the full SLI/SLO definition.
  • Governance: Only the LLM contract can mint; only the token owner can transfer.
  • Auditability: All operational policy records are permanently available, supporting verification and compliance checks with zero runtime inference overhead (Bandara et al., 11 Nov 2025).

This architecture supports reproducible, cryptographically verifiable SRE practices across organizations and services.

5. Error-Budget Enforcement

Error budgets represent permissible deviations from SLO targets within a monitoring window:

  • For SLO target α\alpha over time window TT:

ErrorBudget(α,T)=(1α)T\text{ErrorBudget}(\alpha, T) = (1 - \alpha) \cdot T

  • Example: For α=0.999\alpha=0.999, T=30T=30 days, ErrorBudget=0.72\text{ErrorBudget} = 0.72 h \approx 43.2 minutes.
  • The accumulated error budget usage up to time tt:

β(t)=0t1Q(u)<αdu\beta(t) = \int_{0}^{t} 1_{Q(u) < \alpha} du

  • A violation is present when β(t)ErrorBudget\beta(t) \geq \text{ErrorBudget}.

This formalization underpins automated policy enforcement and alerting across the SRE-Llama platform (Bandara et al., 11 Nov 2025).

6. Prototype Deployment and Evaluation

The SRE-Llama prototype was validated on a use case involving a customized Open5GS 5G Core deployment with Ericsson RAN in a realistic testbed:

  • Testbed: Open5GS microservices on Kubernetes are monitored by Prometheus/Mimir. Alerting is managed by Alertmanager; dashboards via Grafana. GitHub Actions serve for CI/CD.
  • Federated Learning: Bassa-ML with TensorFlow LSTM, deployed across 7 peers.
  • LLM Serving: Llama-3-8B quantized and fine-tuned, running locally.
  • Performance Metrics:
    • FL training converged in 1,000 iterations; loss << 0.03, accuracy >> 95% after ~600 rounds.
    • Block creation latency scaled with peer count: 50 ms (2 peers) to ~120 ms (7 peers).
    • LLM SLO synthesis mean time: 1.2 s per prompt.
  • Observations:
    • Blockchain-based FL eliminated single points of failure; operational latency <<200 ms.
    • LLM-generated PromQL SLO and alert rules were valid in more than 90% of evaluated cases, with edge cases addressed by prompt refinement.
    • NFT encoding introduced no measurable runtime overhead and supported immediate auditability (Bandara et al., 11 Nov 2025).

This suggests robust end-to-end automation of SRE processes is feasible with the SRE-Llama model, lowering technical barriers for cloud-native service developers while improving traceability and trust in operational policies.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SRE-Llama Platform.