Papers
Topics
Authors
Recent
2000 character limit reached

SGuard-v1: Automated Safety & Moderation

Updated 23 November 2025
  • SGuard-v1 is a family of automated safety and moderation systems using machine learning and symbolic analysis to address vulnerabilities in LLMs, smart contracts, and VR speech.
  • Its architecture integrates domain-tuned modules—transformer classifiers for LLM safety, symbolic execution for smart contracts, and audio CNN with GPT-3.5 for VR moderation—to maintain high reliability and low overhead.
  • Empirical evaluations demonstrate robust performance with high precision, rapid inference, and formal verification methods, making it effective for real-world safety-critical applications.

SGuard-v1 denotes a family of automated safety and moderation systems that leverage machine learning and symbolic analysis across application domains such as LLMs, blockchain-based smart contracts, and real-time speech moderation in social virtual reality. The common theme is an architectural emphasis on modular risk detection, efficient inference, and principled coverage of high-impact vulnerabilities or unsafe behaviors. SGuard-v1 is released under the Apache-2.0 License for broad research and production use in AI and systems safety contexts (Lee et al., 16 Nov 2025, Nguyen et al., 2021, Xu et al., 23 Sep 2024).

1. Core Architectures and Functional Modules

SGuard-v1 spans distinct domains via application-specific module design:

  • In LLM safety (Lee et al., 16 Nov 2025), SGuard-v1 couples two transformer-based classifiers: ContentFilter (multi-class hazard detector) and JailbreakFilter (adversarial prompt resilience), each built on the 2B Granite-3.3-2B-Instruct architecture.
  • For smart contracts (Nguyen et al., 2021), SGuard-v1 combines bounded symbolic execution, taint and control-dependency analysis, and automated Solidity AST rewriting to eliminate code-level vulnerabilities.
  • In VR-based speech moderation (Xu et al., 23 Sep 2024), SGuard-v1 fuses OpenAI GPT-3.5 text-chain analyses with a parallel audio CNN to detect hate speech from real-time voice input.

All configurations interleave domain-tuned detection components and follow minimal-overhead principles for deployment, with inference-oriented decision rules—e.g., thresholded softmax outputs or intersectional alerting.

2. Training, Data, and Taxonomy Construction

LLM Content Safety and Adversarial Detection

ContentFilter leverages ≈400K bilingual prompt/response pairs derived from WildGuardMix and Aegis corpora, expanded by Contextual Harm Translation (CHT) and Benign-Harmful Contextual Blending (BHCB). JailbreakFilter is trained on ≈1M adversarial and benign prompts (public datasets) plus an additional curriculum-driven 5.4K high-priority instances. Both use carefully constructed multi-phase curriculum objectives and loss balancing (e.g., noise injection and reweighting via PSNI for the jailbreak task). The MLCommons hazard taxonomy is consolidated to five supported ContentFilter categories, and a 70B LLM is employed for label validation and consistency.

Smart Contract Vulnerability Mining

Contract detection is formulated as a finite-state symbolic analysis over EVM bytecode, with traces generated by loop-bounded symbolic execution. Vulnerabilities—reentrancy (intra/cross function), tx.origin misuse, arithmetic overflow/underflow—are encoded as dependency conditions over symbolic traces. All four vulnerability classes are precisely defined, e.g., intra-function reentrancy is present if a symbolic trace contains a SSTORE\mathtt{SSTORE} and an external call opcode with relevant dependency.

Speech-Based Hate Moderation

Speech moderation components use a small CNN trained on HATEMM (1,083 audio samples, annotated; 39.8% hate) and a hand-tuned few-shot prompt for GPT-3.5. Specific audio features include RMS energy and 40-dimensional MFCCs per frame, aggregated for robust inference. Alignment with a "strict AND" rule ensures the system only triggers on hate-marked content that is confirmed by both subsystems.

3. Formal Models, Loss Functions, and Decision Rules

LLM Safety Filters

ContentFilter minimizes

LCF=i=15yilogp(yix)L_{\mathrm{CF}} = -\sum_{i=1}^5 y_i \log p(y_i|x)

where p(yix)p(y_i|x) is the normalized softmax probability for each risk class (violence, illegal, sexual, privacy, manipulation). JailbreakFilter minimizes

LJB=[ylogp(unsafex)+(1y)logp(safex)]L_{\mathrm{JB}} = -[y\log p(\mathrm{unsafe}|x) + (1-y)\log p(\mathrm{safe}|x)]

with thresholding τjb\tau_{jb} applied at inference. Output is a vector (psafe,pc1,,pc5)(p_{\mathrm{safe}}, p_{c_1}, \dots, p_{c_5}) enabling per-category threshold calibration and downstream workflow integration.

Smart Contract Security

The system models contracts as a tuple

S=(Var,init,N,i,E)\mathcal{S} = (Var, init, N, i, E)

and decomposes traces as

tr=s0,op0,s1,op1,,sn,opn,sn+1tr = \langle s_0, op_0, s_1, op_1, \ldots, s_n, op_n, s_{n+1} \rangle

With opiop_i indicating EVM operations and dependencies assessed per trace using taint tracking and post-dominator analysis. The patching process is formalized via algorithms that enumerate traces, check dependencies per vulnerability definition, and rewrite the Solidity AST to insert nonReentrant modifiers, SafeMath checks, and replace tx.origin with msg.sender.

Voice-based Moderation

Audio CNN subsystems minimize binary cross-entropy loss over frame-derived features: L(θ)=i=1N[yilogpi+(1yi)log(1pi)]L(\theta) = -\sum_{i=1}^N [y_i \log p_i + (1 - y_i) \log(1 - p_i)] with decision-level fusion: y=1{G(T)=1}1{A(x)=1}y = \mathbf{1}\{G(T)=1\} \land \mathbf{1}\{A(x)=1\}

4. Empirical Evaluation and Benchmarking

LLM Safety

  • ContentFilter achieves average F1/AUPRC/pAUROC of 0.83/0.91/0.88 over five English public benchmarks, surpassing baselines like Qwen3Guard-4B and LlamaGuard-12B. In Korean, F1/AUPRC reach 0.90/0.969.
  • JailbreakFilter attains F1 = 0.85, FNR = 0.18, and FPR = 0.17, outperforming AWS Bedrock and Kanana-Prompt.
  • Red-teaming with three attack types shows Attack Success Rates (ASR) of 8.5%, 0.2%, and 7.9% in English, with combined filters outperforming comparable baselines.
  • Memory footprint: ContentFilter-2B ≈ 6.4 GB (NVIDIA H100), total SGuard-v1 ≈ 4B parameters in deployment, sub-100 ms per inference in optimized serving.

Smart Contract Security

  • 5,000 Etherscan contracts used as testbed; after timeouts, 1,605 were either patched or certified safe.
  • 3,548 successful transactions show execution-time overhead of 14.79% and gas overhead of 0.79%. Patch instrumentation is 5× more selective than naïve “protect every op” approaches.
  • 90% of contracts are fixed within 36s; average tool runtime is 15s per contract (Nguyen et al., 2021).

Speech Moderation in VR

  • Combined LLM/audio system achieves 0.96 precision, 0.53 recall, 0.69 F1, and 0.01 false-positive rate on YouTube-based validation (204 clips).
  • Baseline GPT-only achieves 0.95 F1 but higher false positives (0.06); audio CNN alone is less effective (F1=0.62, FPR=0.21).
  • Real-time latency is ≈1.5s/segment, dominated by Whisper transcription and GPT analysis (no dedicated GPU required).

5. Interpretability, Deployment, and Policy Integration

SGuard-v1 offers output granularity and API alignment for safety-centric orchestration:

  • LLM: returns risk category distributions and binary jailbreak flags; downstream modules can set thresholds τcτ_c, implement action hierarchies (warn/block/escalate), and log or escalate blocked interactions for human-in-the-loop review.
  • Smart contracts: provides source-to-bytecode artifact traceability, guarantees absence of the targeted four bug classes by formal argument (bounded trace coverage and patch completeness).
  • Speech moderation: only triggers user-visible alerts if both modalities agree, yielding low spurious alerts and higher user trust.

Deployment recommendations span domain-specific scope limits (e.g., ≤8K tokens for LLMs; Solidity 0.4.x–0.5.x for contracts; limited language/voice domain for VR). Human review appeal flows are recommended for critical safety interventions.

6. Limitations and Prospective Extensions

  • LLM: Covers five risk taxonomies and major jailbreak modalities, but offers no guarantees for novel attack classes; domain/language shift may degrade performance.
  • Smart contracts: Four vulnerability classes only; lacks handling for inline assembly, Vyper contracts, or certain access-control/exception-disorder patterns; symbolic execution bounds can induce path explosion.
  • VR speech: Audio CNN recall is moderate due to training data size; lacks robust front-end noise filtering or fine-grained hate taxonomy; open privacy/consent questions.
  • For all, future work points to richer specification-driven analysis, formal program verification, enhanced dataset construction, and cross-modal fusion for more resilient safety systems.

7. Licensing and Community Engagement

SGuard-v1 is fully released under the Apache-2.0 License, granting commercial and academic users rights to use, modify, and redistribute, subject to retention of license notices and explicit warranty disclaimers. Users are advised to periodically audit false positives/negatives, extend supported domains responsibly, and maintain transparency in deployment logs to support accountability and reproducibility in safety-critical contexts (Lee et al., 16 Nov 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SGuard-v1.