Securing Reliability: A Brief Overview on Enhancing In-Context Learning for Foundation Models
Abstract: As foundation models (FMs) continue to shape the landscape of AI, the in-context learning (ICL) paradigm thrives but also encounters issues such as toxicity, hallucination, disparity, adversarial vulnerability, and inconsistency. Ensuring the reliability and responsibility of FMs is crucial for the sustainable development of the AI ecosystem. In this concise overview, we investigate recent advancements in enhancing the reliability and trustworthiness of FMs within ICL frameworks, focusing on four key methodologies, each with its corresponding subgoals. We sincerely hope this paper can provide valuable insights for researchers and practitioners endeavoring to build safe and dependable FMs and foster a stable and consistent ICL environment, thereby unlocking their vast potential.
- Foundation models can robustify themselves, for free. In R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023. URL https://openreview.net/forum?id=XoacWibt7b.
- Mistral AI. Mixtral of Experts: A High-Quality Sparse Mixture-of-Experts. https://mistral.ai/news/mixtral-of-experts/, 2023.
- Instructeval: Systematic evaluation of instruction selection methods, 2023.
- Alex Albert. Jailbreak chat: the largest collection of chatgpt jailbreaks on the internet. https://www.jailbreakchat.com/, 2023.
- Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
- A gentle introduction to conformal prediction and distribution-free uncertainty quantification, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Anonymous. Don’t trust: Verify – grounding LLM quantitative reasoning with autoformalization. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=V5tdi14ple.
- Anthropic. Model card and evaluations for claude models. https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf, July 2023.
- Promptsource: An integrated development environment and repository for natural language prompts, 2022.
- Detecting concept drift with neural network model uncertainty. arXiv preprint arXiv:2107.01873, 2021.
- Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Re-contextualizing fairness in nlp: The case of india, 2022.
- On the opportunities and risks of foundation models, 2022.
- Icl markup: Structuring in-context learning using soft-token tags, 2023.
- Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
- Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
- Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
- Jailbreaking black box large language models in twenty queries, 2023.
- Envisioning legal mitigations for llm-based intentional and unintentional harms. Administrative Law Journal, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 120–128, 2019.
- Detecting concept drift in data streams using model explanation. Expert Systems with Applications, 92:546–559, 2018.
- Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548, 2022.
- Building socio-culturally inclusive stereotype resources with community engagement, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 67–73, 2018.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- How robust is google’s bard to adversarial image attacks?, 2023.
- Estimating uncertainty in multimodal foundation models using public internet data, 2023.
- Mitigating label biases for in-context learning. arXiv preprint arXiv:2305.19148, 2023.
- Evaluating superhuman models with consistency checks, 2023.
- Flexible visual prompts for in-context learning in computer vision, 2023.
- Flexible text generation for counterfactual fairness probing, 2022.
- Counterfactual fairness in text classification through robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 219–226, 2019.
- Google Gemini Team. Gemini: A family of highly capable multimodal models, Dec 2023. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf.
- Pathfinder: Guided search over multi-step reasoning paths, 2023.
- Demystifying prompts in language models via perplexity estimation, 2022.
- Deep learning, volume 1. MIT Press, 2016.
- Google. Bard. https://bard.google.com/, 2023.
- Advancing transformer architecture in long-context large language models: A comprehensive survey, 2023.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
- Seegull: A stereotype benchmark with broad geo-cultural coverage leveraging generative models, 2023.
- Mistral 7B, 2023.
- A universal prompt generator for large language models. In R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023. URL https://openreview.net/forum?id=HAqPAqztEU.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023.
- Born-again decision boundary: Unsupervised concept drift detection by inspector neural network. In 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2021.
- Albert: A lite bert for self-supervised learning of language representations, 2020.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Jailbreaking chatgpt via prompt engineering: An empirical study, 2023.
- Dr.icl: Demonstration-retrieved in-context learning, 2023.
- Automix: Mixing models with few-shot self and meta verification. In R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023. URL https://openreview.net/forum?id=FJo2lroF7R.
- It’s all in the name: Mitigating gender bias with name-based counterfactual data substitution. arXiv preprint arXiv:1909.00871, 2019.
- The lean 4 theorem prover and programming language. In Automated Deduction–CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28, pp. 625–635. Springer, 2021.
- Isabelle/HOL: a proof assistant for higher-order logic. Springer, 2002.
- Risk assessment and statistical significance in the age of foundation models, 2024.
- Adversarial robustness of prompt-based few-shot learning for natural language understanding, 2023.
- OpenAI. Openai: Introducing chatgpt, 2022. URL https://openai.com/blog/chatgpt.
- OpenAI. Planning for agi and beyond. https://openai.com/blog/planning-for-agi-and-beyond, February 2023a.
- OpenAI. Dall-e 3. https://openai.com/dall-e-3, 2023b.
- OpenAI. Gpt-4 technical report, 2023.
- Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
- Prompt Library. Prompt library. https://promptlibrary.org/, 2024.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Cross-modal attribute insertions for assessing the robustness of vision-and-language learning, 2023.
- Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks, 2023.
- Weakly supervised detection of hallucinations in llm activations, 2023.
- Smoothllm: Defending large language models against jailbreaking attacks, 2023.
- Scalable and transferable black-box jailbreaks for language models via persona modulation, 2023.
- Sincode AI. Prompt library. https://www.sincode.ai/prompt-library, 2024.
- Break it, imitate it, fix it: Robustness by generating human-like attacks, 2023.
- Med-mmhl: A multi-modal dataset for detecting human-and llm-generated misinformation in the medical domain. arXiv preprint arXiv:2306.08871, 2023.
- Quantifying uncertainty in natural language explanations of large language models, 2023.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- THUDM. ChatGLM3. https://github.com/THUDM/ChatGLM3, 2023.
- Interpretable stereotype identification through reasoning, 2023a.
- Efficient evaluation of bias in large language models through prompt tuning. In Socially Responsible Language Modelling Research, 2023b. URL https://openreview.net/forum?id=v1WL01lgp8.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Tensor trust: Interpretable prompt injection attacks from an online game, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Overcoming language disparity in online content classification with multimodal learning. In Proceedings of the International AAAI Conference on Web and Social Media, volume 16, pp. 1040–1051, 2022.
- Are large language models really robust to word-level perturbations?, 2023.
- Measuring and reducing gendered correlations in pre-trained models, 2021.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023a.
- Chain-of-thought prompting elicits reasoning in large language models, 2023b.
- An empirical study on challenging math problem solving with gpt-4. arXiv preprint arXiv:2306.01337, 2023.
- Chia Jeng Yang. Mega prompts: Turning expertise into code. https://medium.com/messy-problems-original-concepts/, 2023.
- LeanDojo: Theorem proving with retrieval-augmented language models. In Neural Information Processing Systems (NeurIPS), 2023.
- Tree of thoughts: Deliberate problem solving with large language models, may 2023. arXiv preprint arXiv:2305.10601, 14, 2023.
- Finpt: Financial risk prediction with profile tuning on pretrained foundation models. arXiv preprint arXiv:2308.00065, 2023.
- Automatic hallucination assessment for aligned large language models via transferable adversarial attacks, 2023.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Promptgen: Automatically generate prompts using generative models. In Findings of the Association for Computational Linguistics: NAACL 2022, pp. 30–37, 2022.
- Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096, 2023.
- Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.
- Group preference optimization: Few-shot alignment of large language models, 2023.
- Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pp. 12697–12706. PMLR, 2021.
- Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. arXiv preprint arXiv:2310.16436, 2023a.
- Gpt-fathom: Benchmarking large language models to decipher the evolutionary path towards gpt-4 and beyond, 2023b.
- Robust prompt optimization for defending language models against jailbreaking attacks. arXiv preprint arXiv:2401.17263, 2024a.
- Batch calibration: Rethinking calibration for in-context learning and prompt engineering, 2024b.
- Large language models are human-level prompt engineers, 2023.
- Autodan: Interpretable gradient-based adversarial attacks on large language models, 2023.
- Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. arXiv preprint arXiv:1906.04571, 2019.
- Universal and transferable adversarial attacks on aligned language models, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.