Transparency in Agentic AI: A Survey of Interpretability, Explainability, and Governance

The transparency gap in agentic AI showing market growth asymmetry, research output gaps, enterprise adoption challenges, and sectoral requirements.

🌟 Overview

Welcome to our Transparency in Agentic AI survey paper repository. This comprehensive collection documents interpretability and explainability methods for LLM-based agentic systems, covering the full agent lifecycle from design through deployment.

Authors: Shaina Raza¹, Ahmed Y. Radwan¹, Sindhuja Chaduvula¹, Mahshid Alinoori¹, Christos Emmanouilidis²

¹Vector Institute for Artificial Intelligence · ²University of Groningen

What is Agentic AI Transparency?

Transparency in Agentic AI encompasses the principles, practices, and frameworks governing the interpretability and explainability of LLM-based agents that plan across multiple steps, use external tools, maintain memory, and coordinate with other agents. Unlike traditional XAI methods designed for static models, agentic transparency addresses the unique challenges of multi-step reasoning, tool interaction, stateful memory, and multi-agent coordination.

📊 The Transparency Gap (2022-2025)

Evolution of methods showing the "transparency gap" period (2022–present) where Agentic AI development has accelerated while explainability methods remain focused on static models.

Key Trends:

Market Asymmetry: Global Agentic AI market projected to reach USD 150-200 billion by 2033-2034, growing 6× faster than XAI tools
Research Gap: XAI/interpretability surveys focus on static models while agentic AI surveys treat transparency as secondary
Enterprise Challenge: 90% plan agentic AI deployment within three years, yet only 2% have deployed at scale
Regulatory Pressure: Sectors driving adoption (banking, healthcare, government) face strictest transparency requirements (EU AI Act, NIST AI RMF, ISO/IEC 42001)

This repository provides comprehensive coverage across:

Interpretability Methods - Design-time and process-time transparency for perception, reasoning, tool use, memory, and multi-agent systems
Explainability Methods - Process-time and outcome-time explanations including chain-of-thought, faithfulness, and counterfactuals
Evaluation & Benchmarks - Assessment frameworks for multi-turn tasks, tool use, planning, memory, multi-agent systems, and safety
Governance & Compliance - Regulatory frameworks (EU AI Act, NIST, ISO/IEC 42001) and governance technologies

🎯 Survey Positioning

This survey occupies the intersection of XAI/interpretability and agentic AI research, addressing transparency challenges specific to agentic systems.

🏗️ Five-Axis Taxonomy

The five-axis taxonomy organizing transparency across WHAT (cognitive objects), WHY (assurance objectives), HOW (mechanisms), WHEN (temporal stages), and WHO (stakeholders).

Our framework organizes transparency along five complementary dimensions: Cognitive Objects (WHAT), Assurance Objectives (WHY), Mechanisms (HOW), Temporal Stages (WHEN), and Stakeholders (WHO).

📖 Papers by Category

1. Interpretability Methods

1.1 Perception & Vision-Language Models

Recent Advances (2024-2025)

• [2025] Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs Zhining Liu et al. [paper]

. [2025] Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation P.Y. Lee et al. [paper]

• [2024] A Concept-Based Explainability Framework for Large Multimodal Models Mohammad Shukor et al. [paper]

• [2024] Inference Optimal VLMs Need Fewer Visual Tokens Kevin Y Li et al. [paper]

Foundational Works (2013-2020)

• [2020] Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models Jize Cao et al. [paper]

• [2016] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Ramprasaath R Selvaraju et al. [paper]

• [2014] Visualizing and Understanding Convolutional Networks Matthew D Zeiler et al. [paper]

• [2013] Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Karen Simonyan et al. [paper]

1.2 Mechanistic Interpretability

Recent Advances (2023-2025)

• [2025] The Mechanistic Emergence of Symbol Grounding in Language Models Shuyu Wu et al. [paper]

. [2025] Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation Qiming Li et al. [paper]

• [2024] Mechanistic Interpretability for AI Safety: A Review Leonard Bereska et al. [paper]

• [2024] Mechanistic Interpretability with Activation Patching Zhen Wang et al. [paper]

• [2023] Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham et al. [paper]

• [2023] Towards Automated Circuit Discovery for Mechanistic Interpretability Arthur Conmy et al. [paper]

Early Works (2020-2022)

• [2022] Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang et al. [paper]

• [2020] Zoom In: An Introduction to Circuits Chris Olah et al. [paper]

1.3 Probing & Representation Analysis

Recent Advances (2023-2025)

• [2025] Linear Personality Probing and Steering in LLMs: A Big Five Study Michel Frising, Daniel Balcells (submitted) [paper]

• [2025] CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs Li Li et al. (submitted) [paper]

• [2024] Probing Language Models on Their Knowledge Source Tighidet et al. [paper]

• [2024] Probing Conceptual Understanding of Large Visual-Language Models Schiappa et al. [paper]

• [2023] The Linear Representation Hypothesis and the Geometry of Large Language Models Kiho Park et al. [paper]

Early Works (2016-2019)

• [2019] A Structural Probe for Finding Syntax in Word Representations John Hewitt et al. [paper]

• [2016] Understanding Intermediate Layers using Linear Classifier Probes Guillaume Alain et al. [paper]

1.4 Attention Analysis

Recent Advances (2023-2025)

• [2025] Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers Andrew J. Nam et al. [paper]

• [2025] Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models Jinyeong Kim et al. [paper]

• [2024] Faithful Attention Explainer: Verbalizing Decisions Based on Discriminative Features Yao Rong et al. [paper]

• [2023] Causal Interpretation of Self-Attention in Pre-Trained Transformers Raanan Y. Rohekar et al. [paper]

Early Works (2019-2021)

• [2021] Transformer Interpretability Beyond Attention Visualization Hila Chefer et al. [paper]

• [2020] Quantifying Attention Flow in Transformers Samira Abnar et al. [paper]

• [2019] What Does BERT Look At? An Analysis of BERT's Attention Kevin Clark et al. [paper]

1.5 Tool Use Interpretability

Latest Methods (2025)

• [2025] MCP-Bench: Benchmarking Tool-Using LLM Agents at Scale Zhenting Wang et al. [paper]

• [2025] AgentSHAP: Explaining Black-Box LLM Agent Tool Use with Shapley Values Miriam Horovicz [paper]

Earlier Works (2023-2024)

• [2024] GTA: A Benchmark for General Tool Agents Jize Wang et al. [paper]

• [2023] Toolformer: Language Models Can Teach Themselves to Use Tools Timo Schick et al. [paper]

1.6 Memory & Retrieval

Recent Advances (2024-2025)

• [2025] MemBench: Memorization Capability Benchmark for Large Language Models Haoran Tan et al. [paper]

• [2025] KGRAG-Ex: Knowledge Graph-Enhanced Explainable Retrieval Augmented Generation Georgios Balanos et al. [paper]

• [2024] Lost in the Middle: How Language Models Use Long Contexts Nelson F Liu et al. [paper]

Foundational Work (2023)

• [2023] Retrieval-Augmented Generation for Large Language Models: A Survey Yunfan Gao et al. [paper]

• [2023] Generative Agents: Interactive Simulacra of Human Behavior Joon Sung Park et al. [paper]

1.7 Multi-Agent Interpretability

Latest Research (2024-2025)

• [2025] G-Safeguard: Topology-Guided Security Treatment for Multi-Agent LLM Systems Shilong Wang et al. [paper]

• [2024] NetSafe: Ensuring Topological Safety in Multi-Agent Networks Miao Yu et al. [paper]

Early Work (2023)

• [2023] CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society Guohao Li et al. [paper]

1.8 Causal Interventions & Provenance

Latest Developments (2025)

• [2025] Traceability in Multi-Agent LLM Pipelines Amine Barrak [paper]

• [2025] Because we have LLMs, we Can Pursue Agentic Interpretability Been Kim et al. [paper]

Earlier Methods (2013-2022)

• [2022] Locating and Editing Factual Associations in GPT Kevin Meng et al. [paper]

• [2013] PROV-Overview: An Overview of the PROV Family of Documents W3C Provenance Working Group [spec]

2. Explainability Methods

2.1 Chain-of-Thought & Reasoning

Recent Advances (2024-2025)

• [2025] Layered Chain-of-Thought Reasoning for Multi-Agent Systems Hao Wang et al. [paper]

• [2024] Graph of Thoughts: Solving Elaborate Problems with Large Language Models Maciej Besta et al. [paper]

• [2024] Self-Reflection in LLM Agents: Effects on Problem-Solving Performance Matthew Renze et al. [paper]

Foundational Methods (2022-2023)

• [2023] Tree of Thoughts: Deliberate Problem Solving with Large Language Models Shunyu Yao et al. [paper]

• [2023] Reflexion: Language Agents with Verbal Reinforcement Learning Noah Shinn et al. [paper]

• [2023] ReAct: Synergizing Reasoning and Acting in Language Models Shunyu Yao et al. [paper]

• [2022] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason Wei et al. [paper]

2.2 Faithfulness & Truthfulness

• [2023] Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting Miles Turpin et al. [paper]

• [2023] Faithful Chain-of-Thought Reasoning Qing Lyu et al. [paper]

• [2019] A Multiscale Visualization of Attention in the Transformer Model Jesse Vig [paper]

2.3 Counterfactual Explanations

• [2023] Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations Yanda Chen et al. [paper]

• [2017] Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR Sandra Wachter et al. [paper]

2.4 Interactive & User-Centered XAI

• [2024] Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era Xuansheng Wu et al. [paper]

• [2024] x-plAIn: Explainable AI Through Human-Centered Prompting and Interactive Explanations Philip Mavrepis et al. [paper]

3. Evaluation & Benchmarks

3.1 Multi-Turn & Long-Horizon Tasks

• [2024] TheAgentCompany: Benchmarking LLM Agents on Consequential Real-World Tasks Frank F Xu et al. [paper]

• [2024] AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? Ori Yoran et al. [paper]

• [2024] τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains Shunyu Yao et al. [paper]

3.2 Tool Use Evaluation

• [2025] The Tool Decathlon: A Diverse Benchmark for Tool Understanding and Use Junlong Li et al. [paper]

• [2024] GTA: A Benchmark for General Tool Agents Jize Wang et al. [paper]

3.3 Planning & Reasoning

• [2025] Can LLMs Truly Plan? Assessing the Planning Capabilities of LLMs Gayeon Jung et al. [paper]

• [2024] Agent-GPA: Generalized Process Automation via LLM Agent Planning Allison Sihan Jia et al. [paper]

3.4 Memory & Context

• [2025] MemBench: Memorization Capability Benchmark for Large Language Models Haoran Tan et al. [paper]

• [2024] Long-Term Memory in LLM Agents: A Survey and Empirical Evaluation Adyasha Maharana et al. [paper]

3.5 Multi-Agent Systems

• [2025] MultiAgentBench: Benchmarking Multi-Agent Collaboration and Competition Kunlun Zhu et al. [paper]

• [2024] MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration Lin Xu et al. [paper]

3.6 Safety & Risk Assessment

• [2024] AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents Maksym Andriushchenko et al. [paper]

• [2024] Agent-SafetyBench: Evaluating the Safety of LLM Agents Zhexin Zhang et al. [paper]

• [2024] R-Judge: Benchmarking Safety Risk Awareness for LLM Agents Tongxin Yuan et al. [paper]

3.7 Transparency Quality Metrics

• [2020] Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods Dylan Slack et al. [paper]

• [2018] Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Been Kim et al. [paper]

• [2017] Towards A Rigorous Science of Interpretable Machine Learning Finale Doshi-Velez et al. [paper]

• [2016] "Why Should I Trust You?" Explaining the Predictions of Any Classifier Marco Tulio Ribeiro et al. [paper]

4. Governance & Compliance

4.1 Regulatory Frameworks

• EU AI Act – Risk-based regulation for AI systems with transparency requirements

• NIST AI Risk Management Framework – Voluntary framework for managing AI risks

• ISO/IEC 42001:2023 – AI management system standard

4.2 Governance Technologies

• [2025] TRiSM for Agentic AI: Trust, Risk and Security Management Framework Shaina Raza et al. [paper]

• [2021] Datasheets for Datasets Timnit Gebru et al. [paper]

• [2019] Model Cards for Model Reporting Margaret Mitchell et al. [paper]

5. Related Surveys

5.1 XAI & Interpretability Surveys

Latest Comprehensive Surveys (2024-2025)

• [2025] Large Language Models for Explainable Recommendation: A Comprehensive Survey Ahsan Bilal et al. [paper]

• [2025] Mechanistic Interpretability for Multimodal Large Language Models: A Survey Zihao Lin et al. [paper]

• [2024] From Understanding to Utilization: A Survey on Explainability for Large Language Models Haoyan Luo et al. [paper]

• [2024] Explainability for Large Language Models: A Survey Haiyan Zhao et al. [paper]

• [2024] Mechanistic Interpretability for AI Safety: A Review Leonard Bereska et al. [paper]

Foundational Surveys (2018-2019)

• [2019] Definitions, Methods, and Applications in Interpretable Machine Learning W James Murdoch et al. [paper]

• [2018] Explaining Explanations: An Overview of Interpretability of Machine Learning Leilani H Gilpin et al. [paper]

• [2018] The Mythos of Model Interpretability Zachary C Lipton [paper]

5.2 Agentic AI Surveys

Recent Surveys (2024-2025)

• [2025] Agentic Large Language Models: A Survey on the Architectural Designs and Applications Aske Plaat et al. [paper]

• [2025] Large Language Model Agents: Methodology, Applications and Challenges Junyu Luo et al. [paper]

• [2025] A Systematic Literature Review on Large Language Model-Based Agents for Tool Learning Weikai Xu et al. [paper]

• [2024] A Survey on Large Language Model Based Autonomous Agents Lei Wang et al. [paper]

🔬 Research Gaps & Future Directions

Interpretability method coverage showing gaps in tool use, memory, and multi-agent settings.

Our systematic analysis reveals critical gaps where current transparency methods fall short of agentic requirements:

• Multi-Step Faithfulness – Current methods explain individual steps but struggle across long trajectories where plan-trace drift accumulates

• Tool Use Attribution – Limited techniques for attributing outcomes to specific tool calls in complex workflows

• Memory & Belief Tracking – Sparse coverage of how beliefs evolve through retrieval and memory updates

• Multi-Agent Coordination – Near-total gap in understanding distributed decision-making and emergent behaviors

• Uncertainty Communication – Agents rarely communicate appropriate uncertainty when evidence conflicts

• Temporal Explanations – Explaining how plans changed mid-execution requires new structures

Explainability coverage by context showing major gaps in uncertainty communication and multi-agent attribution.

🔑 Key Artifacts

Minimal Explanation Packet (MEP)

MEP operationalized showing lifecycle flow from design-time specs to outcome with integrity gates.

A standardized, cryptographically-signed record containing plan summary, tool traces, evidence references, policy activation logs, and cryptographic signatures.

Tool-Using Agent Example

Tool-using agent execution flow with transparency substrate showing provenance for replay and verification.

Evaluation Landscape

Agent performance evaluation landscape showing nine core evaluation areas.

🤝 Contributing

We welcome contributions! This is a living repository that will be updated as the field evolves.

How to contribute:

• 📝 Add papers: Submit via pull request using the format: • [Year] **Title** *Authors* [[paper](link)]

• 💡 Report issues: Open issues for missing categories, broken links, or errors

• 🔬 Share implementations: Let us know if you're implementing or extending the framework

• 📊 Suggest improvements: Propose better organization or additional categories

📜 Citation

If you find this survey useful in your research, please cite:

📧 Contact

Shaina Raza · [email protected]
Ahmed Y. Radwan · [email protected]
Sindhuja Chaduvula · [email protected]
Mahshid Alinoori · [email protected]
Christos Emmanouilidis · [email protected]

🙏 Acknowledgments

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute. This research was funded by the European Union’s Horizon Europe research and innovation programme under the AIXPERT project (Grant Agreement No. 101214389), which aims to develop an agentic, multi-layered, GenAI-powered framework for creating explainable, accountable, and transparent AI systems.

📋 License

MIT License - see LICENSE for details

Last Updated: December 2025

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
docs		docs
README.md		README.md

VectorInstitute/Agentic-Transparency

Folders and files

Latest commit

History

Repository files navigation

Transparency in Agentic AI: A Survey of Interpretability, Explainability, and Governance

🌟 Overview

What is Agentic AI Transparency?

📊 The Transparency Gap (2022-2025)

🎯 Survey Positioning

🏗️ Five-Axis Taxonomy

📖 Papers by Category

1. Interpretability Methods

1.1 Perception & Vision-Language Models

1.2 Mechanistic Interpretability

1.3 Probing & Representation Analysis

1.4 Attention Analysis

1.5 Tool Use Interpretability

1.6 Memory & Retrieval

1.7 Multi-Agent Interpretability

1.8 Causal Interventions & Provenance

2. Explainability Methods

2.1 Chain-of-Thought & Reasoning

2.2 Faithfulness & Truthfulness

2.3 Counterfactual Explanations

2.4 Interactive & User-Centered XAI

3. Evaluation & Benchmarks

3.1 Multi-Turn & Long-Horizon Tasks

3.2 Tool Use Evaluation

3.3 Planning & Reasoning

3.4 Memory & Context

3.5 Multi-Agent Systems

3.6 Safety & Risk Assessment

3.7 Transparency Quality Metrics

4. Governance & Compliance

4.1 Regulatory Frameworks

4.2 Governance Technologies

5. Related Surveys

5.1 XAI & Interpretability Surveys

5.2 Agentic AI Surveys

🔬 Research Gaps & Future Directions

🔑 Key Artifacts

Minimal Explanation Packet (MEP)

Tool-Using Agent Example

Evaluation Landscape

🤝 Contributing

📜 Citation

📧 Contact

🙏 Acknowledgments

📋 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Packages