Further Reading: A Working Bibliography

The papers, framework docs, books, and reports cited across this site. Annotated, grouped, and maintained as a single reference asset.

Reading guide

New to the literature? Start with our annotated walkthrough of the field’s most-cited survey: A Survey on Large Language Model based Autonomous Agents (Wang et al., arXiv:2308.11432) and its Profile-Memory-Planning-Action framework.

Foundational papers

Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao
ReAct: Synergizing Reasoning and Acting in Language Models
The foundational ReAct paper. Introduces the Thought-Action-Observation loop.
Shinn, Cassano, Gopinath, Narasimhan, Yao
Reflexion: Language Agents with Verbal Reinforcement Learning
Verbal reinforcement: an explicit memory of past failures across episodes.
Madaan et al.
Self-Refine: Iterative Refinement with Self-Feedback
Single-model iterative improvement loop.
Schick, Dwivedi-Yu, Dessi et al.
Toolformer: Language Models Can Teach Themselves to Use Tools
Early systematic treatment of LLM tool use.
Wang, Ma, Feng, Zhang et al.
A Survey on Large Language Model based Autonomous Agents
The most-cited LLM-agent survey.
Xi, Chen, Guo et al.
The Rise and Potential of Large Language Model Based Agents: A Survey
Companion survey covering capabilities, applications, and risks.
Wu, Bansal, Zhang et al.
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
The AutoGen architecture paper. Conversational multi-agent foundations.
Yao, Yu, Zhao, Shafran, Griffiths, Cao, Narasimhan
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Generalisation of chain-of-thought to a tree search over reasoning steps.
Lewis, Perez, Piktus et al.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
The original RAG paper.
Liu, Lin, Hewitt et al.
Lost in the Middle: How Language Models Use Long Contexts
Documents the attention drop in the middle of long contexts.

Framework documentation (references, not rankings)

LangChain
LangGraph
Graph-based state-machine orchestration.
Microsoft Research
AutoGen
Conversational multi-agent framework.
CrewAI
CrewAI
Role-based crew orchestration.
OpenAI
OpenAI Agents SDK
Handoff-based coordination, successor to Swarm.
LlamaIndex
LlamaIndex Agent Workflows
Retrieval-first framework with agent support.
Anthropic
Claude Agent SDK
Anthropic’s production agent SDK and patterns.
Microsoft
Semantic Kernel
Planner-and-skills model with .NET, Java, and Python support.
Pydantic
Pydantic-AI
Type-driven agent framework with Pydantic validation.
deepset
Haystack
Retrieval-centric pipeline model with agent support.

Industry reports and indices

Stanford HAI
AI Index Report (annual)
Technical performance, cost, and adoption data sourced from public datasets.
MIT Technology Review
AI coverage
Editorial coverage with strong sourcing discipline.
Anthropic Research
Anthropic Research
Published research from Anthropic; agent and safety papers.
Google DeepMind
DeepMind Research
Published research from Google DeepMind.
OpenAI Research
OpenAI Research
Published research and technical reports from OpenAI.
Anthropic
Building effective agents
The clearest public treatment of the workflow-vs-agent distinction.

Books

S. Russell and P. Norvig
Artificial Intelligence: A Modern Approach (4th ed.)
Pearson, 2020. The canonical AI textbook. Chapter 2 on intelligent agents remains the reference for the classical taxonomy.
F. Chollet
Deep Learning with Python (2nd ed.)
Manning, 2021. The deep-learning foundation underneath language models.
S. Bubeck et al.
Sparks of Artificial General Intelligence
Published as a preprint; widely read as a chapter-length essay on early-2023 GPT-4 capabilities.

Security and safety

OWASP
Top 10 for Large Language Model Applications
The reference list of LLM-application security risks.
NIST
AI Risk Management Framework
Government-grade framework for AI risk.
Perez and Ribeiro
Ignore Previous Prompt: Attack Techniques For Language Models
Foundational direct prompt-injection paper.
Greshake, Abdelnabi, Mishra et al.
Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
The indirect prompt-injection threat model.
Anthropic
Responsible Scaling Policy
Published policy for scaling frontier model deployments.
Bai, Kadavath, Kundu et al.
Constitutional AI: Harmlessness from AI Feedback
The Constitutional AI methodology paper.

Benchmarks (cross-reference)

Liu, Yao, Zhang et al.
AgentBench: Evaluating LLMs as Agents
Broad agent capability evaluation across eight environments.
Jimenez, Yang, Wettig et al.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
The standard coding-agent benchmark.
Zhou, Xu, Zhu et al.
WebArena: A Realistic Web Environment for Building Autonomous Agents
Web-navigation benchmark.
Sierra Research
tau-bench
Multi-turn customer-service tool-use benchmark.
Berkeley
BFCL: Function-Calling Leaderboard
Standard tool-use reliability leaderboard.
AgentCogito sister site
benchmarkingagents.com
The dedicated reference for agent and LLM benchmarks.

A note on citations

Where a claim on this site cites a number, a date, or an empirical result, the citation links to the source above. Where a claim is editorial synthesis (a generalisation across multiple sources, an opinion about practice), it is marked as such inline rather than fabricated as a stat. If a citation appears broken or a claim seems wrong, please send corrections via the contact route.

Return to the front matter