In the first part pf this series, we walked through the setup of our multi-agent spam classification system and defined key terminology like message passing, state management, and tools. Now, it’s time to compare how these frameworks perform in practice.
Here, we’ll break down LangGraph, AutoGen, PydanticAI, CrewAI, Swarm, and Smolagents across five critical factors: message passing, state management, tool calling, quality of documentation, and ease of use. Whether you’re building a production-ready system, prototyping quickly, or creating a personal agentic tool, this comparison will help you pick the right framework for the job.
LangGraph, PydanticAI (A+): Message passing was consistent, with orderly agent flow and no issues handling the feedback loop. LangGraph’s directed graph workflow (agents as nodes, handoffs defined by edges) and conditional edges seamlessly managed execution order, especially for GPT-BERT disagreements. In PydanticAI, each agent had to be triggered individually, with outputs chained together to manage execution.
Swarm (A): Mostly consistent but occasionally ran the BERT or GPT agent twice in a row. While it didn’t affect the overall execution, this behavior was less than ideal. We also had to write custom transfer functions for handoffs, which felt unnecessary.
AutoGen, CrewAI (B+): AutoGen’s Swarm (not to be confused with OpenAI Swarm) multi-agent team had dedicated handoff sequences but occasionally got stuck at the GPT agent during feedback loops. CrewAI was mostly consistent but sometimes routed to the wrong agent, causing errors or premature termination.
Smolagents (C): The execution sometimes skipped entire agents, jumping straight to the output agent and hallucinating results. Each step would often take multiple retries.
LangGraph, PydanticAI, Swarm (A+): All three frameworks handled state and parameters robustly. LangGraph uses TypedDict to define states, which are cleanly updated and managed, with a checkpointer saving state snapshots after updates or handoffs. PydanticAI leverages dataclasses for type checking, validation, and dependency management, ensuring consistent and tool-accessible states. Swarm relies on context variables in the form of Python dictionaries that are manually updated after each agent run but remain consistent and accessible throughout execution.
AutoGen, CrewAI (A): State management in our implementation of AutoGen was mostly automated, with states embedded in LLM prompts and response structures. While consistent, it lacked granular control. CrewAI sometimes failed to pass correct parameter values to function tools, requiring careful workarounds.
Smolagents (B): State management was unstructured, with responses forced into verbose formats (e.g., requiring agents to return three answer sections: Task outcome (short), Task outcome (detailed), Additional context). This created unnecessary bloat, burying key data in extra output. However, the critical information was usually present.
AutoGen, LangGraph, CrewAI (A+): All three frameworks offer extensive prebuilt tools with easy integration. Tools are consistently called at the right time, and integration with external libraries, APIs, or custom code is seamless.
PydanticAI, Swarm, SmolAgents (A): Tool calling was reliable, with tools registered and executed correctly. However, they lack the prebuilt tool libraries and advanced integration features of AutoGen and LangGraph.
LangGraph, AutoGen, CrewAI and PydanticAI (A): Documentation is extensive, with basic concepts and terminology clearly explained, core features detailed through technical explanations and practical examples, and specialized features covered as well. CrewAI andAutoGen stand out with plentiful multi-agent workflow examples, though LangGraph and PydanticAI have fewer. However, all four frameworks have gaps in explaining under-the-hood workings of some of the more complex topics like parallel execution and memory management.
Swarm and SmolAgents (B): These lightweight frameworks focus on the basics, with documentation providing simple examples and straightforward explanations. However, they lack depth on advanced features or multi-agent workflows.
Swarm, SmolAgents, PydanticAI (A+): Super easy to create agents, manage state, and register tools. Chaining agent executions is straightforward, making these frameworks ideal for quick prototyping.
LangGraph (B+): Slightly higher complexity due to graphical workflows and an extensive feature list. However, creating the actual agents and tools remains straightforward, balancing power with usability.
AutoGen, CrewAI (B): More challenging to use, especially when configuring prebuilt multi-agent teams and handling state management. These frameworks require more effort and carefully crafted prompts to get workflows running smoothly.
1. LangGraph, Pydantic (A+): Ideal for building chatbot agents and real-time workflows. Offers streaming of LLM responses, robust state management, granular control over the execution and prebuilt agents and tools. Pydantic is the most user-friendly framework due to excellent documentation and reliable message passing, though debugging event execution errors can be challenging. Pydantic lacks prebuilt tool libraries and advanced integrations.
2. AutoGen (A): AutoGen excels in large-scale multi-agent systems with speed, scalability, and flexible team patterns but has a steeper learning curve.
3. CrewAI (B+): CrewAI is great for prototyping with a diverse feature set, but has inconsistent message passing, making it unsuitable for production-level systems.
4. Swarm/SmolAgents (B): Beginner-friendly but limited. Swarm helps create smooth workflows but relies heavily on OpenAI models. SmolAgents uses AutoGen for multi-agent tasks but struggles with consistency.