Building Custom Sub-Agents for AI-Assisted Workflows
Design patterns for specialized sub-agents with narrow scope, explicit contracts, and composable pipelines for AI-assisted development.
The Narrow Scope Philosophy
A common mistake when building AI agents is making them too general. A single agent that handles code generation, testing, documentation, deployment, and code review will underperform at all of these tasks. It accumulates too much context, its system prompt becomes a wall of conflicting instructions, and its behavior becomes unpredictable.
The alternative: build narrow, focused sub-agents that each do one thing well. This is not a new idea — it is the Unix philosophy applied to AI. Small, composable tools that can be chained together.
A sub-agent should have:
- One clear responsibility (“generate unit tests for Python functions”)
- Defined inputs (function source code, testing framework preference)
- Defined outputs (test file content, coverage estimate)
- No side effects outside its scope (it does not modify the source code it is testing)
Explicit I/O Contracts
Every sub-agent needs a contract — a clear specification of what it accepts and what it returns. Without this, composition breaks down. You cannot pipe the output of one agent into another if you do not know the shape of that output.
Defining Contracts
# agents/test-generator.yaml
name: test-generator
description: "Generates unit tests for Python functions"
model: claude-sonnet-4-20250514
input:
format: json
schema:
source_code:
type: string
required: true
description: "The Python function(s) to test"
framework:
type: enum
values: [pytest, unittest]
default: pytest
coverage_target:
type: number
default: 0.8
description: "Target branch coverage (0.0-1.0)"
output:
format: json
schema:
test_code:
type: string
description: "Generated test file content"
test_count:
type: number
description: "Number of test cases generated"
estimated_coverage:
type: number
description: "Estimated branch coverage"
This contract serves triple duty: it documents the agent for humans, it validates inputs at runtime, and it enables tooling to auto-generate pipeline configurations.
Shell Scripting as Orchestration
You do not need a complex orchestration framework to chain sub-agents. Shell scripts work surprisingly well for linear pipelines, and they have the advantage of being universally understood.
#!/bin/bash
# pipeline: analyze → generate tests → review tests
set -euo pipefail
SOURCE_FILE="$1"
SOURCE_CODE=$(cat "$SOURCE_FILE")
# Step 1: Analyze the code structure
ANALYSIS=$(agent-run code-analyzer \
--input "{\"source_code\": $(echo "$SOURCE_CODE" | jq -Rs .)}" \
--output json)
echo "Analysis complete: $(echo "$ANALYSIS" | jq -r '.summary')"
# Step 2: Generate tests based on analysis
FUNCTIONS=$(echo "$ANALYSIS" | jq -r '.functions')
TESTS=$(agent-run test-generator \
--input "{\"source_code\": $(echo "$SOURCE_CODE" | jq -Rs .), \"framework\": \"pytest\"}" \
--output json)
echo "Generated $(echo "$TESTS" | jq -r '.test_count') test cases"
# Step 3: Review the generated tests for quality
REVIEW=$(agent-run code-reviewer \
--input "{\"code\": $(echo "$TESTS" | jq -r '.test_code' | jq -Rs .), \"review_type\": \"test_quality\"}" \
--output json)
# Step 4: Write output if review passes
SCORE=$(echo "$REVIEW" | jq -r '.quality_score')
if (( $(echo "$SCORE > 0.7" | bc -l) )); then
echo "$TESTS" | jq -r '.test_code' > "tests/test_$(basename "$SOURCE_FILE")"
echo "Tests written. Quality score: $SCORE"
else
echo "Tests below quality threshold ($SCORE). Review: $(echo "$REVIEW" | jq -r '.issues')"
exit 1
fi
This pipeline is readable, debuggable (add set -x for tracing), and easy to modify. Each step’s output is captured in a variable and can be inspected if something goes wrong.
Composable Pipelines
The real power of sub-agents appears when you start composing them into reusable pipelines. A pipeline is just a sequence of sub-agents with defined handoff points.
Pipeline Definition
# pipelines/pr-review.yaml
name: pr-review
description: "Full pull request review pipeline"
steps:
- agent: diff-analyzer
input:
diff: "${pipeline.input.diff}"
output_as: analysis
- agent: security-scanner
input:
code_changes: "${analysis.changed_files}"
output_as: security
- agent: test-coverage-checker
input:
source_files: "${analysis.changed_files}"
test_files: "${analysis.test_files}"
output_as: coverage
- agent: review-summarizer
input:
analysis: "${analysis}"
security: "${security}"
coverage: "${coverage}"
output_as: summary
output:
review: "${summary.review_text}"
approve: "${summary.should_approve}"
issues: "${summary.blocking_issues}"
Each step references outputs from previous steps using a simple variable syntax. The pipeline runner resolves these references, validates the data shapes against the agent contracts, and handles errors at each handoff.
Validation Between Handoffs
The space between two agents is where things break. Agent A returns something unexpected, agent B receives garbage, and the pipeline produces nonsense. Validation at handoff points catches these failures early.
interface HandoffValidator {
validate(
output: unknown,
expectedSchema: Schema
): ValidationResult;
}
// Between each pipeline step:
const result = await agent.run(input);
const validation = validator.validate(result, nextAgent.inputSchema);
if (!validation.valid) {
// Option 1: Retry with error context
const retryResult = await agent.run(input, {
additionalContext: `Previous output was invalid: ${validation.errors.join(", ")}. Please fix.`,
});
// Option 2: Fall back to a default
// Option 3: Halt the pipeline with a clear error
}
Three validation strategies, in order of preference:
- Retry with feedback. Give the agent its validation errors and ask it to fix them. Works well for formatting issues.
- Default values. If a non-critical field is missing, use a sensible default and continue.
- Halt with diagnostics. If the output is fundamentally wrong, stop the pipeline and report exactly which step failed and why.
Building with Agent Forge
Agent Forge provides scaffolding for this pattern. It generates the agent configuration files, sets up the contract validation, and creates the pipeline runner — so you can focus on writing the system prompts and testing the agent behavior rather than building infrastructure.
# Scaffold a new sub-agent
agent-forge create test-generator --model claude-sonnet --category testing
# Define the contract interactively
agent-forge contract test-generator
# Add it to a pipeline
agent-forge pipeline add pr-review --step test-generator --after diff-analyzer
# Run the pipeline locally
agent-forge pipeline run pr-review --input '{"diff": "..."}'
Design Guidelines
Keep prompts under 500 words. If your system prompt is longer, your agent’s scope is probably too broad. Split it.
Test with adversarial inputs. Send your agent malformed JSON, empty strings, and absurdly long inputs. The contract validation should catch these, but verify.
Version your agents. When you update a system prompt, the agent’s behavior changes. Tag versions so pipelines can pin to a known-good configuration.
Log everything. Every sub-agent invocation should log its input, output, model used, token count, and latency. This data is essential for debugging pipelines and optimizing costs.
Start with two agents. Do not design a twelve-agent pipeline on paper. Build two agents, connect them, verify the handoff works, then add the third. Incremental composition catches integration issues early.
The goal is not to build the most sophisticated multi-agent system. The goal is to build reliable, predictable automation from small, testable pieces.