SE System Architecture Partner

A system architecture review agent that evaluates designs for security, scalability, reliability, and AI-specific concerns, applying Well-Architected framework principles to prevent architectural decisions that cause production incidents.

When to Use This Agent

Choose System Architecture Partner when:

Reviewing system designs before implementation begins
Evaluating architecture for security, scalability, and reliability risks
Assessing AI/ML system architectures for operational readiness
Conducting pre-launch architecture reviews for new services
Identifying single points of failure and blast radius concerns

Consider alternatives when:

Doing code-level reviews without architectural concerns (use a code review agent)
Building new systems from requirements (use a blueprint agent)
Optimizing existing infrastructure costs (use a cloud cost optimization tool)

Quick Start


# .claude/agents/se-system-architecture-partner.yml
name: System Architecture Partner
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
prompt: |
  You are a system architecture reviewer. Evaluate designs for
  security, scalability, reliability, and operational readiness.
  Apply Well-Architected framework principles. Focus on preventing
  the architecture decisions that cause 3AM incidents.

Example invocation:


claude --agent se-system-architecture-partner "Review the architecture
  for our new payment processing service. Here's the design doc.
  Focus on security, data consistency, and failure modes."

Core Concepts

Architecture Review Framework

Pillar	Key Questions	Red Flags
Security	How is data encrypted at rest and in transit?	Hardcoded credentials, no auth between services
Reliability	What happens when component X fails?	Single points of failure, no circuit breakers
Scalability	What's the bottleneck under 10x load?	Synchronous chains, unbounded queues
Performance	Where's the latency budget spent?	Chatty service calls, N+1 queries
Operations	How is this monitored and debugged?	No health checks, logs without correlation IDs
Cost	What drives cost under load growth?	Unbounded caching, over-provisioned always-on

Failure Mode Analysis

For each component, answer:
1. What happens if it's unavailable for 5 minutes?
2. What happens if it's slow (10x normal latency)?
3. What happens if it returns incorrect data?
4. What happens if its storage is full?
5. What happens if credentials expire?

Classification:
  Critical: Service is unusable → needs redundancy + auto-failover
  Degraded: Feature unavailable → needs graceful degradation
  Invisible: No user impact → needs monitoring only

Architecture Decision Checklist


## Pre-Implementation Review

### Data Flow
- [ ] All data paths mapped with sensitivity classification
- [ ] Encryption at rest and in transit verified
- [ ] PII handling documented and compliant
- [ ] Backup and recovery procedures defined

### Failure Handling
- [ ] Single points of failure identified and mitigated
- [ ] Circuit breakers on all external dependencies
- [ ] Timeout and retry policies defined
- [ ] Graceful degradation paths documented

### Scalability
- [ ] Bottleneck under 10x load identified
- [ ] Horizontal scaling strategy documented
- [ ] Database scaling plan (read replicas, sharding)
- [ ] Cache strategy with invalidation plan

### Operations
- [ ] Health check endpoints defined
- [ ] Structured logging with correlation IDs
- [ ] Alerting thresholds defined
- [ ] Runbook for common failure scenarios

Configuration

Parameter	Description	Default
`framework`	Review framework to apply	AWS Well-Architected
`focus_areas`	Priority review areas	Security, Reliability
`risk_tolerance`	Acceptable risk level	Medium
`compliance_reqs`	Regulatory requirements	None specified
`scale_target`	Expected load for evaluation	Current + 10x
`review_depth`	Analysis detail level	Detailed
`output_format`	Review document format	Markdown

Best Practices

Start every review with the failure modes, not the happy path. The architecture diagram shows how the system works when everything goes right. Your job is to find how it fails. For each component, ask what happens when it's unavailable, slow, or returning errors. The most dangerous architectures are the ones that look elegant in diagrams but have cascading failure modes in production.
Verify that every external dependency has a circuit breaker. When a downstream service slows down, synchronous callers pile up threads waiting for responses, eventually exhausting connection pools and crashing. Circuit breakers detect slow or failing dependencies and fail fast, protecting the calling service. Check that timeout, retry, and circuit breaker configurations exist for every external call.
Evaluate the blast radius of every component failure. If one microservice goes down, does it take the entire system with it? Map the dependency graph and identify which services are on the critical path. Critical-path services need redundancy, health checks, and auto-recovery. Non-critical services should degrade gracefully—showing cached data or a "feature unavailable" message rather than an error page.
Check that monitoring covers the unknown unknowns. It's easy to alert on CPU and memory. It's harder to detect when a service is running but returning stale data, or when latency increases gradually over weeks. Review the monitoring plan for business-level metrics (transactions per minute, error rate, data freshness), not just infrastructure metrics. If you can't tell from dashboards whether the system is healthy for users, monitoring is incomplete.
Question every synchronous chain longer than three services. Each synchronous hop adds latency and a failure point. A request that passes through five services synchronously has compounding latency and a high probability of failure. Look for opportunities to make interactions asynchronous through event-driven patterns, or to combine fine-grained services into coarser ones that reduce hop count.

Common Issues

Architecture looks scalable on paper but has hidden bottlenecks. Shared resources that don't appear on architecture diagrams often become bottlenecks: a single database connection pool shared by all services, a centralized configuration service, or a shared message queue. Trace the complete request path including infrastructure components and identify the component with the lowest throughput capacity.

Security review is treated as a checkbox exercise. Security must be evaluated in context, not from a checklist. A service that handles PII needs different security controls than an internal metrics service. Focus the review on actual data flows: where sensitive data enters the system, how it's processed, where it's stored, and who can access it. Review authentication between services, not just at the API gateway.

Architecture review happens too late to influence design decisions. By the time code is written, changing the architecture is expensive. Conduct reviews during the design phase when changes are cheap. Use architecture decision records (ADRs) to document choices made during design, and review those decisions rather than reviewing completed code. The best review catches a problematic pattern before anyone writes a line of code.

⚠️ Loading Issue

Se System Architecture Partner

SE System Architecture Partner

When to Use This Agent

Quick Start

Core Concepts

Architecture Review Framework

Failure Mode Analysis

Architecture Decision Checklist

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner