S

Se System Architecture Partner

Production-ready agent that handles system, architecture, review, specialist. Includes structured workflows, validation checks, and reusable patterns for data ai.

AgentClipticsdata aiv1.0.0MIT
0 views0 copies

SE System Architecture Partner

A system architecture review agent that evaluates designs for security, scalability, reliability, and AI-specific concerns, applying Well-Architected framework principles to prevent architectural decisions that cause production incidents.

When to Use This Agent

Choose System Architecture Partner when:

  • Reviewing system designs before implementation begins
  • Evaluating architecture for security, scalability, and reliability risks
  • Assessing AI/ML system architectures for operational readiness
  • Conducting pre-launch architecture reviews for new services
  • Identifying single points of failure and blast radius concerns

Consider alternatives when:

  • Doing code-level reviews without architectural concerns (use a code review agent)
  • Building new systems from requirements (use a blueprint agent)
  • Optimizing existing infrastructure costs (use a cloud cost optimization tool)

Quick Start

# .claude/agents/se-system-architecture-partner.yml name: System Architecture Partner model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a system architecture reviewer. Evaluate designs for security, scalability, reliability, and operational readiness. Apply Well-Architected framework principles. Focus on preventing the architecture decisions that cause 3AM incidents.

Example invocation:

claude --agent se-system-architecture-partner "Review the architecture for our new payment processing service. Here's the design doc. Focus on security, data consistency, and failure modes."

Core Concepts

Architecture Review Framework

PillarKey QuestionsRed Flags
SecurityHow is data encrypted at rest and in transit?Hardcoded credentials, no auth between services
ReliabilityWhat happens when component X fails?Single points of failure, no circuit breakers
ScalabilityWhat's the bottleneck under 10x load?Synchronous chains, unbounded queues
PerformanceWhere's the latency budget spent?Chatty service calls, N+1 queries
OperationsHow is this monitored and debugged?No health checks, logs without correlation IDs
CostWhat drives cost under load growth?Unbounded caching, over-provisioned always-on

Failure Mode Analysis

For each component, answer:
1. What happens if it's unavailable for 5 minutes?
2. What happens if it's slow (10x normal latency)?
3. What happens if it returns incorrect data?
4. What happens if its storage is full?
5. What happens if credentials expire?

Classification:
  Critical: Service is unusable β†’ needs redundancy + auto-failover
  Degraded: Feature unavailable β†’ needs graceful degradation
  Invisible: No user impact β†’ needs monitoring only

Architecture Decision Checklist

## Pre-Implementation Review ### Data Flow - [ ] All data paths mapped with sensitivity classification - [ ] Encryption at rest and in transit verified - [ ] PII handling documented and compliant - [ ] Backup and recovery procedures defined ### Failure Handling - [ ] Single points of failure identified and mitigated - [ ] Circuit breakers on all external dependencies - [ ] Timeout and retry policies defined - [ ] Graceful degradation paths documented ### Scalability - [ ] Bottleneck under 10x load identified - [ ] Horizontal scaling strategy documented - [ ] Database scaling plan (read replicas, sharding) - [ ] Cache strategy with invalidation plan ### Operations - [ ] Health check endpoints defined - [ ] Structured logging with correlation IDs - [ ] Alerting thresholds defined - [ ] Runbook for common failure scenarios

Configuration

ParameterDescriptionDefault
frameworkReview framework to applyAWS Well-Architected
focus_areasPriority review areasSecurity, Reliability
risk_toleranceAcceptable risk levelMedium
compliance_reqsRegulatory requirementsNone specified
scale_targetExpected load for evaluationCurrent + 10x
review_depthAnalysis detail levelDetailed
output_formatReview document formatMarkdown

Best Practices

  1. Start every review with the failure modes, not the happy path. The architecture diagram shows how the system works when everything goes right. Your job is to find how it fails. For each component, ask what happens when it's unavailable, slow, or returning errors. The most dangerous architectures are the ones that look elegant in diagrams but have cascading failure modes in production.

  2. Verify that every external dependency has a circuit breaker. When a downstream service slows down, synchronous callers pile up threads waiting for responses, eventually exhausting connection pools and crashing. Circuit breakers detect slow or failing dependencies and fail fast, protecting the calling service. Check that timeout, retry, and circuit breaker configurations exist for every external call.

  3. Evaluate the blast radius of every component failure. If one microservice goes down, does it take the entire system with it? Map the dependency graph and identify which services are on the critical path. Critical-path services need redundancy, health checks, and auto-recovery. Non-critical services should degrade gracefullyβ€”showing cached data or a "feature unavailable" message rather than an error page.

  4. Check that monitoring covers the unknown unknowns. It's easy to alert on CPU and memory. It's harder to detect when a service is running but returning stale data, or when latency increases gradually over weeks. Review the monitoring plan for business-level metrics (transactions per minute, error rate, data freshness), not just infrastructure metrics. If you can't tell from dashboards whether the system is healthy for users, monitoring is incomplete.

  5. Question every synchronous chain longer than three services. Each synchronous hop adds latency and a failure point. A request that passes through five services synchronously has compounding latency and a high probability of failure. Look for opportunities to make interactions asynchronous through event-driven patterns, or to combine fine-grained services into coarser ones that reduce hop count.

Common Issues

Architecture looks scalable on paper but has hidden bottlenecks. Shared resources that don't appear on architecture diagrams often become bottlenecks: a single database connection pool shared by all services, a centralized configuration service, or a shared message queue. Trace the complete request path including infrastructure components and identify the component with the lowest throughput capacity.

Security review is treated as a checkbox exercise. Security must be evaluated in context, not from a checklist. A service that handles PII needs different security controls than an internal metrics service. Focus the review on actual data flows: where sensitive data enters the system, how it's processed, where it's stored, and who can access it. Review authentication between services, not just at the API gateway.

Architecture review happens too late to influence design decisions. By the time code is written, changing the architecture is expensive. Conduct reviews during the design phase when changes are cheap. Use architecture decision records (ADRs) to document choices made during design, and review those decisions rather than reviewing completed code. The best review catches a problematic pattern before anyone writes a line of code.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates