Data Engineer Assistant

An autonomous agent that helps design and implement data platforms, covering pipeline architecture, ETL/ELT development, data lake and warehouse design, stream processing, and cost optimization for scalable data infrastructure.

When to Use This Agent

Choose Data Engineer Assistant when:

Designing data pipeline architectures for batch and streaming workloads
Building ETL/ELT pipelines with proper error handling and monitoring
Setting up data lake or data warehouse infrastructure
Optimizing query performance and storage costs in data platforms
Implementing data quality frameworks and lineage tracking

Consider alternatives when:

Doing exploratory data analysis on existing datasets (use a data analyst agent)
Building ML models without infrastructure concerns (use an ML engineer agent)
Creating dashboards and reports (use a BI/analytics tool)

Quick Start


# .claude/agents/data-engineer-assistant.yml
name: Data Engineer Assistant
model: claude-sonnet-4-20250514
tools:
  - Read
  - Write
  - Bash
  - Glob
  - Grep
prompt: |
  You are a senior data engineer. Design and implement scalable
  data platforms covering pipelines, storage, processing, and
  monitoring. Prioritize reliability, cost efficiency, and
  data quality throughout the stack.

Example invocation:


claude --agent data-engineer-assistant "Design a pipeline to
  ingest clickstream events from Kafka, transform them with
  user session attribution, and load into BigQuery partitioned
  by event date"

Core Concepts

Data Platform Architecture

Sources → Ingestion → Storage → Processing → Serving → Consumers
  │          │           │          │           │          │
  APIs    Kafka/       Data       Spark/      Warehouse  Dashboards
  DBs     Firehose     Lake       dbt/SQL     API Cache  ML Models
  Files   CDC          (raw)      Airflow     (curated)  Reports
  Events  Batch pull   (staged)   Streaming   Marts      Apps

Storage Layer Strategy

Layer	Purpose	Format	Retention
Raw/Bronze	Exact copies of source data	JSON, Avro, Parquet	Long-term
Staged/Silver	Cleaned, deduplicated, typed	Parquet, Delta	Medium-term
Curated/Gold	Business-ready aggregations	Parquet, Delta	As needed
Serving	Low-latency query access	Materialized views	Refreshed

Pipeline Patterns


# Idempotent pipeline pattern
def process_partition(date: str, source: str, target: str):
    """Process a single date partition idempotently."""
    raw = read_source(source, date)
    validated = validate_schema(raw, expected_schema)
    transformed = apply_business_rules(validated)
    deduped = deduplicate(transformed, key_cols=['event_id'])

    # Atomic write: overwrite the partition, not the table
    write_partition(target, date, deduped, mode='overwrite')

    log_metrics(date, row_count=len(deduped),
                quality_score=compute_quality(deduped))

Configuration

Parameter	Description	Default
`orchestrator`	Pipeline orchestration tool	Airflow
`processing_engine`	Data processing framework	Spark
`storage_format`	Default file format	Parquet
`warehouse`	Target data warehouse	BigQuery
`streaming_platform`	Stream processing system	Kafka
`quality_framework`	Data quality tool	Great Expectations
`transform_tool`	SQL transformation tool	dbt

Best Practices

Make every pipeline idempotent. Running a pipeline twice with the same inputs should produce the same outputs without duplicates. Use partition-level overwrites instead of appends, deduplicate by natural keys before writing, and design transformations as pure functions of their inputs. Idempotency makes retry logic trivial and backfills safe.
Separate ingestion from transformation. Land raw data exactly as received from the source system before applying any business logic. This preserves the original record for debugging and reprocessing. When business rules change, you can replay transformations against the raw layer without re-ingesting from source systems.
Partition and cluster data by query patterns. Choose partition keys based on how data is filtered (usually date), and cluster keys based on how data is joined or grouped (usually entity IDs). Proper partitioning can reduce query costs by orders of magnitude in columnar warehouses where you pay per bytes scanned.
Implement data quality checks as pipeline gates. Use tools like Great Expectations or dbt tests to validate data between pipeline stages. Check row counts, null rates, value ranges, and referential integrity. A pipeline that loads corrupt data silently is worse than one that fails loudly on quality violations.
Monitor pipeline freshness, not just success. A pipeline that completes successfully but takes 6 hours instead of 30 minutes may be worse than a failure. Track execution duration, data freshness (time since last update), row count trends, and resource utilization. Alert on anomalies in any of these dimensions, not just on pipeline failures.

Common Issues

Pipeline runs succeed but downstream queries return stale data. This happens when the pipeline writes to a staging location but the final swap or refresh step fails silently. Implement end-to-end data freshness monitoring that checks the maximum timestamp in the serving layer, not just whether the Airflow DAG completed. Add a final validation step that queries the target table and verifies fresh data exists.

Costs spike unexpectedly in cloud data warehouses. Full table scans, unpartitioned tables, and redundant materialized views are common culprits. Audit query patterns monthly, add partition filters to all queries as a linting rule, and set up cost alerts per project or dataset. Consider replacing expensive scheduled queries with incremental models that only process new data.

Schema changes in source systems break downstream pipelines. Implement schema evolution handling at the ingestion layer: detect new columns automatically, handle missing columns with defaults, and reject incompatible type changes. Use schema registries for streaming data. Notify the data team of schema changes rather than discovering them through pipeline failures.

⚠️ Loading Issue

Data Engineer Assistant

Data Engineer Assistant

When to Use This Agent

Quick Start

Core Concepts

Data Platform Architecture

Storage Layer Strategy

Pipeline Patterns

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

API Endpoint Builder

Documentation Auto-Generator

Ai Ethics Advisor Partner