D

Data Engineer Assistant

All-in-one agent covering agent, need, design, build. Includes structured workflows, validation checks, and reusable patterns for data ai.

AgentClipticsdata aiv1.0.0MIT
0 views0 copies

Data Engineer Assistant

An autonomous agent that helps design and implement data platforms, covering pipeline architecture, ETL/ELT development, data lake and warehouse design, stream processing, and cost optimization for scalable data infrastructure.

When to Use This Agent

Choose Data Engineer Assistant when:

  • Designing data pipeline architectures for batch and streaming workloads
  • Building ETL/ELT pipelines with proper error handling and monitoring
  • Setting up data lake or data warehouse infrastructure
  • Optimizing query performance and storage costs in data platforms
  • Implementing data quality frameworks and lineage tracking

Consider alternatives when:

  • Doing exploratory data analysis on existing datasets (use a data analyst agent)
  • Building ML models without infrastructure concerns (use an ML engineer agent)
  • Creating dashboards and reports (use a BI/analytics tool)

Quick Start

# .claude/agents/data-engineer-assistant.yml name: Data Engineer Assistant model: claude-sonnet-4-20250514 tools: - Read - Write - Bash - Glob - Grep prompt: | You are a senior data engineer. Design and implement scalable data platforms covering pipelines, storage, processing, and monitoring. Prioritize reliability, cost efficiency, and data quality throughout the stack.

Example invocation:

claude --agent data-engineer-assistant "Design a pipeline to ingest clickstream events from Kafka, transform them with user session attribution, and load into BigQuery partitioned by event date"

Core Concepts

Data Platform Architecture

Sources → Ingestion → Storage → Processing → Serving → Consumers
  │          │           │          │           │          │
  APIs    Kafka/       Data       Spark/      Warehouse  Dashboards
  DBs     Firehose     Lake       dbt/SQL     API Cache  ML Models
  Files   CDC          (raw)      Airflow     (curated)  Reports
  Events  Batch pull   (staged)   Streaming   Marts      Apps

Storage Layer Strategy

LayerPurposeFormatRetention
Raw/BronzeExact copies of source dataJSON, Avro, ParquetLong-term
Staged/SilverCleaned, deduplicated, typedParquet, DeltaMedium-term
Curated/GoldBusiness-ready aggregationsParquet, DeltaAs needed
ServingLow-latency query accessMaterialized viewsRefreshed

Pipeline Patterns

# Idempotent pipeline pattern def process_partition(date: str, source: str, target: str): """Process a single date partition idempotently.""" raw = read_source(source, date) validated = validate_schema(raw, expected_schema) transformed = apply_business_rules(validated) deduped = deduplicate(transformed, key_cols=['event_id']) # Atomic write: overwrite the partition, not the table write_partition(target, date, deduped, mode='overwrite') log_metrics(date, row_count=len(deduped), quality_score=compute_quality(deduped))

Configuration

ParameterDescriptionDefault
orchestratorPipeline orchestration toolAirflow
processing_engineData processing frameworkSpark
storage_formatDefault file formatParquet
warehouseTarget data warehouseBigQuery
streaming_platformStream processing systemKafka
quality_frameworkData quality toolGreat Expectations
transform_toolSQL transformation tooldbt

Best Practices

  1. Make every pipeline idempotent. Running a pipeline twice with the same inputs should produce the same outputs without duplicates. Use partition-level overwrites instead of appends, deduplicate by natural keys before writing, and design transformations as pure functions of their inputs. Idempotency makes retry logic trivial and backfills safe.

  2. Separate ingestion from transformation. Land raw data exactly as received from the source system before applying any business logic. This preserves the original record for debugging and reprocessing. When business rules change, you can replay transformations against the raw layer without re-ingesting from source systems.

  3. Partition and cluster data by query patterns. Choose partition keys based on how data is filtered (usually date), and cluster keys based on how data is joined or grouped (usually entity IDs). Proper partitioning can reduce query costs by orders of magnitude in columnar warehouses where you pay per bytes scanned.

  4. Implement data quality checks as pipeline gates. Use tools like Great Expectations or dbt tests to validate data between pipeline stages. Check row counts, null rates, value ranges, and referential integrity. A pipeline that loads corrupt data silently is worse than one that fails loudly on quality violations.

  5. Monitor pipeline freshness, not just success. A pipeline that completes successfully but takes 6 hours instead of 30 minutes may be worse than a failure. Track execution duration, data freshness (time since last update), row count trends, and resource utilization. Alert on anomalies in any of these dimensions, not just on pipeline failures.

Common Issues

Pipeline runs succeed but downstream queries return stale data. This happens when the pipeline writes to a staging location but the final swap or refresh step fails silently. Implement end-to-end data freshness monitoring that checks the maximum timestamp in the serving layer, not just whether the Airflow DAG completed. Add a final validation step that queries the target table and verifies fresh data exists.

Costs spike unexpectedly in cloud data warehouses. Full table scans, unpartitioned tables, and redundant materialized views are common culprits. Audit query patterns monthly, add partition filters to all queries as a linting rule, and set up cost alerts per project or dataset. Consider replacing expensive scheduled queries with incremental models that only process new data.

Schema changes in source systems break downstream pipelines. Implement schema evolution handling at the ingestion layer: detect new columns automatically, handle missing columns with defaults, and reject incompatible type changes. Use schema registries for streaming data. Notify the data team of schema changes rather than discovering them through pipeline failures.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates