CSV Data Analyzer Skill

Description

This skill ingests CSV files and performs comprehensive data analysis including statistical profiling, correlation detection, anomaly identification, and visualization generation. It handles data cleaning, type inference, and produces actionable insights in a structured report format. Designed for data analysts and developers who need quick insights from tabular data.

Instructions

Load & Profile: Read the CSV, infer column types, compute basic statistics
Clean: Handle missing values, detect encoding issues, normalize column names
Analyze: Compute correlations, distributions, outliers, and trends
Visualize: Generate Python/matplotlib code for key charts
Report: Produce a markdown report with findings and recommendations

Rules

Always show data shape (rows x columns) and memory usage first
Handle missing values transparently -- report counts, don't silently drop
Infer and validate column data types (dates stored as strings, numbers as text, etc.)
Use appropriate statistical tests based on data type (categorical vs numerical)
For large files (>100K rows), sample for exploratory analysis and note the sampling
Generate reproducible Python code for all visualizations
Flag potential data quality issues: duplicates, inconsistent formatting, outliers
Never assume the first row is a header -- verify

Analysis Pipeline


import pandas as pd
import numpy as np

def analyze_csv(filepath: str) -> dict:
    # 1. Load with type inference
    df = pd.read_csv(filepath, encoding='utf-8', low_memory=False)

    report = {
        'shape': df.shape,
        'memory_mb': df.memory_usage(deep=True).sum() / 1e6,
        'columns': {},
        'quality_issues': [],
    }

    # 2. Column profiling
    for col in df.columns:
        profile = {
            'dtype': str(df[col].dtype),
            'missing': int(df[col].isna().sum()),
            'missing_pct': round(df[col].isna().mean() * 100, 1),
            'unique': int(df[col].nunique()),
        }
        if pd.api.types.is_numeric_dtype(df[col]):
            profile.update({
                'mean': round(df[col].mean(), 2),
                'median': round(df[col].median(), 2),
                'std': round(df[col].std(), 2),
                'min': df[col].min(),
                'max': df[col].max(),
                'skew': round(df[col].skew(), 2),
            })
        elif pd.api.types.is_string_dtype(df[col]):
            profile['top_values'] = df[col].value_counts().head(5).to_dict()

        report['columns'][col] = profile

    # 3. Quality checks
    dupes = df.duplicated().sum()
    if dupes > 0:
        report['quality_issues'].append(f"{dupes} duplicate rows found")

    return report

Report Template


# Data Analysis Report: [filename]

## Overview
| Metric | Value |
|--------|-------|
| Rows | 45,231 |
| Columns | 12 |
| Memory | 8.2 MB |
| Duplicate Rows | 23 (0.05%) |
| Total Missing Values | 1,847 (0.34%) |

## Column Summary
| Column | Type | Missing | Unique | Top Value / Mean |
|--------|------|---------|--------|------------------|
| user_id | int64 | 0 (0%) | 45,208 | - |
| email | object | 12 (0.03%) | 44,891 | - |
| signup_date | datetime | 0 (0%) | 892 | - |
| revenue | float64 | 156 (0.34%) | 8,234 | mean: $47.82 |
| plan_type | object | 0 (0%) | 3 | "free" (68%) |

## Key Findings
1. **Revenue Distribution**: Right-skewed (skew: 2.3), median $32 vs mean $48
2. **Correlation**: Strong positive correlation (r=0.82) between tenure and revenue
3. **Anomaly**: 47 users with revenue > $500 (>3 std from mean)
4. **Missing Pattern**: 89% of missing revenue values are from "free" plan users

## Visualizations
[Python matplotlib code blocks for each chart]

## Recommendations
- Investigate 47 high-revenue outlier accounts
- Impute missing revenue as $0 for free-tier users
- Remove 23 duplicate rows before production use

Examples

"Analyze this CSV file and give me key insights"
"Profile the data quality of users.csv"
"Find correlations between columns in sales_data.csv"
"Identify outliers in the revenue column"
"Generate visualization code for the top trends in this dataset"

⚠️ Loading Issue

Description

Instructions

Rules

Analysis Pipeline

Report Template

Examples

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace