CSV Data Analyzer Skill
Analyzes CSV files to generate statistical summaries, detect patterns, identify anomalies, and produce visualizations. Handles large datasets with pandas-based analysis and outputs markdown reports with inline chart code.
Description
This skill ingests CSV files and performs comprehensive data analysis including statistical profiling, correlation detection, anomaly identification, and visualization generation. It handles data cleaning, type inference, and produces actionable insights in a structured report format. Designed for data analysts and developers who need quick insights from tabular data.
Instructions
- Load & Profile: Read the CSV, infer column types, compute basic statistics
- Clean: Handle missing values, detect encoding issues, normalize column names
- Analyze: Compute correlations, distributions, outliers, and trends
- Visualize: Generate Python/matplotlib code for key charts
- Report: Produce a markdown report with findings and recommendations
Rules
- Always show data shape (rows x columns) and memory usage first
- Handle missing values transparently -- report counts, don't silently drop
- Infer and validate column data types (dates stored as strings, numbers as text, etc.)
- Use appropriate statistical tests based on data type (categorical vs numerical)
- For large files (>100K rows), sample for exploratory analysis and note the sampling
- Generate reproducible Python code for all visualizations
- Flag potential data quality issues: duplicates, inconsistent formatting, outliers
- Never assume the first row is a header -- verify
Analysis Pipeline
import pandas as pd import numpy as np def analyze_csv(filepath: str) -> dict: # 1. Load with type inference df = pd.read_csv(filepath, encoding='utf-8', low_memory=False) report = { 'shape': df.shape, 'memory_mb': df.memory_usage(deep=True).sum() / 1e6, 'columns': {}, 'quality_issues': [], } # 2. Column profiling for col in df.columns: profile = { 'dtype': str(df[col].dtype), 'missing': int(df[col].isna().sum()), 'missing_pct': round(df[col].isna().mean() * 100, 1), 'unique': int(df[col].nunique()), } if pd.api.types.is_numeric_dtype(df[col]): profile.update({ 'mean': round(df[col].mean(), 2), 'median': round(df[col].median(), 2), 'std': round(df[col].std(), 2), 'min': df[col].min(), 'max': df[col].max(), 'skew': round(df[col].skew(), 2), }) elif pd.api.types.is_string_dtype(df[col]): profile['top_values'] = df[col].value_counts().head(5).to_dict() report['columns'][col] = profile # 3. Quality checks dupes = df.duplicated().sum() if dupes > 0: report['quality_issues'].append(f"{dupes} duplicate rows found") return report
Report Template
# Data Analysis Report: [filename] ## Overview | Metric | Value | |--------|-------| | Rows | 45,231 | | Columns | 12 | | Memory | 8.2 MB | | Duplicate Rows | 23 (0.05%) | | Total Missing Values | 1,847 (0.34%) | ## Column Summary | Column | Type | Missing | Unique | Top Value / Mean | |--------|------|---------|--------|------------------| | user_id | int64 | 0 (0%) | 45,208 | - | | email | object | 12 (0.03%) | 44,891 | - | | signup_date | datetime | 0 (0%) | 892 | - | | revenue | float64 | 156 (0.34%) | 8,234 | mean: $47.82 | | plan_type | object | 0 (0%) | 3 | "free" (68%) | ## Key Findings 1. **Revenue Distribution**: Right-skewed (skew: 2.3), median $32 vs mean $48 2. **Correlation**: Strong positive correlation (r=0.82) between tenure and revenue 3. **Anomaly**: 47 users with revenue > $500 (>3 std from mean) 4. **Missing Pattern**: 89% of missing revenue values are from "free" plan users ## Visualizations [Python matplotlib code blocks for each chart] ## Recommendations - Investigate 47 high-revenue outlier accounts - Impute missing revenue as $0 for free-tier users - Remove 23 duplicate rows before production use
Examples
"Analyze this CSV file and give me key insights"
"Profile the data quality of users.csv"
"Find correlations between columns in sales_data.csv"
"Identify outliers in the revenue column"
"Generate visualization code for the top trends in this dataset"
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.