C

CSV Data Analyzer Skill

Analyzes CSV files to generate statistical summaries, detect patterns, identify anomalies, and produce visualizations. Handles large datasets with pandas-based analysis and outputs markdown reports with inline chart code.

SkillCommunitydata sciencev1.0.0MIT
0 views0 copies

Description

This skill ingests CSV files and performs comprehensive data analysis including statistical profiling, correlation detection, anomaly identification, and visualization generation. It handles data cleaning, type inference, and produces actionable insights in a structured report format. Designed for data analysts and developers who need quick insights from tabular data.

Instructions

  1. Load & Profile: Read the CSV, infer column types, compute basic statistics
  2. Clean: Handle missing values, detect encoding issues, normalize column names
  3. Analyze: Compute correlations, distributions, outliers, and trends
  4. Visualize: Generate Python/matplotlib code for key charts
  5. Report: Produce a markdown report with findings and recommendations

Rules

  • Always show data shape (rows x columns) and memory usage first
  • Handle missing values transparently -- report counts, don't silently drop
  • Infer and validate column data types (dates stored as strings, numbers as text, etc.)
  • Use appropriate statistical tests based on data type (categorical vs numerical)
  • For large files (>100K rows), sample for exploratory analysis and note the sampling
  • Generate reproducible Python code for all visualizations
  • Flag potential data quality issues: duplicates, inconsistent formatting, outliers
  • Never assume the first row is a header -- verify

Analysis Pipeline

import pandas as pd import numpy as np def analyze_csv(filepath: str) -> dict: # 1. Load with type inference df = pd.read_csv(filepath, encoding='utf-8', low_memory=False) report = { 'shape': df.shape, 'memory_mb': df.memory_usage(deep=True).sum() / 1e6, 'columns': {}, 'quality_issues': [], } # 2. Column profiling for col in df.columns: profile = { 'dtype': str(df[col].dtype), 'missing': int(df[col].isna().sum()), 'missing_pct': round(df[col].isna().mean() * 100, 1), 'unique': int(df[col].nunique()), } if pd.api.types.is_numeric_dtype(df[col]): profile.update({ 'mean': round(df[col].mean(), 2), 'median': round(df[col].median(), 2), 'std': round(df[col].std(), 2), 'min': df[col].min(), 'max': df[col].max(), 'skew': round(df[col].skew(), 2), }) elif pd.api.types.is_string_dtype(df[col]): profile['top_values'] = df[col].value_counts().head(5).to_dict() report['columns'][col] = profile # 3. Quality checks dupes = df.duplicated().sum() if dupes > 0: report['quality_issues'].append(f"{dupes} duplicate rows found") return report

Report Template

# Data Analysis Report: [filename] ## Overview | Metric | Value | |--------|-------| | Rows | 45,231 | | Columns | 12 | | Memory | 8.2 MB | | Duplicate Rows | 23 (0.05%) | | Total Missing Values | 1,847 (0.34%) | ## Column Summary | Column | Type | Missing | Unique | Top Value / Mean | |--------|------|---------|--------|------------------| | user_id | int64 | 0 (0%) | 45,208 | - | | email | object | 12 (0.03%) | 44,891 | - | | signup_date | datetime | 0 (0%) | 892 | - | | revenue | float64 | 156 (0.34%) | 8,234 | mean: $47.82 | | plan_type | object | 0 (0%) | 3 | "free" (68%) | ## Key Findings 1. **Revenue Distribution**: Right-skewed (skew: 2.3), median $32 vs mean $48 2. **Correlation**: Strong positive correlation (r=0.82) between tenure and revenue 3. **Anomaly**: 47 users with revenue > $500 (>3 std from mean) 4. **Missing Pattern**: 89% of missing revenue values are from "free" plan users ## Visualizations [Python matplotlib code blocks for each chart] ## Recommendations - Investigate 47 high-revenue outlier accounts - Impute missing revenue as $0 for free-tier users - Remove 23 duplicate rows before production use

Examples

"Analyze this CSV file and give me key insights"
"Profile the data quality of users.csv"
"Find correlations between columns in sales_data.csv"
"Identify outliers in the revenue column"
"Generate visualization code for the top trends in this dataset"
Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates