Smart CSV Data Summarizer Toolkit
Streamline your workflow with this skill for analyze CSV files and generate comprehensive insights. Built for Claude Code with best practices and real-world patterns.
CSV Data Summarizer Toolkit
Comprehensive CSV analysis toolkit that reads, parses, summarizes, and transforms CSV data with statistical analysis, pattern detection, and automated report generation.
When to Use This Skill
Choose CSV Data Summarizer when:
- You need quick statistical summaries of CSV datasets
- Exploring unfamiliar data files for structure and content patterns
- Generating reports from raw CSV exports
- Cleaning and validating CSV data quality
- Comparing multiple CSV files for differences or trends
Consider alternatives when:
- Data is in databases — use SQL queries directly
- Complex transformations need pandas or dbt pipelines
- Real-time streaming data rather than static files
Quick Start
# Analyze a CSV file claude skill activate smart-csv-data-summarizer-toolkit # Get a summary of a CSV file claude "Summarize the data in sales_q4_2024.csv" # Find patterns in data claude "Analyze customer_data.csv and identify any anomalies"
Example Analysis Output
import * as fs from 'fs'; import { parse } from 'csv-parse/sync'; // Load and parse CSV const csvContent = fs.readFileSync('sales_data.csv', 'utf-8'); const records = parse(csvContent, { columns: true, skip_empty_lines: true }); // Generate summary statistics function summarizeColumn(records: any[], column: string) { const values = records.map(r => parseFloat(r[column])).filter(v => !isNaN(v)); const sorted = [...values].sort((a, b) => a - b); return { count: values.length, nullCount: records.length - values.length, min: Math.min(...values), max: Math.max(...values), mean: values.reduce((a, b) => a + b, 0) / values.length, median: sorted[Math.floor(sorted.length / 2)], stdDev: Math.sqrt( values.reduce((sum, v) => sum + Math.pow(v - mean, 2), 0) / values.length ), }; } // Detect column types function inferColumnType(records: any[], column: string): string { const sample = records.slice(0, 100).map(r => r[column]); if (sample.every(v => /^\d+$/.test(v))) return 'integer'; if (sample.every(v => /^\d+\.\d+$/.test(v))) return 'float'; if (sample.every(v => /^\d{4}-\d{2}-\d{2}/.test(v))) return 'date'; if (sample.every(v => /^(true|false)$/i.test(v))) return 'boolean'; return 'string'; }
Core Concepts
Analysis Capabilities
| Feature | Description | Output |
|---|---|---|
| Schema Detection | Auto-detect column names, types, and constraints | Schema table |
| Statistical Summary | Mean, median, mode, stddev, quartiles per numeric column | Stats table |
| Null Analysis | Count and percentage of missing values per column | Quality report |
| Cardinality Check | Unique value counts for categorical columns | Distribution table |
| Outlier Detection | Values beyond 2 standard deviations from mean | Outlier list |
| Pattern Matching | Regex patterns for string columns (emails, phones, dates) | Pattern report |
| Correlation Matrix | Pearson correlation between numeric columns | Correlation table |
Data Quality Checks
| Check | Description | Severity |
|---|---|---|
| Missing Values | Columns with >10% null values | Warning |
| Duplicate Rows | Exact row duplicates | Error |
| Type Mismatches | Values that don't match inferred column type | Error |
| Outliers | Values beyond configurable standard deviation threshold | Warning |
| Encoding Issues | Non-UTF8 characters, BOM markers | Info |
| Delimiter Conflicts | Field values containing the delimiter character | Error |
| Header Issues | Duplicate column names, empty headers | Error |
# Common CSV operations # Count rows wc -l data.csv # Preview structure head -5 data.csv | column -t -s, # Extract specific columns cut -d, -f1,3,5 data.csv # Sort by column sort -t, -k3 -n data.csv # Remove duplicates sort -u data.csv > deduped.csv # Filter rows matching pattern awk -F, '$4 > 1000' data.csv
Configuration
| Parameter | Description | Default |
|---|---|---|
delimiter | CSV field delimiter | , (auto-detect) |
has_header | First row contains column names | true |
encoding | File encoding | utf-8 |
sample_size | Rows to sample for type inference | 1000 |
outlier_threshold | Standard deviations for outlier detection | 2.0 |
max_cardinality | Max unique values to enumerate for categoricals | 50 |
null_values | Strings treated as null | ["", "NULL", "N/A", "n/a", "-"] |
Best Practices
-
Always preview the file structure before full analysis — Check the first 5-10 rows to verify delimiter, header presence, encoding, and quoting. Assumptions about CSV format cause silent data corruption.
-
Handle encoding explicitly — Many CSV exports from Excel use Windows-1252 or Latin-1 encoding rather than UTF-8. Specify encoding upfront to prevent garbled characters in names, addresses, and descriptions.
-
Validate data types after parsing, not before — Let the parser read everything as strings first, then apply type inference and conversion. This catches fields that mix types (like ZIP codes with leading zeros).
-
Profile missing data patterns — Missing values are rarely random. Check if nulls concentrate in specific rows, time periods, or categories. Systematic gaps often indicate upstream data pipeline failures.
-
Use checksums for data integrity — When processing CSV files in pipelines, compute row counts and column checksums at each stage. Mismatches between stages reveal silent data loss or corruption.
Common Issues
CSV file has inconsistent row lengths (some rows have more or fewer fields than the header). This usually means unquoted field values contain the delimiter character. Re-parse with proper quoting enabled, or use a library that handles RFC 4180 compliant parsing. Check for newlines embedded within quoted fields, which many basic parsers mishandle.
Numeric columns are parsed as strings due to formatting. Currency symbols ($, EUR), thousand separators (1,000), percentage signs (50%), and scientific notation (1.5e3) all prevent automatic numeric parsing. Strip formatting characters before conversion and handle locale-specific decimal separators (period vs comma).
Large CSV files exhaust memory during analysis. Stream the file instead of loading it entirely. Use line-by-line processing for summaries (running statistics), or sample a representative subset. For files over 1GB, consider converting to Parquet format first for columnar access and compression.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.