S

Smart CSV Data Summarizer Toolkit

Streamline your workflow with this skill for analyze CSV files and generate comprehensive insights. Built for Claude Code with best practices and real-world patterns.

SkillCommunitydatav1.0.0MIT
0 views0 copies

CSV Data Summarizer Toolkit

Comprehensive CSV analysis toolkit that reads, parses, summarizes, and transforms CSV data with statistical analysis, pattern detection, and automated report generation.

When to Use This Skill

Choose CSV Data Summarizer when:

  • You need quick statistical summaries of CSV datasets
  • Exploring unfamiliar data files for structure and content patterns
  • Generating reports from raw CSV exports
  • Cleaning and validating CSV data quality
  • Comparing multiple CSV files for differences or trends

Consider alternatives when:

  • Data is in databases — use SQL queries directly
  • Complex transformations need pandas or dbt pipelines
  • Real-time streaming data rather than static files

Quick Start

# Analyze a CSV file claude skill activate smart-csv-data-summarizer-toolkit # Get a summary of a CSV file claude "Summarize the data in sales_q4_2024.csv" # Find patterns in data claude "Analyze customer_data.csv and identify any anomalies"

Example Analysis Output

import * as fs from 'fs'; import { parse } from 'csv-parse/sync'; // Load and parse CSV const csvContent = fs.readFileSync('sales_data.csv', 'utf-8'); const records = parse(csvContent, { columns: true, skip_empty_lines: true }); // Generate summary statistics function summarizeColumn(records: any[], column: string) { const values = records.map(r => parseFloat(r[column])).filter(v => !isNaN(v)); const sorted = [...values].sort((a, b) => a - b); return { count: values.length, nullCount: records.length - values.length, min: Math.min(...values), max: Math.max(...values), mean: values.reduce((a, b) => a + b, 0) / values.length, median: sorted[Math.floor(sorted.length / 2)], stdDev: Math.sqrt( values.reduce((sum, v) => sum + Math.pow(v - mean, 2), 0) / values.length ), }; } // Detect column types function inferColumnType(records: any[], column: string): string { const sample = records.slice(0, 100).map(r => r[column]); if (sample.every(v => /^\d+$/.test(v))) return 'integer'; if (sample.every(v => /^\d+\.\d+$/.test(v))) return 'float'; if (sample.every(v => /^\d{4}-\d{2}-\d{2}/.test(v))) return 'date'; if (sample.every(v => /^(true|false)$/i.test(v))) return 'boolean'; return 'string'; }

Core Concepts

Analysis Capabilities

FeatureDescriptionOutput
Schema DetectionAuto-detect column names, types, and constraintsSchema table
Statistical SummaryMean, median, mode, stddev, quartiles per numeric columnStats table
Null AnalysisCount and percentage of missing values per columnQuality report
Cardinality CheckUnique value counts for categorical columnsDistribution table
Outlier DetectionValues beyond 2 standard deviations from meanOutlier list
Pattern MatchingRegex patterns for string columns (emails, phones, dates)Pattern report
Correlation MatrixPearson correlation between numeric columnsCorrelation table

Data Quality Checks

CheckDescriptionSeverity
Missing ValuesColumns with >10% null valuesWarning
Duplicate RowsExact row duplicatesError
Type MismatchesValues that don't match inferred column typeError
OutliersValues beyond configurable standard deviation thresholdWarning
Encoding IssuesNon-UTF8 characters, BOM markersInfo
Delimiter ConflictsField values containing the delimiter characterError
Header IssuesDuplicate column names, empty headersError
# Common CSV operations # Count rows wc -l data.csv # Preview structure head -5 data.csv | column -t -s, # Extract specific columns cut -d, -f1,3,5 data.csv # Sort by column sort -t, -k3 -n data.csv # Remove duplicates sort -u data.csv > deduped.csv # Filter rows matching pattern awk -F, '$4 > 1000' data.csv

Configuration

ParameterDescriptionDefault
delimiterCSV field delimiter, (auto-detect)
has_headerFirst row contains column namestrue
encodingFile encodingutf-8
sample_sizeRows to sample for type inference1000
outlier_thresholdStandard deviations for outlier detection2.0
max_cardinalityMax unique values to enumerate for categoricals50
null_valuesStrings treated as null["", "NULL", "N/A", "n/a", "-"]

Best Practices

  1. Always preview the file structure before full analysis — Check the first 5-10 rows to verify delimiter, header presence, encoding, and quoting. Assumptions about CSV format cause silent data corruption.

  2. Handle encoding explicitly — Many CSV exports from Excel use Windows-1252 or Latin-1 encoding rather than UTF-8. Specify encoding upfront to prevent garbled characters in names, addresses, and descriptions.

  3. Validate data types after parsing, not before — Let the parser read everything as strings first, then apply type inference and conversion. This catches fields that mix types (like ZIP codes with leading zeros).

  4. Profile missing data patterns — Missing values are rarely random. Check if nulls concentrate in specific rows, time periods, or categories. Systematic gaps often indicate upstream data pipeline failures.

  5. Use checksums for data integrity — When processing CSV files in pipelines, compute row counts and column checksums at each stage. Mismatches between stages reveal silent data loss or corruption.

Common Issues

CSV file has inconsistent row lengths (some rows have more or fewer fields than the header). This usually means unquoted field values contain the delimiter character. Re-parse with proper quoting enabled, or use a library that handles RFC 4180 compliant parsing. Check for newlines embedded within quoted fields, which many basic parsers mishandle.

Numeric columns are parsed as strings due to formatting. Currency symbols ($, EUR), thousand separators (1,000), percentage signs (50%), and scientific notation (1.5e3) all prevent automatic numeric parsing. Strip formatting characters before conversion and handle locale-specific decimal separators (period vs comma).

Large CSV files exhaust memory during analysis. Stream the file instead of loading it entirely. Use line-by-line processing for summaries (running statistics), or sample a representative subset. For files over 1GB, consider converting to Parquet format first for columnar access and compression.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates