Excel Analysis Complete
Powerful skill for analyze, excel, spreadsheets, create. Includes structured workflows, validation checks, and reusable patterns for enterprise communication.
Excel Analysis Complete
A comprehensive skill for analyzing Excel files programmatically — covering data loading with openpyxl and pandas, formula auditing, pivot table creation, chart generation, multi-sheet analysis, and automated report building.
When to Use This Skill
Choose Excel Analysis Complete when you need to:
- Read and analyze complex Excel workbooks with multiple sheets
- Audit formulas and identify calculation errors
- Create pivot tables and summary statistics from raw data
- Generate charts and visualizations from Excel data
- Build automated reporting pipelines from Excel inputs
Consider alternatives when:
- You need to create Excel files from scratch (use an XLSX creation skill)
- You're working with CSV files only (use a data analysis skill)
- You need real-time Excel manipulation (use Excel VBA or add-ins)
Quick Start
# Install dependencies pip install pandas openpyxl xlsxwriter matplotlib
import pandas as pd from pathlib import Path # Load workbook with multiple sheets xlsx = pd.ExcelFile("sales_report.xlsx") print(f"Sheets: {xlsx.sheet_names}") # Read specific sheet with header detection df = pd.read_excel(xlsx, sheet_name="Q4 Sales", header=0) print(f"Shape: {df.shape}") print(f"Columns: {list(df.columns)}") # Quick analysis print(f"\nTotal Revenue: ${df['Revenue'].sum():,.2f}") print(f"Avg Deal Size: ${df['Revenue'].mean():,.2f}") print(f"Top Region: {df.groupby('Region')['Revenue'].sum().idxmax()}") # Pivot table pivot = df.pivot_table( values="Revenue", index="Region", columns="Product", aggfunc="sum", margins=True, ) print(f"\nRevenue by Region x Product:\n{pivot}")
Core Concepts
Excel Data Loading Strategies
| Method | Best For | Handles Formulas |
|---|---|---|
pd.read_excel() | Flat data, single sheet | Values only |
openpyxl read | Multi-sheet, formatting, formulas | Yes |
pd.ExcelFile() | Multiple sheets, lazy loading | Values only |
xlrd | Legacy .xls files | Values only |
Formula Auditing
from openpyxl import load_workbook # Load with formulas preserved (not calculated values) wb = load_workbook("financial_model.xlsx") ws = wb["P&L"] # Find all cells with formulas formula_cells = [] for row in ws.iter_rows(min_row=1, max_row=ws.max_row): for cell in row: if isinstance(cell.value, str) and cell.value.startswith("="): formula_cells.append({ "cell": cell.coordinate, "formula": cell.value, "sheet": ws.title, }) print(f"Found {len(formula_cells)} formulas") for fc in formula_cells[:10]: print(f" {fc['cell']}: {fc['formula']}") # Detect common formula issues for fc in formula_cells: formula = fc["formula"] if "#REF" in formula: print(f" ⚠️ Broken reference: {fc['cell']}") if formula.count("(") != formula.count(")"): print(f" ⚠️ Unbalanced parens: {fc['cell']}")
Multi-Sheet Analysis
import pandas as pd def analyze_workbook(filepath): """Analyze all sheets in a workbook.""" xlsx = pd.ExcelFile(filepath) report = {} for sheet_name in xlsx.sheet_names: df = pd.read_excel(xlsx, sheet_name=sheet_name) report[sheet_name] = { "rows": len(df), "columns": len(df.columns), "numeric_cols": list(df.select_dtypes("number").columns), "null_pct": df.isnull().mean().mean() * 100, "memory_mb": df.memory_usage(deep=True).sum() / 1e6, } return report # Generate workbook summary summary = analyze_workbook("sales_report.xlsx") for sheet, info in summary.items(): print(f"\n{sheet}: {info['rows']} rows, {info['columns']} cols") print(f" Numeric: {info['numeric_cols']}") print(f" Missing: {info['null_pct']:.1f}%")
Configuration
| Parameter | Description | Example |
|---|---|---|
file_path | Path to Excel file | "./data/report.xlsx" |
sheet_name | Target sheet (name or index) | "Q4 Sales" or 0 |
header_row | Row containing column headers | 0 (first row) |
skip_rows | Rows to skip at top | 2 |
use_formulas | Preserve formulas vs calculated values | true (openpyxl only) |
output_format | Analysis output format | "xlsx" / "csv" |
Best Practices
-
Always inspect sheet structure before parsing — Excel files frequently have merged cells, multi-row headers, blank rows, and inconsistent formatting. Load the first 10 rows with
nrows=10and print them to verify header positions and data layout before writing analysis code. -
Use
openpyxlfor structure,pandasfor analysis — Read formatting, formulas, and merged cell information with openpyxl, then convert to pandas DataFrames for statistical analysis. Trying to do statistical work in openpyxl is unnecessarily complex. -
Handle merged cells explicitly — Merged cells in Excel become
Nonein all but the top-left cell when read by openpyxl. Forward-fill these values withdf.fillna(method='ffill')after loading, or unmerge cells in openpyxl before converting to a DataFrame. -
Validate numeric columns after loading — Excel columns that look numeric may contain hidden text values, spaces, or special characters. Use
pd.to_numeric(df['col'], errors='coerce')and check how many values became NaN to detect data quality issues. -
Write analysis results to new sheets, not the source file — Never overwrite the input file. Save results to a new file or add new sheets to a copy. This preserves the original data and prevents accidental corruption of shared workbooks.
Common Issues
Dates load as serial numbers — Excel stores dates as integer serial numbers internally. When pandas loads them, they sometimes appear as integers (e.g., 45280 instead of 2023-12-15). Use pd.to_datetime(df['date_col'], unit='D', origin='1899-12-30') to convert Excel serial dates, or ensure parse_dates is specified during loading.
Merged cells cause misaligned data — Merged cells span multiple rows/columns but only store the value in the top-left cell. When reading with pandas, merged rows appear as NaN. Detect merged ranges with ws.merged_cells.ranges in openpyxl, then unmerge and forward-fill before analysis.
Memory errors on large workbooks — A 500MB Excel file will use several GB of RAM when loaded into pandas. Use usecols to load only needed columns, nrows to limit rows during exploration, and dtype specifications to reduce memory. For very large files, consider converting to Parquet first: df.to_parquet("data.parquet").
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.