Excel Analysis Complete

A comprehensive skill for analyzing Excel files programmatically — covering data loading with openpyxl and pandas, formula auditing, pivot table creation, chart generation, multi-sheet analysis, and automated report building.

When to Use This Skill

Choose Excel Analysis Complete when you need to:

Read and analyze complex Excel workbooks with multiple sheets
Audit formulas and identify calculation errors
Create pivot tables and summary statistics from raw data
Generate charts and visualizations from Excel data
Build automated reporting pipelines from Excel inputs

Consider alternatives when:

You need to create Excel files from scratch (use an XLSX creation skill)
You're working with CSV files only (use a data analysis skill)
You need real-time Excel manipulation (use Excel VBA or add-ins)

Quick Start


# Install dependencies
pip install pandas openpyxl xlsxwriter matplotlib


import pandas as pd
from pathlib import Path

# Load workbook with multiple sheets
xlsx = pd.ExcelFile("sales_report.xlsx")
print(f"Sheets: {xlsx.sheet_names}")

# Read specific sheet with header detection
df = pd.read_excel(xlsx, sheet_name="Q4 Sales", header=0)
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Quick analysis
print(f"\nTotal Revenue: ${df['Revenue'].sum():,.2f}")
print(f"Avg Deal Size: ${df['Revenue'].mean():,.2f}")
print(f"Top Region: {df.groupby('Region')['Revenue'].sum().idxmax()}")

# Pivot table
pivot = df.pivot_table(
    values="Revenue",
    index="Region",
    columns="Product",
    aggfunc="sum",
    margins=True,
)
print(f"\nRevenue by Region x Product:\n{pivot}")

Core Concepts

Excel Data Loading Strategies

Method	Best For	Handles Formulas
`pd.read_excel()`	Flat data, single sheet	Values only
`openpyxl` read	Multi-sheet, formatting, formulas	Yes
`pd.ExcelFile()`	Multiple sheets, lazy loading	Values only
`xlrd`	Legacy .xls files	Values only

Formula Auditing


from openpyxl import load_workbook

# Load with formulas preserved (not calculated values)
wb = load_workbook("financial_model.xlsx")
ws = wb["P&L"]

# Find all cells with formulas
formula_cells = []
for row in ws.iter_rows(min_row=1, max_row=ws.max_row):
    for cell in row:
        if isinstance(cell.value, str) and cell.value.startswith("="):
            formula_cells.append({
                "cell": cell.coordinate,
                "formula": cell.value,
                "sheet": ws.title,
            })

print(f"Found {len(formula_cells)} formulas")
for fc in formula_cells[:10]:
    print(f"  {fc['cell']}: {fc['formula']}")

# Detect common formula issues
for fc in formula_cells:
    formula = fc["formula"]
    if "#REF" in formula:
        print(f"  ⚠️ Broken reference: {fc['cell']}")
    if formula.count("(") != formula.count(")"):
        print(f"  ⚠️ Unbalanced parens: {fc['cell']}")

Multi-Sheet Analysis


import pandas as pd

def analyze_workbook(filepath):
    """Analyze all sheets in a workbook."""
    xlsx = pd.ExcelFile(filepath)
    report = {}

    for sheet_name in xlsx.sheet_names:
        df = pd.read_excel(xlsx, sheet_name=sheet_name)
        report[sheet_name] = {
            "rows": len(df),
            "columns": len(df.columns),
            "numeric_cols": list(df.select_dtypes("number").columns),
            "null_pct": df.isnull().mean().mean() * 100,
            "memory_mb": df.memory_usage(deep=True).sum() / 1e6,
        }

    return report

# Generate workbook summary
summary = analyze_workbook("sales_report.xlsx")
for sheet, info in summary.items():
    print(f"\n{sheet}: {info['rows']} rows, {info['columns']} cols")
    print(f"  Numeric: {info['numeric_cols']}")
    print(f"  Missing: {info['null_pct']:.1f}%")

Configuration

Parameter	Description	Example
`file_path`	Path to Excel file	`"./data/report.xlsx"`
`sheet_name`	Target sheet (name or index)	`"Q4 Sales"` or `0`
`header_row`	Row containing column headers	`0` (first row)
`skip_rows`	Rows to skip at top	`2`
`use_formulas`	Preserve formulas vs calculated values	`true` (openpyxl only)
`output_format`	Analysis output format	`"xlsx"` / `"csv"`

Best Practices

Always inspect sheet structure before parsing — Excel files frequently have merged cells, multi-row headers, blank rows, and inconsistent formatting. Load the first 10 rows with nrows=10 and print them to verify header positions and data layout before writing analysis code.
Use openpyxl for structure, pandas for analysis — Read formatting, formulas, and merged cell information with openpyxl, then convert to pandas DataFrames for statistical analysis. Trying to do statistical work in openpyxl is unnecessarily complex.
Handle merged cells explicitly — Merged cells in Excel become None in all but the top-left cell when read by openpyxl. Forward-fill these values with df.fillna(method='ffill') after loading, or unmerge cells in openpyxl before converting to a DataFrame.
Validate numeric columns after loading — Excel columns that look numeric may contain hidden text values, spaces, or special characters. Use pd.to_numeric(df['col'], errors='coerce') and check how many values became NaN to detect data quality issues.
Write analysis results to new sheets, not the source file — Never overwrite the input file. Save results to a new file or add new sheets to a copy. This preserves the original data and prevents accidental corruption of shared workbooks.

Common Issues

Dates load as serial numbers — Excel stores dates as integer serial numbers internally. When pandas loads them, they sometimes appear as integers (e.g., 45280 instead of 2023-12-15). Use pd.to_datetime(df['date_col'], unit='D', origin='1899-12-30') to convert Excel serial dates, or ensure parse_dates is specified during loading.

Merged cells cause misaligned data — Merged cells span multiple rows/columns but only store the value in the top-left cell. When reading with pandas, merged rows appear as NaN. Detect merged ranges with ws.merged_cells.ranges in openpyxl, then unmerge and forward-fill before analysis.

Memory errors on large workbooks — A 500MB Excel file will use several GB of RAM when loaded into pandas. Use usecols to load only needed columns, nrows to limit rows during exploration, and dtype specifications to reduce memory. For very large files, consider converting to Parquet first: df.to_parquet("data.parquet").

⚠️ Loading Issue

Excel Analysis Complete

Excel Analysis Complete

When to Use This Skill

Quick Start

Core Concepts

Excel Data Loading Strategies

Formula Auditing

Multi-Sheet Analysis

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace