E

Excel Analysis Complete

Powerful skill for analyze, excel, spreadsheets, create. Includes structured workflows, validation checks, and reusable patterns for enterprise communication.

SkillClipticsenterprise communicationv1.0.0MIT
0 views0 copies

Excel Analysis Complete

A comprehensive skill for analyzing Excel files programmatically — covering data loading with openpyxl and pandas, formula auditing, pivot table creation, chart generation, multi-sheet analysis, and automated report building.

When to Use This Skill

Choose Excel Analysis Complete when you need to:

  • Read and analyze complex Excel workbooks with multiple sheets
  • Audit formulas and identify calculation errors
  • Create pivot tables and summary statistics from raw data
  • Generate charts and visualizations from Excel data
  • Build automated reporting pipelines from Excel inputs

Consider alternatives when:

  • You need to create Excel files from scratch (use an XLSX creation skill)
  • You're working with CSV files only (use a data analysis skill)
  • You need real-time Excel manipulation (use Excel VBA or add-ins)

Quick Start

# Install dependencies pip install pandas openpyxl xlsxwriter matplotlib
import pandas as pd from pathlib import Path # Load workbook with multiple sheets xlsx = pd.ExcelFile("sales_report.xlsx") print(f"Sheets: {xlsx.sheet_names}") # Read specific sheet with header detection df = pd.read_excel(xlsx, sheet_name="Q4 Sales", header=0) print(f"Shape: {df.shape}") print(f"Columns: {list(df.columns)}") # Quick analysis print(f"\nTotal Revenue: ${df['Revenue'].sum():,.2f}") print(f"Avg Deal Size: ${df['Revenue'].mean():,.2f}") print(f"Top Region: {df.groupby('Region')['Revenue'].sum().idxmax()}") # Pivot table pivot = df.pivot_table( values="Revenue", index="Region", columns="Product", aggfunc="sum", margins=True, ) print(f"\nRevenue by Region x Product:\n{pivot}")

Core Concepts

Excel Data Loading Strategies

MethodBest ForHandles Formulas
pd.read_excel()Flat data, single sheetValues only
openpyxl readMulti-sheet, formatting, formulasYes
pd.ExcelFile()Multiple sheets, lazy loadingValues only
xlrdLegacy .xls filesValues only

Formula Auditing

from openpyxl import load_workbook # Load with formulas preserved (not calculated values) wb = load_workbook("financial_model.xlsx") ws = wb["P&L"] # Find all cells with formulas formula_cells = [] for row in ws.iter_rows(min_row=1, max_row=ws.max_row): for cell in row: if isinstance(cell.value, str) and cell.value.startswith("="): formula_cells.append({ "cell": cell.coordinate, "formula": cell.value, "sheet": ws.title, }) print(f"Found {len(formula_cells)} formulas") for fc in formula_cells[:10]: print(f" {fc['cell']}: {fc['formula']}") # Detect common formula issues for fc in formula_cells: formula = fc["formula"] if "#REF" in formula: print(f" ⚠️ Broken reference: {fc['cell']}") if formula.count("(") != formula.count(")"): print(f" ⚠️ Unbalanced parens: {fc['cell']}")

Multi-Sheet Analysis

import pandas as pd def analyze_workbook(filepath): """Analyze all sheets in a workbook.""" xlsx = pd.ExcelFile(filepath) report = {} for sheet_name in xlsx.sheet_names: df = pd.read_excel(xlsx, sheet_name=sheet_name) report[sheet_name] = { "rows": len(df), "columns": len(df.columns), "numeric_cols": list(df.select_dtypes("number").columns), "null_pct": df.isnull().mean().mean() * 100, "memory_mb": df.memory_usage(deep=True).sum() / 1e6, } return report # Generate workbook summary summary = analyze_workbook("sales_report.xlsx") for sheet, info in summary.items(): print(f"\n{sheet}: {info['rows']} rows, {info['columns']} cols") print(f" Numeric: {info['numeric_cols']}") print(f" Missing: {info['null_pct']:.1f}%")

Configuration

ParameterDescriptionExample
file_pathPath to Excel file"./data/report.xlsx"
sheet_nameTarget sheet (name or index)"Q4 Sales" or 0
header_rowRow containing column headers0 (first row)
skip_rowsRows to skip at top2
use_formulasPreserve formulas vs calculated valuestrue (openpyxl only)
output_formatAnalysis output format"xlsx" / "csv"

Best Practices

  1. Always inspect sheet structure before parsing — Excel files frequently have merged cells, multi-row headers, blank rows, and inconsistent formatting. Load the first 10 rows with nrows=10 and print them to verify header positions and data layout before writing analysis code.

  2. Use openpyxl for structure, pandas for analysis — Read formatting, formulas, and merged cell information with openpyxl, then convert to pandas DataFrames for statistical analysis. Trying to do statistical work in openpyxl is unnecessarily complex.

  3. Handle merged cells explicitly — Merged cells in Excel become None in all but the top-left cell when read by openpyxl. Forward-fill these values with df.fillna(method='ffill') after loading, or unmerge cells in openpyxl before converting to a DataFrame.

  4. Validate numeric columns after loading — Excel columns that look numeric may contain hidden text values, spaces, or special characters. Use pd.to_numeric(df['col'], errors='coerce') and check how many values became NaN to detect data quality issues.

  5. Write analysis results to new sheets, not the source file — Never overwrite the input file. Save results to a new file or add new sheets to a copy. This preserves the original data and prevents accidental corruption of shared workbooks.

Common Issues

Dates load as serial numbers — Excel stores dates as integer serial numbers internally. When pandas loads them, they sometimes appear as integers (e.g., 45280 instead of 2023-12-15). Use pd.to_datetime(df['date_col'], unit='D', origin='1899-12-30') to convert Excel serial dates, or ensure parse_dates is specified during loading.

Merged cells cause misaligned data — Merged cells span multiple rows/columns but only store the value in the top-left cell. When reading with pandas, merged rows appear as NaN. Detect merged ranges with ws.merged_cells.ranges in openpyxl, then unmerge and forward-fill before analysis.

Memory errors on large workbooks — A 500MB Excel file will use several GB of RAM when loaded into pandas. Use usecols to load only needed columns, nrows to limit rows during exploration, and dtype specifications to reduce memory. For very large files, consider converting to Parquet first: df.to_parquet("data.parquet").

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates