Pro Statsmodels

Perform advanced statistical modeling and econometric analysis using statsmodels, Python's comprehensive library for estimation, inference, and diagnostics. This skill covers linear and generalized linear models, time-series analysis, mixed effects models, survival analysis, and regression diagnostics.

When to Use This Skill

Choose Pro Statsmodels when you need to:

Fit regression models with detailed summary tables (coefficients, p-values, confidence intervals)
Perform time-series analysis (ARIMA, SARIMAX, VAR, state space models)
Run regression diagnostics (heteroscedasticity tests, influence measures, VIF)
Build generalized linear models (logistic, Poisson, negative binomial regression)

Consider alternatives when:

You need machine learning prediction pipelines (use scikit-learn)
You need Bayesian posterior inference (use PyMC)
You need quick statistical tests without model fitting (use scipy.stats or pingouin)

Quick Start


pip install statsmodels pandas numpy matplotlib


import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

# Generate data
np.random.seed(42)
n = 200
data = pd.DataFrame({
    'education': np.random.randint(8, 20, n),
    'experience': np.random.randint(0, 30, n),
    'gender': np.random.choice(['Male', 'Female'], n),
})
data['salary'] = (
    3000 + 2000 * data['education'] + 500 * data['experience']
    + 5000 * (data['gender'] == 'Male').astype(int)
    + np.random.normal(0, 5000, n)
)

# Fit OLS with formula API
model = smf.ols('salary ~ education + experience + C(gender)', data=data)
results = model.fit()
print(results.summary())

# Key metrics
print(f"\nR²: {results.rsquared:.3f}")
print(f"Adjusted R²: {results.rsquared_adj:.3f}")
print(f"F-statistic: {results.fstatistic[0]:.1f}, p = {results.f_pvalue:.4f}")

Core Concepts

Model Types

Model	Class	Use Case
OLS	`sm.OLS` / `smf.ols`	Linear regression
WLS	`sm.WLS`	Weighted least squares
GLS	`sm.GLS`	Generalized least squares
Logit	`sm.Logit` / `smf.logit`	Binary classification
Probit	`sm.Probit`	Binary with normal link
Poisson	`sm.Poisson` / `smf.poisson`	Count data
NegativeBinomial	`sm.NegativeBinomial`	Overdispersed counts
MixedLM	`smf.mixedlm`	Multi-level / hierarchical
ARIMA	`sm.tsa.ARIMA`	Time-series forecasting
SARIMAX	`sm.tsa.SARIMAX`	Seasonal time series

Time-Series Analysis


import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate seasonal time series
np.random.seed(42)
n = 120  # 10 years of monthly data
dates = pd.date_range('2015-01', periods=n, freq='MS')
trend = np.linspace(100, 200, n)
seasonal = 20 * np.sin(2 * np.pi * np.arange(n) / 12)
noise = np.random.normal(0, 5, n)
y = trend + seasonal + noise

ts = pd.Series(y, index=dates)

# Decompose
decomposition = sm.tsa.seasonal_decompose(ts, model='additive', period=12)
fig = decomposition.plot()
fig.set_size_inches(10, 8)
fig.tight_layout()
fig.savefig('decomposition.pdf')

# Fit SARIMAX
model = sm.tsa.SARIMAX(
    ts,
    order=(1, 1, 1),              # (p, d, q)
    seasonal_order=(1, 1, 1, 12), # (P, D, Q, s)
    enforce_stationarity=False,
    enforce_invertibility=False,
)
results = model.fit(disp=False)
print(results.summary())

# Forecast
forecast = results.get_forecast(steps=24)
mean_forecast = forecast.predicted_mean
ci = forecast.conf_int()

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(ts.index, ts, label='Observed')
ax.plot(mean_forecast.index, mean_forecast, 'r-', label='Forecast')
ax.fill_between(ci.index, ci.iloc[:, 0], ci.iloc[:, 1], alpha=0.2, color='red')
ax.legend()
ax.set_title('SARIMAX Forecast')
fig.tight_layout()
fig.savefig('forecast.pdf')

Regression Diagnostics


import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.diagnostic import het_breuschpagan
import pandas as pd
import numpy as np

def run_diagnostics(results, data, formula):
    """Run comprehensive regression diagnostics."""
    print("=" * 50)
    print("REGRESSION DIAGNOSTICS")
    print("=" * 50)

    # 1. Multicollinearity (VIF)
    X = results.model.exog
    feature_names = results.model.exog_names
    print("\n1. Variance Inflation Factors:")
    for i, name in enumerate(feature_names):
        if name == 'Intercept':
            continue
        vif = variance_inflation_factor(X, i)
        flag = " ⚠️ HIGH" if vif > 5 else ""
        print(f"   {name}: VIF = {vif:.2f}{flag}")

    # 2. Heteroscedasticity (Breusch-Pagan)
    bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, results.model.exog)
    print(f"\n2. Breusch-Pagan test: stat={bp_stat:.3f}, p={bp_p:.3f}")
    print(f"   {'Heteroscedastic' if bp_p < 0.05 else 'Homoscedastic'}")

    # 3. Normality of residuals
    from scipy.stats import shapiro
    _, p_shapiro = shapiro(results.resid[:5000])  # Shapiro limited to 5000
    print(f"\n3. Shapiro-Wilk normality: p = {p_shapiro:.3f}")

    # 4. Influential observations (Cook's distance)
    influence = results.get_influence()
    cooks_d = influence.cooks_distance[0]
    n_influential = np.sum(cooks_d > 4/len(cooks_d))
    print(f"\n4. Influential observations (Cook's d > 4/n): {n_influential}")

    # 5. Durbin-Watson (autocorrelation)
    from statsmodels.stats.stattools import durbin_watson
    dw = durbin_watson(results.resid)
    print(f"\n5. Durbin-Watson: {dw:.3f}")
    print(f"   {'No autocorrelation' if 1.5 < dw < 2.5 else 'Possible autocorrelation'}")

# Usage
# run_diagnostics(results, data, 'salary ~ education + experience')

Configuration

Parameter	Description	Default
`formula`	Patsy formula string (e.g., `"y ~ x1 + x2"`)	Required
`cov_type`	Covariance estimator (nonrobust, HC0-HC3, HAC)	`"nonrobust"`
`alpha`	Significance level for confidence intervals	`0.05`
`missing`	Missing data handling (none, drop, raise)	`"none"`
`hasconst`	Whether design matrix includes intercept	Auto-detected
`order`	ARIMA order (p, d, q)	`(1, 0, 0)`
`seasonal_order`	SARIMAX seasonal order (P, D, Q, s)	`(0, 0, 0, 0)`
`maxiter`	Maximum optimization iterations	`35`

Best Practices

Use the formula API for readable model specifications — smf.ols('y ~ x1 + x2 + C(category)', data=df) is clearer than manually constructing design matrices. The formula API automatically handles categorical encoding with C(), interactions with x1:x2, and polynomial terms with I(x**2).
Always use robust standard errors for observational data — Real-world data almost always exhibits heteroscedasticity. Use results = model.fit(cov_type='HC3') to get heteroscedasticity-robust standard errors. HC3 is the most conservative and works well for moderate sample sizes.
Run VIF checks before interpreting coefficients — Multicollinearity (VIF > 5-10) inflates standard errors and makes individual coefficients uninterpretable. Remove or combine correlated predictors, or use regularization. A model can predict well with collinear predictors, but individual coefficient interpretation is unreliable.
Use AIC/BIC for model selection, not R² — R² always increases with more predictors. AIC and BIC penalize model complexity and are directly available: results.aic, results.bic. Lower values indicate better models. BIC penalizes complexity more strongly than AIC.
Plot residuals before trusting model results — Use sm.graphics.plot_regress_exog(results, 'x1') and sm.qqplot(results.resid, line='s') to visually check linearity, homoscedasticity, and normality assumptions. Diagnostic tests can miss patterns that are obvious in plots.

Common Issues

"MissingDataError: exog contains inf or nans" — statsmodels doesn't silently drop missing values. Clean data before fitting: data = data.dropna(subset=model_columns) or use missing='drop' in the formula API. Also check for infinite values: data = data.replace([np.inf, -np.inf], np.nan).dropna().

SARIMAX convergence warnings or non-invertible results — The optimizer failed to find stable parameters. Try different initial values with start_params, increase maxiter, or simplify the model order. Use sm.tsa.arma_order_select_ic(ts) to find optimal (p, q) based on information criteria rather than guessing.

Summary shows insignificant predictors but model is significant — This is classic multicollinearity: the predictors together explain variance, but individually they're redundant. Check VIF values, remove the predictor with highest VIF, and refit. Alternatively, if prediction is the goal rather than interpretation, keep all predictors.

⚠️ Loading Issue

Pro Statsmodels

Pro Statsmodels

When to Use This Skill

Quick Start

Core Concepts

Model Types

Time-Series Analysis

Regression Diagnostics

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace