Pro Statsmodels
Streamline your workflow with this statistical, modeling, toolkit, logistic. Includes structured workflows, validation checks, and reusable patterns for scientific.
Pro Statsmodels
Perform advanced statistical modeling and econometric analysis using statsmodels, Python's comprehensive library for estimation, inference, and diagnostics. This skill covers linear and generalized linear models, time-series analysis, mixed effects models, survival analysis, and regression diagnostics.
When to Use This Skill
Choose Pro Statsmodels when you need to:
- Fit regression models with detailed summary tables (coefficients, p-values, confidence intervals)
- Perform time-series analysis (ARIMA, SARIMAX, VAR, state space models)
- Run regression diagnostics (heteroscedasticity tests, influence measures, VIF)
- Build generalized linear models (logistic, Poisson, negative binomial regression)
Consider alternatives when:
- You need machine learning prediction pipelines (use scikit-learn)
- You need Bayesian posterior inference (use PyMC)
- You need quick statistical tests without model fitting (use scipy.stats or pingouin)
Quick Start
pip install statsmodels pandas numpy matplotlib
import statsmodels.api as sm import statsmodels.formula.api as smf import pandas as pd import numpy as np # Generate data np.random.seed(42) n = 200 data = pd.DataFrame({ 'education': np.random.randint(8, 20, n), 'experience': np.random.randint(0, 30, n), 'gender': np.random.choice(['Male', 'Female'], n), }) data['salary'] = ( 3000 + 2000 * data['education'] + 500 * data['experience'] + 5000 * (data['gender'] == 'Male').astype(int) + np.random.normal(0, 5000, n) ) # Fit OLS with formula API model = smf.ols('salary ~ education + experience + C(gender)', data=data) results = model.fit() print(results.summary()) # Key metrics print(f"\nR²: {results.rsquared:.3f}") print(f"Adjusted R²: {results.rsquared_adj:.3f}") print(f"F-statistic: {results.fstatistic[0]:.1f}, p = {results.f_pvalue:.4f}")
Core Concepts
Model Types
| Model | Class | Use Case |
|---|---|---|
| OLS | sm.OLS / smf.ols | Linear regression |
| WLS | sm.WLS | Weighted least squares |
| GLS | sm.GLS | Generalized least squares |
| Logit | sm.Logit / smf.logit | Binary classification |
| Probit | sm.Probit | Binary with normal link |
| Poisson | sm.Poisson / smf.poisson | Count data |
| NegativeBinomial | sm.NegativeBinomial | Overdispersed counts |
| MixedLM | smf.mixedlm | Multi-level / hierarchical |
| ARIMA | sm.tsa.ARIMA | Time-series forecasting |
| SARIMAX | sm.tsa.SARIMAX | Seasonal time series |
Time-Series Analysis
import statsmodels.api as sm import pandas as pd import numpy as np import matplotlib.pyplot as plt # Generate seasonal time series np.random.seed(42) n = 120 # 10 years of monthly data dates = pd.date_range('2015-01', periods=n, freq='MS') trend = np.linspace(100, 200, n) seasonal = 20 * np.sin(2 * np.pi * np.arange(n) / 12) noise = np.random.normal(0, 5, n) y = trend + seasonal + noise ts = pd.Series(y, index=dates) # Decompose decomposition = sm.tsa.seasonal_decompose(ts, model='additive', period=12) fig = decomposition.plot() fig.set_size_inches(10, 8) fig.tight_layout() fig.savefig('decomposition.pdf') # Fit SARIMAX model = sm.tsa.SARIMAX( ts, order=(1, 1, 1), # (p, d, q) seasonal_order=(1, 1, 1, 12), # (P, D, Q, s) enforce_stationarity=False, enforce_invertibility=False, ) results = model.fit(disp=False) print(results.summary()) # Forecast forecast = results.get_forecast(steps=24) mean_forecast = forecast.predicted_mean ci = forecast.conf_int() fig, ax = plt.subplots(figsize=(10, 4)) ax.plot(ts.index, ts, label='Observed') ax.plot(mean_forecast.index, mean_forecast, 'r-', label='Forecast') ax.fill_between(ci.index, ci.iloc[:, 0], ci.iloc[:, 1], alpha=0.2, color='red') ax.legend() ax.set_title('SARIMAX Forecast') fig.tight_layout() fig.savefig('forecast.pdf')
Regression Diagnostics
import statsmodels.api as sm import statsmodels.formula.api as smf from statsmodels.stats.outliers_influence import variance_inflation_factor from statsmodels.stats.diagnostic import het_breuschpagan import pandas as pd import numpy as np def run_diagnostics(results, data, formula): """Run comprehensive regression diagnostics.""" print("=" * 50) print("REGRESSION DIAGNOSTICS") print("=" * 50) # 1. Multicollinearity (VIF) X = results.model.exog feature_names = results.model.exog_names print("\n1. Variance Inflation Factors:") for i, name in enumerate(feature_names): if name == 'Intercept': continue vif = variance_inflation_factor(X, i) flag = " ⚠️ HIGH" if vif > 5 else "" print(f" {name}: VIF = {vif:.2f}{flag}") # 2. Heteroscedasticity (Breusch-Pagan) bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, results.model.exog) print(f"\n2. Breusch-Pagan test: stat={bp_stat:.3f}, p={bp_p:.3f}") print(f" {'Heteroscedastic' if bp_p < 0.05 else 'Homoscedastic'}") # 3. Normality of residuals from scipy.stats import shapiro _, p_shapiro = shapiro(results.resid[:5000]) # Shapiro limited to 5000 print(f"\n3. Shapiro-Wilk normality: p = {p_shapiro:.3f}") # 4. Influential observations (Cook's distance) influence = results.get_influence() cooks_d = influence.cooks_distance[0] n_influential = np.sum(cooks_d > 4/len(cooks_d)) print(f"\n4. Influential observations (Cook's d > 4/n): {n_influential}") # 5. Durbin-Watson (autocorrelation) from statsmodels.stats.stattools import durbin_watson dw = durbin_watson(results.resid) print(f"\n5. Durbin-Watson: {dw:.3f}") print(f" {'No autocorrelation' if 1.5 < dw < 2.5 else 'Possible autocorrelation'}") # Usage # run_diagnostics(results, data, 'salary ~ education + experience')
Configuration
| Parameter | Description | Default |
|---|---|---|
formula | Patsy formula string (e.g., "y ~ x1 + x2") | Required |
cov_type | Covariance estimator (nonrobust, HC0-HC3, HAC) | "nonrobust" |
alpha | Significance level for confidence intervals | 0.05 |
missing | Missing data handling (none, drop, raise) | "none" |
hasconst | Whether design matrix includes intercept | Auto-detected |
order | ARIMA order (p, d, q) | (1, 0, 0) |
seasonal_order | SARIMAX seasonal order (P, D, Q, s) | (0, 0, 0, 0) |
maxiter | Maximum optimization iterations | 35 |
Best Practices
-
Use the formula API for readable model specifications —
smf.ols('y ~ x1 + x2 + C(category)', data=df)is clearer than manually constructing design matrices. The formula API automatically handles categorical encoding withC(), interactions withx1:x2, and polynomial terms withI(x**2). -
Always use robust standard errors for observational data — Real-world data almost always exhibits heteroscedasticity. Use
results = model.fit(cov_type='HC3')to get heteroscedasticity-robust standard errors. HC3 is the most conservative and works well for moderate sample sizes. -
Run VIF checks before interpreting coefficients — Multicollinearity (VIF > 5-10) inflates standard errors and makes individual coefficients uninterpretable. Remove or combine correlated predictors, or use regularization. A model can predict well with collinear predictors, but individual coefficient interpretation is unreliable.
-
Use AIC/BIC for model selection, not R² — R² always increases with more predictors. AIC and BIC penalize model complexity and are directly available:
results.aic,results.bic. Lower values indicate better models. BIC penalizes complexity more strongly than AIC. -
Plot residuals before trusting model results — Use
sm.graphics.plot_regress_exog(results, 'x1')andsm.qqplot(results.resid, line='s')to visually check linearity, homoscedasticity, and normality assumptions. Diagnostic tests can miss patterns that are obvious in plots.
Common Issues
"MissingDataError: exog contains inf or nans" — statsmodels doesn't silently drop missing values. Clean data before fitting: data = data.dropna(subset=model_columns) or use missing='drop' in the formula API. Also check for infinite values: data = data.replace([np.inf, -np.inf], np.nan).dropna().
SARIMAX convergence warnings or non-invertible results — The optimizer failed to find stable parameters. Try different initial values with start_params, increase maxiter, or simplify the model order. Use sm.tsa.arma_order_select_ic(ts) to find optimal (p, q) based on information criteria rather than guessing.
Summary shows insignificant predictors but model is significant — This is classic multicollinearity: the predictors together explain variance, but individually they're redundant. Check VIF values, remove the predictor with highest VIF, and refit. Alternatively, if prediction is the goal rather than interpretation, keep all predictors.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.