P

Pro Statsmodels

Streamline your workflow with this statistical, modeling, toolkit, logistic. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Pro Statsmodels

Perform advanced statistical modeling and econometric analysis using statsmodels, Python's comprehensive library for estimation, inference, and diagnostics. This skill covers linear and generalized linear models, time-series analysis, mixed effects models, survival analysis, and regression diagnostics.

When to Use This Skill

Choose Pro Statsmodels when you need to:

  • Fit regression models with detailed summary tables (coefficients, p-values, confidence intervals)
  • Perform time-series analysis (ARIMA, SARIMAX, VAR, state space models)
  • Run regression diagnostics (heteroscedasticity tests, influence measures, VIF)
  • Build generalized linear models (logistic, Poisson, negative binomial regression)

Consider alternatives when:

  • You need machine learning prediction pipelines (use scikit-learn)
  • You need Bayesian posterior inference (use PyMC)
  • You need quick statistical tests without model fitting (use scipy.stats or pingouin)

Quick Start

pip install statsmodels pandas numpy matplotlib
import statsmodels.api as sm import statsmodels.formula.api as smf import pandas as pd import numpy as np # Generate data np.random.seed(42) n = 200 data = pd.DataFrame({ 'education': np.random.randint(8, 20, n), 'experience': np.random.randint(0, 30, n), 'gender': np.random.choice(['Male', 'Female'], n), }) data['salary'] = ( 3000 + 2000 * data['education'] + 500 * data['experience'] + 5000 * (data['gender'] == 'Male').astype(int) + np.random.normal(0, 5000, n) ) # Fit OLS with formula API model = smf.ols('salary ~ education + experience + C(gender)', data=data) results = model.fit() print(results.summary()) # Key metrics print(f"\nR²: {results.rsquared:.3f}") print(f"Adjusted R²: {results.rsquared_adj:.3f}") print(f"F-statistic: {results.fstatistic[0]:.1f}, p = {results.f_pvalue:.4f}")

Core Concepts

Model Types

ModelClassUse Case
OLSsm.OLS / smf.olsLinear regression
WLSsm.WLSWeighted least squares
GLSsm.GLSGeneralized least squares
Logitsm.Logit / smf.logitBinary classification
Probitsm.ProbitBinary with normal link
Poissonsm.Poisson / smf.poissonCount data
NegativeBinomialsm.NegativeBinomialOverdispersed counts
MixedLMsmf.mixedlmMulti-level / hierarchical
ARIMAsm.tsa.ARIMATime-series forecasting
SARIMAXsm.tsa.SARIMAXSeasonal time series

Time-Series Analysis

import statsmodels.api as sm import pandas as pd import numpy as np import matplotlib.pyplot as plt # Generate seasonal time series np.random.seed(42) n = 120 # 10 years of monthly data dates = pd.date_range('2015-01', periods=n, freq='MS') trend = np.linspace(100, 200, n) seasonal = 20 * np.sin(2 * np.pi * np.arange(n) / 12) noise = np.random.normal(0, 5, n) y = trend + seasonal + noise ts = pd.Series(y, index=dates) # Decompose decomposition = sm.tsa.seasonal_decompose(ts, model='additive', period=12) fig = decomposition.plot() fig.set_size_inches(10, 8) fig.tight_layout() fig.savefig('decomposition.pdf') # Fit SARIMAX model = sm.tsa.SARIMAX( ts, order=(1, 1, 1), # (p, d, q) seasonal_order=(1, 1, 1, 12), # (P, D, Q, s) enforce_stationarity=False, enforce_invertibility=False, ) results = model.fit(disp=False) print(results.summary()) # Forecast forecast = results.get_forecast(steps=24) mean_forecast = forecast.predicted_mean ci = forecast.conf_int() fig, ax = plt.subplots(figsize=(10, 4)) ax.plot(ts.index, ts, label='Observed') ax.plot(mean_forecast.index, mean_forecast, 'r-', label='Forecast') ax.fill_between(ci.index, ci.iloc[:, 0], ci.iloc[:, 1], alpha=0.2, color='red') ax.legend() ax.set_title('SARIMAX Forecast') fig.tight_layout() fig.savefig('forecast.pdf')

Regression Diagnostics

import statsmodels.api as sm import statsmodels.formula.api as smf from statsmodels.stats.outliers_influence import variance_inflation_factor from statsmodels.stats.diagnostic import het_breuschpagan import pandas as pd import numpy as np def run_diagnostics(results, data, formula): """Run comprehensive regression diagnostics.""" print("=" * 50) print("REGRESSION DIAGNOSTICS") print("=" * 50) # 1. Multicollinearity (VIF) X = results.model.exog feature_names = results.model.exog_names print("\n1. Variance Inflation Factors:") for i, name in enumerate(feature_names): if name == 'Intercept': continue vif = variance_inflation_factor(X, i) flag = " ⚠️ HIGH" if vif > 5 else "" print(f" {name}: VIF = {vif:.2f}{flag}") # 2. Heteroscedasticity (Breusch-Pagan) bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, results.model.exog) print(f"\n2. Breusch-Pagan test: stat={bp_stat:.3f}, p={bp_p:.3f}") print(f" {'Heteroscedastic' if bp_p < 0.05 else 'Homoscedastic'}") # 3. Normality of residuals from scipy.stats import shapiro _, p_shapiro = shapiro(results.resid[:5000]) # Shapiro limited to 5000 print(f"\n3. Shapiro-Wilk normality: p = {p_shapiro:.3f}") # 4. Influential observations (Cook's distance) influence = results.get_influence() cooks_d = influence.cooks_distance[0] n_influential = np.sum(cooks_d > 4/len(cooks_d)) print(f"\n4. Influential observations (Cook's d > 4/n): {n_influential}") # 5. Durbin-Watson (autocorrelation) from statsmodels.stats.stattools import durbin_watson dw = durbin_watson(results.resid) print(f"\n5. Durbin-Watson: {dw:.3f}") print(f" {'No autocorrelation' if 1.5 < dw < 2.5 else 'Possible autocorrelation'}") # Usage # run_diagnostics(results, data, 'salary ~ education + experience')

Configuration

ParameterDescriptionDefault
formulaPatsy formula string (e.g., "y ~ x1 + x2")Required
cov_typeCovariance estimator (nonrobust, HC0-HC3, HAC)"nonrobust"
alphaSignificance level for confidence intervals0.05
missingMissing data handling (none, drop, raise)"none"
hasconstWhether design matrix includes interceptAuto-detected
orderARIMA order (p, d, q)(1, 0, 0)
seasonal_orderSARIMAX seasonal order (P, D, Q, s)(0, 0, 0, 0)
maxiterMaximum optimization iterations35

Best Practices

  1. Use the formula API for readable model specificationssmf.ols('y ~ x1 + x2 + C(category)', data=df) is clearer than manually constructing design matrices. The formula API automatically handles categorical encoding with C(), interactions with x1:x2, and polynomial terms with I(x**2).

  2. Always use robust standard errors for observational data — Real-world data almost always exhibits heteroscedasticity. Use results = model.fit(cov_type='HC3') to get heteroscedasticity-robust standard errors. HC3 is the most conservative and works well for moderate sample sizes.

  3. Run VIF checks before interpreting coefficients — Multicollinearity (VIF > 5-10) inflates standard errors and makes individual coefficients uninterpretable. Remove or combine correlated predictors, or use regularization. A model can predict well with collinear predictors, but individual coefficient interpretation is unreliable.

  4. Use AIC/BIC for model selection, not R² — R² always increases with more predictors. AIC and BIC penalize model complexity and are directly available: results.aic, results.bic. Lower values indicate better models. BIC penalizes complexity more strongly than AIC.

  5. Plot residuals before trusting model results — Use sm.graphics.plot_regress_exog(results, 'x1') and sm.qqplot(results.resid, line='s') to visually check linearity, homoscedasticity, and normality assumptions. Diagnostic tests can miss patterns that are obvious in plots.

Common Issues

"MissingDataError: exog contains inf or nans" — statsmodels doesn't silently drop missing values. Clean data before fitting: data = data.dropna(subset=model_columns) or use missing='drop' in the formula API. Also check for infinite values: data = data.replace([np.inf, -np.inf], np.nan).dropna().

SARIMAX convergence warnings or non-invertible results — The optimizer failed to find stable parameters. Try different initial values with start_params, increase maxiter, or simplify the model order. Use sm.tsa.arma_order_select_ic(ts) to find optimal (p, q) based on information criteria rather than guessing.

Summary shows insignificant predictors but model is significant — This is classic multicollinearity: the predictors together explain variance, but individually they're redundant. Check VIF values, remove the predictor with highest VIF, and refit. Alternatively, if prediction is the goal rather than interpretation, keep all predictors.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates