Scikit Survival Engine
All-in-one skill covering comprehensive, toolkit, survival, analysis. Includes structured workflows, validation checks, and reusable patterns for scientific.
Scikit Survival Engine
Perform survival analysis and time-to-event modeling using scikit-survival, a Python library that combines survival analysis techniques with scikit-learn's API. This skill covers Cox proportional hazards models, random survival forests, Kaplan-Meier estimation, concordance metrics, and handling censored data.
When to Use This Skill
Choose Scikit Survival Engine when you need to:
- Model time-to-event data with right-censored observations
- Train survival models using scikit-learn-compatible pipelines and cross-validation
- Compute survival functions, hazard ratios, and concordance indices
- Apply machine learning methods (random survival forests, gradient boosting) to survival data
Consider alternatives when:
- You need basic Kaplan-Meier and log-rank tests only (use lifelines)
- You need deep learning survival models (use PyCox or DeepSurv)
- You need competing risks analysis (use lifelines or R's cmprsk)
Quick Start
pip install scikit-survival pandas numpy matplotlib
import numpy as np import pandas as pd from sksurv.datasets import load_gbsg2 from sksurv.preprocessing import OneHotEncoder from sksurv.ensemble import RandomSurvivalForest from sksurv.metrics import concordance_index_censored from sklearn.model_selection import train_test_split # Load dataset (German Breast Cancer Study Group 2) X, y = load_gbsg2() print(f"Samples: {X.shape[0]}, Features: {X.shape[1]}") print(f"Events observed: {y['cens'].sum()} / {len(y)}") # Encode categorical features Xt = OneHotEncoder().fit_transform(X) # Split data Xt_train, Xt_test, y_train, y_test = train_test_split( Xt, y, test_size=0.25, random_state=42 ) # Train Random Survival Forest rsf = RandomSurvivalForest( n_estimators=100, min_samples_split=10, min_samples_leaf=15, random_state=42, n_jobs=-1 ) rsf.fit(Xt_train, y_train) # Evaluate prediction = rsf.predict(Xt_test) c_index = concordance_index_censored( y_test['cens'], y_test['time'], prediction ) print(f"Concordance index: {c_index[0]:.3f}")
Core Concepts
Survival Models Available
| Model | Class | Best For |
|---|---|---|
| Cox PH | CoxPHSurvivalAnalysis | Linear hazard relationships |
| Cox PH + L2 | CoxnetSurvivalAnalysis | High-dimensional, regularized |
| Survival SVM | FastSurvivalSVM | Non-linear, kernel methods |
| Random Survival Forest | RandomSurvivalForest | Non-linear, feature interactions |
| Gradient Boosted Survival | GradientBoostingSurvivalAnalysis | Best overall performance |
| Component-wise GBS | ComponentwiseGradientBoostingSurvivalAnalysis | Feature selection + survival |
Cox Model with Pipeline
from sksurv.linear_model import CoxPHSurvivalAnalysis, CoxnetSurvivalAnalysis from sksurv.preprocessing import OneHotEncoder from sksurv.metrics import concordance_index_censored from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GridSearchCV import numpy as np import warnings warnings.filterwarnings('ignore') # Pipeline with Cox PH pipe = Pipeline([ ('encode', OneHotEncoder()), ('scale', StandardScaler()), ('cox', CoxnetSurvivalAnalysis(l1_ratio=0.9, alpha_min_ratio=0.01, max_iter=1000)) ]) # The structured array y must have dtype [('event', bool), ('time', float)] # sksurv datasets provide this automatically # For custom data: def make_survival_target(event_col, time_col): """Create structured array for sksurv from event and time columns.""" dt = np.dtype([('event', bool), ('time', float)]) y = np.empty(len(event_col), dtype=dt) y['event'] = event_col.astype(bool) y['time'] = time_col.astype(float) return y # Kaplan-Meier estimation from sksurv.nonparametric import kaplan_meier_estimator import matplotlib.pyplot as plt # Example with gbsg2 data from sksurv.datasets import load_gbsg2 X, y = load_gbsg2() # KM curve by hormone therapy status for value in X['horTh'].unique(): mask = X['horTh'] == value time, survival_prob, conf_int = kaplan_meier_estimator( y['cens'][mask], y['time'][mask], conf_type='log-log' ) plt.step(time, survival_prob, where='post', label=f'horTh={value}') plt.fill_between(time, conf_int[0], conf_int[1], alpha=0.15, step='post') plt.xlabel('Time (days)') plt.ylabel('Survival probability') plt.title('Kaplan-Meier Survival Curves') plt.legend() plt.tight_layout() plt.savefig('km_curves.pdf')
Configuration
| Parameter | Description | Default |
|---|---|---|
n_estimators | Number of trees (RSF/GBSA) | 100 |
max_depth | Maximum tree depth | None (unlimited) |
min_samples_split | Minimum samples to split a node | 6 |
min_samples_leaf | Minimum samples in a leaf | 3 |
l1_ratio | Elastic net mixing (CoxNet) | 0.5 |
alpha | Regularization strength (CoxNet) | Auto-selected |
learning_rate | Step size (gradient boosting) | 0.1 |
n_jobs | Parallel workers | -1 (all cores) |
Best Practices
-
Always check the proportional hazards assumption for Cox models — Cox PH assumes that hazard ratios are constant over time. Use Schoenfeld residuals to test this: if violated, consider time-varying covariates, stratified Cox models, or switch to tree-based methods which don't require this assumption.
-
Encode the survival target as a structured NumPy array — scikit-survival requires
yas a structured array with('event', bool)and('time', float)fields. Usenp.dtype([('event', bool), ('time', float)])and populate it carefully. A regular NumPy array or pandas DataFrame will raise cryptic errors. -
Use concordance index as the primary evaluation metric — The C-index measures how well the model ranks individuals by risk. A C-index of 0.5 is random, >0.7 is acceptable, >0.8 is good. Use
concordance_index_censoredwhich properly handles censored observations. Don't use accuracy or RMSE for survival data. -
Handle high-dimensional data with regularized Cox models — When features outnumber samples (common in genomics), use
CoxnetSurvivalAnalysiswith elastic net regularization. Setl1_ratio > 0.5for feature selection (sparse solutions) and tunealphavia cross-validation. -
Validate with time-aware cross-validation — Standard k-fold CV is acceptable for survival data if there's no temporal ordering. For longitudinal cohorts, use time-aware splitting to avoid using future data to predict past events. Report both C-index and integrated Brier score for comprehensive evaluation.
Common Issues
"ValueError: y must be a structured array" when fitting — scikit-survival requires the target as a structured NumPy array, not a DataFrame or regular array. Create it with: y = np.array([(e, t) for e, t in zip(events, times)], dtype=[('event', bool), ('time', float)]). The event field must be boolean, not integer.
Concordance index returns 0.5 or lower — This indicates the model is no better than random. Check that features are properly encoded (categorical to numeric), scaled (for Cox models), and that the dataset has enough events relative to features. Very low event rates (<10% censoring) can also make learning difficult.
CoxnetSurvivalAnalysis convergence warnings — Increase max_iter (try 10000) and ensure features are standardized. Also check for multicollinearity — highly correlated features cause numerical instability. Remove correlated features (r > 0.9) or use PCA before fitting the Cox model.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.