S

Scikit Survival Engine

All-in-one skill covering comprehensive, toolkit, survival, analysis. Includes structured workflows, validation checks, and reusable patterns for scientific.

SkillClipticsscientificv1.0.0MIT
0 views0 copies

Scikit Survival Engine

Perform survival analysis and time-to-event modeling using scikit-survival, a Python library that combines survival analysis techniques with scikit-learn's API. This skill covers Cox proportional hazards models, random survival forests, Kaplan-Meier estimation, concordance metrics, and handling censored data.

When to Use This Skill

Choose Scikit Survival Engine when you need to:

  • Model time-to-event data with right-censored observations
  • Train survival models using scikit-learn-compatible pipelines and cross-validation
  • Compute survival functions, hazard ratios, and concordance indices
  • Apply machine learning methods (random survival forests, gradient boosting) to survival data

Consider alternatives when:

  • You need basic Kaplan-Meier and log-rank tests only (use lifelines)
  • You need deep learning survival models (use PyCox or DeepSurv)
  • You need competing risks analysis (use lifelines or R's cmprsk)

Quick Start

pip install scikit-survival pandas numpy matplotlib
import numpy as np import pandas as pd from sksurv.datasets import load_gbsg2 from sksurv.preprocessing import OneHotEncoder from sksurv.ensemble import RandomSurvivalForest from sksurv.metrics import concordance_index_censored from sklearn.model_selection import train_test_split # Load dataset (German Breast Cancer Study Group 2) X, y = load_gbsg2() print(f"Samples: {X.shape[0]}, Features: {X.shape[1]}") print(f"Events observed: {y['cens'].sum()} / {len(y)}") # Encode categorical features Xt = OneHotEncoder().fit_transform(X) # Split data Xt_train, Xt_test, y_train, y_test = train_test_split( Xt, y, test_size=0.25, random_state=42 ) # Train Random Survival Forest rsf = RandomSurvivalForest( n_estimators=100, min_samples_split=10, min_samples_leaf=15, random_state=42, n_jobs=-1 ) rsf.fit(Xt_train, y_train) # Evaluate prediction = rsf.predict(Xt_test) c_index = concordance_index_censored( y_test['cens'], y_test['time'], prediction ) print(f"Concordance index: {c_index[0]:.3f}")

Core Concepts

Survival Models Available

ModelClassBest For
Cox PHCoxPHSurvivalAnalysisLinear hazard relationships
Cox PH + L2CoxnetSurvivalAnalysisHigh-dimensional, regularized
Survival SVMFastSurvivalSVMNon-linear, kernel methods
Random Survival ForestRandomSurvivalForestNon-linear, feature interactions
Gradient Boosted SurvivalGradientBoostingSurvivalAnalysisBest overall performance
Component-wise GBSComponentwiseGradientBoostingSurvivalAnalysisFeature selection + survival

Cox Model with Pipeline

from sksurv.linear_model import CoxPHSurvivalAnalysis, CoxnetSurvivalAnalysis from sksurv.preprocessing import OneHotEncoder from sksurv.metrics import concordance_index_censored from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GridSearchCV import numpy as np import warnings warnings.filterwarnings('ignore') # Pipeline with Cox PH pipe = Pipeline([ ('encode', OneHotEncoder()), ('scale', StandardScaler()), ('cox', CoxnetSurvivalAnalysis(l1_ratio=0.9, alpha_min_ratio=0.01, max_iter=1000)) ]) # The structured array y must have dtype [('event', bool), ('time', float)] # sksurv datasets provide this automatically # For custom data: def make_survival_target(event_col, time_col): """Create structured array for sksurv from event and time columns.""" dt = np.dtype([('event', bool), ('time', float)]) y = np.empty(len(event_col), dtype=dt) y['event'] = event_col.astype(bool) y['time'] = time_col.astype(float) return y # Kaplan-Meier estimation from sksurv.nonparametric import kaplan_meier_estimator import matplotlib.pyplot as plt # Example with gbsg2 data from sksurv.datasets import load_gbsg2 X, y = load_gbsg2() # KM curve by hormone therapy status for value in X['horTh'].unique(): mask = X['horTh'] == value time, survival_prob, conf_int = kaplan_meier_estimator( y['cens'][mask], y['time'][mask], conf_type='log-log' ) plt.step(time, survival_prob, where='post', label=f'horTh={value}') plt.fill_between(time, conf_int[0], conf_int[1], alpha=0.15, step='post') plt.xlabel('Time (days)') plt.ylabel('Survival probability') plt.title('Kaplan-Meier Survival Curves') plt.legend() plt.tight_layout() plt.savefig('km_curves.pdf')

Configuration

ParameterDescriptionDefault
n_estimatorsNumber of trees (RSF/GBSA)100
max_depthMaximum tree depthNone (unlimited)
min_samples_splitMinimum samples to split a node6
min_samples_leafMinimum samples in a leaf3
l1_ratioElastic net mixing (CoxNet)0.5
alphaRegularization strength (CoxNet)Auto-selected
learning_rateStep size (gradient boosting)0.1
n_jobsParallel workers-1 (all cores)

Best Practices

  1. Always check the proportional hazards assumption for Cox models — Cox PH assumes that hazard ratios are constant over time. Use Schoenfeld residuals to test this: if violated, consider time-varying covariates, stratified Cox models, or switch to tree-based methods which don't require this assumption.

  2. Encode the survival target as a structured NumPy array — scikit-survival requires y as a structured array with ('event', bool) and ('time', float) fields. Use np.dtype([('event', bool), ('time', float)]) and populate it carefully. A regular NumPy array or pandas DataFrame will raise cryptic errors.

  3. Use concordance index as the primary evaluation metric — The C-index measures how well the model ranks individuals by risk. A C-index of 0.5 is random, >0.7 is acceptable, >0.8 is good. Use concordance_index_censored which properly handles censored observations. Don't use accuracy or RMSE for survival data.

  4. Handle high-dimensional data with regularized Cox models — When features outnumber samples (common in genomics), use CoxnetSurvivalAnalysis with elastic net regularization. Set l1_ratio > 0.5 for feature selection (sparse solutions) and tune alpha via cross-validation.

  5. Validate with time-aware cross-validation — Standard k-fold CV is acceptable for survival data if there's no temporal ordering. For longitudinal cohorts, use time-aware splitting to avoid using future data to predict past events. Report both C-index and integrated Brier score for comprehensive evaluation.

Common Issues

"ValueError: y must be a structured array" when fitting — scikit-survival requires the target as a structured NumPy array, not a DataFrame or regular array. Create it with: y = np.array([(e, t) for e, t in zip(events, times)], dtype=[('event', bool), ('time', float)]). The event field must be boolean, not integer.

Concordance index returns 0.5 or lower — This indicates the model is no better than random. Check that features are properly encoded (categorical to numeric), scaled (for Cox models), and that the dataset has enough events relative to features. Very low event rates (<10% censoring) can also make learning difficult.

CoxnetSurvivalAnalysis convergence warnings — Increase max_iter (try 10000) and ensure features are standardized. Also check for multicollinearity — highly correlated features cause numerical instability. Remove correlated features (r > 0.9) or use PCA before fitting the Cox model.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates