Scikit Survival Engine

Perform survival analysis and time-to-event modeling using scikit-survival, a Python library that combines survival analysis techniques with scikit-learn's API. This skill covers Cox proportional hazards models, random survival forests, Kaplan-Meier estimation, concordance metrics, and handling censored data.

When to Use This Skill

Choose Scikit Survival Engine when you need to:

Model time-to-event data with right-censored observations
Train survival models using scikit-learn-compatible pipelines and cross-validation
Compute survival functions, hazard ratios, and concordance indices
Apply machine learning methods (random survival forests, gradient boosting) to survival data

Consider alternatives when:

You need basic Kaplan-Meier and log-rank tests only (use lifelines)
You need deep learning survival models (use PyCox or DeepSurv)
You need competing risks analysis (use lifelines or R's cmprsk)

Quick Start


pip install scikit-survival pandas numpy matplotlib


import numpy as np
import pandas as pd
from sksurv.datasets import load_gbsg2
from sksurv.preprocessing import OneHotEncoder
from sksurv.ensemble import RandomSurvivalForest
from sksurv.metrics import concordance_index_censored
from sklearn.model_selection import train_test_split

# Load dataset (German Breast Cancer Study Group 2)
X, y = load_gbsg2()
print(f"Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"Events observed: {y['cens'].sum()} / {len(y)}")

# Encode categorical features
Xt = OneHotEncoder().fit_transform(X)

# Split data
Xt_train, Xt_test, y_train, y_test = train_test_split(
    Xt, y, test_size=0.25, random_state=42
)

# Train Random Survival Forest
rsf = RandomSurvivalForest(
    n_estimators=100, min_samples_split=10,
    min_samples_leaf=15, random_state=42, n_jobs=-1
)
rsf.fit(Xt_train, y_train)

# Evaluate
prediction = rsf.predict(Xt_test)
c_index = concordance_index_censored(
    y_test['cens'], y_test['time'], prediction
)
print(f"Concordance index: {c_index[0]:.3f}")

Core Concepts

Survival Models Available

Model	Class	Best For
Cox PH	`CoxPHSurvivalAnalysis`	Linear hazard relationships
Cox PH + L2	`CoxnetSurvivalAnalysis`	High-dimensional, regularized
Survival SVM	`FastSurvivalSVM`	Non-linear, kernel methods
Random Survival Forest	`RandomSurvivalForest`	Non-linear, feature interactions
Gradient Boosted Survival	`GradientBoostingSurvivalAnalysis`	Best overall performance
Component-wise GBS	`ComponentwiseGradientBoostingSurvivalAnalysis`	Feature selection + survival

Cox Model with Pipeline


from sksurv.linear_model import CoxPHSurvivalAnalysis, CoxnetSurvivalAnalysis
from sksurv.preprocessing import OneHotEncoder
from sksurv.metrics import concordance_index_censored
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Pipeline with Cox PH
pipe = Pipeline([
    ('encode', OneHotEncoder()),
    ('scale', StandardScaler()),
    ('cox', CoxnetSurvivalAnalysis(l1_ratio=0.9, alpha_min_ratio=0.01,
                                     max_iter=1000))
])

# The structured array y must have dtype [('event', bool), ('time', float)]
# sksurv datasets provide this automatically

# For custom data:
def make_survival_target(event_col, time_col):
    """Create structured array for sksurv from event and time columns."""
    dt = np.dtype([('event', bool), ('time', float)])
    y = np.empty(len(event_col), dtype=dt)
    y['event'] = event_col.astype(bool)
    y['time'] = time_col.astype(float)
    return y

# Kaplan-Meier estimation
from sksurv.nonparametric import kaplan_meier_estimator
import matplotlib.pyplot as plt

# Example with gbsg2 data
from sksurv.datasets import load_gbsg2
X, y = load_gbsg2()

# KM curve by hormone therapy status
for value in X['horTh'].unique():
    mask = X['horTh'] == value
    time, survival_prob, conf_int = kaplan_meier_estimator(
        y['cens'][mask], y['time'][mask], conf_type='log-log'
    )
    plt.step(time, survival_prob, where='post', label=f'horTh={value}')
    plt.fill_between(time, conf_int[0], conf_int[1], alpha=0.15, step='post')

plt.xlabel('Time (days)')
plt.ylabel('Survival probability')
plt.title('Kaplan-Meier Survival Curves')
plt.legend()
plt.tight_layout()
plt.savefig('km_curves.pdf')

Configuration

Parameter	Description	Default
`n_estimators`	Number of trees (RSF/GBSA)	`100`
`max_depth`	Maximum tree depth	`None` (unlimited)
`min_samples_split`	Minimum samples to split a node	`6`
`min_samples_leaf`	Minimum samples in a leaf	`3`
`l1_ratio`	Elastic net mixing (CoxNet)	`0.5`
`alpha`	Regularization strength (CoxNet)	Auto-selected
`learning_rate`	Step size (gradient boosting)	`0.1`
`n_jobs`	Parallel workers	`-1` (all cores)

Best Practices

Always check the proportional hazards assumption for Cox models — Cox PH assumes that hazard ratios are constant over time. Use Schoenfeld residuals to test this: if violated, consider time-varying covariates, stratified Cox models, or switch to tree-based methods which don't require this assumption.
Encode the survival target as a structured NumPy array — scikit-survival requires y as a structured array with ('event', bool) and ('time', float) fields. Use np.dtype([('event', bool), ('time', float)]) and populate it carefully. A regular NumPy array or pandas DataFrame will raise cryptic errors.
Use concordance index as the primary evaluation metric — The C-index measures how well the model ranks individuals by risk. A C-index of 0.5 is random, >0.7 is acceptable, >0.8 is good. Use concordance_index_censored which properly handles censored observations. Don't use accuracy or RMSE for survival data.
Handle high-dimensional data with regularized Cox models — When features outnumber samples (common in genomics), use CoxnetSurvivalAnalysis with elastic net regularization. Set l1_ratio > 0.5 for feature selection (sparse solutions) and tune alpha via cross-validation.
Validate with time-aware cross-validation — Standard k-fold CV is acceptable for survival data if there's no temporal ordering. For longitudinal cohorts, use time-aware splitting to avoid using future data to predict past events. Report both C-index and integrated Brier score for comprehensive evaluation.

Common Issues

"ValueError: y must be a structured array" when fitting — scikit-survival requires the target as a structured NumPy array, not a DataFrame or regular array. Create it with: y = np.array([(e, t) for e, t in zip(events, times)], dtype=[('event', bool), ('time', float)]). The event field must be boolean, not integer.

Concordance index returns 0.5 or lower — This indicates the model is no better than random. Check that features are properly encoded (categorical to numeric), scaled (for Cox models), and that the dataset has enough events relative to features. Very low event rates (<10% censoring) can also make learning difficult.

CoxnetSurvivalAnalysis convergence warnings — Increase max_iter (try 10000) and ensure features are standardized. Also check for multicollinearity — highly correlated features cause numerical instability. Remove correlated features (r > 0.9) or use PCA before fitting the Cox model.

⚠️ Loading Issue

Scikit Survival Engine

Scikit Survival Engine

When to Use This Skill

Quick Start

Core Concepts

Survival Models Available

Cox Model with Pipeline

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace