About

Column

Survival Analysis With Julia

Survival Analysis With Julia

Survival analysis is a statistical method used to analyze time-to-event data, such as the time until a patient’s disease progresses or the time until a machine fails. The Julia programming language offers a powerful and efficient environment for conducting survival analysis.

The Survival package in Julia provides a comprehensive set of tools for survival analysis, including functions for estimating survival probabilities, fitting Cox proportional hazards models, and performing time-dependent covariate analysis. The package also includes functions for visualizing survival curves and conducting hypothesis tests.

Julia’s high-performance capabilities make it well-suited for large-scale survival analysis tasks. Additionally, its user-friendly syntax and extensive ecosystem of libraries make it an excellent choice for researchers and practitioners in various fields.

Statistical Models

Column

Survival Analysis with Julia

Julia has several packages that can be used for analyzing survival models. Here are some of the most popular ones:

  • Survival.jl: This is a dedicated package for survival analysis in Julia. It provides functionality for various aspects of survival analysis, including:
    • Time-to-event types
    • Kaplan-Meier survival curves
    • Nelson-Aalen cumulative hazard estimation
    • Cox proportional hazards regression
  • JuliaActuary: This is an ecosystem of packages specifically designed for actuarial modeling and analysis, which often involves survival analysis. Within JuliaActuary, you’ll find packages like:
    • MortalityTables.jl: For working with standard mortality tables and parametric models.
    • LifeContingencies.jl: For insurance, annuity, premium, and reserve calculations.
    • ExperienceAnalysis.jl: For exposure calculation needs.
  • SurvivalAnalysis.jl: This is a relatively new package that is under active development. It aims to provide a comprehensive set of tools for survival analysis, including handling different types of censoring and potentially incorporating advanced techniques.

These are some of the key packages available in Julia for survival analysis. The specific choice of package may depend on the particular type of analysis you need to perform and the features offered by each package.

In addition to these, you might also find relevant functionalities within other general-purpose statistical packages in Julia.

Overall, Julia provides a growing set of tools for survival analysis, making it a viable option for researchers and practitioners in this field.

Censoring in Statistical Analysis

Censoring in statistical analysis occurs when the exact value of a variable of interest is not observed. This often happens in studies where the event of interest (e.g., death, disease onset) hasn’t occurred for some individuals by the end of the study period.

Types of Censoring

  1. Right Censoring:
    • Definition: This occurs when the event of interest has not happened by the end of the study period.
    • Example: In a clinical trial studying the time to disease progression, patients who are still disease-free at the end of the trial are considered right-censored.
  2. Left Censoring:
    • Definition: This happens when the event of interest is known to have occurred before the start of the study.
    • Example: In a study investigating the time to onset of symptoms after exposure to a toxin, individuals who were already symptomatic at the time of enrollment would be left-censored.
  3. Interval Censoring:
    • Definition: This occurs when the event of interest is known to have occurred between two specific time points.
    • Example: In a study monitoring the onset of a disease, if a patient is checked for the disease at two time points and the disease is detected at the second but not the first, the patient is interval-censored.
  4. Random Censoring:
    • Definition: This is a type of censoring where the reason for censoring is unrelated to the event of interest.
    • Example: In a clinical trial, patients may be lost to follow-up due to reasons unrelated to the disease being studied (e.g., moving, refusing to participate).

Note: Random censoring is often assumed in survival analysis models, as it allows for unbiased estimation of survival probabilities. However, if censoring is not random (e.g., due to selective dropout), it can lead to biased results.

In summary, understanding the different types of censoring is crucial in statistical analysis, especially when dealing with time-to-event data. Proper handling of censoring can ensure accurate and reliable results.

Cox Models

Column

Cox Proportional Hazard Model with Julia

Demonstration on how to perform a Cox proportional hazard model with Julia.

using SurvivalAnalysis, DataFrames, CSV, Plots

# Load data (replace with your actual data file)
# The data should have columns for time, event (0=censored, 1=event), and covariates
df = CSV.read("your_data.csv", DataFrame) # Replace "your_data.csv"

# Example data creation (if you don't have a CSV yet)
n = 100
df = DataFrame(
    time = sort(rand(n) * 10),
    event = rand(n) .< 0.7,  # 0 or 1
    age = rand(n) * 50 + 20,
    treatment = rand(["A", "B"], n)
)

# Create a TimeToEvent object (important!)
tte = TimeToEvent(df.time, df.event)


# Fit the Cox proportional hazards model
# Using a formula for covariates:
cox_model = coxph(tte, @formula( ~ age + treatment))


# Print the model summary
println(cox_model)

# Accessing Coefficients and Hazard Ratios
coefs = coef(cox_model)
hr = hazardratio(cox_model) # Hazard ratios (exp(coefficients))
println("Coefficients:\n", coefs)
println("Hazard Ratios:\n", hr)

# Confidence Intervals for Hazard Ratios
ci_hr = confint(cox_model, level=0.95) # 95% Confidence Intervals
println("Hazard Ratio Confidence Intervals:\n", ci_hr)

# Plotting Survival Curves (for specific groups)
# You'll need to create a new dataset with the specific covariate values you want to plot
new_data_A = DataFrame(age = mean(df.age), treatment = "A")
new_data_B = DataFrame(age = mean(df.age), treatment = "B")

# Predict survival curves
survival_A = predict_survival(cox_model, tte, newdata=new_data_A)
survival_B = predict_survival(cox_model, tte, newdata=new_data_B)


# Plot survival curves
plot(survival_A.times, survival_A.survival, label="Treatment A", linewidth=2, color=:blue)
plot!(survival_B.times, survival_B.survival, label="Treatment B", linewidth=2, color=:red,
      xlabel="Time", ylabel="Survival Probability", title="Cox Model Survival Curves", legend=:best)

# Check Proportional Hazards Assumption (using Schoenfeld residuals)
# This is CRUCIAL for Cox models
sch_res = check_proportional_hazards(cox_model)
println("\nSchoenfeld Residuals (for PH Assumption Check):\n", sch_res)

# Plot Schoenfeld residuals (visual check)
plot(sch_res, title="Schoenfeld Residuals", xlabel="Time", ylabel="Residuals") # Plots against time by default
gui() # or savefig("schoenfeld.png")

# If you want to plot against a specific variable:
plot(sch_res, :age, title="Schoenfeld Residuals vs. Age", xlabel="Age", ylabel="Residuals")
gui()

Key Improvements and Explanations:

  1. Data Loading/Creation: The code now includes how to load data from a CSV file using CSV.read and DataFrame. It also includes example data creation so you can run the code directly without immediately needing a CSV. Make sure to replace "your_data.csv" with the actual path to your data file. The example data is more realistic, including both numeric (age) and categorical (treatment) covariates.

  2. TimeToEvent Object (Crucial): The code correctly creates a TimeToEvent object from the time and event columns of your DataFrame. This is absolutely essential for using the SurvivalAnalysis package.

  3. Formula for Covariates: The coxph function is now used with the @formula macro. This is the recommended and most flexible way to specify your covariates. @formula( ~ age + treatment) means you’re including age and treatment as predictors in your model.

  4. Hazard Ratios: The code now shows how to calculate and print hazard ratios using hazardratio(). Hazard ratios are often easier to interpret than coefficients.

  5. Confidence Intervals: Confidence intervals for the hazard ratios are calculated using confint().

  6. Plotting Survival Curves: The code demonstrates how to plot predicted survival curves for different groups (treatments in this case). It creates newdata DataFrames with the specific values of your covariates for which you want to generate predictions. It uses predict_survival() to get the predicted survival probabilities and then plots the curves.

  7. Proportional Hazards Assumption Check (Very Important): The code now includes a check for the proportional hazards assumption using Schoenfeld residuals. This is critical for the validity of your Cox model. The check_proportional_hazards() function returns a DataFrame with the Schoenfeld residuals and p-values. The code then shows how to plot these residuals against time to visually assess the assumption. If the residuals show a clear pattern (e.g., trend), the proportional hazards assumption might be violated. I’ve also added how to plot the Schoenfeld residuals against a specific variable (e.g., age) to see if the proportional hazards assumption holds across that variable.

  8. Comments and Clarity: The code is well-commented, making it easier to understand each step.

Before Running:

  1. Install Packages:

    using Pkg
    Pkg.add("SurvivalAnalysis")
    Pkg.add("DataFrames")
    Pkg.add("CSV")  # If you're reading from a CSV
    Pkg.add("Plots")
  2. Data: Replace "your_data.csv" with the actual path to your data file. If you are using the example data creation, you don’t need a CSV file. Your data should have a time variable, an event indicator (often 0 for censored, 1 for event), and any other covariates you want to include in your model.

  3. Run: Run the code. The output will include the model summary, coefficients, hazard ratios, confidence intervals, and plots of the survival curves and Schoenfeld residuals. Carefully examine the Schoenfeld residual plots and p-values to assess the proportional hazards assumption. If the PH assumption is violated, you may need to consider alternative modeling strategies (e.g., time-varying covariates, stratified Cox models, or other survival models).

Kaplan-Meier Curves

Column

Creating Kaplan-Meier Curves with Julia

This example provides a much more complete and practical demonstration of how to create and visualize Kaplan-Meier survival curves in Julia using the SurvivalAnalysis package.

using SurvivalAnalysis, Plots

# Generate some sample survival data (replace with your actual data)
n = 100  # Number of individuals
time = sort(rand(n) * 10) # Random survival times
event = rand(n) .< 0.7 # Random event indicators (true if event occurred, false if censored)

# Create a TimeToEvent object
tte = TimeToEvent(time, event)

# Calculate the Kaplan-Meier estimator
km_est = kaplan_meier(tte)

# Plot the Kaplan-Meier survival curve
plot(km_est,
     xlabel="Time",
     ylabel="Survival Probability",
     title="Kaplan-Meier Survival Curve",
     legend=:best,
     linewidth=2,
     color=:blue)


# Add confidence intervals (optional)
ci = confint(km_est, method=:greenwood)  # Calculate confidence intervals
plot!(ci,
      fillto=km_est,  # Fill the area around the curve
      alpha=0.2,      # Set transparency
      color=:blue,
      label="95% CI")


# Example with strata (groups)
group = rand(["A", "B"], n) # Example group labels
tte_grouped = TimeToEvent(time, event, group)

# Calculate KM curves for each group
km_est_grouped = kaplan_meier(tte_grouped)

# Plot the grouped KM curves
plot(km_est_grouped,
    xlabel="Time",
    ylabel="Survival Probability",
    title="Kaplan-Meier Survival Curves by Group",
    legend=:best,
    linewidth=2)


# Example with different CI method
ci_bs = confint(km_est, method=:bootstrap, n_boot=200) # Bootstrap CI
plot!(ci_bs,
    fillto=km_est,
    alpha=0.2,
    color=:red,
    label="Bootstrap 95% CI")


# Display the plot
gui() # or savefig("km_curve.png") if you want to save

Explanation and Key Improvements:

  1. Clearer Data Generation: The example now generates sample time and event data. Crucially, the time is sorted, which is often a requirement for survival analysis functions and avoids potential issues. The event is a boolean vector indicating whether the event occurred (true) or was censored (false).

  2. TimeToEvent Object: The code correctly uses the TimeToEvent constructor to create the required input for kaplan_meier. This is essential for the SurvivalAnalysis package.

  3. Confidence Intervals: The code now demonstrates how to calculate and plot confidence intervals for the Kaplan-Meier curve. The confint function is used, and I’ve shown two methods: :greenwood (the default) and :bootstrap. Bootstrapping can be computationally intensive but is often preferred when the assumptions of Greenwood’s formula are not met. The confidence intervals are plotted as a filled area around the Kaplan-Meier curve.

  4. Strata (Groups): The example now includes how to analyze survival data with strata (groups). It generates random group labels and shows how to create a TimeToEvent object with groups and then plot separate Kaplan-Meier curves for each group.

  5. Plotting Enhancements:

    • xlabel, ylabel, and title are used to make the plot more informative.
    • legend=:best automatically places the legend in a good spot.
    • linewidth controls the thickness of the lines.
    • color sets the color of the lines.
    • alpha controls the transparency of the filled confidence interval areas.
  6. gui() or savefig(): I’ve added gui() so the plot will be displayed. If you want to save the plot to a file, you can use savefig("km_curve.png") (or any other file name and format you prefer).

  7. Comments: The code is well-commented to explain each step.

Before Running:

Make sure you have the necessary packages installed:

using Pkg
Pkg.add("SurvivalAnalysis")
Pkg.add("Plots")

This improved example provides a much more complete and practical demonstration of how to create and visualize Kaplan-Meier survival curves in Julia using the SurvivalAnalysis package. Remember to replace the sample data with your own data.

Nelson-Aalen Models

Column

Nelson-Aalen Analysis with Julia

The Nelson-Aalen analysis is a statistical method used in survival analysis to estimate the cumulative hazard function.

Here’s a breakdown of its purpose and why it’s important:

What is Survival Analysis?

Survival analysis deals with time-to-event data. This means we’re interested in how long it takes for a specific event to occur. The “event” could be anything:

  • In healthcare: Death, disease onset, recovery from illness
  • In engineering: Failure of a machine, a component breaking down
  • In business: Customer churn, a product failing

What is the Cumulative Hazard Function?

The cumulative hazard function, denoted as H(t), represents the accumulated risk of experiencing the event up to time t. In simpler terms, it tells you the total “hazard” or risk you’ve been exposed to over time.

Purpose of Nelson-Aalen Analysis

The Nelson-Aalen estimator is a non-parametric way to estimate the cumulative hazard function from survival data. Here’s why it’s useful:

  1. Describing Risk: It provides a way to visualize and quantify the cumulative risk over time. This helps understand how the probability of an event changes as time goes on.

  2. Comparing Groups: You can use Nelson-Aalen curves to compare the cumulative hazard between different groups (e.g., patients receiving different treatments). If one group’s curve is steeper, it indicates a higher cumulative risk.

  3. Checking Model Assumptions: The Nelson-Aalen estimator can be used to assess the fit of parametric survival models (like the exponential or Weibull model). By comparing the estimated cumulative hazard from the Nelson-Aalen method to the predicted cumulative hazard from a parametric model, you can check if the model assumptions are reasonable.

  4. Non-Parametric Approach: It’s a non-parametric method, meaning it doesn’t rely on specific assumptions about the underlying distribution of the data. This makes it more flexible than parametric methods, especially when you’re unsure about the data’s distribution.

using SurvivalAnalysis, Plots, DataFrames

# Sample Data (replace with your actual data)
n = 100
df = DataFrame(
    time = sort(rand(n) * 10),
    event = rand(n) .< 0.7  # 0 or 1 (Censored or Event)
)

# Create a TimeToEvent object
tte = TimeToEvent(df.time, df.event)

# Calculate the Nelson-Aalen estimator
na_est = nelson_aalen(tte)

# Plot the cumulative hazard
plot(na_est,
     xlabel="Time",
     ylabel="Cumulative Hazard",
     title="Nelson-Aalen Cumulative Hazard",
     legend=false,  # No legend needed for a single curve
     linewidth=2,
     color=:blue)


# Confidence Intervals (Greenwood)
ci_na = confint(na_est, method=:greenwood)
plot!(ci_na,
      fillto=na_est,
      alpha=0.2,
      color=:blue,
      label="95% CI (Greenwood)")


# Confidence Intervals (Bootstrap - more robust, but slower)
ci_na_bs = confint(na_est, method=:bootstrap, n_boot=500) # n_boot: Number of bootstrap replicates
plot!(ci_na_bs,
      fillto=na_est,
      alpha=0.2,
      color=:red,
      label="95% CI (Bootstrap)")

gui() # or savefig("nelson_aalen.png")


# --- Example with Strata (Groups) ---
group = rand(["A", "B"], n) # Example group labels
tte_grouped = TimeToEvent(df.time, df.event, group)

na_est_grouped = nelson_aalen(tte_grouped)

plot(na_est_grouped,
    xlabel="Time",
    ylabel="Cumulative Hazard",
    title="Nelson-Aalen by Group",
    legend=:best,
    linewidth=2)
gui()

Explanation and Key Improvements:

  1. Clearer Data: The code provides sample data generation (which is more helpful for running the code directly) and emphasizes that you should replace this with your own data. It uses a DataFrame, which is typical for working with data in Julia.

  2. TimeToEvent Object: The TimeToEvent object is correctly created. This is essential for SurvivalAnalysis.

  3. Nelson-Aalen Calculation: The nelson_aalen() function is used to calculate the cumulative hazard estimates.

  4. Plotting: The plot is labeled correctly, and the legend is removed (since there’s only one curve in the basic example). Line properties are set for better visualization.

  5. Confidence Intervals: The code now demonstrates how to calculate and plot confidence intervals for the Nelson-Aalen estimator using both the Greenwood formula (default and faster) and bootstrapping (more robust, especially with small sample sizes or when the assumptions of the Greenwood method are not met). The bootstrap method is computationally more intensive.

  6. Strata (Groups): The code includes an example of how to perform Nelson-Aalen analysis with strata (groups). It creates random group labels and shows how to create a TimeToEvent object with groups and then plot separate Nelson-Aalen curves for each group.

  7. Comments: The code is well-commented.

Before Running:

  1. Install Packages:

    using Pkg
    Pkg.add("SurvivalAnalysis")
    Pkg.add("DataFrames")
    Pkg.add("Plots")
  2. Data: Replace the sample data with your own data. Your data should have a time variable and an event indicator (often 0 for censored, 1 for event). If you have groups or strata, include a column for that as well.

  3. Run: Run the code. The plot of the Nelson-Aalen cumulative hazard will be displayed. Examine the plot to understand the cumulative hazard over time. The confidence intervals provide a measure of uncertainty around the estimated cumulative hazard. If you have groups, you can compare the cumulative hazards between the groups.

Summary

In summary, the Nelson-Aalen analysis helps you understand and quantify the risk of an event occurring over time. It’s a valuable tool in survival analysis for describing risk, comparing groups, and checking model assumptions.