The survival function, also known as a survivor function or reliability function, is a property of any random variable that maps a set of events, usually associated with mortality or failure of some system, onto time.
It captures the probability that the system will survive beyond a specified time.
Let \(T\) be a continuous random variable with cumulative distribution function \(F(t)\) on the interval \([0,\infty)\).
Its survival function or reliability function is:
\[R(t) = P(T > t)\]
\[R(t) = \int_t^{\infty} f(u)\,du,.\]
\[R(t) = 1-F(t).\]
The most common model used to determine the effects of covariates on survival
\[ h_i(t)=h_0(t)exp(\beta_{1}x_{i1} + \beta_{2}x_{ik} + ... + \beta_{2}x_{ik} ) \]
It is a semi-parametric model:
The baseline hazard function is unspecified The effects of the covariates are multiplicative Doesn’t make arbitrary assumptions about the shape/form of the baseline hazard function The Proportional Hazards Assumption
Covariates multiply the hazard by some constant e.g. a drug may halve a subjects risk of death at any time ( t ) The effect is the same at any time point Violating the PH assumption can seriously invalidate your model!
\[ S(t)=Pr(T>t) \]
The Survival function (( S )) is the probability that the time of death (( T )) is greater than some specified time (( t ))
It is composed of:
The underlying Hazard function (How the risk of death per unit time changes over time at baseline covariates) The effect parameters (How the hazard varies in response to the covariates)
The Kaplan-Meier (KM) method is a non-parametric method used to estimate the survival probability from observed survival times (Kaplan and Meier, 1958).
The survival probability at time \(t_i\), \(S(t_i)\), is calculated as follow:
\[ S(ti)=S(tiâ1)(1âdini)\]
Where,
The estimated probability (S(t)) is a step function that changes value only at the time of each event. Itâs also possible to compute confidence intervals for the survival probability.
The KM survival curve, a plot of the KM survival probability against time, provides a useful summary of the data that can be used to estimate measures such as median survival time.
The estimator of the survival function \(\normalsize S(t)\) (the probability that life is longer than \(\normalsize t\)) is given by:
\[{\normalsize {\widehat {S}}(t)=\prod \limits _{i:\ t_{i}\leq t}\left(1-{\frac {d_{i}}{n_{i}}}\right) =\prod \limits _{i:\ t_{i}\leq t}\left(1-\lambda_i \right),} \]
with \(\normalsize t_{i}\) a time when at least one event happened, \(\normalsize d_i\) the number of events (e.g., deaths) that happened at time \(\normalsize t_{i}\), and \(\normalsize n_{i}\) the individuals known to have survived (have not yet had an event or been censored) up to time \(\normalsize t_{i}\).
Rather than classifying the observed survival times into a life table, we can estimate the survival function directly from the continuous survival or failure times. Intuitively, imagine that we create a life table so that each time interval contains exactly one case. Multiplying out the survival probabilities across the “intervals” (i.e., for each single observation) we would get for the survival function:
\[S(t) = \sum_{t= 1} [(n-j)/(n-j+1)]( j )\]
In this equation, S(t) is the estimated survival function, n is the total number of cases, and denotes the multiplication (geometric sum) across all cases less than or equal to t; (j) is a constant that is either 1 if the j’th case is uncensored (complete), and 0 if it is censored. This estimate of the survival function is also called the product-limit estimator, and was first proposed by Kaplan and Meier (1958). An example plot of this function is shown below.
The advantage of the Kaplan-Meier Product-Limit method over the life table method for analyzing survival and failure time data is that the resulting estimates do not depend on the grouping of the data (into a certain number of time intervals). Actually, the Product-Limit method and the life table method are identical if the intervals of the life table contain at most one observation.
The following table shows the Kaplan-Meier estimate of the survival function, based on data from the 12 insects.
t (weeks) | \(\normalsize S(t)\) | ||
---|---|---|---|
\(\normalsize 0 \leq t < 1\) | 1.0000 | ||
\(\normalsize 1 \leq t < 3\) | 0.9167 | ||
\(\normalsize 3 \leq t < 6\) | 0.7130 | ||
\(\normalsize t \geq 6\) | 0.4278 |
The Nelson-Aalen estimator is a non-parametric estimator of the cumulative hazard rate function in case of censored data or incomplete data.
It is used in survival theory, reliability engineering and life insurance to estimate the cumulative number of expected events.
An “event” can be the failure of a non-repairable component, the death of a human being, or any occurrence for which the experimental unit remains in the “failed” state (e.g., death) from the point at which it changed on.
The Nelson-Aalen estimator is a non-parametric estimator of the cumulative hazard function and is given by: \[\LARGE \tilde{H}(t)=\sum_{t_i\leq t}\frac{d_i}{n_i},\] with \(\LARGE d_i\) the number of events at \(\LARGE t_i\) and \(\LARGE n_i\) the total individuals at risk at \(t_i\). where \(\LARGE d_i\) is the number who failed out of \(\LARGE n_i\) at risk in interval ti.)
The curvature of the Nelson-Aalen estimator gives an idea of the hazard rate shape. A concave shape is an indicator for infant mortality while a convex shape indicates wear out mortality. It can be used for example when testing the homogeneity of Poisson processes.
Because of its simple relationship with the survival function, \(\LARGE S(t)=e^{âH(t)}\) the cumulative hazard function can be used to estimate the survival function.
The estimator is calculated, then, by summing the proportion of those at risk who failed in each interval up to time t.
The proportional hazard model is the most general of the regression models because it is not based on any assumptions concerning the nature or shape of the underlying survival distribution. The model assumes that the underlying hazard rate (rather than survival time) is a function of the independent variables (covariates); no assumptions are made about the nature or shape of the hazard function. Thus, in a sense, Cox’s regression model may be considered to be a nonparametric method. The model may be written as:
\[\LARGE h{(t), (z1, z2, \ldots, zm)} = h0(t)exp(b_1z_1 + \ldots + b_mz_m)\]
where h(t,…) denotes the resultant hazard, given the values of the m covariates for the respective case (z1, z2, …, zm) and the respective survival time (t). The term \(h_0(t)\) is called the baseline hazard; it is the hazard for the respective individual when all independent variable values are equal to zero. We can linearize this model by dividing both sides of the equation by \(h_0(t)\) and then taking the natural logarithm of both sides:
\[\LARGE \log[h{(t), (z...)}/h_0(t)] = b_1z_1 + ... + b_mz_m\]
We now have a fairly “simple” linear model that can be readily estimated.
While no assumptions are made about the shape of the underlying hazard function, the model equations shown above do imply two assumptions. First, they specify a multiplicative relationship between the underlying hazard function and the log-linear function of the covariates. This assumption is also called the proportionality assumption.
In practical terms, it is assumed that, given two observations with different values for the independent variables, the ratio of the hazard functions for those two observations does not depend on time. The second assumption of course, is that there is a log-linear relationship between the independent variables and the underlying hazard function.
An assumption of the proportional hazard model is that the hazard function for an individual (i.e., observation in the analysis) depends on the values of the covariates and the value of the baseline hazard. Given two individuals with particular values for the covariates, the ratio of the estimated hazards over time will be constant – hence the name of the method: the proportional hazard model. The validity of this assumption may often be questionable.
For example, age is often included in studies of physical health. Suppose you studied survival after surgery. It is likely, that age is a more important predictor of risk immediately after surgery, than some time after the surgery (after initial recovery). In accelerated life testing one sometimes uses a stress covariate (e.g., amount of voltage) that is slowly increased over time until failure occurs (e.g., until the electrical insulation fails; ). In this case, the impact of the covariate is clearly dependent on time. The user can specify arithmetic expressions to define covariates as functions of several variables and survival time.
As indicated by the previous examples, there are many applications where it is likely that the proportionality assumption does not hold. In that case, one can explicitly define covariates as functions of time.
For example, the analysis of a data set presented by Pike (1966) consists of survival times for two groups of rats that had been exposed to a carcinogen (see also Lawless, 1982, page 393, for a similar example). Suppose that z is a grouping variable with codes 1 and 0 to denote whether or not the respective rat was exposed. One could then fit the proportional hazard model:
\[h(t,z) = h0(t)exp{b1z + b2[zlog(t)-5.4]}\]
Thus, in this model the conditional hazard at time t is a function of (1) the baseline hazard h0, (2) the covariate z, and (3) of z times the logarithm of time. Note that the constant 5.4 is used here for scaling purposes only: the mean of the logarithm of the survival times in this data set is equal to 5.4. In other words, the conditional hazard at each point in time is a function of the covariate and time; thus, the effect of the covariate on survival is dependent on time; hence the name time-dependent covariate. This model allows one to specifically test the proportionality assumption. If parameter b2 is statistically significant (e.g., if it is at least twice as large as its standard error), then one can conclude that, indeed, the effect of the covariate z on survival is dependent on time, and, therefore, that the proportionality assumption does not hold.