library(tidyverse)
library(survival)
On importe les données
pharmacoSmoking = read_csv("./pharmacoSmoking.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
id = [32mcol_double()[39m,
ttr = [32mcol_double()[39m,
relapse = [32mcol_double()[39m,
grp = [31mcol_character()[39m,
age = [32mcol_double()[39m,
gender = [31mcol_character()[39m,
race = [31mcol_character()[39m,
employment = [31mcol_character()[39m,
yearsSmoking = [32mcol_double()[39m,
levelSmoking = [31mcol_character()[39m,
ageGroup2 = [31mcol_character()[39m,
ageGroup4 = [31mcol_character()[39m,
priorAttempts = [32mcol_double()[39m,
longestNoSmoke = [32mcol_double()[39m
)
glimpse(pharmacoSmoking)
Observations: 125
Variables: 15
$ X1 [3m[38;5;246m<dbl>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
$ id [3m[38;5;246m<dbl>[39m[23m 21, 113, 39, 80, 87, 29, 16, 35, 54, 70, 84, 8…
$ ttr [3m[38;5;246m<dbl>[39m[23m 182, 14, 5, 16, 0, 182, 14, 77, 2, 0, 12, 182,…
$ relapse [3m[38;5;246m<dbl>[39m[23m 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1…
$ grp [3m[38;5;246m<chr>[39m[23m "patchOnly", "patchOnly", "combination", "comb…
$ age [3m[38;5;246m<dbl>[39m[23m 36, 41, 25, 54, 45, 43, 66, 78, 40, 38, 64, 51…
$ gender [3m[38;5;246m<chr>[39m[23m "Male", "Male", "Female", "Male", "Male", "Mal…
$ race [3m[38;5;246m<chr>[39m[23m "white", "white", "white", "white", "white", "…
$ employment [3m[38;5;246m<chr>[39m[23m "ft", "other", "other", "ft", "other", "ft", "…
$ yearsSmoking [3m[38;5;246m<dbl>[39m[23m 26, 27, 12, 39, 30, 30, 54, 56, 25, 23, 30, 35…
$ levelSmoking [3m[38;5;246m<chr>[39m[23m "heavy", "heavy", "heavy", "heavy", "heavy", "…
$ ageGroup2 [3m[38;5;246m<chr>[39m[23m "21-49", "21-49", "21-49", "50+", "21-49", "21…
$ ageGroup4 [3m[38;5;246m<chr>[39m[23m "35-49", "35-49", "21-34", "50-64", "35-49", "…
$ priorAttempts [3m[38;5;246m<dbl>[39m[23m 0, 3, 3, 0, 0, 2, 0, 10, 4, 10, 12, 1, 5, 6, 5…
$ longestNoSmoke [3m[38;5;246m<dbl>[39m[23m 0, 90, 21, 0, 0, 1825, 0, 15, 7, 90, 365, 7, 1…
Elles contiennent 2 colonnes d’identifiants, on en enlève donc une.
pharmacoSmoking = pharmacoSmoking %>% select( -X1)
Les données contiennent maintenant un identifiant (id
), un temps censuré (ttr
), une indicatrice de censure (relapse
) et 11 variables explicatives
Nous allons estimer un premier modèle de Cox avec toutes les variables. Certaines variables discrètes ont plus de 2 facteurs, la dimension de la matrice de design est \(125 \times 16\)
coxph(Surv(ttr,relapse)~ . - id, pharmacoSmoking)
Call:
coxph(formula = Surv(ttr, relapse) ~ . - id, data = pharmacoSmoking)
coef exp(coef) se(coef) z p
grppatchOnly 6.138e-01 1.847e+00 2.223e-01 2.761 0.00576
age -4.414e-02 9.568e-01 3.173e-02 -1.391 0.16429
genderMale -3.792e-02 9.628e-01 2.537e-01 -0.150 0.88116
racehispanic -5.202e-01 5.944e-01 5.261e-01 -0.989 0.32273
raceother -1.180e+00 3.074e-01 1.053e+00 -1.120 0.26268
racewhite -3.456e-01 7.078e-01 2.603e-01 -1.328 0.18424
employmentother 7.373e-01 2.090e+00 2.886e-01 2.555 0.01062
employmentpt 6.136e-01 1.847e+00 3.446e-01 1.780 0.07500
yearsSmoking 1.036e-02 1.010e+00 1.905e-02 0.544 0.58638
levelSmokinglight -1.154e-01 8.910e-01 2.708e-01 -0.426 0.66994
ageGroup250+ 5.654e-01 1.760e+00 1.209e+00 0.467 0.64015
ageGroup435-49 2.405e-01 1.272e+00 4.828e-01 0.498 0.61834
ageGroup450-64 -7.928e-01 4.526e-01 5.929e-01 -1.337 0.18120
ageGroup465+ NA NA 0.000e+00 NA NA
priorAttempts 3.540e-04 1.000e+00 1.123e-03 0.315 0.75256
longestNoSmoke -9.122e-05 9.999e-01 1.242e-04 -0.735 0.46250
Likelihood ratio test=32.1 on 15 df, p=0.006232
n= 125, number of events= 89
L’un des coefficients associé à ageGroup4
n’a pas été estimé. C’est normal car ageGroup2
peut se déduire de ageGroup4
, on a donc des colonnes parfaitement colinéaires dans la matrice de design. On choisit d’nelever ageGroup2
et également la variable numérique age
.
fit_total = coxph(Surv(ttr,relapse)~ . - id - age - ageGroup2, pharmacoSmoking)
summary(fit_total)
Call:
coxph(formula = Surv(ttr, relapse) ~ . - id - age - ageGroup2,
data = pharmacoSmoking)
n= 125, number of events= 89
coef exp(coef) se(coef) z Pr(>|z|)
grppatchOnly 0.6433768 1.9028957 0.2218484 2.900 0.00373 **
genderMale 0.0199073 1.0201067 0.2503849 0.080 0.93663
racehispanic -0.3996167 0.6705770 0.5187766 -0.770 0.44112
raceother -1.2542734 0.2852831 1.0506379 -1.194 0.23255
racewhite -0.3125691 0.7315650 0.2616143 -1.195 0.23218
employmentother 0.6545940 1.9243610 0.2827911 2.315 0.02063 *
employmentpt 0.5674429 1.7637512 0.3463838 1.638 0.10138
yearsSmoking -0.0011233 0.9988774 0.0159852 -0.070 0.94398
levelSmokinglight -0.1412694 0.8682554 0.2728663 -0.518 0.60465
ageGroup435-49 -0.1695924 0.8440087 0.3825013 -0.443 0.65749
ageGroup450-64 -1.0700195 0.3430018 0.5058335 -2.115 0.03440 *
ageGroup465+ -0.7360911 0.4789825 0.7685627 -0.958 0.33819
priorAttempts 0.0002332 1.0002332 0.0011214 0.208 0.83526
longestNoSmoke -0.0001109 0.9998891 0.0001264 -0.877 0.38043
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
grppatchOnly 1.9029 0.5255 1.23191 2.9394
genderMale 1.0201 0.9803 0.62448 1.6664
racehispanic 0.6706 1.4913 0.24259 1.8537
raceother 0.2853 3.5053 0.03639 2.2366
racewhite 0.7316 1.3669 0.43809 1.2216
employmentother 1.9244 0.5197 1.10554 3.3496
employmentpt 1.7638 0.5670 0.89453 3.4776
yearsSmoking 0.9989 1.0011 0.96807 1.0307
levelSmokinglight 0.8683 1.1517 0.50861 1.4822
ageGroup435-49 0.8440 1.1848 0.39881 1.7862
ageGroup450-64 0.3430 2.9154 0.12727 0.9244
ageGroup465+ 0.4790 2.0878 0.10620 2.1603
priorAttempts 1.0002 0.9998 0.99804 1.0024
longestNoSmoke 0.9999 1.0001 0.99964 1.0001
Concordance= 0.668 (se = 0.031 )
Likelihood ratio test= 30.11 on 14 df, p=0.007
Wald test = 27.85 on 14 df, p=0.01
Score (logrank) test = 28.8 on 14 df, p=0.01
On obtient un premier modèle dont l’incide de concordance est plutôt bon (\(0.67\)). L’hypothèse \(\mathcal H_0 : \beta^\star_1 = \beta^\star_2 = \ldots = \beta^\star_p =0\) est rejetée par le LRT (p-value=\(7e-3\)). On remarque par ailleurs que les p-values des tests de Wald univariés associées au(x) coefficient(s) des variables grp
, employment
et ageGroup4
(au moins l’une d’entre elles pour les variables à plus de 2 modalités) sont inférieures à \(5e-2\).
On va donc faire une sélection de modèles par AIC.
library(MASS)
Attachement du package : ‘MASS’
The following object is masked from ‘package:dplyr’:
select
stepAIC(fit_total,trace = F)
Call:
coxph(formula = Surv(ttr, relapse) ~ grp + employment + ageGroup4,
data = pharmacoSmoking)
coef exp(coef) se(coef) z p
grppatchOnly 0.6564 1.9278 0.2198 2.986 0.00283
employmentother 0.6231 1.8648 0.2764 2.254 0.02418
employmentpt 0.5214 1.6844 0.3320 1.570 0.11631
ageGroup435-49 -0.1119 0.8942 0.3216 -0.348 0.72792
ageGroup450-64 -1.0233 0.3594 0.3597 -2.845 0.00444
ageGroup465+ -0.7071 0.4931 0.5017 -1.410 0.15868
Likelihood ratio test=25.89 on 6 df, p=0.0002333
n= 125, number of events= 89
Seules les variables associées à grp
, employment
et ageGroup4
sont conservées, comme attendu. Le concordance index est maintenant de \(0.65\).
On peut alors interpréter les résultats dans ce modèle.
fit_final = coxph(Surv(ttr, relapse) ~ grp + employment + ageGroup4,
data = pharmacoSmoking)
summary(fit_final)
Call:
coxph(formula = Surv(ttr, relapse) ~ grp + employment + ageGroup4,
data = pharmacoSmoking)
n= 125, number of events= 89
coef exp(coef) se(coef) z Pr(>|z|)
grppatchOnly 0.6564 1.9278 0.2198 2.986 0.00283 **
employmentother 0.6231 1.8648 0.2764 2.254 0.02418 *
employmentpt 0.5214 1.6844 0.3320 1.570 0.11631
ageGroup435-49 -0.1119 0.8942 0.3216 -0.348 0.72792
ageGroup450-64 -1.0233 0.3594 0.3597 -2.845 0.00444 **
ageGroup465+ -0.7071 0.4931 0.5017 -1.410 0.15868
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
grppatchOnly 1.9278 0.5187 1.2529 2.9661
employmentother 1.8648 0.5363 1.0848 3.2057
employmentpt 1.6844 0.5937 0.8787 3.2289
ageGroup435-49 0.8942 1.1184 0.4761 1.6793
ageGroup450-64 0.3594 2.7825 0.1776 0.7273
ageGroup465+ 0.4931 2.0281 0.1845 1.3180
Concordance= 0.647 (se = 0.033 )
Likelihood ratio test= 25.89 on 6 df, p=2e-04
Wald test = 24.59 on 6 df, p=4e-04
Score (logrank) test = 25.54 on 6 df, p=3e-04
Le fait de n’avoir qu’un patch (plutôt que la combinaison) multiplie le risque relatif par \(1.9278\). Pour la variable employment
les risques des modalités pt
et other
sont multipliés par \(1.68\) et \(1.86\) par rapport à celui de la modalité ft
. Pour l’âge, seule la modalité 50-64
a un coefficient significativement différent de \(0\). Le risque de cette modalité est multiplié par \(0.36\) par rapport à 21-34
.
On peut alors faire des prédictions dans le modèle final. La commande
marqueurs = predict(fit_final)
permet d’obtenir les marqueurs estimés pour tous les individus dans ce modèle.
La commande permet d’obtenir les estimations liées au risque de base
prediction = survfit(fit_final)
cbind(prediction$time,prediction$cumhaz) # pour le risque de base intégré
[,1] [,2]
[1,] 0 0.08643475
[2,] 1 0.12664605
[3,] 2 0.17836039
[4,] 3 0.18728907
[5,] 4 0.21477936
[6,] 5 0.23362036
[7,] 6 0.24317805
[8,] 7 0.25279400
[9,] 8 0.28291398
[10,] 10 0.29325093
[11,] 12 0.31425038
[12,] 14 0.39317530
[13,] 15 0.44266436
[14,] 16 0.45557402
[15,] 20 0.46853871
[16,] 21 0.49506815
[17,] 25 0.50881951
[18,] 28 0.55265526
[19,] 30 0.59891705
[20,] 40 0.61479931
[21,] 42 0.63108959
[22,] 45 0.64759990
[23,] 49 0.66435071
[24,] 50 0.68140657
[25,] 56 0.77205694
[26,] 60 0.81125920
[27,] 63 0.85197230
[28,] 65 0.87306432
[29,] 75 0.89452666
[30,] 77 0.93851316
[31,] 80 0.96135265
[32,] 84 0.98515078
[33,] 100 1.00987689
[34,] 105 1.03499670
[35,] 110 1.06088520
[36,] 140 1.17389639
[37,] 155 1.20450965
[38,] 170 1.27254088
[39,] 182 1.27254088
On représente graphiquement les prédictions des fonctions de survie pour les 16ième et 106ième individus des données ainsi que pour un individu dont tous les variables (dans la matrice de design) seraient nulles : c’est-à -dire ayant reçu la combinaison, à plein temps ft
et âgé de 21 à 34 ans.
time = prediction$time
pred_ind0 = prediction$surv
pred_ind16 = exp(-prediction$cumhaz*exp(marqueurs[16]))
pred_ind106 = exp(-prediction$cumhaz*exp(marqueurs[106]))
pred = tibble(time,pred_ind0, pred_ind16,pred_ind106) %>% gather("ind","value",2:4)
ggplot(pred,aes(x=time,y=value,color=ind)) + geom_step()