Set up

library(ggplot2)
# library(GGally)
data("mtcars")
# ggpairs(data = mtcars)

Simple stats

Percentiles

data(mtcars)
quantile(mtcars$mpg, c(0.25, 0.5, 0.75), na.rm = TRUE)
##    25%    50%    75% 
## 15.425 19.200 22.800

Chi-sq test

Make a contingency table and nest inside chisq.test()

data(infert) # built in fertility data
head(infert)
##   education age parity induced case spontaneous stratum pooled.stratum
## 1    0-5yrs  26      6       1    1           2       1              3
## 2    0-5yrs  42      1       1    1           0       2              1
## 3    0-5yrs  39      6       2    1           0       3              4
## 4    0-5yrs  34      4       2    1           0       4              2
## 5   6-11yrs  35      3       1    1           1       5             32
## 6   6-11yrs  36      4       2    1           1       6             36
chisq.test(table(infert$education, infert$case))
## Warning in chisq.test(table(infert$education, infert$case)): Chi-squared
## approximation may be incorrect
## 
## 	Pearson's Chi-squared test
## 
## data:  table(infert$education, infert$case)
## X-squared = 0.0022896, df = 2, p-value = 0.9989

The Chi square test results in a warning message because the cell case-0-5yrs is < 5. Best to check with Fisher’s exact test:

fisher.test(table(infert$education, infert$case))
## 
## 	Fisher's Exact Test for Count Data
## 
## data:  table(infert$education, infert$case)
## p-value = 1
## alternative hypothesis: two.sided

It’s possible to get the observed and expected cell counts from the output of a chisq.test():

chisq.test(table(infert$education, infert$case))$expected
## Warning in chisq.test(table(infert$education, infert$case)): Chi-squared
## approximation may be incorrect
##          
##                   0         1
##   0-5yrs   7.983871  4.016129
##   6-11yrs 79.838710 40.161290
##   12+ yrs 77.177419 38.822581

Mantel-Haenzsel

It pains me to say it, but I don’t think the output from R for MH analysis is as neat as that from Stata. The package epicalc had a reasonable function, mhor(), but has been removed from CRAN.

Base-r has mantelhaen.test() which

Performs a Cochran-Mantel-Haenszel chi-squared test of the null that two nominal variables are conditionally independent in each stratum, assuming that there is no three-way interaction.

However,

Currently, no inference on homogeneity of the odds ratios is performed.

i.e. there one can’t use mantelhaen.test() to identify interaction, and the p-values obtained refer to the pooled odds ratio not being equal to one.

library(epicalc)
data(Oswego)
mhor(Oswego$ill, Oswego$chocolate, Oswego$sex)

would return

    Stratified analysis by  Var3 
                    OR lower lim. upper lim. P value
    Var3 F       0.417     0.0617       2.06  0.3137
    Var3 M       0.331     0.0512       1.83  0.2635
    M-H combined 0.364     0.1258       1.05  0.0611

    M-H Chi2(1) = 3.51 , P value = 0.061 
    Homogeneity test, chi-squared 1 d.f. = 0.05 , P value = 0.827 

Linear regression

m1 <- lm(mpg ~ wt, data = mtcars)

# Multiple, with ordered categorical varliable
m2 <- lm(mpg ~ wt + factor(cyl), data = mtcars)

# check the results
summary(m1)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
# and get confidence intervals
confint(m1)
##                 2.5 %    97.5 %
## (Intercept) 33.450500 41.119753
## wt          -6.486308 -4.202635

The summary provide R2 which gives coefficient of determination, the amount of variance in data explained by regressors. Adjusted R2, adjusts this for number of predictor variables included in model. It should be on the scale 0 to 1. However, plot of fitted values (x axis) against residuals more useful (y axis).

Departures from linearity

library(ggplot2)
test.data <- data.frame(x = seq(1,100,1), y = seq(1,100,1))
test.data$z <- test.data$x^2

qplot(x=x, y=y, data = test.data)

Logistic regression

data(infert)
library(broom)
m1 <- glm(case ~ education + induced, data = infert, family = "binomial")
tidy(m1, exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 4 x 7
##   term             estimate std.error statistic p.value conf.low conf.high
##   <chr>               <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)         0.471     0.651   -1.16     0.247    0.118      1.62
## 2 education6-11yrs    1.04      0.655    0.0536   0.957    0.299      4.16
## 3 education12+ yrs    1.04      0.652    0.0631   0.950    0.303      4.16
## 4 induced             1.05      0.186    0.273    0.785    0.727      1.51

Fitting an interaction

This uses data from Kirkwood and Sterne, Chapter 29, p324 onwards.

oncho <- readxl::read_xls(
  "C:/Users/simon/Documents/r_stuff/kirkwood_and_sterne/oncho_ems.xls")
head(oncho)
## # A tibble: 6 x 7
##      id    mf  area agegrp   sex mfload lesions
##   <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>   <dbl>
## 1     1     1     0      2     1      1       0
## 2     2     1     1      3     0      3       0
## 3     3     1     0      3     1      1       0
## 4     4     0     1      2     1      0       0
## 5     5     0     0      3     1      0       0
## 6     6     0     1      2     1      0       0
oncho$agegrp <- factor(oncho$agegrp)
m1 <- glm(mf ~ area + agegrp, data = oncho, family = "binomial")
tidy(m1, exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 5 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    0.147     0.197     -9.74 2.02e-22   0.0992     0.215
## 2 area           3.08      0.138      8.18 2.82e-16   2.36       4.05 
## 3 agegrp1        2.60      0.222      4.30 1.70e- 5   1.69       4.04 
## 4 agegrp2        9.77      0.208     10.9  7.10e-28   6.54      14.8  
## 5 agegrp3       17.6       0.216     13.3  2.48e-40  11.7       27.2

The output above matches the values in table 29.3, bottom of p324. Now fit a model with an interaction term:

m2 <- glm(mf ~ area * agegrp, data = oncho, family = "binomial")
tidy(m2, exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 8 x 7
##   term         estimate std.error statistic  p.value conf.low conf.high
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)     0.208     0.275    -5.72  1.07e- 8    0.117     0.346
## 2 area            1.83      0.349     1.73  8.36e- 2    0.933     3.69 
## 3 agegrp1         2.12      0.375     2.00  4.57e- 2    1.02      4.48 
## 4 agegrp2         6.96      0.309     6.28  3.30e-10    3.89     13.1  
## 5 agegrp3        10.5       0.319     7.36  1.81e-13    5.74     20.2  
## 6 area:agegrp1    1.39      0.463     0.708 4.79e- 1    0.557     3.44 
## 7 area:agegrp2    1.66      0.415     1.23  2.20e- 1    0.730     3.73 
## 8 area:agegrp3    2.59      0.438     2.17  2.99e- 2    1.09      6.09

As decsribed in the text, the model now has eight parameters (equal to the number of area and age subgroups multiplied together). Again, the output from this matches that in table 29.4, part b.

As described in the text, the interaction parameter allows the effect of area to be different across the different age groups. To calculate the odds ratio for the effect of area in the first age group, one can do $\mbox{area} \times \mbox{area:factor(agegroup)1} = $ $1.83 \times 1.39 = 2.54$

And this is fine, but we want to go a bit further and do this more systematically, rather than relying on our manual calculations.

library(multcomp)
## Warning: package 'multcomp' was built under R version 3.6.1
## Warning: package 'TH.data' was built under R version 3.6.1
summary(glht(m2, linfct = c("area + area:agegrp1 = 1")))
## 
## 	 Simultaneous Tests for General Linear Hypotheses
## 
## Fit: glm(formula = mf ~ area * agegrp, family = "binomial", data = oncho)
## 
## Linear Hypotheses:
##                          Estimate Std. Error z value Pr(>|z|)
## area + area:agegrp1 == 1   0.9307     0.3049  -0.227     0.82
## (Adjusted p values reported -- single-step method)
summary(glht(m2, linfct = c("area + area:agegrp1 = 1", 
                            "area + area:agegrp2 = 1", "area + area:agegrp3 = 1")))
## 
## 	 Simultaneous Tests for General Linear Hypotheses
## 
## Fit: glm(formula = mf ~ area * agegrp, family = "binomial", data = oncho)
## 
## Linear Hypotheses:
##                          Estimate Std. Error z value Pr(>|z|)
## area + area:agegrp1 == 1   0.9307     0.3049  -0.227    0.994
## area + area:agegrp2 == 1   1.1121     0.2249   0.498    0.944
## area + area:agegrp3 == 1   1.5539     0.2653   2.088    0.106
## (Adjusted p values reported -- single-step method)

Getting the confidence intervals out is not tricky, just wrap the glht() in confint(). Getting the exponentiated for out is tricky though.

# tried converting to dataframe - won't work. Can I extract with purrr
# Also fails:
# library(purrr)
# map(t, "Estimate")
t <- glht(m2, linfct = c("area + area:agegrp1 = 1"))
s <- confint(t)
s
## 
## 	 Simultaneous Confidence Intervals
## 
## Fit: glm(formula = mf ~ area * agegrp, family = "binomial", data = oncho)
## 
## Quantile = 1.96
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                          Estimate lwr    upr   
## area + area:agegrp1 == 1 0.9307   0.3332 1.5282
names(s)
## [1] "model"       "linfct"      "rhs"         "coef"        "vcov"       
## [6] "df"          "alternative" "type"        "confint"
s$confint
##                      Estimate       lwr      upr
## area + area:agegrp1 0.9306795 0.3331821 1.528177
## attr(,"conf.level")
## [1] 0.95
## attr(,"calpha")
## [1] 1.959964
names(s$confint)
## NULL
str(s$confint)
##  num [1, 1:3] 0.931 0.333 1.528
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr "area + area:agegrp1"
##   ..$ : chr [1:3] "Estimate" "lwr" "upr"
##  - attr(*, "conf.level")= num 0.95
##  - attr(*, "calpha")= num 1.96
s$confint[1:3]
## [1] 0.9306795 0.3331821 1.5281768
exp(s$confint[1:3])
## [1] 2.536232 1.395401 4.609765

The exponentiated value is close enough to that calculated above and given in Kirkwood and Sterne.

Does this work across all the combinations

s <- confint(glht(m2, linfct = c("area + area:agegrp1 = 1", 
                            "area + area:agegrp2 = 1", "area + area:agegrp3 = 1")))
s
## 
## 	 Simultaneous Confidence Intervals
## 
## Fit: glm(formula = mf ~ area * agegrp, family = "binomial", data = oncho)
## 
## Quantile = 2.3876
## 95% family-wise confidence level
##  
## 
## Linear Hypotheses:
##                          Estimate lwr    upr   
## area + area:agegrp1 == 1 0.9307   0.2028 1.6585
## area + area:agegrp2 == 1 1.1121   0.5751 1.6490
## area + area:agegrp3 == 1 1.5539   0.9206 2.1873
s$confint[1:3, ]
##                      Estimate       lwr      upr
## area + area:agegrp1 0.9306795 0.2028199 1.658539
## area + area:agegrp2 1.1120714 0.5751206 1.649022
## area + area:agegrp3 1.5539250 0.9205693 2.187281
exp(s$confint[1:3, ])
##                     Estimate      lwr      upr
## area + area:agegrp1 2.536232 1.224852 5.251633
## area + area:agegrp2 3.040650 1.777345 5.201891
## area + area:agegrp3 4.729999 2.510719 8.910949

Yes, it does.

Interactions: Done.

Polynomial regression

# linear first
m1 <- lm(wt ~ disp, data = mtcars)
m2 <- lm(wt ~ poly(disp, 2), data = mtcars)
m3 <- lm(wt ~ poly(disp, 3), data = mtcars)
anova(m1, m2)
## Analysis of Variance Table
## 
## Model 1: wt ~ disp
## Model 2: wt ~ poly(disp, 2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     30 6.2768                           
## 2     29 6.2748  1 0.0020473 0.0095 0.9232
anova(m2, m3)
## Analysis of Variance Table
## 
## Model 1: wt ~ poly(disp, 2)
## Model 2: wt ~ poly(disp, 3)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     29 6.2748                                  
## 2     28 3.0471  1    3.2277 29.659 8.205e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m1, m3)
## Analysis of Variance Table
## 
## Model 1: wt ~ disp
## Model 2: wt ~ poly(disp, 3)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 6.2768                                  
## 2     28 3.0471  2    3.2297 14.839 4.037e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p <- ggplot(data = mtcars, aes(x = disp, y = wt)) + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 3)) + 
  geom_point()
p

Quantile regression

Ordinal regression

# install.packages("ordinal")
library(ordinal)
## Warning: package 'ordinal' was built under R version 3.6.1
data(wine)
head(wine)
##   response rating temp contact bottle judge
## 1       36      2 cold      no      1     1
## 2       48      3 cold      no      2     1
## 3       47      3 cold     yes      3     1
## 4       67      4 cold     yes      4     1
## 5       77      4 warm      no      5     1
## 6       60      4 warm      no      6     1
m1 <- clm(rating ~ temp, data = wine)
m2 <- clm(rating ~ temp + contact, data = wine)

anova(m1, m2)
## Likelihood ratio tests of cumulative link models:
##  
##    formula:                link: threshold:
## m1 rating ~ temp           logit flexible  
## m2 rating ~ temp + contact logit flexible  
## 
##    no.par    AIC  logLik LR.stat df Pr(>Chisq)    
## m1      5 194.03 -92.013                          
## m2      6 184.98 -86.492  11.043  1  0.0008902 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The anova above tests the null hypothesis that there is no difference in fit between the two models above. There is strong evidence against the null hypothesis and so it should be rejected.

The summary gives the un-exponentiated coefficients for the two variables.

summary(m2)
## formula: rating ~ temp + contact
## data:    wine
## 
##  link  threshold nobs logLik AIC    niter max.grad cond.H 
##  logit flexible  72   -86.49 184.98 6(0)  4.02e-12 2.7e+01
## 
## Coefficients:
##            Estimate Std. Error z value Pr(>|z|)    
## tempwarm     2.5031     0.5287   4.735 2.19e-06 ***
## contactyes   1.5278     0.4766   3.205  0.00135 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 1|2  -1.3444     0.5171  -2.600
## 2|3   1.2508     0.4379   2.857
## 3|4   3.4669     0.5978   5.800
## 4|5   5.0064     0.7309   6.850

I think verbalising the interpretation of the summary output is not easy. Here’s my understanding. A wine at baseline will have intercept odds of a rating. A wine that is warm will have an odds of a So, the odds of a wine being at the next level up (out of a 1-5 rating) increases by exp(2.5) [12.2203428] for warm wines compared to cold.

Poisson regression

Poisson regression can be used to calculate incident rate ratios using either grouped data. This is an example from Kirkwood and Sterne Medical Statistics (2nd Ed), p 241. It compares numbers of infections among children in either good or poor quality housing. The manual calculations given in Kirkwood and Sterne give a rate ratio of 2.01 (95%CI 1.19-3.39)

lrti <- data.frame(housing = c("poor", "good"), n_lrti = c(33, 24), pyar = c(355, 518), 
                   stringsAsFactors = FALSE)
lrti
##   housing n_lrti pyar
## 1    poor     33  355
## 2    good     24  518
library(broom)
m1 <- glm(n_lrti ~ housing + offset(log(pyar)), family = "poisson", data = lrti)
tidy(m1, exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 2 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   0.0463     0.204    -15.0  3.49e-51   0.0302    0.0674
## 2 housingpoor   2.01       0.268      2.60 9.44e- 3   1.19      3.43