13 Generalised Linear Models
In a genome-wide association study, we perform an experiment where we select \(n\) individuals with a disease (cases) and \(n\) individuals without the diseases (controls) and look for genetic differences between these two groups. In particular, we are interested in specific genetic variants (SNPs) that might induce some predisposition towards the disease.
Suppose I observe the following genotypes for a SNP in 4,000 individuals (2,000 cases, 2,000 controls):
- Genotypes:
AA
Aa
aa
- Controls:
3 209 1788
- Cases:
83 621 1296
The cases seem to have relatively more A
alleles than the controls. This might make us suspect that having A
alleles at this SNP is associated with the disease.
This is the idea we will implement with data.
13.1 Data
For this practical we will use some preprepared data-files. You will find instructions on how to download files in the linke below:
13.2 Detecting SNP Associations
We have seen in lectures that we can do statistical tests for this type of contingency table using Chi Squared Tests. Let’s load the example data set and prepare a loop to scan through all SNPs to find those that are associated with the disease.
The first step is to load the data and check what the data looks like:
library(ggplot2) # for plots later
load("gwas-cc-ex1.Rdata")
# how many individuals are there
n <- length(y)
# How many SNPs do we have data for
p <- nrow(X)
# samples that are controls are encoded as 0 in y
control <- which(y == 0)
# disease cases are encoded as 1 in y
cases <- which(y == 1)
cat("No. of individuals:", n)
#> No. of individuals: 40000
#> No. of SNPs: 300
#> No. of controls: 200010000
#> No. of cases: 600010000
Now we need to write a loop that scans through all, \(p\), SNPs. Each SNP is a row in the matrix X
. For each SNP we will perform a chi-squared test to see if the genotype distribution is different between cases and controls. This means we will create a contingency table for each SNP and perform a chi-squared test.
# create a vector where p-values will be stored
p_vals <- rep(0, p)
# Loop over SNPs
for (i_p in 1:p) {
# 1. obtain genotype counts
counts <- matrix(0, nrow = 2, ncol = 3)
counts[1, ] <- c(sum(X[i_p, control] == 0),
sum(X[i_p, control] == 1),
sum(X[i_p, control] == 2))
counts[2, ] <- c(sum(X[i_p, cases] == 0),
sum(X[i_p, cases] == 1),
sum(X[i_p, cases] == 2))
# 2. expected probability of AA
# (assuming no dependence on case/control status)
expected_pr_AA <- sum(counts[, 1]) / n
# expected probability of Aa
expected_pr_Aa <- sum(counts[, 2]) / n
# expected probability of aa
expected_pr_aa <- sum(counts[, 3]) / n
expected_probs <- c(expected_pr_AA, expected_pr_Aa, expected_pr_aa )
# 3. do my chi-squared test
out <- chisq.test(counts, p = expected_probs)
# extract p value of test and store
p_vals[i_p] <- out$p.value
}
We went through each SNP (rows in matrix X
), extracted the counts of each genotype (marked 1.
in code) for cases and controls, then we compute expected probability (marked 2.
in code). Finally, we perform a chi-squared contingency table test comparing those observed counts to expected probabilities assuming that genotype is not related to disease status (marked 3.
in code).
p_val_df <- data.frame(p_val = p_vals, idx = 1:p)
ggplot(p_val_df, aes(x = idx, y = -log10(p_val))) +
geom_point(size = 2.5, col = "dodgerblue1")
![](glm_files/figure-html/plot-1.png)
This plot is knows as a Manhattan plot. One SNP (i_p = 250
) will pop out as being highly associated with the disease process. Look at the genotype counts (or MAF) for this SNP in the cases and controls to see for yourself that there is large difference in the distribution of genotypes (or MAF).
i_p <- 250
counts_v <- c(sum(X[i_p, control] == 0), sum(X[i_p, control] == 1),
sum(X[i_p, control] == 2), sum(X[i_p, cases] == 0),
sum(X[i_p, cases] == 1), sum(X[i_p, cases] == 2))
snp_procs <- data.frame(counts_v, type = rep(c("control", "cases"), each = 3),
genotype = rep(c("AA", "Aa", "aa"), 2))
ggplot(snp_procs, aes(x = genotype, y = counts_v, fill = type)) +
geom_bar(stat = "identity", position = "dodge")
![](glm_files/figure-html/snp-1.png)
13.3 GWAS and logistic regression
Now lets approach this problem using Generalised Linear Models. Lets load a data set containing genotypes in X
and case-control status in y
:
# load an example data set (genotypes in X, case-control (1/0) status in y)
load("gwas-cc-ex2.Rdata")
n <- length(y) # how many individuals do we have in total?
p <- nrow(X) # how many SNPs do I have data for?
For each of the p
SNPs we are going to call the R
GLM function glm
using the binomial
family option with the logit
link function because my outcomes are binary. We will then extract the p-value associated with the regression coefficient for the genotype. This is obtained from applying a hypothesis test (the Wald Test) on whether the coefficient has a null value zero.
p_vals <- rep(0, p)
for ( j in 1:p ) {
snp_data <- data.frame(y = y, x = X[j, ])
glm.fit <- glm(y ~ x, family = binomial(link = logit), data = snp_data )
p_vals[j] <- summary(glm.fit)$coefficients[2,4]
}
We are testing 1,000 SNPs so lets use Bonferroni correction to adjust these p-values to take into account multiple testing:
Lets use the adjusted -log10 p-values to plot a Manhattan plot:
# create data.frame with p-values for plotting with ggplot
p_val_df <- data.frame(p_val = adj_p_vals, idx = 1:p)
ggplot(p_val_df, aes(x = idx, y = -log10(p_val))) +
geom_point(size = 2.5, col = "dodgerblue1") +
labs(y = "-log10(adjusted p-value)")
![](glm_files/figure-html/gwas2_plot-1.png)
You should see a single SNP showing a strong association with disease status.
13.4 Negative binomial and Poisson regression
Molecular biologists study the behaviour of protein expression in normal and cancerous tissues. The hypothesis is that the total number of over-expressed proteins depends on the histopathological-derived tumour subtype and an immune cell contexture measure.
You are provided with data on 314 tumours in the file nb_data.Rdata
. The file contains one data frame
with the following variables:
-
overexpressed_proteins
: response variable of interest. -
immunoscore
: gives a standardised measure of immune cell contexture. -
tumor_subtype
: three-level nominal variable indicating the histopathological sub-type of the tumour. The three levels are Unstable, Stable, and Complex
Let’s load some prerequisite R libraries and the data to produce some summary statistics (install if required using install.package()
command ):
#> Warning: package 'foreign' was built under R version 4.3.3
#> sample_id gender immunoscore
#> 1001 : 1 female:160 Min. : 1.00
#> 1002 : 1 male :154 1st Qu.:28.00
#> 1003 : 1 Median :48.00
#> 1004 : 1 Mean :48.27
#> 1005 : 1 3rd Qu.:70.00
#> 1006 : 1 Max. :99.00
#> (Other):308
#> overexpressed_proteins tumor_subtype
#> Min. : 0.000 Complex : 40
#> 1st Qu.: 1.000 Unstable:167
#> Median : 4.000 Stable :107
#> Mean : 5.955
#> 3rd Qu.: 8.000
#> Max. :35.000
#>
13.4.1 Count-based GLMs
The overexpressed_proteins
measurements are counts. This implies we should use a Poisson based GLM.
In Poisson regression models, the conditional variance is by definition equal to the conditional mean. This can be limiting.
Negative binomial regression can be used for over-dispersed count data, that is when the conditional variance exceeds the conditional mean.
It can be considered as a generalization of Poisson regression since it has the same mean structure as Poisson regression but it has an extra parameter to model the over-dispersion. If the conditional distribution of the outcome variable is over-dispersed, the confidence intervals for the Poisson regression are likely to be narrower as compared to those from a Negative Binomial regression model.
In the following we will try both models to see which fits best.
13.4.2 Fitting a GLM
Below we use the glm.nb
function from the MASS
package to estimate a negative binomial regression. The use of the function is similar to that of lm
for linear models but with the additional requirement of a link function. As count data is always positive, a log link function is useful here.
glm_1 <- glm.nb(overexpressed_proteins ~ immunoscore + tumor_subtype + gender, data = dat, link=log)
# print summary statistics of glm.nb output object to Console
summary(glm_1)
#>
#> Call:
#> glm.nb(formula = overexpressed_proteins ~ immunoscore + tumor_subtype +
#> gender, data = dat, link = log, init.theta = 1.047288915)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.707484 0.204275 13.254 < 2e-16
#> immunoscore -0.006236 0.002492 -2.502 0.0124
#> tumor_subtypeUnstable -0.424540 0.181725 -2.336 0.0195
#> tumor_subtypeStable -1.252615 0.199699 -6.273 3.55e-10
#> gendermale -0.211086 0.121989 -1.730 0.0836
#>
#> (Intercept) ***
#> immunoscore *
#> tumor_subtypeUnstable *
#> tumor_subtypeStable ***
#> gendermale .
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for Negative Binomial(1.0473) family taken to be 1)
#>
#> Null deviance: 431.67 on 313 degrees of freedom
#> Residual deviance: 358.87 on 309 degrees of freedom
#> AIC: 1740.3
#>
#> Number of Fisher Scoring iterations: 1
#>
#>
#> Theta: 1.047
#> Std. Err.: 0.108
#>
#> 2 x log-likelihood: -1728.307
R first displays the call and the deviance residuals. Next, we see the regression coefficients for each of the variables, along with standard errors, z-scores, and p-values. The variable immunoscore
has a coefficient of -0.006, which is statistically significant at the 5% level (Pr(>|z|) = 0.0124*
). This means that for each one-unit increase in immunoscore
, the expected log count of the number of overexpressed_proteins
decreases by 0.006.
The indicator variable shown as tumor_subtypeUnstable
is the expected difference in log count between group Unstable and the reference group (tumor_subtype=Complex
). The expected log count for the Unstable type is approximately 0.4 lower than the expected log count for the Complex type.
The indicator variable for Stable type is the expected difference in log count between the Stable type and the reference Complex group. The expected log count for Stable is approximately 1.2 lower than the expected log count for the Complex type.
13.4.3 Comparing nested models
To determine if tumor_subtype
itself, overall, is statistically significant, we can compare a model with and without tumor_subtype
. The reason it is important to fit separate models is that, unless we do, the overdispersion parameter is held constant and it would not be a fair comparison.
We use the anova
function to compare models using a likelihood ratio test (LRT):
#> Warning in anova.negbin(glm_1, glm_2, test = "LRT"): only
#> Chi-squared LR tests are implemented
#> Likelihood ratio tests of Negative Binomial Models
#>
#> Response: overexpressed_proteins
#> Model theta Resid. df
#> 1 immunoscore + gender 0.8705939 311
#> 2 immunoscore + tumor_subtype + gender 1.0472889 309
#> 2 x log-lik. Test df LR stat. Pr(Chi)
#> 1 -1772.074
#> 2 -1728.307 1 vs 2 2 43.76737 3.133546e-10
The two degree-of-freedom chi-square test indicates that tumor_subtype
is a statistically significant predictor of overexpressed_proteins
(Pr(Chi) = 3.133546e-10
).
The anova
function performs a form of LRT. It computes the likelihood of the data under the two models being compared and then uses the ration of these likelihood values as a test statistic.
Theory tells us that, for large samples sizes, the (2x) log likelihood ratio has a chi-squared distribution with degrees of freedom equal to the difference in the number of free parameters between the two models being compared. The LRT only applies to nested models, i.e. a pair of models where one is a less complex subset of the other.
13.5 Negative-Binomial vs Poisson GLMs
Negative binomial models assume the conditional means are not equal to the conditional variances. This inequality is captured by estimating a dispersion parameter (not shown in the output) that is held constant in a Poisson model. Thus, the Poisson model is actually nested in the negative binomial model. We can then use a likelihood ratio test to compare these two models.
To do this, we will first fit a GLM Poisson regression:
glm_3 <- glm(overexpressed_proteins ~ immunoscore + tumor_subtype + gender, family = "poisson", data = dat)
Now, lets do our likelihood ratio test, we can extract the log-likelihood using logLik()
and then use pchisq()
to extract the probability of getting a statistic at least as extreme as this:
#> 'log Lik.' 3.847622e-198 (df=6)
Note that the more complex model goes first because more complex models always have the larger likelihood.
In this example the associated chi-squared value estimated from 2*(logLik(m1) – logLik(m3))
is around 900 with one degree of freedom. This strongly suggests the negative binomial model, estimating the dispersion parameter, is more appropriate than the Poisson model.
13.6 Further understanding the model (OPTIONAL)
For assistance in further understanding the model, we can look at predicted counts for various levels of our predictors. Below we create new datasets with values of immunoscore
and tumor_subtype
and then use the predict command to calculate the predicted number of overexpressed proteins
First, we can look at predicted counts for each value of tumor_subtype
while holding immunoscore
at its mean. To do this, we create a new dataset with the combinations of tumor_subtype
and immunoscore
for which we would like to find predicted values, then use the predict()
command.
newdata_1 <-
data.frame(
immunoscore = mean(dat$immunoscore),
tumor_subtype = factor(c("Complex", "Unstable", "Stable"), labels = levels(dat$tumor_subtype)),
gender="male")
newdata_2 <-
data.frame(
immunoscore = mean(dat$immunoscore),
tumor_subtype = factor(c("Complex", "Unstable", "Stable"), labels = levels(dat$tumor_subtype)),
gender="female")
new_data <- rbind(newdata_1, newdata_2)
new_data$phat <- predict(glm_1, new_data, type = "response")
print(new_data)
#> immunoscore tumor_subtype gender phat
#> 1 48.26752 Complex male 8.983829
#> 2 48.26752 Stable male 2.567187
#> 3 48.26752 Unstable male 5.876060
#> 4 48.26752 Complex female 11.095193
#> 5 48.26752 Stable female 3.170523
#> 6 48.26752 Unstable female 7.257042
newdata_3 <-
data.frame(
immunoscore = rep(seq(from = min(dat$immunoscore), to = max(dat$immunoscore), length.out = 100), 3),
tumor_subtype = rep(factor(c("Complex", "Unstable", "Stable"), labels = levels(dat$tumor_subtype)), each=100),
gender="male")
newdata_4 <-
data.frame(
immunoscore = rep(seq(from = min(dat$immunoscore), to = max(dat$immunoscore), length.out = 100), 3),
tumor_subtype = rep(factor(c("Complex", "Unstable", "Stable"), labels = levels(dat$tumor_subtype)), each=100),
gender="female")
new_data <- rbind(newdata_3, newdata_4)
new_data <- cbind(new_data, predict(glm_1, new_data, type = "link", se.fit=TRUE))
new_data <- within(new_data, {
overexpressed_proteins <- exp(fit)
LL <- exp(fit - 1.96 * se.fit)
UL <- exp(fit + 1.96 * se.fit)
})
library(ggplot2)
ggplot(new_data, aes(immunoscore, overexpressed_proteins)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = tumor_subtype), alpha = 0.2) +
geom_line(aes(colour = tumor_subtype), size = 1.5) +
labs(x = "Immunoscore",
y = "Overexpressed Proteins") +
facet_wrap(~ gender)
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
![](glm_files/figure-html/glm-plot-1.png)