The R package ggforestplot
allows to plot vertical forest plots,
a.k.a. blobbograms, and it’s based on ggplot2
.
In this tutorial we will go through its basic functionality, as well as how one can produce grouped plots, using demo data from Nightingale’s NMR platform.
You can install ggforestplot
from github as shown below
(unless already installed, you need to install devtools
first):
If you want display package vignettes with
utils::vignette()
, install ggforestplot
with
devtools::install_github("NightingaleHealth/ggforestplot", build_vignettes = TRUE)
.
However, installing with building the vignettes takes little bit longer.
(Note: If dependencies are not installed automatically, try updating
devtools
.)
The main plotting function is
ggforestplot::forestplot()
. It’s main input is a data frame
that contains the values and corresponding standard errors to be plotted
in a forestplot layout.
Let’s get right to it and plot an example before with delve into the details of the input parameters.
# Load and attach the package
library(ggforestplot)
# Load and attach other useful packages
# install.packages("tidyverse")
library(tidyverse)
# Filter only associations to BMI for the first 30 biomarkers of the example
# dataset
df <-
ggforestplot::df_linear_associations %>%
filter(
trait == "BMI",
dplyr::row_number() <= 30
)
# Draw a forestplot of cross-sectional, linear associations
ggforestplot::forestplot(
df = df,
name = name,
estimate = beta,
se = se
)
These are the linear associations of BMI to 30 blood biomarkers - selected somewhat randomly - along with their 95% confidence intervals.
Now let’s take a closer look at the example dataset used above in
order to understand what is the input that
ggforestplot::forestplot()
expects.
The dataset ggforestplot::df_linear_associations
is a tibble()
and
contains linear associations of blood biomarkers to Body Mass Index
(BMI), insulin resistance (log(HOMA-IR)) and fasting glucose as found in
A. V.
Ahola-Olli et al. (2019). Below are the first 10 rows.
# Linear associations of blood biomarkers to Body Mass Index (BMI), insulin
# resistance (log(HOMA-IR)) and fasting glucose
ggforestplot::df_linear_associations %>%
print()
#> # A tibble: 681 × 5
#> name trait beta se pvalue
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Isoleucine BMI 0.339 0.00945 1.11e-281
#> 2 Leucine BMI 0.343 0.00951 1.25e-285
#> 3 Valine BMI 0.287 0.00951 7.94e-200
#> 4 Phenylalanine BMI 0.343 0.00862 0
#> 5 Tyrosine BMI 0.261 0.00900 6.65e-185
#> 6 Alanine BMI 0.179 0.00890 8.62e- 90
#> 7 Glutamine BMI -0.134 0.00945 7.68e- 46
#> 8 Glycine BMI -0.0296 0.00937 1.56e- 3
#> 9 Histidine BMI 0.0364 0.00917 7.25e- 5
#> 10 Lactate BMI 0.131 0.00911 9.20e- 47
#> # ℹ 671 more rows
The variables, as also stated in the documentation, are:
name
: the names of the Nightingale serum biomarkers
(Note: glucose is missing as the results are adjusted for this
biomarker).trait
: the response variable of the regression model,
BMI, log(HOMA-IR) or fasting glucose.beta
: regression coefficient β.se
: standard deviationpvalue
: p-valueSo the first input parameter for forestplot()
must be a
data frame with at least the variables name
(character),
estimate
(numeric) and se
(numeric), which of
course may be named differently.
Let’s now start improving the plot above.
First, we need a more descriptive x axis label and we may also add a
title. We’d also like to visualize a Bonferroni correction to account
for multiple testing. Here we have a comparison of 30 biomarkers. If we
suppose the common significance threshold α = 0.05, the Bonferroni correction
for each individual hypothesis, assuming the 30 tests are independent,
is α = 0.05/30 ≈ 0.002 (Note:
in fact for biologically correlated measures this correction is too
strict; here we could, for example, perform a principal component
analysis and instead of the number of biomarkers we’d choose the number
of principal components that explain a fair amount of variance in the
metabolic data, e.g. 99%; see also
vignette("nmr-data-analysis-tutorial")
.
In the forestplot function, we will input the significance threshold
as the parameter psignif
and we will define explicitly the
variable name in df
that contains the p-values of the
linear regression, in this case the column pvalue
.
# Draw a forestplot of cross-sectional, linear associations
# Add a more descriptive x-label, a title and the threshold for statistical
# significance with Bonferroni correction.
# (Note that the df variable names 'name', 'beta' and 'se' do not really have to be
# explictly defined, as these are the default input values of these parameters.)
ggforestplot::forestplot(
df = df,
estimate = beta,
pvalue = pvalue,
psignif = 0.002,
xlab = "1-SD increment in BMI\nper 1-SD increment in biomarker concentration",
title = "Associations of blood biomarkers to BMI"
)
Notice how the non-significant results are now displayed as hollow points.
Let us now go ahead and plot, along with the BMI, the associations with insulin resistance (HOMA-IR) and fasting glucose.
# Extract the biomarker names
selected_bmrs <- df %>% pull(name)
# Filter the demo dataset for the biomarkers above and all three traits:
# BMI, HOMA-IR and fasting glucose
df_compare_traits <-
ggforestplot::df_linear_associations %>%
filter(name %in% selected_bmrs) %>%
# Set class to factor to set order of display.
mutate(
trait = factor(
trait,
levels = c("BMI", "HOMA-IR", "Fasting glucose")
)
)
# Draw a forestplot of cross-sectional, linear associations
# Notice how the df variable 'trait' is used here to color the points
ggforestplot::forestplot(
df = df_compare_traits,
estimate = beta,
pvalue = pvalue,
psignif = 0.002,
xlab = "1-SD increment in cardiometabolic trait\nper 1-SD increment in biomarker concentration",
title = "Biomarker associations to metabolic traits",
colour = trait
)
Finally, we would like to group the blood biomarkers by category to improve the readability of the plot.
The package includes also a data frame,
ggforestplot::df_NG_biomarker_metadata
, with metadata on
the Nightingale blood biomarkers, such as different naming options,
descriptions, and group/sugroup information. We will use the grouping
information in this data frame to plot the 30 biomarkers above in a
grouped layout.
We can achieve this by joining the biomarker metadata with the data
frame df_compare_traits
above, that contains the
associations. We will then use ggforce::facet_col
to facet
by group.
# Install and attach the ggforce library
# install.packages("ggforce")
library(ggforce)
# Filter df_NG_biomarker_metadata, that contain the groups, for only the 30
# biomarkers under discussion
df_grouping <-
df_NG_biomarker_metadata %>%
filter(name %in% df_compare_traits$name)
# Join the association data frame df_compare_traits with group data
df_compare_traits_groups <-
df_compare_traits %>%
# use right_join, with df_grouping on the right, to preserve the order of
# biomarkers it specifies.
dplyr::right_join(., df_grouping, by = "name") %>%
dplyr::mutate(
group = factor(.data$group, levels = unique(.data$group))
)
# Draw a forestplot of cross-sectional, linear associations.
forestplot(
df = df_compare_traits_groups,
estimate = beta,
pvalue = pvalue,
psignif = 0.002,
xlab = "1-SD increment in cardiometabolic trait\nper 1-SD increment in biomarker concentration",
colour = trait
) +
ggforce::facet_col(
facets = ~group,
scales = "free_y",
space = "free"
)
Notice above, how the order of the biomarkers is now different than
in the previous plots. This is the order of groups and biomarkers in
df_grouping
.
# Print the first 10 rows of df_grouping to inspect the order of biomarkers
df_grouping %>%
pull(name) %>%
print()
#> [1] "Total cholines" "Sphingomyelins" "Total fatty acids"
#> [4] "Unsaturation" "Omega-3 %" "Omega-6 %"
#> [7] "PUFA %" "MUFA %" "SFA %"
#> [10] "LA %" "DHA %" "Alanine"
#> [13] "Glutamine" "Glycine" "Histidine"
#> [16] "Isoleucine" "Leucine" "Valine"
#> [19] "Phenylalanine" "Tyrosine" "Lactate"
#> [22] "Pyruvate" "Citrate" "Glycerol"
#> [25] "3-Hydroxybutyrate" "Acetate" "Acetoacetate"
#> [28] "Creatinine" "Albumin" "Glycoprotein acetyls"
Another note is that we had to define manually a common axis for all
the plots. For that we added the layer
ggplot2::coord_cartesian(xlim = c(-0.3, 0.4))
in each
forestplot()
call. In order to know, a priori, what is the
correct common x limits you are looking for, you need to estimate the
confidence intervals from the standard errors. In practice however, you
may simply run the above piece of code twice, once without setting x
limits and, after visual inspection, rerun adding
ggplot2::coord_cartesian(xlim = c(xmin, xmax))
.
Finally, the implementation above removes x-axis texts and labels for
each group. You may want to comment that out depending on how long is
your column of forestplots but, at least, keep axis.text.x
for only the last plot.
The package includes also a second demo dataset from the same paper,
ggforestplot::df_logodds_associations
, with log odds ratios
of blood biomarkers with incident type 2 diabetes. The
beta
, se
and pvalue
variables in
this set are the result of logistic regression and the additonal
n
variable reports the cohorts sample size.
# Odds Ratios of Blood Biomarkers with Incident Type 2 Diabetes
ggforestplot::df_logodds_associations %>%
print()
#> # A tibble: 1,135 × 6
#> name study beta se pvalue n
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Isoleucine Meta-analysis 0.285 0.0500 0.0000000126 11777
#> 2 Leucine Meta-analysis 0.286 0.0506 0.0000000165 11783
#> 3 Valine Meta-analysis 0.182 0.0558 0.00110 11777
#> 4 Phenylalanine Meta-analysis 0.271 0.0538 0.000000453 11782
#> 5 Tyrosine Meta-analysis 0.168 0.0541 0.00189 11779
#> 6 Alanine Meta-analysis 0.122 0.0540 0.0237 11784
#> 7 Glutamine Meta-analysis -0.172 0.0575 0.00284 11744
#> 8 Glycine Meta-analysis -0.0773 0.0617 0.211 11496
#> 9 Histidine Meta-analysis 0.0473 0.0553 0.393 11777
#> 10 Lactate Meta-analysis 0.119 0.0529 0.0245 11781
#> # ℹ 1,125 more rows
We will use this dataset to demonstrate how to plot odds ratios (the same logic applies for hazard ratios).
The dataset includes log odds ratios with incident type 2 diabetes for a total of 4 cohorts plus the meta-analysis we saw above. Let’s plot the odds ratios for amino acids.
# Filter df_NG_biomarker_metadata for only amino acids
df_grouping <-
df_NG_biomarker_metadata %>%
filter(group %in% "Amino acids")
# Join the association data frame with group data
df <-
df_logodds_associations %>%
# Set the study variable to a factor to preserve order of appearance
mutate(
study = factor(
study,
levels = c("Meta-analysis", "NFBC-1997", "DILGOM", "FINRISK-1997", "YFS")
)
) %>%
# use right_join, with df_grouping on the right, to preserve the order of
# biomarkers it specifies.
dplyr::right_join(., df_grouping, by = "name") %>%
tidyr::drop_na(.data$beta)
#> Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
#> ℹ Please use `"beta"` instead of `.data$beta`
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
# Draw a forestplot of odds ratios
ggforestplot::forestplot(
df = df,
name = name,
estimate = beta,
se = se,
pvalue = pvalue,
psignif = 0.002,
colour = study,
xlab = "Odds ratio for incident type 2 diabetes (95% CI)\nper 1−SD increment in metabolite concentration",
title = "Amino acids",
logodds = TRUE
)
Notice that the additional parameter set here is
logodds = TRUE
. What happens, in this case, is that the
beta
values are exponentiated and plotted in a logarithmic
scale. The confidence intervals for the odds ratios are estimated by
default, using the standard errors of the log odds ratios, as follows
CIlow = exp (β − 1.96 * SE)
and CIhigh = exp (β + 1.96 * SE).
If you wish to use some confidence interval other than the 95%, you may
specify this in the parameter ci
of the function (see
below).
Additionally, you may set a scale_shape_manual()
. For
example, a common convention is to plot meta-analysis results with a
diamond shape.
# Draw a forestplot of odds ratios
ggforestplot::forestplot(
df = df,
name = name,
estimate = beta,
se = se,
pvalue = pvalue,
psignif = 0.002,
colour = study,
shape = study,
xlab = "Odds ratio for incident type 2 diabetes (95% CI)\nper 1−SD increment in metabolite concentration",
title = "Amino acids",
logodds = TRUE
) +
ggplot2::scale_shape_manual(
values = c(23L, 21L, 21L, 21L, 21L),
labels = c("Meta-analysis", "NFBC-1997", "DILGOM", "FINRISK-1997", "YFS")
)
#> Scale for shape is already present.
#> Adding another scale for shape, which will replace the existing scale.
Finally, the ggforestplot
package provides a custom
function that plots and prints all Nightingale blood biomarkers in a
single pdf file, called plot_all_NG_biomarkers()
. There are
two layouts available for 2016 and 2020 versions of the biomarker
platform. Alternatively, one can also input a custom layout. Its usage
is straightforward.
# Join the built-in association demo dataset with a variable that contains
# the machine readable names of Nightingale biomarkers. (Note: if you
# have built your association data frame using the Nightingale CSV result file,
# then your data frame should already contain machine readable names.)
df <-
df_linear_associations %>%
left_join(
select(
df_NG_biomarker_metadata,
name,
machine_readable_name
),
by = "name"
)
# Print effect sizes for all Nightingale biomarkers in a custom 2-page pdf
p <- plot_all_NG_biomarkers(
df = df,
machine_readable_name = machine_readable_name,
estimate = beta,
se = se,
pvalue = pvalue,
colour = trait,
layout = "2016",
xlab = "1-SD increment in cardiometabolic trait
per 1-SD increment in biomarker concentration"
)
# Results are missing for glucose, remove NA from legend
p <- purrr::map(
.x = p,
.f = ~ .x + ggforestplot::scale_colour_ng_d(na.translate = FALSE)
)
grDevices::pdf("biomarker_linear_associations.pdf", 15, sqrt(2) * 15)
p
grDevices::dev.off()
You may view the output of this function here.