In this tutorial we will go through the basic steps of importing blood metabolomics data into R, joining them with phenotypes or endpoint data, and performing basic epidemiological analysis. The tutorial uses simulated demo datasets from Nightingale’s NMR platform.
This tutorial uses R, which is free, compiles and runs in most operating systems. If you don’t have it already installed in your system, you may install it by following the instructions here. Additionally, we recommend using the RStudio IDE (Integrated Development Environment), which can be downloaded and installed following the instructions here for the free “RStudio Desktop (Open Source License)”.
You should be able to reproduce easily the code in this tutorial, even with very little knowledge in programming. If you have never programmed in R before and are intrested to learn more, you may find the following book useful, Hands on Programming with R by Garrett Grolemund.
Throughout the tutorial, we will be using the collection of tidyverse
R packages
and try to abide by their philosophy. You can install the complete
tidyverse by typing the following line of code in the R console:
# Install the tidyverse packages
install.packages("tidyverse")
# Attach the tidyverse packages into the current session
library(tidyverse)
We also note the usage of the pipe %>%
symbol
throughout the code of this tutorial. The pipe is a way to write a
series of operations on an R object, e.g. a dataset, in an easy-to-read
way. As an example, the operation x %>% f(y)
effectivly
means f(x, y)
. You can read more on the pipe here.
The NMR biomarker concentrations are provided in xlsx, csv and tsv formats. R can read either but in this tutorial we will use the csv format.
For the purposes of this tutorial we will use an example csv file. Download the zipped folder here and unzip it. It contains two files, one with metabolomics data and one with corresponding clinical data that we will use later on.
Type the commands below in your R session to load the metabolomics
data and view the available biomarkers. Mind to put the correct path in
the file
parameter below.
# Read the biomarker concentration file
df_nmr_results <- readr::read_csv(
# Enter the correct location for your file below
file = "/path/to/file/12345-Results.csv",
# Set not only NA but TAG string as <NA>
na = c("NA", "TAG"),
col_types = cols(.default = col_double(),
Sample_id = col_character())
)
Print the names of all the variables in the dataset for inspection.
names(df_nmr_results)
#> [1] "Sample_id" "Unidentified_small_molecule_c"
#> [3] "Unidentified_macromolecules" "Aminocaproic_acid"
#> [5] "Below_limit_of_quantification" "Citrate_plasma"
#> [7] "Low_protein" "Low_ethanol"
#> [9] "Low_glutamine_or_high_glutamate" "Gluconolactone"
#> [11] "High_ethanol" "High_lactate"
#> [13] "High_pyruvate" "Isopropyl_alcohol"
#> [15] "Low_glucose" "Medium_ethanol"
#> [17] "EDTA_plasma" "Polysaccharides"
#> [19] "N_methyl_2_pyrrolidone" "Unexpected_amino_acid_signals"
#> [21] "Unidentified_small_molecule_a" "Unidentified_small_molecule_b"
#> [23] "Total_C" "non_HDL_C"
#> [25] "Remnant_C" "VLDL_C"
#> [27] "Clinical_LDL_C" "LDL_C"
#> [29] "HDL_C" "Total_TG"
#> [31] "VLDL_TG" "LDL_TG"
#> [33] "HDL_TG" "Total_PL"
#> [35] "VLDL_PL" "LDL_PL"
#> [37] "HDL_PL" "Total_CE"
#> [39] "VLDL_CE" "LDL_CE"
#> [41] "HDL_CE" "Total_FC"
#> [43] "VLDL_FC" "LDL_FC"
#> [45] "HDL_FC" "Total_L"
#> [47] "VLDL_L" "LDL_L"
#> [49] "HDL_L" "Total_P"
#> [51] "VLDL_P" "LDL_P"
#> [53] "HDL_P" "VLDL_size"
#> [55] "LDL_size" "HDL_size"
#> [57] "Phosphoglyc" "TG_by_PG"
#> [59] "Cholines" "Phosphatidylc"
#> [61] "Sphingomyelins" "ApoB"
#> [63] "ApoA1" "ApoB_by_ApoA1"
#> [65] "Total_FA" "Unsaturation"
#> [67] "Omega_3" "Omega_6"
#> [69] "PUFA" "MUFA"
#> [71] "SFA" "LA"
#> [73] "DHA" "Omega_3_pct"
#> [75] "Omega_6_pct" "PUFA_pct"
#> [77] "MUFA_pct" "SFA_pct"
#> [79] "LA_pct" "DHA_pct"
#> [81] "PUFA_by_MUFA" "Omega_6_by_Omega_3"
#> [83] "Ala" "Gln"
#> [85] "Gly" "His"
#> [87] "Total_BCAA" "Ile"
#> [89] "Leu" "Val"
#> [91] "Phe" "Tyr"
#> [93] "Glucose" "Lactate"
#> [95] "Pyruvate" "Citrate"
#> [97] "Glycerol" "bOHbutyrate"
#> [99] "Acetate" "Acetoacetate"
#> [101] "Acetone" "Creatinine"
#> [103] "Albumin" "GlycA"
#> [105] "XXL_VLDL_P" "XXL_VLDL_L"
#> [107] "XXL_VLDL_PL" "XXL_VLDL_C"
#> [109] "XXL_VLDL_CE" "XXL_VLDL_FC"
#> [111] "XXL_VLDL_TG" "XL_VLDL_P"
#> [113] "XL_VLDL_L" "XL_VLDL_PL"
#> [115] "XL_VLDL_C" "XL_VLDL_CE"
#> [117] "XL_VLDL_FC" "XL_VLDL_TG"
#> [119] "L_VLDL_P" "L_VLDL_L"
#> [121] "L_VLDL_PL" "L_VLDL_C"
#> [123] "L_VLDL_CE" "L_VLDL_FC"
#> [125] "L_VLDL_TG" "M_VLDL_P"
#> [127] "M_VLDL_L" "M_VLDL_PL"
#> [129] "M_VLDL_C" "M_VLDL_CE"
#> [131] "M_VLDL_FC" "M_VLDL_TG"
#> [133] "S_VLDL_P" "S_VLDL_L"
#> [135] "S_VLDL_PL" "S_VLDL_C"
#> [137] "S_VLDL_CE" "S_VLDL_FC"
#> [139] "S_VLDL_TG" "XS_VLDL_P"
#> [141] "XS_VLDL_L" "XS_VLDL_PL"
#> [143] "XS_VLDL_C" "XS_VLDL_CE"
#> [145] "XS_VLDL_FC" "XS_VLDL_TG"
#> [147] "IDL_P" "IDL_L"
#> [149] "IDL_PL" "IDL_C"
#> [151] "IDL_CE" "IDL_FC"
#> [153] "IDL_TG" "L_LDL_P"
#> [155] "L_LDL_L" "L_LDL_PL"
#> [157] "L_LDL_C" "L_LDL_CE"
#> [159] "L_LDL_FC" "L_LDL_TG"
#> [161] "M_LDL_P" "M_LDL_L"
#> [163] "M_LDL_PL" "M_LDL_C"
#> [165] "M_LDL_CE" "M_LDL_FC"
#> [167] "M_LDL_TG" "S_LDL_P"
#> [169] "S_LDL_L" "S_LDL_PL"
#> [171] "S_LDL_C" "S_LDL_CE"
#> [173] "S_LDL_FC" "S_LDL_TG"
#> [175] "XL_HDL_P" "XL_HDL_L"
#> [177] "XL_HDL_PL" "XL_HDL_C"
#> [179] "XL_HDL_CE" "XL_HDL_FC"
#> [181] "XL_HDL_TG" "L_HDL_P"
#> [183] "L_HDL_L" "L_HDL_PL"
#> [185] "L_HDL_C" "L_HDL_CE"
#> [187] "L_HDL_FC" "L_HDL_TG"
#> [189] "M_HDL_P" "M_HDL_L"
#> [191] "M_HDL_PL" "M_HDL_C"
#> [193] "M_HDL_CE" "M_HDL_FC"
#> [195] "M_HDL_TG" "S_HDL_P"
#> [197] "S_HDL_L" "S_HDL_PL"
#> [199] "S_HDL_C" "S_HDL_CE"
#> [201] "S_HDL_FC" "S_HDL_TG"
#> [203] "XXL_VLDL_PL_pct" "XXL_VLDL_C_pct"
#> [205] "XXL_VLDL_CE_pct" "XXL_VLDL_FC_pct"
#> [207] "XXL_VLDL_TG_pct" "XL_VLDL_PL_pct"
#> [209] "XL_VLDL_C_pct" "XL_VLDL_CE_pct"
#> [211] "XL_VLDL_FC_pct" "XL_VLDL_TG_pct"
#> [213] "L_VLDL_PL_pct" "L_VLDL_C_pct"
#> [215] "L_VLDL_CE_pct" "L_VLDL_FC_pct"
#> [217] "L_VLDL_TG_pct" "M_VLDL_PL_pct"
#> [219] "M_VLDL_C_pct" "M_VLDL_CE_pct"
#> [221] "M_VLDL_FC_pct" "M_VLDL_TG_pct"
#> [223] "S_VLDL_PL_pct" "S_VLDL_C_pct"
#> [225] "S_VLDL_CE_pct" "S_VLDL_FC_pct"
#> [227] "S_VLDL_TG_pct" "XS_VLDL_PL_pct"
#> [229] "XS_VLDL_C_pct" "XS_VLDL_CE_pct"
#> [231] "XS_VLDL_FC_pct" "XS_VLDL_TG_pct"
#> [233] "IDL_PL_pct" "IDL_C_pct"
#> [235] "IDL_CE_pct" "IDL_FC_pct"
#> [237] "IDL_TG_pct" "L_LDL_PL_pct"
#> [239] "L_LDL_C_pct" "L_LDL_CE_pct"
#> [241] "L_LDL_FC_pct" "L_LDL_TG_pct"
#> [243] "M_LDL_PL_pct" "M_LDL_C_pct"
#> [245] "M_LDL_CE_pct" "M_LDL_FC_pct"
#> [247] "M_LDL_TG_pct" "S_LDL_PL_pct"
#> [249] "S_LDL_C_pct" "S_LDL_CE_pct"
#> [251] "S_LDL_FC_pct" "S_LDL_TG_pct"
#> [253] "XL_HDL_PL_pct" "XL_HDL_C_pct"
#> [255] "XL_HDL_CE_pct" "XL_HDL_FC_pct"
#> [257] "XL_HDL_TG_pct" "L_HDL_PL_pct"
#> [259] "L_HDL_C_pct" "L_HDL_CE_pct"
#> [261] "L_HDL_FC_pct" "L_HDL_TG_pct"
#> [263] "M_HDL_PL_pct" "M_HDL_C_pct"
#> [265] "M_HDL_CE_pct" "M_HDL_FC_pct"
#> [267] "M_HDL_TG_pct" "S_HDL_PL_pct"
#> [269] "S_HDL_C_pct" "S_HDL_CE_pct"
#> [271] "S_HDL_FC_pct" "S_HDL_TG_pct"
The first column, Sample_id
, contains the sample
identifiers. Columns 2 to 19 contain information on different tags that
a sample may contain. Explanation on any tags present in an actual
result file is given along with the analysis report but here all
possible tag columns are provided for demonstarational purposes. The
rest of the columns correspond to machine readable abbreviations of the
NMR-quantified blood biomarkers.
Let us drop the tag columns as they are not used in this demo.
df_nmr_results <-
df_nmr_results %>%
dplyr::select(
.data$Sample_id,
tidyselect::any_of(
ggforestplot::df_NG_biomarker_metadata$machine_readable_name
)
)
#> Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
#> ℹ Please use `"Sample_id"` instead of `.data$Sample_id`
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
If you have a Nightingale result file with alternative biomarker
names, for example XXL-VLDL-C_%
instead of
XXL_VLDL_C_pct
or LA/FA
instead of
LA_pct
, you may utilize the variable
alternative_ids
in the
ggforestplot::df_NG_biomarker_metadata
dataset to convert
from one type to the other. An example is provided below, where we
transform a dataset with alternative biomarker name (as long as they are
found in the
ggforestplot::df_NG_biomarker_metadata$alternative_ids
list) to their
ggforestplot::df_NG_biomarker_metadata$machine_readable_name
.
# Assume that your data frame, containing alternative biomarker names, is called
# df_nmr_results_alt_names
alt_names <-
names(df_nmr_results)
new_names <-
alt_names %>%
purrr::map_chr(function(id) {
# Look through the alternative_ids
hits <-
purrr::map_lgl(
df_NG_biomarker_metadata$alternative_names,
~ id %in% .
)
# If one unambiguous hit, return it.
if (sum(hits) == 1L) {
return(df_NG_biomarker_metadata$machine_readable_name[hits])
# If not found, give a warning and pass through the input.
} else {
warning("Biomarker not found: ", id, call. = FALSE)
return(id)
}
})
# Name the vector with the new names
names(alt_names) <- new_names
# Rename your result data frame with machine_readable_names
df_nmr_results_alt_names <-
df_nmr_results %>%
rename(!!alt_names)
Next we join the blood biomarker data above with other variables we may have, such as patient basic information (e.g. gender, age, e.t.c.) or patient outcomes (e.g. a disease event).
The second dataset you downloaded earlier (from here),
called clinical_data.csv
, has additional to the NMR
information corresponding to the same fictional individuals. It has
information on gender, age, body mass index (BMI) as well as incident
T2D diabetes.
We will join the two datasets using the IDs, i.e. column
Sample_id
in the NMR data and column
identifier
in the clinical data. Except for the columns
having different names in the two files, the IDs themselves are not the
same either (which is also often the case in real life).
Sample_id
in the NMR is a character column with the prefix
ResearchInstitutionProjectLabId_
while
identifier
in clinical data is a character consisting of 4
numbers.
Below we read the clinical data and join the two datasets after a few adjustments.
# Read the clinical_data.csv data file
df_clinical_data <- readr::read_csv(
# Enter the correct location for your file below
file = "/path/to/file/clinical_data.csv") %>%
# Rename the identifier column to Sample_id as in the NMR result file
rename(Sample_id = identifier) %>%
# Harmonize the id entries in clinical data with the ids in the NMR data
mutate(Sample_id = paste0(
"ResearchInstitutionProjectLabId_",
as.numeric(Sample_id)
))
# Inspect the first 10 entries
print(df_clinical_data)
#> # A tibble: 1,887 × 6
#> Sample_id gender baseline_age BMI incident_diabetes age_at_diabetes
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 ResearchInstitut… male 57.6 25.2 1 59.9
#> 2 ResearchInstitut… male 36.6 25.5 0 52.2
#> 3 ResearchInstitut… female 37.9 31.4 0 56.6
#> 4 ResearchInstitut… male 57.9 30.5 0 70.9
#> 5 ResearchInstitut… male 70.9 28.6 0 86.5
#> 6 ResearchInstitut… female 41.1 26.9 0 63.8
#> 7 ResearchInstitut… male 47.1 24.2 0 59.7
#> 8 ResearchInstitut… female 32.4 23.7 0 47.2
#> 9 ResearchInstitut… female 58.1 19.3 1 72.4
#> 10 ResearchInstitut… male 52.4 31.6 0 67.4
#> # ℹ 1,877 more rows
# Join NMR result file with the clinical data using column "Sample_id"
df_full_data <- dplyr::left_join(
x = df_nmr_results,
y = df_clinical_data,
by = "Sample_id"
)
We will first plot the distribution for a specific biomarker to inspect the data more closely. We choose glycoprotein acetyls (abbrev. GlycA) which reflects the amount of N-acetyl groups in circulating glycoproteins, many of which are involved in the acute-phase inflammatory response.
We first add a column to the dataset, marking the subjects as obese or not-obese. We then plot the GlycA distributions for the two sets as boxplots.
df_full_data %>%
# Add a column to mark obese/non-obese subjects
mutate(obesity = ifelse(BMI >= 30, yes = "Obese", no = "Not obese")) %>%
# Filter out subjects with missing values (if any)
filter(!is.na(obesity)) %>%
# Box plots for each group
ggplot(aes(y = GlycA, x = obesity, color = obesity)) +
geom_boxplot() +
# Remove x axis title
theme(axis.title.x = element_blank())
In the box plot above, the lower, middle and higher hinges correspond to 25%, 50% and 75% quantiles, respectively. Therefore, we see that the GlycA distribution for obese subjects is clearly shifted to higher values as compared to non-obese subjects, indicating that this marker is likely positively associated with obesity.
We will now estimate associations of each biomarker to BMI via linear regression:
where the blood biomarker is the outcome, y, and BMI is the exposure x. The association of x with y refers to the beta coefficient (β).
Below, we use function
ggforestplot::discovery_regression()
, with input parameter
model
set to 'lm'
to estimate multiple,
adjusted, linear associations, for all the biomarkers. Note that we
scale the biomarkers so that the magnitude of associations are directly
comparable across the different biomarkers. Log-transformation is done
in order to satisfy the assumption of linear regression for normal
distribution of the residuals.
# Extract names of NMR biomarkers
nmr_biomarkers <- dplyr::intersect(
ggforestplot::df_NG_biomarker_metadata$machine_readable_name,
colnames(df_nmr_results)
)
# NMR biomarkers here should be 250
stopifnot(length(nmr_biomarkers) == 250)
# Select only variables to be used for the model and collapse to a long data
# format
df_long <-
df_full_data %>%
# Select only model variables
dplyr::select(tidyselect::all_of(nmr_biomarkers), gender, baseline_age, BMI) %>%
# Log-transform and scale biomarkers
dplyr::mutate_at(
.vars = dplyr::vars(tidyselect::all_of(nmr_biomarkers)),
.funs = ~ .x %>% log1p() %>% scale %>% as.numeric()
) %>%
# Collapse to a long format
tidyr::gather(
key = biomarkerid,
value = biomarkervalue,
tidyselect::all_of(nmr_biomarkers)
)
# Estimate sex- and age-adjusted associations of metabolite to BMI
df_assoc_per_biomarker_bmi <-
ggforestplot::discovery_regression(
df_long = df_long,
model = "lm",
formula =
formula(
biomarkervalue ~ BMI + factor(gender) + baseline_age
),
key = biomarkerid,
predictor = BMI
) %>%
# Join this dataset with the grouping data in order to choose a different
# biomarker naming option
left_join(
select(
df_NG_biomarker_metadata,
name,
biomarkerid = machine_readable_name
),
by = "biomarkerid")
head(df_assoc_per_biomarker_bmi)
#> # A tibble: 6 × 5
#> biomarkerid estimate se pvalue name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 Total_C 0.0169 0.00544 1.91e- 3 Total-C
#> 2 non_HDL_C 0.0293 0.00536 5.01e- 8 non-HDL-C
#> 3 Remnant_C 0.0303 0.00535 1.79e- 8 Remnant-C
#> 4 VLDL_C 0.0428 0.00526 7.63e-16 VLDL-C
#> 5 Clinical_LDL_C 0.0251 0.00541 3.71e- 6 Clinical LDL-C
#> 6 LDL_C 0.0277 0.00539 2.94e- 7 LDL-C
The data frame df_assoc_per_biomarker_bmi
above,
contains 250 rows, one for each blood biomarker, and 5 variables
(columns): the biomarkerid
(same as variable
machine_readable_name
in
df_NG_biomarker_metadata
) which has the biomarker
abbreviations in the same format as in the input csv file, the biomarker
name
which is a bit more descriptive than
biomarkerid
, and the variables estimate
,
se
and pvalue
that correspond to the linear
regression coefficient β, the
standard error and the p-value, respectively.
Let’s now visualize the results. For the purposes of this demo, we
will plot a selected subset of the biomarkers in a forestplot layout. At
the end of the tutorial, you will find an example of how to plot
associations for all Nightingale biomarkers in a pdf (see also
vignette("ggforestplot")
for more details on the usage of
ggforestplot::forestplot()
).
The ggforestplot
package includes a data frame, called
ggforestplot::df_NG_biomarker_metadata
, with metadata on
the Nightingale blood biomarkers, such as different naming options,
descriptions and group/sugroup information. We will use the group (or
sugroup) information in this data frame to decide which groups of
biomarkers and in what order we want to visualize here.
# Display blood biomarker groups
ggforestplot::df_NG_biomarker_metadata %>%
pull(group) %>%
unique()
#> [1] "Cholesterol"
#> [2] "Triglycerides"
#> [3] "Phospholipids"
#> [4] "Cholesteryl esters"
#> [5] "Free cholesterol"
#> [6] "Total lipids"
#> [7] "Lipoprotein particle concentrations"
#> [8] "Lipoprotein particle sizes"
#> [9] "Other lipids"
#> [10] "Apolipoproteins"
#> [11] "Fatty acids"
#> [12] "Amino acids"
#> [13] "Glycolysis related metabolites"
#> [14] "Ketone bodies"
#> [15] "Fluid balance"
#> [16] "Inflammation"
#> [17] "Lipoprotein subclasses"
#> [18] "Relative lipoprotein lipid concentrations"
# # You may wish, alternatively, to display blood biomarker subgroups and plot
# # according to that. You can see which subgroups are available in the
# # grouping data by uncommenting the following lines. In several cases, subgroup
# # is identical to group.
# ggforestplot::df_NG_biomarker_metadata %>%
# pull(subgroup) %>%
# unique()
# Choose the groups you want to plot and define the order with which group
# categories will appear in the plot
group_order <- c(
"Branched-chain amino acids",
"Aromatic amino acids",
"Amino acids",
"Fluid balance",
"Inflammation",
"Fatty acids",
"Triglycerides",
"Other lipids",
"Cholesterol"
)
# Extract a subset of the df_NG_biomarker_metadata, with the desired group
# order, to set the order of biomarkers in the forestplot later on
df_with_groups <-
ggforestplot::df_NG_biomarker_metadata %>%
# Select subset of variables
select(name = name,
group) %>%
# Filter and arrange for the wanted groups
filter(group %in% group_order) %>%
arrange(factor(group, levels = group_order))
We’d also like to add to the forrestplot a Bonferroni correction to account for multiple testing. Here we have a comparison of 250 biomarkers. If we suppose the common significance threshold α = 0.05, the Bonferroni correction for each individual hypothesis, assuming the 250 tests are independent, is α = 0.05/250 ≈ 0.0002. However, for biologically correlated measures this correction is too strict; Here, instead of correcting for the total number of biomarkers we will use the number of principal components that explain 99% of the variance in the data.
# Perform principal component analysis on metabolic data
df_pca <-
df_full_data %>%
select(tidyselect::all_of(nmr_biomarkers)) %>%
nest(data = dplyr::everything()) %>%
mutate(
# Perform PCA on the data above, after scaling and centering to 0
pca = map(data, ~ stats::prcomp(.x, center = TRUE, scale = TRUE)),
# Augment the original data with columns containing each observation's
# projection into the PCA space. Each PCA-space dimension is signified
# with the .fitted prefix in its name
pca_aug = map2(pca, data, ~broom::augment(.x, data = .y))
)
# Estimate amount of variance explained by each principal component
df_pca_variance <-
df_pca %>%
unnest(pca_aug) %>%
# Estimate variance for the PCA projected variables
summarize_at(.vars = vars(starts_with(".fittedPC")), .funs = ~ var(.x)) %>%
# Gather data in a long format
gather(key = PC, value = variance) %>%
# Estimate cumulative normalized variance
mutate(cumvar = cumsum(variance / sum(variance)),
PC = str_replace(PC, ".fitted", ""))
# Find number of principal components that explain 99%
pc_99 <-
df_pca_variance %>%
filter(cumvar <= 0.99) %>%
nrow()
print(pc_99)
#> [1] 41
# Corrected significance threshold
psignif <- signif(0.05 / pc_99, 1)
Therefore, our significance threshold here will be α = 0.05/41 ≈ 0.001
Below we plot the results. In the forestplot function, our main input
is the data frame df_assoc_per_biomarker_bmi
, while we use
the previously constructed data frame df_with_groups
and
its variable group
to impose the grouping and the order of
biomarkers in the plot. For the statistical significance, we input the
significance threshold as the parameter psignif = 0.001
and
we define explicitly the variable name in
df_assoc_per_biomarker_bmi
that contains the p-values of
the linear regression, in this case the column pvalue.
# Join the association data frame with group data
df_to_plot <-
df_assoc_per_biomarker_bmi %>%
# use right_join, with df_grouping on the right, to preserve the order of
# biomarkers it specifies.
dplyr::right_join(., df_with_groups, by = "name") %>%
dplyr::mutate(
group = factor(.data$group, levels = group_order)
) %>%
tidyr::drop_na(.data$estimate)
#> Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
#> ℹ Please use `"estimate"` instead of `.data$estimate`
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
# Draw a forestplot of cross-sectional, linear associations.
forestplot(
df = df_to_plot,
pvalue = pvalue,
psignif = psignif,
xlab = "1-SD increment in biomarker concentration\nper unit increment in BMI"
) +
ggforce::facet_col(
facets = ~group,
scales = "free_y",
space = "free"
)
The figure shows that BMI is significantly associated with several of the plotted blood biomarkers. Specifically, higher values of branched chain amino acids, aromatic amino acids, omega-6 and glycoprotein acetylation are significantly associated with obesity, as has been previously seen in Würtz P et al. Metabolic Signatures of Adiposity in Young Adults: Mendelian Randomization Analysis and Effects of Weight Change., PLoS Med. 2014.
Similar analysis may be performed for other phenotypic traits. Two other very common regression approaches in epidemiology are logistic regression for studying the association to binary outcomes and Cox proportional hazards for time-to-event outcomes.
In similar spirit as above, we utilize
ggforestplot::discovery_regression()
to demonstrate an
example of fitting Cox proportional hazards in all of the biomarkers to
estimate the gender-, age- and BMI- adjusted hazard ratios for type 2
diabetes.
# Select only variables to be used for the model and collapse to a long data
# format
df_long <-
df_full_data %>%
# Select only model variables
dplyr::select(tidyselect::all_of(nmr_biomarkers), gender, baseline_age, BMI, age_at_diabetes, incident_diabetes) %>%
# Log-transform and scale biomarkers
dplyr::mutate_at(
.vars = dplyr::vars(tidyselect::all_of(nmr_biomarkers)),
.funs = ~ .x %>% log1p() %>% scale %>% as.numeric()
) %>%
# Collapse to a long format
tidyr::gather(
key = biomarkerid,
value = biomarkervalue,
tidyselect::all_of(nmr_biomarkers)
)
# Estimate sex-adjusted associations of metabolite to T2D
# Notice that parameter model below is set to 'coxph'
df_assoc_per_biomarker_diab <-
discovery_regression(
df_long = df_long,
model = "coxph",
formula =
formula(
survival::Surv(
time = baseline_age,
time2 = age_at_diabetes,
event = incident_diabetes
) ~ biomarkervalue + factor(gender) + BMI
),
key = biomarkerid,
predictor = biomarkervalue
) %>%
# Join this dataset with the grouping data in order to choose a different
# biomarker naming option
left_join(
select(
df_NG_biomarker_metadata,
name,
biomarkerid = machine_readable_name
),
by = "biomarkerid")
head(df_assoc_per_biomarker_diab)
#> # A tibble: 6 × 5
#> biomarkerid estimate se pvalue name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 Total_C 0.0161 0.0690 0.815 Total-C
#> 2 non_HDL_C 0.0524 0.0715 0.464 non-HDL-C
#> 3 Remnant_C 0.0876 0.0708 0.216 Remnant-C
#> 4 VLDL_C 0.238 0.0725 0.00101 VLDL-C
#> 5 Clinical_LDL_C -0.0229 0.0700 0.744 Clinical LDL-C
#> 6 LDL_C 0.0207 0.0707 0.770 LDL-C
Plot results for fatty acids.
# Filter df_NG_biomarker_metadata for only fatty acids
df_grouping <-
df_NG_biomarker_metadata %>%
filter(group %in% "Fatty acids")
# Join the association data frame with group data
df <-
df_assoc_per_biomarker_diab %>%
# use right_join, with df_grouping on the right, to preserve the order of
# biomarkers it specifies.
dplyr::right_join(., df_grouping, by = "name")
# Draw a forestplot of time-to-event associations for t2d.
ggforestplot::forestplot(
df = df,
name = name,
se = se,
pvalue = pvalue,
psignif = 0.001,
xlab = "Hazards ratio for incident type 2 diabetes (95% CI)\nper 1−SD increment in metabolite concentration",
title = "Fatty acids and risk of future T2D",
logodds = TRUE
)
Notice that the main difference in the latter forestplot as compared
to the forestplot for linear associations above, is parameter
logodds = TRUE
. In this case, the beta
values,
here log hazard ratios, are exponentiated inside the forestplot to
obtain hazard ratios and plotted in a logarithmic scale. The confidence
intervals for the odds ratios are estimated using the standard errors of
the log hazard ratios, as follows CIlow = exp (β − 1.96 * SE)
and CIhigh = exp (β + 1.96 * SE).
If you wish to use some confidence interval other than the default 95%,
you may specify this in the parameter ci
of the
function.
Finally, the ggforestplot
package provides a custom
function that plots and prints all Nightingale blood biomarkers in a
single pdf file, called plot_all_NG_biomarkers()
. There are
two layouts available for 2016 and 2020 versions of the biomarker
platform. Alternatively, one can also input a custom layout. Its usage
is straightforward.
# Plot in all biomarkers in a two-page pdf file
plot_all_NG_biomarkers(
df = df_assoc_per_biomarker_bmi,
machine_readable_name = biomarkerid,
name = name,
estimate = estimate,
se = se,
pvalue = pvalue,
xlab = "1-SD increment in biomarker concentration\nper unit increment in BMI",
filename = "all_ng_associations_bmi.pdf"
)
You may view the output of this function here.
The purpose of this demo is to familiarize the user with the Nightingale Health biomarker dataset and to showcase simple association analysis using linear and Cox proportional hazards regression.
Further details on data analysis approaches may be found in the following example publications.