glmmTMB
has the capability to simulate from a fitted
model. These simulations resample random effects from their estimated
distribution. In future versions of glmmTMB
, it may be
possible to condition on estimated random effects.
Fit a typical model:
data(Owls)
owls_nb1 <- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent +
(1|Nest)+offset(log(BroodSize)),
family = nbinom1,
ziformula = ~1, data=Owls)
Then we can simulate from the fitted model with the
simulate.glmmTMB
function. It produces a list of simulated
observation vectors, each of which is the same size as the original
vector of observations. The default is to only simulate one vector
(nsim=1
) but we still return a list for consistency.
simo=simulate(owls_nb1, seed=1)
Simdat=Owls
Simdat$SiblingNegotiation=simo[[1]]
Simdat=transform(Simdat,
NegPerChick = SiblingNegotiation/BroodSize,
type="simulated")
Owls$type = "observed"
Dat=rbind(Owls, Simdat)
Then we can plot the simulated data against the observed data to check if they are similar.
what if you want to simulate data with specified parameters in the absence of a data set, for example for a power analysis?
glmmTMB
has a simulate_new
function that
can handle this case; the hardest part is understanding the meaning of
the parameter values, especially for random-effects covariances.
For the first example we’ll simulate something that looks like the
classic “sleep study” data, using the sleepstudy
data set
for structure and covariates. The conditional-fixed effects parameters
(beta
) are standard regression parameters (intercept and
slope): we use 250 and 10, which are close to the values from the actual
data. The only other parameter, betadisp
, is the log of the
dispersion parameter, which in the specific case of the Gaussian
(default) family is the log of the conditional (residual)
variance; the standard deviation from a simple regression on
these data1 is approximately 50, so we use
2*log(50)
.
data("sleepstudy", package = "lme4")
set.seed(101)
ss_sim <- transform(sleepstudy,
Reaction = simulate_new(~ Days,
newdata = sleepstudy,
family = gaussian,
newparams = list(beta = c(250, 10),
betadisp = 2*log(50)))[[1]])
For comparison, we’ll also fit the model and use the built-in simulation method:
ss_fit <- glmmTMB(Reaction ~ Days, sleepstudy)
ss_simlm <- transform(sleepstudy,
Reaction = simulate(ss_fit)[[1]])
Comparing against the real data set:
library(ggplot2); theme_set(theme_bw())
ss_comb <- rbind(data.frame(sleepstudy, sample = "real"),
data.frame(ss_sim, sample = "simulated"),
data.frame(ss_simlm, sample = "simulated (from fit)")
)
ggplot(ss_comb, aes(x = Days, y = Reaction, colour = Subject)) +
geom_line() +
facet_wrap(~sample)
The simulated data have about the right variability, but in contrast to the real data have no among-subject variation.
The next example will be more complex, getting into the nuts and
bolts of how to translate random effects covariances into the terms that
glmmTMB
expects.
The hardest piece is probably translating correlation parameters. The
“covariance structures” vignette has more details on how correlation
matrices are parameterized, and the put_cor()
function is a
general translator from a specified correlation matrix (or its lower
triangular elements) to the appropriate set of theta
parameters. For the specific case of 2x2 correlation matrices (i.e. with
a single correlation parameter), a correlation ρ corresponds to a
glmmTMB
parameter of $\rho/\sqrt{1-\rho^2}$. Here’s a utility
function:
rho_to_theta <- function(rho) rho/sqrt(1-rho^2)
## tests
stopifnot(all.equal(get_cor(rho_to_theta(-0.2)), -0.2))
## equivalent to general function
stopifnot(all.equal(rho_to_theta(-0.2), put_cor(-0.2, input_val = "vec")))
Setting up metadata/covariates (tools in the faux
package may also be useful for this):
We’ll set up some reasonable fixed effects. When in doubt about the
order of fixed effect parameters, use model.matrix()
to
check:
## [1] "(Intercept)" "trtB" "time" "trtB:time"
We’ll set SDs such that the average coeff var = 1 (SD = mean for
among-subject variation in intercept and slope). As described in the
“covstruct” vignette, the parameter vector for a random-effect
covariance consists of the log-standard-deviations followed by the
scaled correlations. Finally we’ll set the dispersion parameter for the
negative binomial conditional distribution to 1 (more detail on the
betadisp
parameterization for different families is given
in ?sigma.glmmTMB
).
sdvec <- c(1.5,0.15)
corval <- -0.2
thetavec <- c(log(sdvec), rho_to_theta(corval))
par1 <- list(beta = beta_vec,
betadisp = log(1), ## log(theta)
theta = thetavec)
Now simulate:
dd$y <- simulate_new(form1,
newdata = dd,
seed = 101,
family = nbinom2,
newparams = par1)[[1]]
range(dd$y)
## [1] 0 454
For comparison, we’ll do this by hand (with some help from
lme4
machinery). lme4
parameterizes covariance
matrices by the lower triangle of the Cholesky factor rather than using
glmmTMB
’s method …
library(lme4)
set.seed(101)
X <- model.matrix(nobars(form1), data = dd)
## generate random effects values
rt <- mkReTrms(findbars(form1),
model.frame(subbars(form1), data = dd))
Z <- t(rt$Zt)
## construct covariance matrix
Sigma <- diag(sdvec) %*% matrix(c(1, corval, corval, 1), 2) %*% diag(sdvec)
lmer_thetavec <- t(chol(Sigma))[c(1,2,4)]
## plug values into Cholesky factor of random effects covariance matrix
rt$Lambdat@x <- lmer_thetavec[rt$Lind]
u <- rnorm(nrow(rt$Lambdat))
b <- t(rt$Lambdat) %*% u
eta <- drop(X %*% par1$beta + t(rt$Zt) %*% b)
mu <- exp(eta)
y <- rnbinom(nrow(dd), size = 1, mu = mu)
range(y) ## range varies a lot
## [1] 0 1484
Alternatively we could have generated the random effects with:
We will simulate a single time series with AR1 structure, with a nugget (measurement error) variance σn2 = 1.0, an autoregressive variance of σa2 = 1, and an autoregressive parameter of ϕ = 0.7,
First by brute force, using the code from the “covariance structures” vignette:
set.seed(101)
n <- 1000 ## Number of time points
x <- MASS::mvrnorm(mu = rep(0,n),
Sigma = .7 ^ as.matrix(dist(1:n)) ) ## Simulate the process using the MASS package
## as.matrix(dist(1:n)) constructs a banded matrix with m_{ij} = abs(i-j)
y <- x + rnorm(n) ## Add measurement noise/nugget
dat0 <- data.frame(y,
times = factor(1:n, levels=1:n),
group = factor(rep(1, n)))
Now using simulate_new()
with matching parameters
beta = 0
(the only fixed effect is the intercept, which we
set to zero); betadisp = 0
(the log-variance of the
measurement error [note Gaussian family uses log-variance rather than
log-SD parameterization, although in this case it doesn’t make any
difference …]); theta[1] = 0
(log-SD of autoregressive
variance); and theta[2]
specifying a correlation parameter
ϕ = 0.7 (see “Covariance
structures” vignette for details).
phi_to_AR1 <- function(phi) phi/sqrt(1-phi^2)
s2 <- simulate_new(~ ar1(times + 0 | group),
newdata = dat0,
seed = 101,
newparams = list(
beta = 0,
betadisp = 0,
theta = c(0, phi_to_AR1(0.7)))
)[[1]]
With a nugget variance of σn2 = 1.0, an autoregressive variance of σa2 = 1, and an autoregressive parameter of ϕ = 0.7, we expect the ACF to be σa2/(σa2 + σn2)ϕd .
a1 <- drop(acf(dat0$y, plot = FALSE)$acf)
a2 <- drop(acf(s2, plot = FALSE)$acf)
par(las = 1, bty = "l")
matplot(0:(length(a1)-1), cbind(a1, a2), pch = 1,
ylab = "autocorrelation", xlab = "lag")
curve(0.7^x/2, add = TRUE, col = 4, lwd = 2)
The precise curves are different (because the multivariate normal deviates are generated in a different way), but the ACFs match.
lme4
(spatial, ZI, etc.). If necessary, more details on
parameterizations (shape/scale for spatial cov structures, etc.)I realize this violates the assumption of de novo simulation that we don’t know what the real data looks like yet …↩︎