Title: | Ordered Beta Regression Models with 'brms' |
---|---|
Description: | Implements ordered beta regression models, which are for modeling continuous variables with upper and lower bounds, such as survey sliders, dose-response relationships and indexes. For more information, see Kubinec (2023) <doi:10.31235/osf.io/2sx6y>. The package is a front-end to the R package 'brms', which facilitates a range of regression specifications, including hierarchical, dynamic and multivariate modeling. |
Authors: | Robert Kubinec [aut, cre] |
Maintainer: | Robert Kubinec <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.7.3.1 |
Built: | 2024-11-02 06:00:42 UTC |
Source: | https://github.com/saudiwin/ordbetareg_pack |
This function will return the density of given variates of the
ordered beta distribution conditional on values for
the mean (mu
), dispersion (phi
) and cutpoints
governing the ratio of degenerate (discrete) to continuous
responses.
dordbeta(x = 0.9, mu = 0.5, phi = 1, cutpoints = c(-1, 1), log = FALSE)
dordbeta(x = 0.9, mu = 0.5, phi = 1, cutpoints = c(-1, 1), log = FALSE)
x |
Variates of the ordered beta distribution (should be in the [0,1] interval). |
mu |
Value of the mean of the distribution. Should be in the \(0,1\) interval (cannot be strictly equal to 0 or 1). If length is greater than 1, should be of length x. |
phi |
Value of the dispersion parameter. Should be strictly greater than 0. If length is greater than 1, should be of length x. |
cutpoints |
A vector of two numeric values for the cutpoints. Second value should |
log |
where to return the log density be strictly greater than the first value. |
Returns a vector of length x
of the density of the ordered beta distribution
conditional on mu
and phi
.
# examine density (likelihood) of different possible values # given fixed values for ordered beta parameters x <- seq(0, 1, by=0.01) x_dens <- dordbeta(x, mu = 0.3, phi=2, cutpoints=c(-2, 2)) # Most likely value for x is approx 1 # Note discontinuity in density function between continuous/discrete values # density function is a combined PMF/PDF, so not a real PDF # can though be used for MLE plot(x_dens, x) # discrete values should be compared to each other: # prob of discrete 0 > prob of discrete 1 x_dens[x==0] > x_dens[x==1]
# examine density (likelihood) of different possible values # given fixed values for ordered beta parameters x <- seq(0, 1, by=0.01) x_dens <- dordbeta(x, mu = 0.3, phi=2, cutpoints=c(-2, 2)) # Most likely value for x is approx 1 # Note discontinuity in density function between continuous/discrete values # density function is a combined PMF/PDF, so not a real PDF # can though be used for MLE plot(x_dens, x) # discrete values should be compared to each other: # prob of discrete 0 > prob of discrete 1 x_dens[x==0] > x_dens[x==1]
A fitted ordered beta regression model on multiple imputed
datasets generated by the package mice
.
fit_imputed
fit_imputed
an ordbetareg
object
A fitted ordered beta regression model with two responses, one an ordered beta regression and the other a Gaussian/Normal outcome. Useful for examining mediation analysis.
fit_multivariate
fit_multivariate
an ordbetareg
object
This function takes a continuous (double) column of data and converts it to have 0 as the lower bound and 1 as the upper bound.
normalize(outcome, true_bounds = NULL)
normalize(outcome, true_bounds = NULL)
outcome |
Any non-character vector. Factors will be converted to numeric via coercion. |
true_bounds |
Specify this parameter with the lower and upper bound if the observed min/max of the outcome should not be used. Useful when an upper or lower bound exists but the observed data is less than/more than that bound. The normalization function will respect these bounds. |
Beta regression can only be done with a response that is continuous with a lower bound of 0 and an upper bound of 1. However, it is straightforward to transform any lower and upper-bounded continuous variable to the \[0,1\] interval. This function does the transformation and saves the original bounds as attributes so that the bounds can be reverse-transformed.
A numeric vector with an upper bound of 1 and a lower bound of 0. The original bounds are saved in the attributes "lower_bound" and "upper_bound".
# set up arbitrary upper and lower-bounded vector outcome <- runif(1000, min=-33, max=445) # normalize to \[0,1\] trans_outcome <- normalize(outcome=outcome) summary(trans_outcome) # only works with numeric vectors and factors try(normalize(outcome=c('a','b')))
# set up arbitrary upper and lower-bounded vector outcome <- runif(1000, min=-33, max=445) # normalize to \[0,1\] trans_outcome <- normalize(outcome=outcome) summary(trans_outcome) # only works with numeric vectors and factors try(normalize(outcome=c('a','b')))
A fitted ordered beta regression model to the mean of the thermometer column from the pew data.
ord_fit_mean
ord_fit_mean
an ordbetareg
object
A fitted ordered beta regression model to the dispersion parameter of the thermometer column from the pew data.
ord_fit_phi
ord_fit_phi
an ordbetareg
object
This function allows you to estimate an ordered beta regression model via a formula syntax.
The ordbetareg
package is essentially a wrapper around brms
that
enables the ordered beta regression model to be fit. This model has
advantages over other alternatives for continous data with upper
and lower bounds, such as survey sliders, indexes,
dose-response relationships,
and visual analog scales (among others). The package allows for all of the
many brms
regression modeling functions to be used with the ordered
beta regression distribution.
ordbetareg( formula = NULL, data = NULL, true_bounds = NULL, phi_reg = "none", use_brm_multiple = FALSE, coef_prior_mean = 0, coef_prior_SD = 5, intercept_prior_mean = NULL, intercept_prior_SD = NULL, phi_prior = 0.1, dirichlet_prior = c(1, 1, 1), phi_coef_prior_mean = 0, phi_coef_prior_SD = 5, phi_intercept_prior_mean = NULL, phi_intercept_prior_SD = NULL, extra_prior = NULL, init = "0", return_stancode = FALSE, ... )
ordbetareg( formula = NULL, data = NULL, true_bounds = NULL, phi_reg = "none", use_brm_multiple = FALSE, coef_prior_mean = 0, coef_prior_SD = 5, intercept_prior_mean = NULL, intercept_prior_SD = NULL, phi_prior = 0.1, dirichlet_prior = c(1, 1, 1), phi_coef_prior_mean = 0, phi_coef_prior_SD = 5, phi_intercept_prior_mean = NULL, phi_intercept_prior_SD = NULL, extra_prior = NULL, init = "0", return_stancode = FALSE, ... )
formula |
Either an R formula in the form response/DV ~ var1 + var2
etc. or formula object as created/called by the |
data |
An R data frame or tibble containing the variables in the formula |
true_bounds |
If the true bounds of the outcome/response don't exist in the data, pass a length 2 numeric vector of the minimum and maximum bounds to properly normalize the outcome/response |
phi_reg |
Whether you are including a linear model predicting
the dispersion parameter, phi, and/or for the response. If you are
including models for both, pass option 'both'. If you only have an
intercept for the outcome (i.e. a 1 in place of covariates), pass 'only'.
If you only have intercepts for phi (such as a varying intercepts/random effects)
model, pass the value "intercepts". To set priors on these intercepts,
use the |
use_brm_multiple |
(T/F) Whether the model should use
brms::brm_multiple for multiple
imputation over multiple dataframes passed
as a list to the |
coef_prior_mean |
The mean of the Normal distribution prior on the regression coefficients (for predicting the mean of the response). Default is 0. |
coef_prior_SD |
The SD of the Normal distribution prior on the regression coefficients (for predicting the mean of the response). Default is 5, which makes the prior weakly informative on the logit scale. |
intercept_prior_mean |
The mean of the Normal distribution prior
for the intercept. By default is NULL, which means the intercept
receives the same prior as |
intercept_prior_SD |
The SD of the Normal distribution prior
for the intercept. By default is NULL, which means the intercept
receives the same prior SD as |
phi_prior |
The mean parameter of the exponential prior on phi, which determines the dispersion of the beta distribution. The default is .1, which equals a mean of 10 and is thus weakly informative on the interval (0.4, 30). If the response has very low variance (i.e. tightly) clusters around a specific value, then decreasing this prior (and increasing the expected value) may be helpful. Checking the value of phi in the output of the model command will reveal if a value of 0.1 (mean of 10) is too small. |
dirichlet_prior |
A vector of three integers corresponding to the prior parameters for the dirchlet distribution (alpha parameter) governing the location of the cutpoints between the components of the response (continuous vs. degenerate). The default is 1 which puts equal probability on degenerate versus continuous responses. Likely only needs to be changed in a repeated sampling situation to stabilize the cutpoint locations across samples. |
phi_coef_prior_mean |
The mean of the Normal distribution prior on the regression coefficients for predicting phi, the dispersion parameter. Only useful if a linear model is being fit to phi. Default is 0. |
phi_coef_prior_SD |
The SD of the Normal distribution prior on the regression coefficients for predicting phi, the dispersion parameter. Only useful if a linear model is being fit to phi. Default is 5, which makes the prior weakly informative on the exponential scale. |
phi_intercept_prior_mean |
The mean of the Normal distribution prior
for the phi (dispersion) regression intercept. By default is NULL,
which means the intercept
receives the same prior as |
phi_intercept_prior_SD |
The SD of the Normal distribution prior
for the phi (dispersion) regression intercept. By default is NULL,
which means the intercept
receives the same prior SD as |
extra_prior |
An additional prior, such as a prior for a specific
regression coefficient, added to the outcome regression by passing one of the |
init |
This parameter is used to determine starting values for
the Stan sampler to begin Markov Chain Monte Carlo sampling. It is
set by default at 0 because the non-linear nature of beta regression
means that it is possible to begin with extreme values depending on the
scale of the covariates. Setting this to 0 helps the sampler find
starting values. It does, on the other hand, limit the ability to detect
convergence issues with Rhat statistics. If that is a concern, such as
with an experimental feature of |
return_stancode |
If |
... |
All other arguments passed on to the |
This function is a wrapper around the brms::brm function, which is a
powerful Bayesian regression modeling engine using Stan. To fully explore
the options available, including dynamic and hierarchical modeling, please
see the documentation for the brm
function above. As the ordered beta
regression model is currently not available in brms
natively, this modeling
function allows a brms
model to be fit with the ordered beta regression
distribution.
For more information about the model, see the paper here: https://osf.io/preprints/socarxiv/2sx6y/.
This function allows you to set priors on the dispersion parameter,
the cutpoints, and the regression coefficients (see below for options).
However, to add specific priors on individual covariates, you would need
to use the brms::set_prior function by specifying an individual covariate
(see function documentation) and passing the result of the function call
to the extra_prior
argument.
This function will also automatically normalize the outcome so that it lies in the \[0,1\] interval, as required by beta regression. For furthur information, see the documentation for the normalize function.
Priors can be set on a variety of coefficients in the model, see
the description of parameters coef_prior_mean
and intercept_prior_mean
,
in addition to setting a custom prior with the extra_prior
option.
When setting priors on intercepts, it is important to note that
by default, all intercepts in brms are centered (the means are
subtracted from the data). As a result, a prior set on the default
intercept will have a different interpretation than a traditional
intercept (i.e. the value of the outcome when the covariates are
all zero). To change this setting, use the brms::bf()
function
as a wrapper around the formula with the option center=FALSE
to
set priors on a traditional non-centered intercept.
Note that while brms
also supports adding 0 + Intercept
to the
formula to address this issue, ordbetareg
does not support this
syntax. Instead, use center=FALSE
as an option to brms::bf()
.
To learn more about how the package works, see the vignette by using
the command browseVignettes(package='ordbetareg')
.
For more info about the distribution, see this paper: https://osf.io/preprints/socarxiv/2sx6y/
To cite the package, please cite the following paper:
Kubinec, Robert. "Ordered Beta Regression: A Parsimonious, Well-Fitting Model for Continuous Data with Lower and Upper Bounds." Political Analysis. 2022.
A brms
object fitted with the ordered beta regression distribution.
Maintainer: Robert Kubinec [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/saudiwin/ordbetareg_pack/issues
# load survey data that comes with the package library(dplyr) data("pew") # prepare data model_data <- select(pew,therm, education="F_EDUCCAT2_FINAL", region="F_CREGION_FINAL", income="F_INCOME_FINAL") # It takes a while to fit the models. Run the code # below if you want to load a saved fitted model from the # package, otherwise use the model-fitting code data("ord_fit_mean") # fit the actual model if(.Platform$OS.type!="windows") { ord_fit_mean <- ordbetareg(formula=therm ~ education + income + (1|region), data=model_data, cores=2,chains=2) } # access values of the coefficients summary(ord_fit_mean)
# load survey data that comes with the package library(dplyr) data("pew") # prepare data model_data <- select(pew,therm, education="F_EDUCCAT2_FINAL", region="F_CREGION_FINAL", income="F_INCOME_FINAL") # It takes a while to fit the models. Run the code # below if you want to load a saved fitted model from the # package, otherwise use the model-fitting code data("ord_fit_mean") # fit the actual model if(.Platform$OS.type!="windows") { ord_fit_mean <- ordbetareg(formula=therm ~ education + income + (1|region), data=model_data, cores=2,chains=2) } # access values of the coefficients summary(ord_fit_mean)
A dataset with the non-missing responses for the 28th wave of the Pew American Trends Panel survey.
pew
pew
A data frame with 140 variables and 2,538 observations.
https://www.pewresearch.org/social-trends/dataset/american-trends-panel-wave-28/]
The standard brms::pp_check plot available via brms
is not accurate for ordbetareg models because an
ordered beta regression has both continuous and discrete
components. This function implements a bar plot and a density
plot for the continuous and discrete elements separately,
and will return accurate posterior predictive plots
relative to the data.
pp_check_ordbeta( model = NULL, type = "both", ndraws = 10, cores = NULL, group = NULL, new_theme = NULL, outcome_label = NULL, animate = FALSE, reverse_bounds = TRUE, facet_scales = "fixed" )
pp_check_ordbeta( model = NULL, type = "both", ndraws = 10, cores = NULL, group = NULL, new_theme = NULL, outcome_label = NULL, animate = FALSE, reverse_bounds = TRUE, facet_scales = "fixed" )
model |
A fitted ordbetareg model. |
type |
Default is "both" for creating both a discrete (bar) and continuous (density) plot. Can also be "discrete" for only the bar plot for discrete values (0/1) or "continuous" for continuous values (density plot). |
ndraws |
Number of posterior draws to use to calculate estimates and show in plot. Defaults to 10. |
cores |
Number of cores to use to produce posterior predictive distribution. Defaults to NULL or 1 core. |
group |
A factor variable of the same number of rows as the data that is used to broduce grouped (faceted) plots of the posterior distribution. |
new_theme |
Any additional themes to be added to ggplot2 (default is NULL). |
outcome_label |
A character value that will replace the name of the outcome in the plot (default is the name of the response variable in the data frame). |
animate |
Whether to animate each posterior draw for continuous
distributions (defaults to FALSE). Requires installation of the
|
reverse_bounds |
Whether to plot data using the original bounds in the data (i.e. not 0 and 1). |
facet_scales |
The option passed on to the |
If "both", prints both plots and returns a list of both plots as
ggplot2
objects. Otherwise, prints and returnst the specific plot
as a ggplot2
object.
# need a fitted ordbetareg model data("ord_fit_mean") out_plots <- pp_check_ordbeta(ord_fit_mean) # view discrete bar plot out_plots$discrete # view continuous density plot out_plots$continuous # change title using ggplot2 ggtitle function out_plots$discrete + ggplot2::ggtitle("New title")
# need a fitted ordbetareg model data("ord_fit_mean") out_plots <- pp_check_ordbeta(ord_fit_mean) # view discrete bar plot out_plots$discrete # view continuous density plot out_plots$continuous # change title using ggplot2 ggtitle function out_plots$discrete + ggplot2::ggtitle("New title")
This function will generate ordered beta random variates given
values for the mean (mu
), dispersion (phi
) and cutpoints
governing the ratio of degenerate (discrete) to continuous
responses.
rordbeta(n = 100, mu = 0.5, phi = 1, cutpoints = c(-1, 1))
rordbeta(n = 100, mu = 0.5, phi = 1, cutpoints = c(-1, 1))
n |
Number of variates to generate. |
mu |
Value of the mean of the distribution.
Should be in the \(0,1\) interval (cannot be strictly equal to 0 or 1). If
length is greater than 1, should be of length |
phi |
Value of the dispersion parameter. Should be strictly greater than 0. If
length is greater than 1, should be of length |
cutpoints |
A vector of two numeric values for the cutpoints. Second value should be strictly greater than the first value. |
A vector of length n
of variates from the ordered beta distribution.
# generate 100 random variates with an average of 0.7 # all will be in the closed interval \[0,1\] ordbeta_var <- rordbeta(n=100, mu=0.7, phi=2) # Will be approx mean = 0.7 with high positive skew summary(ordbeta_var)
# generate 100 random variates with an average of 0.7 # all will be in the closed interval \[0,1\] ordbeta_var <- rordbeta(n=100, mu=0.7, phi=2) # Will be approx mean = 0.7 with high positive skew summary(ordbeta_var)
The simulated draws used in the vignette for calculating statistical power.
sim_data
sim_data
A dataframe
This function allows you to calculate power curves (or anything else) via simulating the ordered beta regression model.
sim_ordbeta( N = 1000, k = 5, iter = 1000, cores = 1, phi = 1, cutpoints = c(-1, 1), beta_coef = NULL, beta_type = "continuous", treat_assign = 0.5, return_data = FALSE, seed = as.numeric(Sys.time()), ... )
sim_ordbeta( N = 1000, k = 5, iter = 1000, cores = 1, phi = 1, cutpoints = c(-1, 1), beta_coef = NULL, beta_type = "continuous", treat_assign = 0.5, return_data = FALSE, seed = as.numeric(Sys.time()), ... )
N |
The sample size for the simulation. Include a vector of integers to examine power/results for multiple sample sizes. |
k |
The number of covariates/predictors. |
iter |
The number of simulations to run. For power calculation, should be at least 500 (yes, this will take some time). |
cores |
The number of cores to use to parallelize the simulation. |
phi |
Value of the dispersion parameter in the beta distribution. |
cutpoints |
Value of the two cutpoints for the ordered model. By default are the values -1 and +1 (these are interpreted in the logit scale and so should not be too large). The farther apart, the fewer degenerate (0 or 1) responses there will be in the distribution. |
beta_coef |
If not null, a vector of length |
beta_type |
Can be either |
treat_assign |
If |
return_data |
Whether to return the simulated dqta as a list
in the |
seed |
The seed to use to make the results reproducible. Set automatically to a date-time stamp. |
... |
Any other arguments are passed on to the brms::brm function to control modeling options. |
This function implements the simulation found in Kubinec (2022). This
simulation allows you to vary the sample size, number & type of predictors,
values of the predictors (or treatment values), and the power to target.
The function returns a data frame
with one row per simulation draw and covariate k
.
a tibble data frame with columns of simulated and estimated values and
rows for each simulation iteration X coefficient combination. I.e.,
if there are five predictors, and 1,000 iterations, the resulting data frame
will have 1,000 rows. If there are multiple values for N
,
then each value
of N
will have its own set of iterations, making the final size of the
data a multiple of the number of sample sizes to iterate over. The
data frame will have the following columns:
1.
# This function takes a while to run as it has # to fit an ordered beta regression to each # draw. The package comes with a saved # simulation dataset you can inspect to see what the # result looks like data("sim_data") library(dplyr) # will take a while to run this if(.Platform$OS.type!="windows") { sim_data <- sim_ordbeta(N=c(250,750), k=1, beta_coef = .5, iter=5,cores=2, beta_type="binary", treat_assign=0.3) } # to get the power values by N, simply summarize/group # by N with functions from the R package dplyr sim_data %>% group_by(N) %>% summarize(mean_power=mean(power))
# This function takes a while to run as it has # to fit an ordered beta regression to each # draw. The package comes with a saved # simulation dataset you can inspect to see what the # result looks like data("sim_data") library(dplyr) # will take a while to run this if(.Platform$OS.type!="windows") { sim_data <- sim_ordbeta(N=c(250,750), k=1, beta_coef = .5, iter=5,cores=2, beta_type="binary", treat_assign=0.3) } # to get the power values by N, simply summarize/group # by N with functions from the R package dplyr sim_data %>% group_by(N) %>% summarize(mean_power=mean(power))