Title: | Simulation of Multivariate Linear Model Data |
---|---|
Description: | Researchers have been using simulated data from a multivariate linear model to compare and evaluate different methods, ideas and models. Additionally, teachers and educators have been using a simulation tool to demonstrate and teach various statistical and machine learning concepts. This package helps users to simulate linear model data with a wide range of properties by tuning few parameters such as relevant latent components. In addition, a shiny app as an 'RStudio' gadget gives users a simple interface for using the simulation function. See more on: Sæbø, S., Almøy, T., Helland, I.S. (2015) <doi:10.1016/j.chemolab.2015.05.012> and Rimal, R., Almøy, T., Sæbø, S. (2018) <doi:10.1016/j.chemolab.2018.02.009>. |
Authors: | Raju Rimal [aut, cre] |
Maintainer: | Raju Rimal <[email protected]> |
License: | GPL-3 |
Version: | 2.1.0 |
Built: | 2025-02-28 04:47:30 UTC |
Source: | https://github.com/simulatr/simrel |
Simulation of Multivariate Linear Model Data
AppSimrel()
AppSimrel()
No return value, runs the shiny interface for simulation
Simulation of Multivariate Linear Model data with response
bisimrel( n = 50, p = 100, q = c(10, 10, 5), rho = c(0.8, 0.4), relpos = list(c(1, 2), c(2, 3)), gamma = 0.5, R2 = c(0.8, 0.8), ntest = NULL, muY = NULL, muX = NULL, sim = NULL )
bisimrel( n = 50, p = 100, q = c(10, 10, 5), rho = c(0.8, 0.4), relpos = list(c(1, 2), c(2, 3)), gamma = 0.5, R2 = c(0.8, 0.8), ntest = NULL, muY = NULL, muX = NULL, sim = NULL )
n |
Number of training samples |
p |
Number of x-variables |
q |
Vector of number of relevant predictor variables for first, second and common to both responses |
rho |
A 2-element vector, unconditional and conditional correlation between y_1 and y_2 |
relpos |
A list of position of relevant component for predictor variables. The list contains vectors of position index, one vector or each response |
gamma |
A declining (decaying) factor of eigen value of predictors (X). Higher the value of |
R2 |
Vector of coefficient of determination for each response |
ntest |
Number of test observation |
muY |
Vector of average (mean) for each response variable |
muX |
Vector of average (mean) for each predictor variable |
sim |
A simrel object for reusing parameters setting |
A simrel object with all the input arguments along with following additional items
X |
Simulated predictors |
Y |
Simulated responses |
beta |
True regression coefficients |
beta0 |
True regression intercept |
relpred |
Position of relevant predictors |
testX |
Test Predictors |
testY |
Test Response |
minerror |
Minimum model error |
Rotation |
Rotation matrix of predictor (R) |
type |
Type of simrel object, in this case bivariate |
lambda |
Eigenvalues of predictors |
Sigma |
Variance-Covariance matrix of response and predictors |
Sæbø, S., Almøy, T., & Helland, I. S. (2015). simrel—A versatile tool for linear model data simulation based on the concept of a relevant subspace and relevant predictors. Chemometrics and Intelligent Laboratory Systems, 146, 128-135.
Almøy, T. (1996). A simulation study on comparison of prediction methods when only a few components are relevant. Computational statistics & data analysis, 21(1), 87-107.
sobj <- bisimrel( n = 100, p = 10, q = c(5, 5, 3), rho = c(0.8, 0.4), relpos = list(c(1, 2, 3), c(2, 3, 4)), gamma = 0.7, R2 = c(0.8, 0.8) ) # Regression Coefficients from this simulation sobj$beta
sobj <- bisimrel( n = 100, p = 10, q = c(5, 5, 3), rho = c(0.8, 0.4), relpos = list(c(1, 2, 3), c(2, 3, 4)), gamma = 0.7, R2 = c(0.8, 0.8) ) # Regression Coefficients from this simulation sobj$beta
Extract various sigma matrices
cov_mat(obj, which = c("xy", "zy", "zw"), use_population = TRUE)
cov_mat(obj, which = c("xy", "zy", "zw"), use_population = TRUE)
obj |
A simrel object |
which |
A character string to specify which covariance matrix to extract, possible values are "xy", "zy" and "zw" |
use_population |
A boolean whether to use compute population values or to estimate from sample |
A matrix of covariances with column equals to the number of response and row equals to the number of predictors
set.seed(1983) sobj <- multisimrel() cov_mat(sobj, which = "xy", use_population = TRUE) cov_mat(sobj, which = "xy", use_population = FALSE)
set.seed(1983) sobj <- multisimrel() cov_mat(sobj, which = "xy", use_population = TRUE) cov_mat(sobj, which = "xy", use_population = FALSE)
Prepare data for Plotting Covariance Matrix
cov_plot_data(sobj, type = "relpos", ordering = TRUE, facetting = TRUE)
cov_plot_data(sobj, type = "relpos", ordering = TRUE, facetting = TRUE)
sobj |
A simrel object |
type |
Type of covariance matrix - can take two values |
ordering |
TRUE for ordering the covariance for block diagonal display |
facetting |
TRUE for facetting the predictor and response space. FALSE will give a single facet plot |
A data frame with covariances and related values based on type
argument that is ready to plot
sobj <- simrel(n = 100, p = 10, q = c(4, 5), relpos = list(c(1, 2, 3), c(4, 6, 7)), m = 3, R2 = c(0.8, 0.7), ypos = list(c(1, 3), 2), gamma = 0.7, type = "multivariate") head(cov_plot_data(sobj))
sobj <- simrel(n = 100, p = 10, q = c(4, 5), relpos = list(c(1, 2, 3), c(4, 6, 7)), m = 3, R2 = c(0.8, 0.7), ypos = list(c(1, 3), 2), gamma = 0.7, type = "multivariate") head(cov_plot_data(sobj))
Covariance between X and Y
cov_xy(obj, use_population = TRUE)
cov_xy(obj, use_population = TRUE)
obj |
A simrel object |
use_population |
A boolean to specify wheather to use population or sample |
A covariance matrix of X and Y
Helper Functions
cov_zw(obj)
cov_zw(obj)
obj |
A simrel object |
A covariance matrix of Z and W
Covariance between Z and Y
cov_zy(obj, use_population = TRUE)
cov_zy(obj, use_population = TRUE)
obj |
A simrel object |
use_population |
A boolean to specify wheather to use population or sample |
A covariance matrix of Z and Y
Extra test functions
expect_subset( object, expected, info = NULL, label = NULL, expected.label = NULL )
expect_subset( object, expected, info = NULL, label = NULL, expected.label = NULL )
object |
object to test |
expected |
Expected value |
info |
extra information to be included in the message (useful when writing tests in loops). |
label |
object label. When 'NULL', computed from deparsed object. |
expected.label |
Equivalent of 'label' for shortcut form. |
Returns the object itself if expected value is found in the object as a subset else return Error
expect_subset(c(1, 2, 3, 4, 5), c(2, 4, 5))
expect_subset(c(1, 2, 3, 4, 5), c(2, 4, 5))
Simulation Plot with ggplot: The true beta, relevant component and eigen structure
ggsimrelplot( obj, ncomp = min(obj$p, obj$n, 20), which = 1L:3L, layout = NULL, print.cov = FALSE, use_population = TRUE )
ggsimrelplot( obj, ncomp = min(obj$p, obj$n, 20), which = 1L:3L, layout = NULL, print.cov = FALSE, use_population = TRUE )
obj |
A simrel object |
ncomp |
Number of components to plot |
which |
A character indicating which plot you want as output, it can take |
layout |
A layout matrix of how to layout multiple plots |
print.cov |
Output estimated covariance structure |
use_population |
Logical, TRUE if population values should be used and FALSE if sample values should be used |
A list of plots
sim.obj <- simrel(n = 50, p = 16, q = c(3, 4, 5), relpos = list(c(1, 2), c(3, 4), c(5, 7)), m = 5, ypos = list(c(1, 4), 2, c(3, 5)), type = "multivariate", R2 = c(0.8, 0.7, 0.9), gamma = 0.8) ggsimrelplot(sim.obj, layout = matrix(c(2, 1, 3, 1), 2)) ggsimrelplot(sim.obj, which = c(1, 2), use_population = TRUE) ggsimrelplot(sim.obj, which = c(1, 2), use_population = FALSE) ggsimrelplot(sim.obj, which = c(1, 3), layout = matrix(c(1, 2), 1))
sim.obj <- simrel(n = 50, p = 16, q = c(3, 4, 5), relpos = list(c(1, 2), c(3, 4), c(5, 7)), m = 5, ypos = list(c(1, 4), 2, c(3, 5)), type = "multivariate", R2 = c(0.8, 0.7, 0.9), gamma = 0.8) ggsimrelplot(sim.obj, layout = matrix(c(2, 1, 3, 1), 2)) ggsimrelplot(sim.obj, which = c(1, 2), use_population = TRUE) ggsimrelplot(sim.obj, which = c(1, 2), use_population = FALSE) ggsimrelplot(sim.obj, which = c(1, 3), layout = matrix(c(1, 2), 1))
Function to create multi-level binary replacement (MBR) design (Martens et al., 2010). The MBR approach was
developed for constructing experimental designs for computer experiments.
MBR makes it possible to set up fractional designs for multi-factor problems
with potentially many levels for each factor. In this package
it is mainly called by the mbrdsim
function.
mbrd( l2levels = c(2, 2), fraction = 0, gen = NULL, fnames1 = NULL, fnames2 = NULL )
mbrd( l2levels = c(2, 2), fraction = 0, gen = NULL, fnames1 = NULL, fnames2 = NULL )
l2levels |
A vector indicating the number of log2-levels for each factor. E.g. |
fraction |
Design fraction at bit-level. Full design: fraction=0, half-fraction: fraction=1, and so on... |
gen |
list of generators at bit-factor level. Same as generators in function FrF2. |
fnames1 |
Factor names of original multi-level factors (optional). |
fnames2 |
Factor names at bit-level (optional). |
The MBR design approach was developed for designing fractional designs in multi-level multi-factor experiments,
typically computer experiments. The basic idea can be summarized in the following steps: 1) Choose the number of levels
for each multi-level factor as a multiple of 2, that is
. 2) Replace any given multi-level factor by a
set of
two-level "bit factors". The complete bit-factor design can then by expressed as a
design where
is the total number of bit-factors across all original multi-level factors. 3) Choose a fraction level
defining av fractional
design
(see e.g. Montgomery, 2008) as for regular two-levels factorial designs. 4)
Express the reduced design in terms of the original multi-level factors.
BitDesign |
The design at bit-factor level (inherits from FrF2). Function |
Design |
The design at original factor levels, non-randomized. |
Martens, H., Måge, I., Tøndel, K., Isaeva, J., Høy, M. and Sæbø¸, S., 2010, Multi-level binary replacement (MBR) design for computer experiments in high-dimensional nonlinear systems, J, Chemom, 24, 748–756.
Montgomery, D., Design and analysis of experiments, John Wiley & Sons, 2008.
#Two variables with 8 levels each (2^3=8), a half-fraction design. res <- mbrd(c(3,3),fraction=1, gen=list(c(1,4))) #plot(res$Design, pch=20, cex=2, col=2) #Three variabler with 8 levels each, a 1/16-fraction. res <- mbrd(c(3,3,3),fraction=4) #library(rgl) #plot3d(res$Design,type="s",col=2)
#Two variables with 8 levels each (2^3=8), a half-fraction design. res <- mbrd(c(3,3),fraction=1, gen=list(c(1,4))) #plot(res$Design, pch=20, cex=2, col=2) #Three variabler with 8 levels each, a 1/16-fraction. res <- mbrd(c(3,3,3),fraction=4) #library(rgl) #plot3d(res$Design,type="s",col=2)
The multi-level binary replacement (MBR) design approach is used here in order to facilitate the investigation of the effects of the data properties on the performance of estimation/prediction methods. The mbrdsim function takes as input a list containing a set of factors with their levels. The output is an MBR-design with the combinations of the factor levels to be run.
mbrdsim(simlist, fraction, gen = NULL)
mbrdsim(simlist, fraction, gen = NULL)
simlist |
A named list containing the levels of a set of (multi-level) factors. |
fraction |
Design fraction at bit-level. Full design: fraction=0, half-fraction: fraction=1, and so on. |
gen |
Generators for the fractioning at the bit level. Default is |
BitDesign |
The design at bit-factor level. The object is of class design, as output from FrF2. Function design.info() can be used to get extra design info of the bit-design. The bit-factors are named.numbered if the input factor list is named. |
Design |
The design at original factor level, non-randomized. The factors are named if the input factor list is named. |
Solve Sæbø
Martens, H., Måge, I., Tøndel, K., Isaeva, J., Høy, M. and Sæbø¸, S., 2010, Multi-level binary replacement (MBR) design for computer experiments in high-dimensional nonlinear systems, J, Chemom, 24, 748–756.
# Input: A list of factors with their levels (number of levels must be a multiple of 2). ## Simrel Parameters ---- sim_list <- list( p = c(20, 150), gamma = seq(0.2, 1.1, length.out = 4), relpos = list(list(c(1, 2, 3), c(4, 5, 6)), list(c(1, 5, 6), c(2, 3, 4))), R2 = list(c(0.4, 0.8), c(0.8, 0.8)), ypos = list(list(1, c(2, 3)), list(c(1, 3), 2)) ) ## 1/8 fractional Design ---- dgn <- mbrdsim(sim_list, fraction = 3) design <- cbind( dgn[["Design"]], q = lapply(dgn[["Design"]][, "p"], function(x) rep(x/2, 2)), type = "multivariate", n = 100, ntest = 200, m = 3, eta = 0.6 ) ## Simulation ---- sobj <- apply(design, 1, function(x) do.call(simrel, x)) names(sobj) <- paste0("Design", seq.int(sobj)) # Info about the bit-design including bit-level aliasing (and resolution if \code{gen = NULL}) if (requireNamespace("DoE.base", quietly = TRUE)) { dgn <- mbrdsim(sim_list, fraction = 3) DoE.base::design.info(dgn$BitDesign) }
# Input: A list of factors with their levels (number of levels must be a multiple of 2). ## Simrel Parameters ---- sim_list <- list( p = c(20, 150), gamma = seq(0.2, 1.1, length.out = 4), relpos = list(list(c(1, 2, 3), c(4, 5, 6)), list(c(1, 5, 6), c(2, 3, 4))), R2 = list(c(0.4, 0.8), c(0.8, 0.8)), ypos = list(list(1, c(2, 3)), list(c(1, 3), 2)) ) ## 1/8 fractional Design ---- dgn <- mbrdsim(sim_list, fraction = 3) design <- cbind( dgn[["Design"]], q = lapply(dgn[["Design"]][, "p"], function(x) rep(x/2, 2)), type = "multivariate", n = 100, ntest = 200, m = 3, eta = 0.6 ) ## Simulation ---- sobj <- apply(design, 1, function(x) do.call(simrel, x)) names(sobj) <- paste0("Design", seq.int(sobj)) # Info about the bit-design including bit-level aliasing (and resolution if \code{gen = NULL}) if (requireNamespace("DoE.base", quietly = TRUE)) { dgn <- mbrdsim(sim_list, fraction = 3) DoE.base::design.info(dgn$BitDesign) }
Simulation of Multivariate Linear Model Data
msim( p = 15, q = c(5, 4, 3), m = 5, relpos = list(c(1, 2), c(3, 4, 6), c(5, 7)), gamma = 0.6, R2 = c(0.8, 0.7, 0.8), eta = 0, muX = NULL, muY = NULL, ypos = list(c(1), c(3, 4), c(2, 5)) )
msim( p = 15, q = c(5, 4, 3), m = 5, relpos = list(c(1, 2), c(3, 4, 6), c(5, 7)), gamma = 0.6, R2 = c(0.8, 0.7, 0.8), eta = 0, muX = NULL, muY = NULL, ypos = list(c(1), c(3, 4), c(2, 5)) )
p |
Number of variables |
q |
Vector containing the number of relevant predictor variables for each relevant response components |
m |
Number of response variables |
relpos |
A list of position of relevant component for predictor variables. The list contains vectors of position index, one vector or each relevant response components |
gamma |
A declining (decaying) factor of eigen value of predictors (X). Higher the value of |
R2 |
Vector of coefficient of determination (proportion of variation explained by predictor variable) for each relevant response components |
eta |
A declining (decaying) factor of eigenvalues of response (Y). Higher the value of |
muX |
Vector of average (mean) for each predictor variable |
muY |
Vector of average (mean) for each response variable |
ypos |
List of position of relevant response components that are combined to generate response variable during orthogonal rotation |
A simrel object with all the input arguments along with following additional items
X |
Simulated predictors |
Y |
Simulated responses |
W |
Simulated predictor components |
Z |
Simulated response components |
beta |
True regression coefficients |
beta0 |
True regression intercept |
relpred |
Position of relevant predictors |
testX |
Test Predictors |
testY |
Test Response |
testW |
Test predictor components |
testZ |
Test response components |
minerror |
Minimum model error |
Xrotation |
Rotation matrix of predictor (R) |
Yrotation |
Rotation matrix of response (Q) |
type |
Type of simrel object univariate or multivariate |
lambda |
Eigenvalues of predictors |
SigmaWZ |
Variance-Covariance matrix of components of response and predictors |
SigmaWX |
Covariance matrix of response components and predictors |
SigmaYZ |
Covariance matrix of response and predictor components |
Sigma |
Variance-Covariance matrix of response and predictors |
RsqW |
Coefficient of determination corresponding to response components |
RsqY |
Coefficient of determination corresponding to response variables |
Sæbø, S., Almøy, T., & Helland, I. S. (2015). simrel—A versatile tool for linear model data simulation based on the concept of a relevant subspace and relevant predictors. Chemometrics and Intelligent Laboratory Systems, 146, 128-135.
Almøy, T. (1996). A simulation study on comparison of prediction methods when only a few components are relevant. Computational statistics & data analysis, 21(1), 87-107.
Simulation of Multivariate Linear Model Data
multisimrel( n = 100, p = 15, q = c(5, 4, 3), m = 5, relpos = list(c(1, 2), c(3, 4, 6), c(5, 7)), gamma = 0.6, R2 = c(0.8, 0.7, 0.8), eta = 0, ntest = NULL, muX = NULL, muY = NULL, ypos = list(c(1), c(3, 4), c(2, 5)) )
multisimrel( n = 100, p = 15, q = c(5, 4, 3), m = 5, relpos = list(c(1, 2), c(3, 4, 6), c(5, 7)), gamma = 0.6, R2 = c(0.8, 0.7, 0.8), eta = 0, ntest = NULL, muX = NULL, muY = NULL, ypos = list(c(1), c(3, 4), c(2, 5)) )
n |
Number of observations |
p |
Number of variables |
q |
Vector containing the number of relevant predictor variables for each relevant response components |
m |
Number of response variables |
relpos |
A list of position of relevant component for predictor variables. The list contains vectors of position index, one vector or each relevant response components |
gamma |
A declining (decaying) factor of eigen value of predictors (X). Higher the value of |
R2 |
Vector of coefficient of determination (proportion of variation explained by predictor variable) for each relevant response components |
eta |
A declining (decaying) factor of eigenvalues of response (Y). Higher the value of |
ntest |
Number of test observation |
muX |
Vector of average (mean) for each predictor variable |
muY |
Vector of average (mean) for each response variable |
ypos |
List of position of relevant response components that are combined to generate response variable during orthogonal rotation |
A simrel object with all the input arguments along with following additional items
X |
Simulated predictors |
Y |
Simulated responses |
W |
Simulated predictor components |
Z |
Simulated response components |
beta |
True regression coefficients |
beta0 |
True regression intercept |
relpred |
Position of relevant predictors |
testX |
Test Predictors |
testY |
Test Response |
testW |
Test predictor components |
testZ |
Test response components |
minerror |
Minimum model error |
Xrotation |
Rotation matrix of predictor (R) |
Yrotation |
Rotation matrix of response (Q) |
type |
Type of simrel object univariate or multivariate |
lambda |
Eigenvalues of predictors |
SigmaWZ |
Variance-Covariance matrix of components of response and predictors |
SigmaWX |
Covariance matrix of response components and predictors |
SigmaYZ |
Covariance matrix of response and predictor components |
Sigma |
Variance-Covariance matrix of response and predictors |
RsqW |
Coefficient of determination corresponding to response components |
RsqY |
Coefficient of determination corresponding to response variables |
Sæbø, S., Almøy, T., & Helland, I. S. (2015). simrel—A versatile tool for linear model data simulation based on the concept of a relevant subspace and relevant predictors. Chemometrics and Intelligent Laboratory Systems, 146, 128-135.
Almøy, T. (1996). A simulation study on comparison of prediction methods when only a few components are relevant. Computational statistics & data analysis, 21(1), 87-107.
These function helps to parse a character string into a list object and also creates parameters for performing multiple simulations
parse_parm(character_string, in_list = FALSE)
parse_parm(character_string, in_list = FALSE)
character_string |
A character string for parameter where the items in a list is separated by semicolon. For example: 1, 2; 3, 4 |
in_list |
TRUE if the result need to wrap in a list, default is FALSE |
A list or a vector
parse_parm("1, 2; 3, 4") parse_parm("1, 2")
parse_parm("1, 2; 3, 4") parse_parm("1, 2")
Plotting Functions
plot_beta(obj, base_theme = theme_grey, lab_list = NULL, theme_list = NULL)
plot_beta(obj, base_theme = theme_grey, lab_list = NULL, theme_list = NULL)
obj |
A simrel object |
base_theme |
Base ggplot theme to apply |
lab_list |
List of labs arguments such as x, y, title, subtitle |
theme_list |
List of theme arguments to apply in the plot |
A plot of true regression coefficients for the simulated data
sobj <- multisimrel() sobj %>% plot_beta( base_theme = ggplot2::theme_bw, lab_list = list( title = "Regression Coefficients", subtitle = "From Simulation", y = "True Regression Coefficients" ), theme_list = list( legend.position = "bottom" ) )
sobj <- multisimrel() sobj %>% plot_beta( base_theme = ggplot2::theme_bw, lab_list = list( title = "Regression Coefficients", subtitle = "From Simulation", y = "True Regression Coefficients" ), theme_list = list( legend.position = "bottom" ) )
Plotting Covariance Matrix
plot_cov(sobj, type = "relpos", ordering = TRUE, facetting = TRUE)
plot_cov(sobj, type = "relpos", ordering = TRUE, facetting = TRUE)
sobj |
A simrel object |
type |
Type of covariance matrix - can take two values |
ordering |
TRUE for ordering the covariance for block diagonal display |
facetting |
TRUE for facetting the predictor and response space. FALSE will give a single facet plot |
A covariance plot
Sæbø, S., Almøy, T., & Helland, I. S. (2015). simrel—A versatile tool for linear model data simulation based on the concept of a relevant subspace and relevant predictors. Chemometrics and Intelligent Laboratory Systems, 146, 128-135.
Almøy, T. (1996). A simulation study on comparison of prediction methods when only a few components are relevant. Computational statistics & data analysis, 21(1), 87-107.
Rimal, R., Almøy, T., & Sæbø, S. (2018). A tool for simulating multi-response linear model data. Chemometrics and Intelligent Laboratory Systems, 176, 1-10.
sobj <- simrel(n = 100, p = 10, q = c(4, 5), relpos = list(c(1, 2, 3), c(4, 6, 7)), m = 3, R2 = c(0.8, 0.7), ypos = list(c(1, 3), 2), gamma = 0.7, type = "multivariate") p1 <- plot_cov(sobj, type = "relpos", facetting = FALSE) p2 <- plot_cov(sobj, type = "rotation", facetting = FALSE) p3 <- plot_cov(sobj, type = "relpred", facetting = FALSE) gridExtra::grid.arrange(p1, p2, p3, ncol = 3)
sobj <- simrel(n = 100, p = 10, q = c(4, 5), relpos = list(c(1, 2, 3), c(4, 6, 7)), m = 3, R2 = c(0.8, 0.7), ypos = list(c(1, 3), 2), gamma = 0.7, type = "multivariate") p1 <- plot_cov(sobj, type = "relpos", facetting = FALSE) p2 <- plot_cov(sobj, type = "rotation", facetting = FALSE) p3 <- plot_cov(sobj, type = "relpred", facetting = FALSE) gridExtra::grid.arrange(p1, p2, p3, ncol = 3)
Plot Covariance between predictor (components) and response (components)
plot_covariance( sigma_df, lambda_df = NULL, base_theme = theme_grey, lab_list = NULL, theme_list = NULL )
plot_covariance( sigma_df, lambda_df = NULL, base_theme = theme_grey, lab_list = NULL, theme_list = NULL )
sigma_df |
A data.frame generated by tidy_sigma |
lambda_df |
A data.frame generated by tidy_lambda |
base_theme |
Base ggplot theme to apply |
lab_list |
List of labs arguments such as x, y, title, subtitle |
theme_list |
List of theme arguments to apply in the plot |
A plot of true regression coefficients for the simulated data
sobj <- bisimrel(p = 12) sigma_df <- sobj %>% cov_mat(which = "zy") %>% tidy_sigma() %>% abs_sigma() lambda_df <- sobj %>% tidy_lambda() plot_covariance( sigma_df, lambda_df, base_theme = ggplot2::theme_bw, lab_list = list( title = "Covariance between Response and Predictor Components", subtitle = "The bar represents the eigenvalues predictor covariance", y = "Absolute covariance", x = "Predictor Component", color = "Response Component" ), theme_list = list( legend.position = "bottom" ) )
sobj <- bisimrel(p = 12) sigma_df <- sobj %>% cov_mat(which = "zy") %>% tidy_sigma() %>% abs_sigma() lambda_df <- sobj %>% tidy_lambda() plot_covariance( sigma_df, lambda_df, base_theme = ggplot2::theme_bw, lab_list = list( title = "Covariance between Response and Predictor Components", subtitle = "The bar represents the eigenvalues predictor covariance", y = "Absolute covariance", x = "Predictor Component", color = "Response Component" ), theme_list = list( legend.position = "bottom" ) )
A wrapper function for a simrel object
plot_simrel( obj, ncomp = min(obj$p, obj$n, 20), which = c(1L:4L), layout = NULL, print.cov = FALSE, use_population = TRUE, palette = "Set1", base_theme = ggplot2::theme_grey, lab_list = NULL, theme_list = NULL )
plot_simrel( obj, ncomp = min(obj$p, obj$n, 20), which = c(1L:4L), layout = NULL, print.cov = FALSE, use_population = TRUE, palette = "Set1", base_theme = ggplot2::theme_grey, lab_list = NULL, theme_list = NULL )
obj |
A simrel object |
ncomp |
Number of components to show in x-axis |
which |
An integer specifying which simrel plot to obtain |
layout |
A layout matrix for arranging the simrel plots |
print.cov |
A boolean where to print covariance matrices |
use_population |
A boolean specifying weather to get plot for population or sample |
palette |
Name of color paletter compaticable with RColorBrewer |
base_theme |
Base ggplot theme to apply |
lab_list |
List of labs arguments such as x, y, title, subtitle. A nested list if the argument which has length greater than 1. |
theme_list |
List of theme arguments to apply in the plot. A nested list if the argument which has length greater than 1. |
Simrel Plot(s)
sobj <- bisimrel(p = 12) plot_simrel(sobj, layout = matrix(1:4, 2, 2))
sobj <- bisimrel(p = 12) plot_simrel(sobj, layout = matrix(1:4, 2, 2))
Prepare design for experiment from a list of simulation parameter
prepare_design(option_list, tabular = TRUE)
prepare_design(option_list, tabular = TRUE)
option_list |
A list of options that is to be parsed |
tabular |
logical if output is needed in tabular form or list format |
A list of parsed parameters for simulatr
opts <- list( n = rep(100, 2), p = c(20, 40), q = c("5, 5, 4", "10, 5, 5"), m = c(5, 5), relpos = c("1; 2, 4; 3", "1, 2; 3, 4; 5"), gamma = c(0.2, 0.4), R2 = c("0.8, 0.9, 0.7", "0.6, 0.8, 0.7"), ypos = c("1, 4; 2, 5; 3", "1; 2, 4; 3, 5"), ntest = rep(1000, 2) ) design <- prepare_design(opts) design
opts <- list( n = rep(100, 2), p = c(20, 40), q = c("5, 5, 4", "10, 5, 5"), m = c(5, 5), relpos = c("1; 2, 4; 3", "1, 2; 3, 4; 5"), gamma = c(0.2, 0.4), R2 = c("0.8, 0.9, 0.7", "0.6, 0.8, 0.7"), ypos = c("1, 4; 2, 5; 3", "1; 2, 4; 3, 5"), ntest = rep(1000, 2) ) design <- prepare_design(opts) design
Simulation of Multivariate Linear Model Data
simrel(n, p, q, relpos, gamma, R2, type = "univariate", ...)
simrel(n, p, q, relpos, gamma, R2, type = "univariate", ...)
n |
Number of observations. |
p |
Number of variables. |
q |
Number of predictors related to each relevant components An integer for univariate, a vector of 3 integers for bivariate and 3 or more for multivariate simulation (for details see Notes). |
relpos |
A list (vector in case of univariate simulation) of position of relevant component for predictor variables corresponding to each response. |
gamma |
A declining (decaying) factor of eigenvalues of predictors (X).
Higher the value of |
R2 |
Vector of coefficient of determination (proportion of variation explained by predictor variable) for each relevant response components. |
type |
Type of simulation - |
... |
Since this is a wrapper function to simulate univariate,
bivariate or multivariate, it calls their respective function.
This parameter should contain all the necessary arguements for respective
simulations. See |
A simrel object with all the input arguments along with
following additional items. For more detail on the return values see the
individual simulation functions unisimrel
,
bisimrel
and multisimrel
.
Common returns from univariate, bivariate and multivariate simulation:
call |
the matched call |
X |
simulated predictors |
Y |
simulated responses |
beta |
true regression coefficients |
beta0 |
true regression intercept |
relpred |
position of relevant predictors |
n |
number of observations |
p |
number of predictors (as supplied in the arguments) |
p |
number of responses (as supplied in the arguments) |
q |
number of relevant predictors (as supplied in the arguments) |
gamma |
declining factor of eigenvalues of predictors (as supplied in the arguments) |
lambda |
eigenvalues corresponding to the predictors |
R2 |
theoretical R-squared value (as supplied in the arguments) |
relpos |
position of relevant components (as supplied in the arguments) |
minerror |
minimum model error |
Sigma |
variance-Covariance matrix of response and predictors |
testX |
simulated test predictor (in univarite simulation |
testY |
simulated test response (in univarite simulation |
Rotation |
Random rotation matrix used to rotate latent components. Is
equivalent to the transpose of eigenvector-matrix. In multivariate
simulation, |
type |
type of simrel object |
Returns from multivariate simulation:
eta |
a declining factor of eigenvalues of response (Y) (as supplied in the arguments) |
ntest |
number of simulated test observations |
W |
simulated response components |
Z |
simulated predictor components |
testW |
test predictor components |
testZ |
test response components |
SigmaWZ |
Variance-Covariance matrix of components of response and predictors |
SigmaWX |
Covariance matrix of response components and predictors |
SigmaYZ |
Covariance matrix of response and predictor components |
RsqW |
Coefficient of determination corresponding to response components |
RsqY |
Coefficient of determination corresponding to response variables |
The parameter q
represetns the number of predictor variables
that forms a basis for each of the relevant componetns. For example,
for q = 8
and relevant components 1, 2, and 3 specified by
parameter relpos
then the randomly selected 8 predictor variables
forms basis for these three relevant componets and thus in the model
these 8 predictors will be revant for the response (outcome).
Sæbø, S., Almøy, T., & Helland, I. S. (2015). simrel—A versatile tool for linear model data simulation based on the concept of a relevant subspace and relevant predictors. Chemometrics and Intelligent Laboratory Systems, 146, 128-135.
Almøy, T. (1996). A simulation study on comparison of prediction methods when only a few components are relevant. Computational statistics & data analysis, 21(1), 87-107.
Simulation Plot: The true beta, relevant component and eigen structure
simrelplot( obj, ncomp = min(obj$p, obj$n, 20), ask = TRUE, print.cov = FALSE, which = 1L:3L )
simrelplot( obj, ncomp = min(obj$p, obj$n, 20), ask = TRUE, print.cov = FALSE, which = 1L:3L )
obj |
A simrel object |
ncomp |
Number of components to plot |
ask |
logical, TRUE: functions ask for comfirmation FALSE: function layout plot on predefined format |
print.cov |
Output estimated covariance structure |
which |
A character indicating which plot you want as output, it can take |
A list of plots
Tidy Functions to make plotting easy
Absolute value of sigma scaled by the overall maximum absolute value
tidy_beta(obj) abs_sigma(sigma_df)
tidy_beta(obj) abs_sigma(sigma_df)
obj |
A Simrel Object |
sigma_df |
A tidy covariance data frame generated by tidy_sigma function |
A tibble with three columns: Predictor, Response and BetaCoef
Another data.frame (tibble) of same dimension with absolute covarinace scaled by overall maximum absolute values
sobj <- multisimrel() beta_df <- tidy_beta(sobj) beta_df sobj <- multisimrel() sobj %>% cov_mat("zy") %>% tidy_sigma() %>% abs_sigma()
sobj <- multisimrel() beta_df <- tidy_beta(sobj) beta_df sobj <- multisimrel() sobj %>% cov_mat("zy") %>% tidy_sigma() %>% abs_sigma()
Extract Eigenvalues of predictors
tidy_lambda(obj, use_population = TRUE)
tidy_lambda(obj, use_population = TRUE)
obj |
A simrel Object |
use_population |
A boolean to specify where to use population value or calculate from sample |
A dataframe of eigenvalues for each predictors
sobj <- multisimrel() sobj %>% tidy_lambda()
sobj <- multisimrel() sobj %>% tidy_lambda()
Tidy covariance matrix
tidy_sigma(covs)
tidy_sigma(covs)
covs |
A sigma matrix obtained from cov_mat function |
A tibble with three columns: Predictor, Response and Covariance
sobj <- multisimrel()
sobj <- multisimrel()
Functions for data simulation from a random regression model with one response variable where the data properties can be controlled by a few input parameters. The data simulation is based on the concept of relevant latent components and relevant predictors, and was developed for the purpose of testing methods for variable selection for prediction.
unisimrel( n, p, q, relpos, gamma, R2, ntest = NULL, muY = NULL, muX = NULL, lambda.min = .Machine$double.eps, sim = NULL )
unisimrel( n, p, q, relpos, gamma, R2, ntest = NULL, muY = NULL, muX = NULL, lambda.min = .Machine$double.eps, sim = NULL )
n |
The number of (training) samples to generate. |
p |
The total number of predictor variables to generate. |
q |
The number of relevant predictor variables (as a subset of |
relpos |
A vector indicating the position (between 1 and |
gamma |
A number defining the speed of decline in eigenvalues (variances) of the latent components. The eigenvalues are assumed to decline according to an exponential model. The first eigenvalue is set equal to 1. |
R2 |
The theoretical R-squared according to the true linear model. A number between 0 and 1. |
ntest |
The number of test samples to be generated (optional). |
muY |
The true mean of the response variable (optional). Default is muY=NULL. |
muX |
The |
lambda.min |
Lower bound of the eigenvalues. Defaults to .Machine$double.eps. |
sim |
A fitted simrel object. If this is given, the same regression coefficients will be used to simulate a new data set of requested size. Default is NULL, for which new regression coefficients are sampled. |
The data are simulated according to a multivariate normal model for the
vector where
is the response
variable and
is the vector of latent (principal)
components. The ordered principal components are uncorrelated variables with
declining variances (eigenvalues) defined for component
as
. Hence, the variance (eigenvalue) of the
first principal component is equal to 1, and a large value of
gives a rapid decline in the variances. The variance of the response
variable is by default fixed equal to 1.
Some of the principal components (ordered by their decreasing variances) are
assumed to be relevant for the prediction of the response. The indices of
the positions of the relevant components are set by the relpos
argument. The joint degree of relevance for the relevant components is
determined by the population R-squared defined by R2
.
In order to obtain predictor variables for
, a random rotation of the principal components is performed. Hence,
for some random rotation matrix
. For values of
satisfying
only a subspace of dimension
containing the
relevant component(s) is rotated. This facilitates
the possibility to generate
relevant predictor variables
(
's). The indices of the relevant predictors is randomly selected
with the only restriction that the index set contains the indices in
relpos
. The final index set of the relevant predictors is saved in
the output argument relpred
. If q=p
all predictor
variables are relevant for the prediction of
.
For further details on the simulation approach, please see S<e6>b<f8>, Alm<f8>y and Helland (2015).
A simrel object with list of following items,
call |
The call to simrel. |
X |
The (n x p) simulated predictor matrix. |
Y |
The n-vector of simulated response values. |
beta |
The vector of true regression coefficients. |
beta0 |
The true intercept. This is zero if muY=NULL and muX=NULL |
muY |
The true mean of the response variable. |
muX |
The |
relpred |
The index of the true relevant predictors, that is the x-variables with non-zero true regression coefficients. |
TESTX |
The (ntest x p) matrix of optional test samples. |
TESTY |
The ntest-vector of responses of the optional test samples. |
n |
The number of simulated samples. |
p |
The number of predictor variables. |
m |
The number of relevant components. |
q |
The number of relevant predictors. |
gamma |
The decline parameter in the exponential model for the true eigenvalues. |
lambda |
The true eigenvalues of the covariance matrix of the p predictor variables. |
R2 |
The true R-squared value of the linear model. |
relpos |
The positions of the relevant components. |
minerror |
The minimum achievable prediction error. Also the variance of the noise term in the linear model. |
r |
The sampled correlations between the principal components and the response. |
Sigma |
The true covariance matrix of |
Rotation |
The random rotation matrix which is used to achieve the predictor variables as rotations of the latent components. Equals the transposed of the eigenvector-matrix of the covariance matrix of |
type |
The type of response generated, either "univariate" as returned from |
Solve S<e6>b<f8> and Kristian H. Liland
Helland, I. S. and Alm<f8>y, T., 1994, Comparison of prediction methods when only a few components are relevant, J. Amer. Statist. Ass., 89(426), 583 – 591.
S<e6>b<f8>, S., Alm<f8>y, T. and Helland, I. S., 2015, simrel - A versatile tool for linear model data simulation based on the concept of a relevant subspace and relevant predictors, Chemometr. Intell. Lab.(in press),doi:10.1016/j.chemolab.2015.05.012.
#Linear model data, large n, small p mydata <- unisimrel(n = 250, p = 20, q = 5, relpos = c(2, 4), gamma = 0.25, R2 = 0.75) #Estimating model parameters using ordinary least squares lmfit <- lm(mydata$Y ~ mydata$X) summary(lmfit) #Comparing true with estimated regression coefficients plot(mydata$beta, lmfit$coef[-1], xlab = "True regression coefficients", ylab = "Estimated regression coefficients") abline(0,1) #Linear model data, small n, large p mydata <- unisimrel(n = 50, p = 200, q = 25, relpos = c(2, 4), gamma = 0.25, R2 = 0.8 ) #Simulating more samples with identical distribution as previous simulation mydata2 <- unisimrel(n = 2500, sim = mydata) #Estimating model parameters using partial least squares regression with #cross-validation to determine the number of relevant components. if (requireNamespace("pls", quietly = TRUE)) { require(pls) plsfit <- plsr(mydata$Y ~ mydata$X, 15, validation = "CV") #Validation plot and finding the number of relevant components. plot(0:15, c(plsfit$validation$PRESS0, plsfit$validation$PRESS), type = "b", xlab = "Components", ylab = "PRESS") mincomp <- which(plsfit$validation$PRESS == min(plsfit$validation$PRESS)) #Comparing true with estimated regression coefficients plot(mydata$beta, plsfit$coef[, 1, mincomp], xlab = "True regression coefficients", ylab = "Estimated regression coefficients") abline(0, 1) }
#Linear model data, large n, small p mydata <- unisimrel(n = 250, p = 20, q = 5, relpos = c(2, 4), gamma = 0.25, R2 = 0.75) #Estimating model parameters using ordinary least squares lmfit <- lm(mydata$Y ~ mydata$X) summary(lmfit) #Comparing true with estimated regression coefficients plot(mydata$beta, lmfit$coef[-1], xlab = "True regression coefficients", ylab = "Estimated regression coefficients") abline(0,1) #Linear model data, small n, large p mydata <- unisimrel(n = 50, p = 200, q = 25, relpos = c(2, 4), gamma = 0.25, R2 = 0.8 ) #Simulating more samples with identical distribution as previous simulation mydata2 <- unisimrel(n = 2500, sim = mydata) #Estimating model parameters using partial least squares regression with #cross-validation to determine the number of relevant components. if (requireNamespace("pls", quietly = TRUE)) { require(pls) plsfit <- plsr(mydata$Y ~ mydata$X, 15, validation = "CV") #Validation plot and finding the number of relevant components. plot(0:15, c(plsfit$validation$PRESS0, plsfit$validation$PRESS), type = "b", xlab = "Components", ylab = "PRESS") mincomp <- which(plsfit$validation$PRESS == min(plsfit$validation$PRESS)) #Comparing true with estimated regression coefficients plot(mydata$beta, plsfit$coef[, 1, mincomp], xlab = "True regression coefficients", ylab = "Estimated regression coefficients") abline(0, 1) }