Simulate data which satisfies certain conditions
Source:R/simulate_data.R
simulate_data_conditional.Rd
Generate simulated dataset based on transformation of an underlying base distribution while checking that certain conditions are met.
Usage
simulate_data_conditional(
generator,
n_obs = 1,
reject = function(x) TRUE,
reject_max_iter = 10,
on_reject = "ignore",
return_tries = FALSE,
seed = NULL,
...
)
Arguments
- generator
Function which generates data from the underlying base distribution. It is assumend it takes the number of simulated observations
n_obs
as first argument, as all random generation functions in the stats and extraDistr do. Furthermore, it is expected to return a two-dimensional array as output (matrix or data.frame). See details.- n_obs
Number of simulated observations.
- reject
Function which takes a matrix or data.frame
X
as single input and outputs TRUE or FALSE. Specifies when a simulated final datamatrixX
should be rejected. Functions must output TRUE if condition IS NOT met / FALSE if condition IS met and matrix can be accepted. Intended to be used withfunction_list
. See details.- reject_max_iter
Integer > 0. In case of rejection, how many times should a new datamatrix be simulated until the conditions in
reject
are met?- on_reject
If "stop", an error is returned if after
reject_max_iter
times no suitable datamatrix X could be found. If "current", the current datamatrix is returned, regardless of the conditions inreject
. Otherwise, NULL is returned. In each case a warning is reported.- return_tries
If TRUE, then the function also outputs the number of tries necessary to find a dataset fulfilling the condition. Useful to record to assess the possible bias of the simulated datasets. See Value.
- seed
Set random seed to ensure reproducibility of results. See Note below.
- ...
All further parameters are passed to
simulate_data
.
Value
Data.frame or matrix with n_obs
rows for simulated dataset X
if all
conditions are met within the iteration limit. Otherwise NULL.
If return_tries
is TRUE, then the output is a list with the first entry
being the data.frame or matrix as described above, and the second entry
(n_tries
) giving a numeric with the number of tries necessary to
find the returned dataset.
Details
For details on generating, transforming and post-processing datasets, see
simulate_data
. This function simulates data conditional
on certain requirements that must be met by the final datamatrix X
.
This checking is conducted on the output of simulate_data
(i.e.
also includes possible post-processing steps).
Note
Seeding the random number generator is tricky in this case. The seed can not
be passed to simulate_data
but is set before calling it, otherwise
the random number generation is the same for each of the tries.
This means that the seed used to call this function might not be the seed
corresponding to the returned dataset.
Rejecting Datasets
Examples for restrictions include
variance restrictions (e.g. no constant columns which could happen due
to extreme transformations of the initial gaussian distribution Z
),
ensuring a sufficient number of observations in a given class (e.g. certain
binary variables should have at least x\
multicollinearity (e.g. X
must have full column rank). If reject
evaluates to FALSE, the current datamatrix X
is rejected.
In case of rejection, new datasets can be simulated until the conditions
are met or a given maximum iteration limit is hit (reject_max_iter
),
after which the latest datamatrix is returned or an error is reported.
Rejection Function
The reject
function should take a single input (a data.frame or matrix)
and output TRUE if the dataset is to be rejected or FALSE if it is to be
accepted.
This package provides the function_list
convenience function
which allows to easily create a rejection function which assesses several
conditions on the input dataset by simply passing individual test functions
to function_list
. Such test function templates are found in
is_collinear
and contains_constant
.
See the example below.
Examples
dsgn <- simdesign_mvtnorm(diag(5))
simulate_data_conditional(dsgn, 10,
reject = function_list(is_collinear, contains_constant),
seed = 18)
#> v1 v2 v3 v4 v5
#> [1,] 0.9264592 1.8228212 -1.6105669 -0.28510975 -0.3420730
#> [2,] 0.3661761 -1.3270408 2.4125922 0.06381535 1.5455148
#> [3,] -1.8847271 0.9114273 -1.3054559 0.04207365 -0.7853306
#> [4,] 1.2116326 -0.9245488 -0.6780290 1.33215937 0.4625876
#> [5,] -1.3022219 1.1084913 -0.7721368 -0.67717026 0.4643073
#> [6,] -1.9503188 -1.0555950 -0.1178095 -0.25952112 -1.7482390
#> [7,] -0.9088646 0.2069206 0.3134738 1.16177107 -1.6939916
#> [8,] 1.0352176 0.6658341 -1.2718919 0.60963927 0.8209692
#> [9,] 0.6948703 2.5308357 0.4522430 0.98874442 -0.1265098
#> [10,] 0.5149085 0.8279297 -1.3236720 -0.32455266 -0.5578677