Simulate data which satisfies certain conditions

Generate simulated dataset based on transformation of an underlying base distribution while checking that certain conditions are met.

Usage

simulate_data_conditional(
  generator,
  n_obs = 1,
  reject = function(x) TRUE,
  reject_max_iter = 10,
  on_reject = "ignore",
  return_tries = FALSE,
  seed = NULL,
  ...
)

Arguments

generator: Function which generates data from the underlying base distribution. It is assumend it takes the number of simulated observations n_obs as first argument, as all random generation functions in the stats and extraDistr do. Furthermore, it is expected to return a two-dimensional array as output (matrix or data.frame). See details.
n_obs: Number of simulated observations.
reject: Function which takes a matrix or data.frame X as single input and outputs TRUE or FALSE. Specifies when a simulated final datamatrix X should be rejected. Functions must output TRUE if condition IS NOT met / FALSE if condition IS met and matrix can be accepted. Intended to be used with function_list. See details.
reject_max_iter: Integer > 0. In case of rejection, how many times should a new datamatrix be simulated until the conditions in reject are met?
on_reject: If "stop", an error is returned if after reject_max_iter times no suitable datamatrix X could be found. If "current", the current datamatrix is returned, regardless of the conditions in reject. Otherwise, NULL is returned. In each case a warning is reported.
return_tries: If TRUE, then the function also outputs the number of tries necessary to find a dataset fulfilling the condition. Useful to record to assess the possible bias of the simulated datasets. See Value.
seed: Set random seed to ensure reproducibility of results. See Note below.
...: All further parameters are passed to simulate_data.

Value

Data.frame or matrix with n_obs rows for simulated dataset X if all conditions are met within the iteration limit. Otherwise NULL.

If return_tries is TRUE, then the output is a list with the first entry being the data.frame or matrix as described above, and the second entry (n_tries) giving a numeric with the number of tries necessary to find the returned dataset.

Details

For details on generating, transforming and post-processing datasets, see simulate_data. This function simulates data conditional on certain requirements that must be met by the final datamatrix X. This checking is conducted on the output of simulate_data (i.e. also includes possible post-processing steps).

Note

Seeding the random number generator is tricky in this case. The seed can not be passed to simulate_data but is set before calling it, otherwise the random number generation is the same for each of the tries. This means that the seed used to call this function might not be the seed corresponding to the returned dataset.

Rejecting Datasets

Examples for restrictions include variance restrictions (e.g. no constant columns which could happen due to extreme transformations of the initial gaussian distribution Z), ensuring a sufficient number of observations in a given class (e.g. certain binary variables should have at least x\ multicollinearity (e.g. X must have full column rank). If reject evaluates to FALSE, the current datamatrix X is rejected. In case of rejection, new datasets can be simulated until the conditions are met or a given maximum iteration limit is hit (reject_max_iter), after which the latest datamatrix is returned or an error is reported.

Rejection Function

The reject function should take a single input (a data.frame or matrix) and output TRUE if the dataset is to be rejected or FALSE if it is to be accepted. This package provides the function_list convenience function which allows to easily create a rejection function which assesses several conditions on the input dataset by simply passing individual test functions to function_list. Such test function templates are found in is_collinear and contains_constant. See the example below.

Examples

dsgn <- simdesign_mvtnorm(diag(5))
simulate_data_conditional(dsgn, 10,
    reject = function_list(is_collinear, contains_constant), 
    seed = 18)
#>               v1         v2         v3          v4         v5
#>  [1,]  0.9264592  1.8228212 -1.6105669 -0.28510975 -0.3420730
#>  [2,]  0.3661761 -1.3270408  2.4125922  0.06381535  1.5455148
#>  [3,] -1.8847271  0.9114273 -1.3054559  0.04207365 -0.7853306
#>  [4,]  1.2116326 -0.9245488 -0.6780290  1.33215937  0.4625876
#>  [5,] -1.3022219  1.1084913 -0.7721368 -0.67717026  0.4643073
#>  [6,] -1.9503188 -1.0555950 -0.1178095 -0.25952112 -1.7482390
#>  [7,] -0.9088646  0.2069206  0.3134738  1.16177107 -1.6939916
#>  [8,]  1.0352176  0.6658341 -1.2718919  0.60963927  0.8209692
#>  [9,]  0.6948703  2.5308357  0.4522430  0.98874442 -0.1265098
#> [10,]  0.5149085  0.8279297 -1.3236720 -0.32455266 -0.5578677