Skip to contents

Generate simulated dataset based on transformation of an underlying base distribution.

Usage

simulate_data(generator, ...)

# Default S3 method
simulate_data(
  generator = function(n) matrix(rnorm(n)),
  n_obs = 1,
  transform_initial = base::identity,
  names_final = NULL,
  prefix_final = NULL,
  process_final = list(),
  seed = NULL,
  ...
)

# S3 method for class 'simdesign'
simulate_data(
  generator,
  n_obs = 1,
  seed = NULL,
  apply_transformation = TRUE,
  apply_processing = TRUE,
  ...
)

Arguments

generator

Function which generates data from the underlying base distribution. It is assumed it takes the number of simulated observations n_obs as first argument, as all random generation functions in the stats and extraDistr do. Furthermore, it is expected to return a two-dimensional array as output (matrix or data.frame). Alternatively an R object derived from the simdata::simdesign class. See details.

...

Further arguments passed to generator function.

n_obs

Number of simulated observations.

transform_initial

Function which specifies the transformation of the underlying dataset Z to final dataset X. See details.

names_final

NULL or character vector with variable names for final dataset X. Length needs to equal the number of columns of X. Overrides other naming options. See details.

prefix_final

NULL or prefix attached to variables in final dataset X. Overriden by names_final argument. Set to NULL if no prefixes should be added. See details.

process_final

List of lists specifying post-processing functions applied to final datamatrix X before returning it. See do_processing.

seed

Set random seed to ensure reproducibility of results.

apply_transformation

This argument can be set to FALSE to override the information stored in the passed simdesign object and not transform and process data. Thus, the raw data from the design generator is returned. This can be useful for debugging purposes.

apply_processing

This argument can be set to FALSE to override the information stored in the passed simdesign object and not transform and process data after the initial data is transformed. This can be useful for debugging purposes.

Value

Data.frame or matrix with n_obs rows for simulated dataset X.

Details

Data is generated using the following procedure:

  1. An underlying dataset Z is sampled from some distribution. This is done by a call to the generator function.

  2. Z is then transformed into the final dataset X by applying the transform function to Z.

  3. X is post-processed if specified (e.g. truncation to avoid outliers).

Methods (by class)

  • simulate_data(default): Function to be used if no simdesign S3 class is used.

  • simulate_data(simdesign): Function to be used with simdesign S3 class.

Note

This function is best used in conjunction with the simdesign S3 class or any template based upon it, which facilitates further data visualization and conveniently stores information as a template for simulation tasks.

Generators

The generator function which is either passed directly, or via a simdata::simdesign object, is assumed to provide the same interface as the random generation functions in the R stats and extraDistr packages. Specifically, that means it takes the number of observations as first argument. All further arguments can be set via passing them as named argument to this function. It is expected to return a two-dimensional array (matrix or data.frame) for which the number of columns can be determined. Otherwise the check_and_infer step will fail.

Transformations

Transformations should be applicable to the output of the generator function (i.e. take a data.frame or matrix as input) and output another data.frame or matrix. A convenience function function_list is provided by this package to specify transformations as a list of functions, which take the whole datamatrix Z as single argument and can be used to apply specific transformations to the columns of that matrix. See the documentation for function_list for details.

Post-processing

Post-processing the datamatrix is based on do_processing.

Naming of variables

Variables are named by names_final if not NULL and of correct length. Otherwise, if prefix_final is not NULL, it is used as prefix for variable numbers. Otherwise, variables names remain as returned by the generator function.

Examples

generator <- function(n) mvtnorm::rmvnorm(n, mean = 0)
simulate_data(generator, 10, seed = 24)
#>               [,1]
#>  [1,] -0.545880758
#>  [2,]  0.536585304
#>  [3,]  0.419623149
#>  [4,] -0.583627199
#>  [5,]  0.847460017
#>  [6,]  0.266021979
#>  [7,]  0.444585270
#>  [8,] -0.466495124
#>  [9,] -0.848370044
#> [10,]  0.002311942