Generate simulated dataset based on transformation of an underlying base distribution.
Usage
simulate_data(generator, ...)
# Default S3 method
simulate_data(
generator = function(n) matrix(rnorm(n)),
n_obs = 1,
transform_initial = base::identity,
names_final = NULL,
prefix_final = NULL,
process_final = list(),
seed = NULL,
...
)
# S3 method for class 'simdesign'
simulate_data(
generator,
n_obs = 1,
seed = NULL,
apply_transformation = TRUE,
apply_processing = TRUE,
...
)
Arguments
- generator
Function which generates data from the underlying base distribution. It is assumed it takes the number of simulated observations
n_obs
as first argument, as all random generation functions in the stats and extraDistr do. Furthermore, it is expected to return a two-dimensional array as output (matrix or data.frame). Alternatively an R object derived from thesimdata::simdesign
class. See details.- ...
Further arguments passed to
generator
function.- n_obs
Number of simulated observations.
- transform_initial
Function which specifies the transformation of the underlying dataset
Z
to final datasetX
. See details.- names_final
NULL or character vector with variable names for final dataset
X
. Length needs to equal the number of columns ofX
. Overrides other naming options. See details.- prefix_final
NULL or prefix attached to variables in final dataset
X
. Overriden bynames_final
argument. Set to NULL if no prefixes should be added. See details.- process_final
List of lists specifying post-processing functions applied to final datamatrix
X
before returning it. Seedo_processing
.- seed
Set random seed to ensure reproducibility of results.
- apply_transformation
This argument can be set to FALSE to override the information stored in the passed
simdesign
object and not transform and process data. Thus, the raw data from the design generator is returned. This can be useful for debugging purposes.- apply_processing
This argument can be set to FALSE to override the information stored in the passed
simdesign
object and not transform and process data after the initial data is transformed. This can be useful for debugging purposes.
Details
Data is generated using the following procedure:
An underlying dataset
Z
is sampled from some distribution. This is done by a call to thegenerator
function.Z
is then transformed into the final datasetX
by applying thetransform
function toZ
.X
is post-processed if specified (e.g. truncation to avoid outliers).
Note
This function is best used in conjunction with the simdesign
S3 class or any template based upon it, which facilitates further data
visualization and conveniently stores information as a template for
simulation tasks.
Generators
The generator
function which is either passed directly, or via a
simdata::simdesign
object, is assumed to provide the same interface
as the random generation functions in the R stats and extraDistr
packages. Specifically, that means it takes the number of observations as
first argument. All further arguments can be set via passing them as
named argument to this function. It is expected to return a two-dimensional
array (matrix or data.frame) for which the number of columns can be
determined. Otherwise the check_and_infer
step will fail.
Transformations
Transformations should be applicable to the output of the generator
function (i.e. take a data.frame or matrix as input) and output another
data.frame or matrix. A convenience function function_list
is
provided by this package to specify transformations as a list of functions,
which take the whole datamatrix Z
as single argument and can be used to
apply specific transformations to the columns of that matrix. See the
documentation for function_list
for details.
Post-processing
Post-processing the datamatrix is based on do_processing
.
Naming of variables
Variables are named by names_final
if not NULL and of correct length.
Otherwise, if prefix_final
is not NULL, it is used as prefix for variable
numbers. Otherwise, variables names remain as returned by the generator
function.
Examples
generator <- function(n) mvtnorm::rmvnorm(n, mean = 0)
simulate_data(generator, 10, seed = 24)
#> [,1]
#> [1,] -0.545880758
#> [2,] 0.536585304
#> [3,] 0.419623149
#> [4,] -0.583627199
#> [5,] 0.847460017
#> [6,] 0.266021979
#> [7,] 0.444585270
#> [8,] -0.466495124
#> [9,] -0.848370044
#> [10,] 0.002311942