simdata: Technical documentation
Michael Kammer
2024-12-03
Source:vignettes/Technical_documentation.Rmd
Technical_documentation.RmdIntroduction
This document is intended to elaborate on the inner workings of the
simdata package, for users who may wish to extend it for
their purposes.
The simdata package is based on a very simple idea:
- the
simdesignS3 class, and any concrete subclass implemented by the user, which provides a data generating mechanism, and stores all necessary data to simulate data from the data generating mechanism - the
simulate_datamethod for thesimdesignclass, which actually implements drawing from the data generating mechanism
Both key functionalities can be embellished by further features to
adapt to the task of interest. How to do this is presented in the
Demo vignette of the package. The package further provides
some utilities around the core functionality, to assist in simulation
tasks, but which are not essential to the usage of the package.
simdesign S3 class
The main class of this package is the simdesign S3
class. It is a list with class attribute simdesign and
entries as defined in the documentation of the simdesign
class.
Subclassing simdesign
A template for a constructor implementing a subclass for a specific simulation design is given by:
# constructor takes any number of arguments arg1, arg2, and so on
# and it must use the elipsis ... as final argument
new_simdesign <- function(arg1, arg2, ...) {
# define generator function in one argument
generator = function(n) {
# implement data generating mechanism
# make use of any argument passed to the new_simdesign constructor
# make sure it returns a two-dimensional array
}
# setup simdesign subclass
# make sure to pass generator function and ...
# all other information passed is optional
dsgn = simdesign(
generator = generator,
arg1 = arg1,
arg2 = arg2,
...
)
# extend the class attribute
class(dsgn) = c("binomial_simdesign", class(dsgn))
# return the object
dsgn
}Examples for actual implementations are provided in the
Demo vignette of this package.
Simulation of data
simulate_data method
The data generation in the simulate_data method follows
a simple recipe. In principle, the method can be used without a
simdesign object, but here we assume they are used
together. In the following graphic, circular shapes denote
functions.
- Data is drawn from an initial distribution using the
generatorfield (a function object) of thesimdesignclass.- Relevant input: the function stored in the
generatorfield of thesimdesignclass,n_obs(number of observations), any further argument passed tosimulate_datawhich is not specified in the documentation - Output: initial generated dataset
Z
- Relevant input: the function stored in the
- The initial data
Zis transformed by one or several functions which are applied to the dataset.- Relevant input:
Z, function stored in thetransform_initialfield of thesimdesignclass (can be implemented by using afunction_list, see documentation of this package) - Default:
base::identityis used to return the datasetZunchanged - Output: final generated dataset
X
- Relevant input:
- Optional: the final data
Xcan be post-processed before further usage.- Relevant input:
X, functions stored in theprocess_finalfield of thesimdesignobject - Default:
base::identityis used to return the datasetXunchanged - Output: post-processed dataset
X'.
- Relevant input:
The final output of the method is a dataset (a matrix or data.frame depending on the data generating mechanism) which can be used in further analysis steps.
simulate_data_conditional function
Data can be simulated to conform to specific user-specified
constraints. These constraints are implemented through a rejection
function applied to a simulated dataset. Only datasets for which the
function returns FALSE (i.e. not rejected) are returned. This is
implemented by repeatedly calling simulate_data to obtain
new instances of datasets from the data generating mechanism, either
until the rejection function accepts the dataset, or until a maximum
number of iterations was conducted. This process is depicted in the
following diagram, in which circular shaps denote functions.
R session information
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 desc_1.4.3 R6_2.5.1 fastmap_1.2.0
## [5] xfun_0.49 cachem_1.1.0 knitr_1.49 htmltools_0.5.8.1
## [9] rmarkdown_2.29 lifecycle_1.0.4 cli_3.6.3 pkgdown_2.1.1
## [13] sass_0.4.9 textshaping_0.4.0 jquerylib_0.1.4 systemfonts_1.1.0
## [17] compiler_4.4.2 tools_4.4.2 ragg_1.3.3 evaluate_1.0.1
## [21] bslib_0.8.0 yaml_2.3.10 jsonlite_1.8.9 rlang_1.1.4
## [25] fs_1.6.5