Stores information necessary to simulate datasets based on the NORTA procedure (Cario and Nelson 1997).
Usage
simdesign_norta(
cor_target_final = NULL,
cor_initial = NULL,
dist = list(),
tol_initial = 0.001,
n_obs_initial = 10000,
seed_initial = 1,
conv_norm_type = "O",
method = "svd",
name = "NORTA based simulation design",
...
)
Arguments
- cor_target_final
Target correlation matrix for simulated datasets. At least one of
cor_target_final
orcor_initial
must be specified.- cor_initial
Correlation matrix for underlying multivariate standard normal distribution on which the final data is based on. At least one of
cor_target_final
orcor_initial
must be specified. If NULL, thencor_initial
will be numerically optimized by simulation for the NORTA procedure usingcor_target_final
.- dist
List of functions of marginal distributions for simulated variables. Must have the same length as the specified correlation matrix (
cor_target_final
and / orcor_inital
), and the order of the entries must correspond to the variables in the correlation matrix. See details for the specification of the marginal distributions.- tol_initial
If
cor_initial
is numerically optimized, specifies the tolerance for the difference to the target correlationcor_target_final
. Parameter passed tooptimize_cor_for_pair
.- n_obs_initial
If
cor_initial
is numerically optimized, specifies the number of draws in simulation during optimization used to estimate correlations. Parameter passed tooptimize_cor_for_pair
.- seed_initial
Seed used for draws of the initial distribution used during optimization to estimate correlations.
- conv_norm_type
If
cor_initial
is numerically optimized and found not to be a proper correlation matrix (i.e. not positive-definite), specifies the metric used to find the nearest positive-definite correlation matrix. Parameter passed toMatrix::nearPD
(conv.norm.type), see there for details.- method
method
argument ofmvtnorm::rmvnorm
.- name
Character, optional name of the simulation design.
- ...
Further arguments are passed to the
simdesign
constructor.
Value
List object with class attribute "simdesign_norta" (S3 class), inheriting
from "simdesign". It contains the same entries as a simdesign
object but in addition the following entries:
cor_target_final
cor_initial
Initial correlation matrix of multivariate normal distribution
dist
tol_initial
n_obs_initial
conv_norm_type
method
Details
This S3 class implements a simulation design based on the NORmal-To-Anything (NORTA) procedure by Cario and Nelson (1997). See the corresponding NORTA vignette for usage examples how to approximate real datasets.
Data Generation
Data will be generated using the following procedure:
An underlying data matrix
Z
is sampled from a multivariate standard Normal distribution with correlation structure given bycor_initial
.Z
is then transformed into a datasetX
by applying the functions given indist
to the columns ofZ
. The resulting datasetX
will then have the desired marginal distributions, and approximate the target correlationcor_target_final
, if specified.X
is further transformed by the transformationtransform_initial
(note that this may affect the correlation of the final dataset and is not respected by the optimization procedure), and post-processed if specified.
Marginal distributions
A list of functions dist
is used to define the marginal distributions of
the variables. Each entry must be a quantile function, i.e. a function
that maps [0, 1]
to the domain of a probability distribution. Each entry
must take a single input vector, and return a single numeric vector.
Examples for acceptable entries include all standard quantile functions
implemented in R (e.g. qnorm
, qbinom
, ...), user defined functions
wrapping these (e.g. function(x) = qnorm(x, mean = 10, sd = 4)
), or
empirical quantile functions. The helper function
quantile_functions_from_data can be used to automatically
estimate empirical quantile functions from a given data to reproduce it using
the NORTA approach.See the example in the NORTA vignette of this package for
workflow details.
Target correlations
Not every valid correlation matrix (i.e. symmetric, positive-definite matrix
with elements in [-1, 1]
and unity diagonal) for a number of variables
is feasible for given desired marginal distributions (see e.g.
Ghosh and Henderson 2003). Therefore, if cor_target_final
is specified
as target correlation, this class optimises cor_initial
in such a
way, that the final simulated dataset has a correlation which approximates
cor_target_final
. However, the actual correlation in the end may differ
if cor_target_final
is infeasible for the given specification, or the
NORTA procedure cannot exactly reproduce the target correlation. In general,
however, approximations should be acceptable if target correlations and
marginal structures are derived from real datasets.
See e.g. Ghosh and Henderson 2003 for the motivation why this works.
References
Cario, M. C. and Nelson, B. L. (1997) Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois.
Ghosh, S. and Henderson, S. G. (2003) Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation.