Skip to contents

Stores information necessary to simulate datasets based on the NORTA procedure (Cario and Nelson 1997).

Usage

simdesign_norta(
  cor_target_final = NULL,
  cor_initial = NULL,
  dist = list(),
  tol_initial = 0.001,
  n_obs_initial = 10000,
  seed_initial = 1,
  conv_norm_type = "O",
  method = "svd",
  name = "NORTA based simulation design",
  ...
)

Arguments

cor_target_final

Target correlation matrix for simulated datasets. At least one of cor_target_final or cor_initial must be specified.

cor_initial

Correlation matrix for underlying multivariate standard normal distribution on which the final data is based on. At least one of cor_target_final or cor_initial must be specified. If NULL, then cor_initial will be numerically optimized by simulation for the NORTA procedure using cor_target_final.

dist

List of functions of marginal distributions for simulated variables. Must have the same length as the specified correlation matrix (cor_target_final and / or cor_inital), and the order of the entries must correspond to the variables in the correlation matrix. See details for the specification of the marginal distributions.

tol_initial

If cor_initial is numerically optimized, specifies the tolerance for the difference to the target correlation cor_target_final. Parameter passed to optimize_cor_for_pair.

n_obs_initial

If cor_initial is numerically optimized, specifies the number of draws in simulation during optimization used to estimate correlations. Parameter passed to optimize_cor_for_pair.

seed_initial

Seed used for draws of the initial distribution used during optimization to estimate correlations.

conv_norm_type

If cor_initial is numerically optimized and found not to be a proper correlation matrix (i.e. not positive-definite), specifies the metric used to find the nearest positive-definite correlation matrix. Parameter passed to Matrix::nearPD (conv.norm.type), see there for details.

method

method argument of mvtnorm::rmvnorm.

name

Character, optional name of the simulation design.

...

Further arguments are passed to the simdesign constructor.

Value

List object with class attribute "simdesign_norta" (S3 class), inheriting from "simdesign". It contains the same entries as a simdesign

object but in addition the following entries:

cor_target_final
cor_initial

Initial correlation matrix of multivariate normal distribution

dist
tol_initial
n_obs_initial
conv_norm_type
method

Details

This S3 class implements a simulation design based on the NORmal-To-Anything (NORTA) procedure by Cario and Nelson (1997). See the corresponding NORTA vignette for usage examples how to approximate real datasets.

Data Generation

Data will be generated using the following procedure:

  1. An underlying data matrix Z is sampled from a multivariate standard Normal distribution with correlation structure given by cor_initial.

  2. Z is then transformed into a dataset X by applying the functions given in dist to the columns of Z. The resulting dataset X will then have the desired marginal distributions, and approximate the target correlation cor_target_final, if specified.

  3. X is further transformed by the transformation transform_initial (note that this may affect the correlation of the final dataset and is not respected by the optimization procedure), and post-processed if specified.

Marginal distributions

A list of functions dist is used to define the marginal distributions of the variables. Each entry must be a quantile function, i.e. a function that maps [0, 1] to the domain of a probability distribution. Each entry must take a single input vector, and return a single numeric vector. Examples for acceptable entries include all standard quantile functions implemented in R (e.g. qnorm, qbinom, ...), user defined functions wrapping these (e.g. function(x) = qnorm(x, mean = 10, sd = 4)), or empirical quantile functions. The helper function quantile_functions_from_data can be used to automatically estimate empirical quantile functions from a given data to reproduce it using the NORTA approach.See the example in the NORTA vignette of this package for workflow details.

Target correlations

Not every valid correlation matrix (i.e. symmetric, positive-definite matrix with elements in [-1, 1] and unity diagonal) for a number of variables is feasible for given desired marginal distributions (see e.g. Ghosh and Henderson 2003). Therefore, if cor_target_final is specified as target correlation, this class optimises cor_initial in such a way, that the final simulated dataset has a correlation which approximates cor_target_final. However, the actual correlation in the end may differ if cor_target_final is infeasible for the given specification, or the NORTA procedure cannot exactly reproduce the target correlation. In general, however, approximations should be acceptable if target correlations and marginal structures are derived from real datasets. See e.g. Ghosh and Henderson 2003 for the motivation why this works.

References

Cario, M. C. and Nelson, B. L. (1997) Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois.

Ghosh, S. and Henderson, S. G. (2003) Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation.