Skip to contents

sim_data() generates a simulated dataset D = L + S + Z for experimentation with Principal Component Pursuit (PCP) algorithms.

Usage

sim_data(
  n = 100,
  p = 10,
  r = 3,
  sparse_nonzero_idxs = NULL,
  sigma = 0.05,
  seed = 42
)

Arguments

n, p

(Optional) A pair of integers specifying the simulated dataset's number of n observations (rows) and p variables (columns). By default, n = 100, and p = 10.

r

(Optional) An integer specifying the rank of the simulated dataset's low-rank component. Intuitively, the number of latent patterns governing the simulated dataset. Must be that r <= min(n, p). By default, r = 3.

sparse_nonzero_idxs

(Optional) An integer vector with length(sparse_nonzero_idxs) <= n * p specifying the indices of the non-zero elements in the sparse component. By default, sparse_nonzero_idxs = NULL, in which case it is defined to be the vector seq(1, n * p, n + 1) (placing sparse noise along the diagonal of the simulated dataset).

sigma

(Optional) A double specifying the standard deviation of the dense (Gaussian) noise component Z. By default, sigma = 0.05.

seed

(Optional) An integer specifying the seed for random number generation. By default, seed = 42.

Value

A list containing:

  • D: The observed data matrix, where D = L + S + Z.

  • L: The ground truth rank-r low-rank matrix.

  • S: The ground truth sparse matrix.

  • S: The ground truth dense (Gaussian) noise matrix.

Details

The data is simulated as follows:

L <- matrix(runif(n * r), n, r) %*% matrix(runif(r * p), r, p)

S <- matrix(0, n, p)

S[sparse_nonzero_idxs] <- 1

Z <- matrix(rnorm(n * p, sd = sigma), n, p)

D <- L + S + Z

Examples

# rank 3 example
data <- sim_data()
matrix_rank(data$D)
#> [1] 10
matrix_rank(data$L)
#> [1] 3
# rank 7 example
data <- sim_data(n = 1000, p = 25, r = 7)
matrix_rank(data$D)
#> [1] 25
matrix_rank(data$L)
#> [1] 7