sim_data() generates a simulated dataset D = L + S + Z for
experimentation with Principal Component Pursuit (PCP) algorithms.
Arguments
- n, p
(Optional) A pair of integers specifying the simulated dataset's number of
nobservations (rows) andpvariables (columns). By default,n = 100, andp = 10.- r
(Optional) An integer specifying the rank of the simulated dataset's low-rank component. Intuitively, the number of latent patterns governing the simulated dataset. Must be that
r<= min(n, p). By default,r= 3.- sparse_nonzero_idxs
(Optional) An integer vector with
length(sparse_nonzero_idxs) <= n * pspecifying the indices of the non-zero elements in the sparse component. By default,sparse_nonzero_idxs = NULL, in which case it is defined to be the vectorseq(1, n * p, n + 1)(placing sparse noise along the diagonal of the simulated dataset).- sigma
(Optional) A double specifying the standard deviation of the dense (Gaussian) noise component
Z. By default,sigma = 0.05.- seed
(Optional) An integer specifying the seed for random number generation. By default,
seed = 42.
Value
A list containing:
D: The observed data matrix, whereD = L + S + Z.L: The ground truth rank-rlow-rank matrix.S: The ground truth sparse matrix.S: The ground truth dense (Gaussian) noise matrix.
Details
The data is simulated as follows:
L <- matrix(runif(n * r), n, r) %*% matrix(runif(r * p), r, p)
S <- matrix(0, n, p)
S[sparse_nonzero_idxs] <- 1
Z <- matrix(rnorm(n * p, sd = sigma), n, p)
D <- L + S + Z
Examples
# rank 3 example
data <- sim_data()
matrix_rank(data$D)
#> [1] 10
matrix_rank(data$L)
#> [1] 3
# rank 7 example
data <- sim_data(n = 1000, p = 25, r = 7)
matrix_rank(data$D)
#> [1] 25
matrix_rank(data$L)
#> [1] 7
