sim_data()
generates a simulated dataset D = L + S + Z
for
experimentation with Principal Component Pursuit (PCP) algorithms.
Arguments
- n, p
(Optional) A pair of integers specifying the simulated dataset's number of
n
observations (rows) andp
variables (columns). By default,n = 100
, andp = 10
.- r
(Optional) An integer specifying the rank of the simulated dataset's low-rank component. Intuitively, the number of latent patterns governing the simulated dataset. Must be that
r
<= min(n, p)
. By default,r
= 3.- sparse_nonzero_idxs
(Optional) An integer vector with
length(sparse_nonzero_idxs) <= n * p
specifying the indices of the non-zero elements in the sparse component. By default,sparse_nonzero_idxs = NULL
, in which case it is defined to be the vectorseq(1, n * p, n + 1)
(placing sparse noise along the diagonal of the simulated dataset).- sigma
(Optional) A double specifying the standard deviation of the dense (Gaussian) noise component
Z
. By default,sigma = 0.05
.- seed
(Optional) An integer specifying the seed for random number generation. By default,
seed = 42
.
Value
A list containing:
D
: The observed data matrix, whereD = L + S + Z
.L
: The ground truth rank-r
low-rank matrix.S
: The ground truth sparse matrix.S
: The ground truth dense (Gaussian) noise matrix.
Details
The data is simulated as follows:
L <- matrix(runif(n * r), n, r) %*% matrix(runif(r * p), r, p)
S <- matrix(0, n, p)
S[sparse_nonzero_idxs] <- 1
Z <- matrix(rnorm(n * p, sd = sigma), n, p)
D <- L + S + Z
Examples
# rank 3 example
data <- sim_data()
matrix_rank(data$D)
#> [1] 10
matrix_rank(data$L)
#> [1] 3
# rank 7 example
data <- sim_data(n = 1000, p = 25, r = 7)
matrix_rank(data$D)
#> [1] 25
matrix_rank(data$L)
#> [1] 7