sim_na()
corrupts a given data matrix D
such that a random perc
percent of its entries are set to be missing (set to NA
). Used by
grid_search_cv()
in constructing test matrices for PCP models. Can be
used for experimentation with PCP models.
Note: only observed values can be corrupted as NA
. This means if a matrix
D
already has e.g. 20% of its values missing, then
sim_na(D, perc = 0.2)
would result in a matrix with 40% of
its values as missing.
Should e.g. perc = 0.6
be passed as input when D
only has e.g. 10% of its
entries left as observed, then all remaining corruptable entries will be
set to NA
.
Value
A list containing:
D_tilde
: The original matrixD
with a randomperc
percent of its entries set toNA
.tilde_mask
: A binary matrix ofdim(D)
specifying the locations of corrupted entries (1
) and uncorrupted entries (0
).
Examples
# Simple example corrupting 20% of a 5x5 matrix
D <- matrix(1:25, 5, 5)
corrupted_data <- sim_na(D, perc = 0.2)
corrupted_data$D_tilde
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] NA 6 11 16 21
#> [2,] 2 7 12 NA 22
#> [3,] 3 8 13 18 23
#> [4,] NA 9 14 19 24
#> [5,] NA NA 15 20 25
sum(is.na(corrupted_data$D_tilde)) / prod(dim(corrupted_data$D_tilde))
#> [1] 0.2
# Now corrupting another 20% ontop of the original 20%
double_corrupted <- sim_na(corrupted_data$D_tilde, perc = 0.2)
double_corrupted$D_tilde
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] NA 6 11 16 21
#> [2,] NA NA 12 NA NA
#> [3,] 3 NA 13 18 23
#> [4,] NA 9 NA 19 24
#> [5,] NA NA 15 20 25
sum(is.na(double_corrupted$D_tilde)) / prod(dim(double_corrupted$D_tilde))
#> [1] 0.4
# Corrupting the remaining entries by passing in a large value for perc
all_corrupted <- sim_na(double_corrupted$D_tilde, perc = 1)
all_corrupted$D_tilde
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] NA NA NA NA NA
#> [2,] NA NA NA NA NA
#> [3,] NA NA NA NA NA
#> [4,] NA NA NA NA NA
#> [5,] NA NA NA NA NA