Skip to contents

sim_na() corrupts a given data matrix D such that a random perc percent of its entries are set to be missing (set to NA). Used by grid_search_cv() in constructing test matrices for PCP models. Can be used for experimentation with PCP models.

Note: only observed values can be corrupted as NA. This means if a matrix D already has e.g. 20% of its values missing, then sim_na(D, perc = 0.2) would result in a matrix with 40% of its values as missing.

Should e.g. perc = 0.6 be passed as input when D only has e.g. 10% of its entries left as observed, then all remaining corruptable entries will be set to NA.

Usage

sim_na(D, perc, seed = 42)

Arguments

D

The input data matrix.

perc

A double in the range [0, 1] specifying the percentage of entries in D to corrupt as missing (NA).

seed

(Optional) An integer specifying the seed for the random selection of entries in D to corrupt as missing (NA). By default, seed = 42.

Value

A list containing:

  • D_tilde: The original matrix D with a random perc percent of its entries set to NA.

  • tilde_mask: A binary matrix of dim(D) specifying the locations of corrupted entries (1) and uncorrupted entries (0).

Examples

# Simple example corrupting 20% of a 5x5 matrix
D <- matrix(1:25, 5, 5)
corrupted_data <- sim_na(D, perc = 0.2)
corrupted_data$D_tilde
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]   NA    6   11   16   21
#> [2,]    2    7   12   NA   22
#> [3,]    3    8   13   18   23
#> [4,]   NA    9   14   19   24
#> [5,]   NA   NA   15   20   25
sum(is.na(corrupted_data$D_tilde)) / prod(dim(corrupted_data$D_tilde))
#> [1] 0.2
# Now corrupting another 20% ontop of the original 20%
double_corrupted <- sim_na(corrupted_data$D_tilde, perc = 0.2)
double_corrupted$D_tilde
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]   NA    6   11   16   21
#> [2,]   NA   NA   12   NA   NA
#> [3,]    3   NA   13   18   23
#> [4,]   NA    9   NA   19   24
#> [5,]   NA   NA   15   20   25
sum(is.na(double_corrupted$D_tilde)) / prod(dim(double_corrupted$D_tilde))
#> [1] 0.4
# Corrupting the remaining entries by passing in a large value for perc
all_corrupted <- sim_na(double_corrupted$D_tilde, perc = 1)
all_corrupted$D_tilde
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]   NA   NA   NA   NA   NA
#> [2,]   NA   NA   NA   NA   NA
#> [3,]   NA   NA   NA   NA   NA
#> [4,]   NA   NA   NA   NA   NA
#> [5,]   NA   NA   NA   NA   NA