Skip to contents

grid_search_cv() conducts a Monte Carlo style cross-validated grid search of PCP parameters for a given data matrix D, PCP function pcp_fn, and grid of parameter settings to search through grid. The run time of the grid search can be sped up using bespoke parallelization settings. The call to grid_search_cv() can be wrapped in a call to progressr::with_progress() for progress bar updates. See the below sections for details.

Usage

grid_search_cv(
  D,
  pcp_fn,
  grid,
  ...,
  parallel_strategy = "sequential",
  num_workers = 1,
  perc_test = 0.05,
  num_runs = 100,
  return_all_tests = FALSE,
  verbose = TRUE
)

Arguments

D

The input data matrix (can contain NA values). Note that PCP will converge much more quickly when D has been standardized in some way (e.g. scaling columns by their standard deviations, or column-wise min-max normalization).

pcp_fn

The PCP function to use when grid searching. Must be either rrmc or root_pcp (passed without the soft brackets).

grid

A data.frame of dimension j by k containing the j-many unique settings of k-many parameters to try. NOTE: The columns of grid should be named after the required parameters in the function header of pcp_fn. For example, if pcp_fn = root_pcp and you want to search through lambda and mu, then names(grid) must be set to c("lambda", "mu"). If instead you want to keep e.g. lambda fixed and search through only mu, you can either have a grid with only one column, mu, and pass lambda as a constant via ..., or you can have names(grid) set to c("lambda", "mu") where lambda is constant. The same logic applies for pcp_fn = rrmc and eta and r.

...

Any parameters required by pcp_fn that should be kept constant throughout the grid search, or those parameters that cannot be stored in grid (e.g. the LOD parameter). A parameter should not be passed with ... if it is already a column in grid, as that behavior is ambiguous.

parallel_strategy

(Optional) The parallelization strategy used when conducting the grid search (to be passed on to the future::plan() function). Must be one of: "sequential", "multisession", "multicore" or "cluster". By default, parallel_strategy = "sequential", which runs the grid search in serial and the num_workers argument is ignored. The option parallel_strategy = "multisession" parallelizes the search via sockets in separate R sessions. The option parallel_strategy = "multicore" is not supported on Windows machines, nor in .Rmd files (must be run in a .R script) but parallelizes the search much faster than "multisession" since it runs separate forked R processes. The option parallel_strategy = "cluster" parallelizes using separate R sessions running typically on one or more machines. Support for other parallel strategies will be added in a future release of pcpr. It is recommended to use parallel_strategy = "multicore" or "multisession" when possible.

num_workers

(Optional) An integer specifying the number of workers to use when parallelizing the grid search, to be passed on to future::plan(). By default, num_workers = 1. When possible, it is recommended to use num_workers = parallel::detectCores(logical = F), which computes the number of physical CPUs available on the machine (see parallel::detectCores()). num_workers is ignored when parallel_strategy = "sequential", and must be > 1 otherwise.

perc_test

(Optional) The fraction of entries of D that will be randomly corrupted as NA missing values (the test set). Can be anthing in the range [0, 1). By default, perc_test = 0.05. See Best practices section for more details.

num_runs

(Optional) The number of times to test a given parameter setting. By default, num_runs = 100. See Best practices section for more details.

return_all_tests

(Optional) A logical indicating if you would like the output from all the calls made to pcp_fn over the course of the grid search to be returned to you in list format. If set to FALSE, then only statistics on the parameters tested will be returned. If set to TRUE then every L, and S matrix recovered during the grid search will be returned in the lists L_mats and S_mats, every test set matrix will be returned in the list test_mats, the original input matrix will be returned as original_mat, and the parameters passed in to ... will be returned in the constant_params list. By default, return_all_tests = FALSE, which is highly recommended. Setting return_all_tests = TRUE can consume a massive amount of memory depending on the size of grid, the input matrix D, and the value for num_runs.

verbose

(Optional) A logical indicating if you would like verbose output displayed or not. By default, verbose = TRUE. To obtain progress bar updates, the user must wrap the grid_search_cv() call with a call to progressr::with_progress(). The progress bar does not depend on the value passed for verbose.

Value

A list containing:

  • all_stats: A data.frame containing the statistics of every run comprising the grid search. These statistics include the parameter settings for the run, along with the run_num (used as the seed for the corruption step, step 1 in the grid search procedure), the relative error for the run rel_err, the rank of the recovered L matrix L_rank, the sparsity of the recovered S matrix S_sparsity, the number of iterations PCP took to reach convergence (for root_pcp() only), and the error status run_error of the PCP run (NA if no error, otherwise a character string).

  • summary_stats: A data.frame containing a summary of the information in all_stats. Summary made by column-wise averaging the results in all_stats.

  • metadata: A character string containing the metadata associated with the grid search instance.

If return_all_tests = TRUE then the following are also returned as part of the list:

  • L_mats: A list containing all the L matrices returned from PCP throughout the grid search. Therefore, length(L_mats) == nrow(all_stats). Row i in all_stats corresponds to L_mats[[i]].

  • S_mats: A list containing all the S matrices returned from PCP throughout the grid search. Therefore, length(S_mats) == nrow(all_stats). Row i in all_stats corresponds to S_mats[[i]].

  • test_mats: A list of length(num_runs) containing all the corrupted test mats (and their masks) used throughout the grid search. Note: all_stats$run[i] corresponds to test_mats[[i]].

  • original_mat: The original data matrix D.

  • constant_params: A copy of the constant parameters that were originally passed to the grid search (for record keeping).

The Monte Carlo style cross-validation procedure

Each hyperparameter setting is cross-validated by:

  1. Randomly corrupting perc_test percent of the entries in D as missing (i.e. NA values), yielding D_tilde. Done via sim_na().

  2. Running the PCP function pcp_fn on D_tilde, yielding estimates L and S.

  3. Recording the relative recovery error of L compared with the input data matrix D for only those values that were imputed as missing during the corruption step (step 1 above). Mathematically, calculate: \(||P_{\Omega^c}(D - L)||_F / ||P_{\Omega^c}(D)||_F\), where \(P_{\Omega^c}\) selects only those entries where is.na(D_tilde) == TRUE.

  4. Repeating steps 1-3 for a total of num_runs-many times, where each "run" has a unique random seed from 1 to num_runs associated with it.

  5. Performance statistics can then be calculated for each "run", and then summarized across all runs for average model performance statistics.

Best practices for perc_test and num_runs

Experimentally, this grid search procedure retrieves the best performing PCP parameter settings when perc_test is relatively low, e.g. perc_test = 0.05, or 5%, and num_runs is relatively high, e.g. num_runs = 100.

The larger perc_test is, the more the test set turns into a matrix completion problem, rather than the desired matrix decomposition problem. To better resemble the actual problem PCP will be faced with come inference time, perc_test should therefore be kept relatively low.

Choosing a reasonable value for num_runs is dependent on the need to keep perc_test relatively low. Ideally, a large enough num_runs is used so that many (if not all) of the entries in D are likely to eventually be tested. Note that since test set entries are chosen randomly for all runs 1 through num_runs, in the pathologically worst case scenario, the same exact test set could be drawn each time. In the best case scenario, a different test set is obtained each run, providing balanced coverage of D. Viewed another way, the smaller num_runs is, the more the results are susceptible to overfitting to the relatively few selected test sets.

Interpretaion of results

Once the grid search of has been conducted, the optimal hyperparameters can be chosen by examining the output statistics summary_stats. Below are a few suggestions for how to interpret the summary_stats table:

  • Generally speaking, the first thing a user will want to inspect is the rel_err statistic, capturing the relative discrepancy between recovered test sets and their original, observed (yet possibly noisy) values. Lower rel_err means the PCP model was better able to recover the held-out test set. So, in general, the best parameter settings are those with the lowest rel_err. Having said this, it is important to remember that this statistic should be taken with a grain of salt: Because no ground truth L matrix exists, the rel_err measurement is forced to rely on the comparison between the noisy observed data matrix D and the estimated low-rank model L. So the rel_err metric is an "apples to oranges" relative error. For data that is a priori expected to be subject to a high degree of noise, it may actually be better to discard parameter settings with suspiciously low rel_errs (in which case the solution may be hallucinating an inaccurate low-rank structure from the observed noise).

  • For grid searches using root_pcp() as the PCP model, parameters that fail to converge can be discarded. Generally, fewer root_pcp() iterations (num_iter) taken to reach convergence portend a more reliable / stable solution. In rare cases, the user may need to increase root_pcp()'s max_iter argument to reach convergence. rrmc() does not report convergence metadata, as its optimization scheme runs for a fixed number of iterations.

  • Parameter settings with unreasonable sparsity or rank measurements can also be discarded. Here, "unreasonable" means these reported metrics flagrantly contradict prior assumptions, knowledge, or work. For instance, most air pollution datasets contain a number of extreme exposure events, so PCP solutions returning sparse S models with 100% sparsity have obviously been regularized too heavily. Solutions with lower sparsities should be preferred. Note that reported sparsity and rank measurements are estimates heavily dependent on the thresh set by the sparsity() & matrix_rank() functions. E.g. it could be that the actual average matrix rank is much higher or lower when a threshold that better takes into account the relative scale of the singular values is used. Likewise for the sparsity estimations. Also, recall that the given value for perc_test artifically sets a sparsity floor, since those missing entries in the test set cannot be recovered in the S matrix. E.g. if perc_test = 0.05, then no parameter setting will have an estimated sparsity lower than 5%.

Examples

#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-3 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data()
#### -------Tiny grid search-------####
# Here is a tiny grid search just to test the function quickly.
# In practice we would recommend a larger grid search.
# For examples of larger searches, see the vignettes.
gs <- grid_search_cv(
  data$D,
  rrmc,
  data.frame("eta" = 0.35),
  r = 3,
  num_runs = 2
)
#> 
#> Initializing grid search...
#> Beginning sequential grid search...
#> Start time: 2025-04-02 19:20:26.028764
#> 
#> Grid search completed at time: 2025-04-02 19:20:26.57631
#> Metrics calculations complete.
#> Grid search completed!
gs$summary_stats
#> # A tibble: 3 × 7
#>     eta     r rel_err L_rank S_sparsity iterations run_error_perc
#>   <dbl> <int>   <dbl>  <dbl>      <dbl>      <dbl> <chr>         
#> 1  0.35     3   0.113      3      0.990        NaN 0%            
#> 2  0.35     2   0.162      2      0.998        NaN 0%            
#> 3  0.35     1   0.172      1      1            NaN 0%