grid_search_cv()
conducts a Monte Carlo style cross-validated grid search
of PCP parameters for a given data matrix D
, PCP function pcp_fn
, and
grid of parameter settings to search through grid
. The run time of the grid
search can be sped up using bespoke parallelization settings. The call to
grid_search_cv()
can be wrapped in a call to progressr::with_progress()
for progress bar updates. See the below sections for details.
Usage
grid_search_cv(
D,
pcp_fn,
grid,
...,
parallel_strategy = "sequential",
num_workers = 1,
perc_test = 0.05,
num_runs = 100,
return_all_tests = FALSE,
verbose = TRUE
)
Arguments
- D
The input data matrix (can contain
NA
values). Note that PCP will converge much more quickly whenD
has been standardized in some way (e.g. scaling columns by their standard deviations, or column-wise min-max normalization).- pcp_fn
The PCP function to use when grid searching. Must be either
rrmc
orroot_pcp
(passed without the soft brackets).- grid
A
data.frame
of dimensionj
byk
containing thej
-many unique settings ofk
-many parameters to try. NOTE: The columns ofgrid
should be named after the required parameters in the function header ofpcp_fn
. For example, ifpcp_fn = root_pcp
and you want to search throughlambda
andmu
, thennames(grid)
must be set toc("lambda", "mu")
. If instead you want to keep e.g.lambda
fixed and search through onlymu
, you can either have agrid
with only one column,mu
, and passlambda
as a constant via...
, or you can havenames(grid)
set toc("lambda", "mu")
wherelambda
is constant. The same logic applies forpcp_fn = rrmc
andeta
andr
.- ...
Any parameters required by
pcp_fn
that should be kept constant throughout the grid search, or those parameters that cannot be stored ingrid
(e.g. theLOD
parameter). A parameter should not be passed with...
if it is already a column ingrid
, as that behavior is ambiguous.- parallel_strategy
(Optional) The parallelization strategy used when conducting the grid search (to be passed on to the
future::plan()
function). Must be one of:"sequential"
,"multisession"
,"multicore"
or"cluster"
. By default,parallel_strategy = "sequential"
, which runs the grid search in serial and thenum_workers
argument is ignored. The optionparallel_strategy = "multisession"
parallelizes the search via sockets in separate R sessions. The optionparallel_strategy = "multicore"
is not supported on Windows machines, nor in .Rmd files (must be run in a .R script) but parallelizes the search much faster than"multisession"
since it runs separate forked R processes. The optionparallel_strategy = "cluster"
parallelizes using separate R sessions running typically on one or more machines. Support for other parallel strategies will be added in a future release ofpcpr
. It is recommended to useparallel_strategy = "multicore"
or"multisession"
when possible.- num_workers
(Optional) An integer specifying the number of workers to use when parallelizing the grid search, to be passed on to
future::plan()
. By default,num_workers = 1
. When possible, it is recommended to usenum_workers = parallel::detectCores(logical = F)
, which computes the number of physical CPUs available on the machine (seeparallel::detectCores()
).num_workers
is ignored whenparallel_strategy = "sequential"
, and must be> 1
otherwise.- perc_test
(Optional) The fraction of entries of
D
that will be randomly corrupted asNA
missing values (the test set). Can be anthing in the range[0, 1)
. By default,perc_test = 0.05
. See Best practices section for more details.- num_runs
(Optional) The number of times to test a given parameter setting. By default,
num_runs = 100
. See Best practices section for more details.- return_all_tests
(Optional) A logical indicating if you would like the output from all the calls made to
pcp_fn
over the course of the grid search to be returned to you in list format. If set toFALSE
, then only statistics on the parameters tested will be returned. If set toTRUE
then everyL
, andS
matrix recovered during the grid search will be returned in the listsL_mats
andS_mats
, every test set matrix will be returned in the listtest_mats
, the original input matrix will be returned asoriginal_mat
, and the parameters passed in to...
will be returned in theconstant_params
list. By default,return_all_tests = FALSE
, which is highly recommended. Settingreturn_all_tests = TRUE
can consume a massive amount of memory depending on the size ofgrid
, the input matrixD
, and the value fornum_runs
.- verbose
(Optional) A logical indicating if you would like verbose output displayed or not. By default,
verbose = TRUE
. To obtain progress bar updates, the user must wrap thegrid_search_cv()
call with a call toprogressr::with_progress()
. The progress bar does not depend on the value passed forverbose
.
Value
A list containing:
all_stats
: Adata.frame
containing the statistics of every run comprising the grid search. These statistics include the parameter settings for the run, along with therun_num
(used as the seed for the corruption step, step 1 in the grid search procedure), the relative error for the runrel_err
, the rank of the recovered L matrixL_rank
, the sparsity of the recovered S matrixS_sparsity
, the number ofiterations
PCP took to reach convergence (forroot_pcp()
only), and the error statusrun_error
of the PCP run (NA
if no error, otherwise a character string).summary_stats
: Adata.frame
containing a summary of the information inall_stats
. Summary made by column-wise averaging the results inall_stats
.metadata
: A character string containing the metadata associated with the grid search instance.
If return_all_tests = TRUE
then the following are also returned as part
of the list:
L_mats
: A list containing all theL
matrices returned from PCP throughout the grid search. Therefore,length(L_mats) == nrow(all_stats)
. Rowi
inall_stats
corresponds toL_mats[[i]]
.S_mats
: A list containing all the S matrices returned from PCP throughout the grid search. Therefore,length(S_mats) == nrow(all_stats)
. Rowi
inall_stats
corresponds toS_mats[[i]]
.test_mats
: A list oflength(num_runs)
containing all the corrupted test mats (and their masks) used throughout the grid search. Note:all_stats$run[i]
corresponds totest_mats[[i]]
.original_mat
: The original data matrixD
.constant_params
: A copy of the constant parameters that were originally passed to the grid search (for record keeping).
The Monte Carlo style cross-validation procedure
Each hyperparameter setting is cross-validated by:
Randomly corrupting
perc_test
percent of the entries inD
as missing (i.e.NA
values), yieldingD_tilde
. Done viasim_na()
.Running the PCP function
pcp_fn
onD_tilde
, yielding estimatesL
andS
.Recording the relative recovery error of
L
compared with the input data matrixD
for only those values that were imputed as missing during the corruption step (step 1 above). Mathematically, calculate: \(||P_{\Omega^c}(D - L)||_F / ||P_{\Omega^c}(D)||_F\), where \(P_{\Omega^c}\) selects only those entries whereis.na(D_tilde) == TRUE
.Repeating steps 1-3 for a total of
num_runs
-many times, where each "run" has a unique random seed from1
tonum_runs
associated with it.Performance statistics can then be calculated for each "run", and then summarized across all runs for average model performance statistics.
Best practices for perc_test
and num_runs
Experimentally, this grid search procedure retrieves the best performing
PCP parameter settings when perc_test
is relatively low, e.g.
perc_test = 0.05
, or 5%, and num_runs
is relatively high, e.g.
num_runs = 100
.
The larger perc_test
is, the more the test set turns into a matrix
completion problem, rather than the desired matrix decomposition problem. To
better resemble the actual problem PCP will be faced with come inference
time, perc_test
should therefore be kept relatively low.
Choosing a reasonable value for num_runs
is dependent on the need to keep
perc_test
relatively low. Ideally, a large enough num_runs
is used so
that many (if not all) of the entries in D
are likely to eventually be
tested. Note that since test set entries are chosen randomly for all runs 1
through num_runs
, in the pathologically worst case scenario, the same exact
test set could be drawn each time. In the best case scenario, a different
test set is obtained each run, providing balanced coverage of D
. Viewed
another way, the smaller num_runs
is, the more the results are susceptible
to overfitting to the relatively few selected test sets.
Interpretaion of results
Once the grid search of has been conducted, the optimal hyperparameters can
be chosen by examining the output statistics summary_stats
. Below are a
few suggestions for how to interpret the summary_stats
table:
Generally speaking, the first thing a user will want to inspect is the
rel_err
statistic, capturing the relative discrepancy between recovered test sets and their original, observed (yet possibly noisy) values. Lowerrel_err
means the PCP model was better able to recover the held-out test set. So, in general, the best parameter settings are those with the lowestrel_err
. Having said this, it is important to remember that this statistic should be taken with a grain of salt: Because no ground truthL
matrix exists, therel_err
measurement is forced to rely on the comparison between the noisy observed data matrixD
and the estimated low-rank modelL
. So therel_err
metric is an "apples to oranges" relative error. For data that is a priori expected to be subject to a high degree of noise, it may actually be better to discard parameter settings with suspiciously lowrel_err
s (in which case the solution may be hallucinating an inaccurate low-rank structure from the observed noise).For grid searches using
root_pcp()
as the PCP model, parameters that fail to converge can be discarded. Generally, fewerroot_pcp()
iterations (num_iter
) taken to reach convergence portend a more reliable / stable solution. In rare cases, the user may need to increaseroot_pcp()
'smax_iter
argument to reach convergence.rrmc()
does not report convergence metadata, as its optimization scheme runs for a fixed number of iterations.Parameter settings with unreasonable sparsity or rank measurements can also be discarded. Here, "unreasonable" means these reported metrics flagrantly contradict prior assumptions, knowledge, or work. For instance, most air pollution datasets contain a number of extreme exposure events, so PCP solutions returning sparse
S
models with 100% sparsity have obviously been regularized too heavily. Solutions with lower sparsities should be preferred. Note that reported sparsity and rank measurements are estimates heavily dependent on thethresh
set by thesparsity()
&matrix_rank()
functions. E.g. it could be that the actual average matrix rank is much higher or lower when a threshold that better takes into account the relative scale of the singular values is used. Likewise for the sparsity estimations. Also, recall that the given value forperc_test
artifically sets a sparsity floor, since those missing entries in the test set cannot be recovered in theS
matrix. E.g. ifperc_test = 0.05
, then no parameter setting will have an estimated sparsity lower than 5%.
Examples
#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-3 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data()
#### -------Tiny grid search-------####
# Here is a tiny grid search just to test the function quickly.
# In practice we would recommend a larger grid search.
# For examples of larger searches, see the vignettes.
gs <- grid_search_cv(
data$D,
rrmc,
data.frame("eta" = 0.35),
r = 3,
num_runs = 2
)
#>
#> Initializing grid search...
#> Beginning sequential grid search...
#> Start time: 2025-04-02 19:20:26.028764
#>
#> Grid search completed at time: 2025-04-02 19:20:26.57631
#> Metrics calculations complete.
#> Grid search completed!
gs$summary_stats
#> # A tibble: 3 × 7
#> eta r rel_err L_rank S_sparsity iterations run_error_perc
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 0.35 3 0.113 3 0.990 NaN 0%
#> 2 0.35 2 0.162 2 0.998 NaN 0%
#> 3 0.35 1 0.172 1 1 NaN 0%