rrmc()
implements the non-convex PCP algorithm "Rank-based robust matrix
completion" as described in
Cherapanamjeri et al. (2017)
(see Algorithm 3), outfitted with environmental health (EH)-specific
extensions as described in Gibson et al. (2022).
Given an observed data matrix D
, maximum rank to search up to r
, and
regularization parameter eta
, rrmc()
seeks to find the best low-rank
and sparse estimates L
and S
using an incremental rank-based strategy.
The L
matrix encodes latent patterns that govern the observed data.
The S
matrix captures any extreme events in the data unexplained by the
underlying patterns in L
.
rrmc()
's incremental rank-based strategy first estimates a rank-1
model
\((L^{(1)}, S^{(1)})\), before using the rank-1
model as the initialization
point to then construct a rank-2
model \((L^{(2)}, S^{(2)})\), and so on,
until the desired rank-r
model \((L^{(r)}, S^{(r)})\) is recovered. All
models from ranks 1
through r
are returned by rrmc()
in this way.
Experimentally, the rrmc()
approach to PCP has best been able to handle
those datasets that are governed by complex underlying patterns characterized
by slowly decaying singular values, such as EH data. For observed data with a
well-defined low rank structure (rapidly decaying singular values),
root_pcp()
may offer a better model estimate.
Two EH-specific extensions are currently supported by rrmc()
:
The model can handle missing values in the input data matrix
D
; andThe model can also handle measurements that fall below the limit of detection (LOD), if provided
LOD
information by the user.
Support for a non-negativity constraint on rrmc()
's output will be added in
a future release of pcpr
.
Arguments
- D
The input data matrix (can contain
NA
values). Note that PCP will converge much more quickly whenD
has been standardized in some way (e.g. scaling columns by their standard deviations, or column-wise min-max normalization).- r
An integer
>= 1
specifying the maximum rank PCP model to return. All models from rank1
throughr
will be returned.- eta
(Optional) A double in the range
[0, Inf)
defining the ratio between the model's sensitivity to sparse and dense noise. Larger values ofeta
will place a greater emphasis on penalizing the non-zero entries inS
over penalizing dense noiseZ
, i.e. errors between the predicted and observed dataZ = L + S - D
. It is recommended to tuneeta
usinggrid_search_cv()
for each unique data matrixD
. By default,eta = NULL
, in which caseeta
is retrieved usingget_pcp_defaults()
.- LOD
(Optional) The limit of detection (LOD) data. Entries in
D
that satisfyD >= LOD
are understood to be above the LOD, otherwise those entries are treated as below the LOD.LOD
can be either:A double, implying a universal LOD common across all measurements in
D
;A vector of length
ncol(D)
, signifying a column-specific LOD, where each entry in theLOD
vector corresponds to the LOD for each column inD
; orA matrix of dimension
dim(D)
, indicating an observation-specific LOD, where each entry in theLOD
matrix corresponds to the LOD for each entry inD
.
By default,
LOD = -Inf
, indicating there are no known LODs for PCP to leverage.
Value
A list containing:
L
: The rank-r
low-rank matrix encoding ther
-many latent patterns governing the observed input data matrixD
.dim(L)
will be the same asdim(D)
. To explicitly obtain the underlying patterns,L
can be used as the input to any matrix factorization technique of choice, e.g. PCA, factor analysis, or non-negative matrix factorization.S
: The sparse matrix containing the rare outlying or extreme observations inD
that are not explained by the underlying patterns in the correspondingL
matrix.dim(S)
will be the same asdim(D)
. Most entries inS
are0
, while non-zero entries identify the extreme outlying observations inD
.L_list
: A list of ther
-manyL
matrices recovered over the course ofrrmc()
's iterative optimization procedure. The first element inL_list
corresponds to the rank-1
L
matrix, the second to the rank-2
L
matrix, and so on.S_list
: A list of ther
-many correspondingS
matrices recovered over the course ofrrmc()
's iterative optimization procedure. The first element inS_list
corresponds to the rank-1
solution'sS
matrix, the second to the rank-2
solution'sS
matrix, and so on.objective
: A vector containing the values ofrrmc()
's objective function over the course of optimization.
The objective function
rrmc()
implicitly optimizes the following objective function:
$$\min_{L, S} I_{rank(L) \leq r} + \eta ||S||_0 + ||L + S - D||_F^2$$
The first term is the indicator function checking that the L
matrix is
strictly rank r
or less, implemented using a rank r
projection operator
proj_rank_r()
. The second term is the \(\ell_0\) norm applied to the S
matrix to encourage sparsity, and is implemented with the help of an adaptive
hard-thresholding operator hard_threshold()
. The third term is the squared
Frobenius norm applied to the model's noise.
The eta
parameter
The eta
parameter scales the sparse penalty applied to rrmc()
's output
sparse S
matrix. Larger values of eta
penalize non-zero entries in S
more stringently, driving the recovery of sparser S
matrices.
Because there are no other parameters scaling the other terms in rrmc()
's
objective function, eta
can intuitively be thought of as the dial that
balances the model's sensitivity to extreme events (placed in S
) and
its sensitivity to noise Z
(captured by the last term in the objective,
which measures the discrepancy between the between the predicted model
and the observed data). Larger values of eta
will place a
greater emphasis on penalizing the non-zero entries in S
over penalizing
the errors between the predicted and observed data Z = L + S - D
.
Environmental health specific extensions
We refer interested readers to Gibson et al. (2022) for the complete details regarding the EH-specific extensions.
Missing value functionality: PCP assumes that the same data generating
mechanisms govern both the missing and the observed entries in D
. Because
PCP primarily seeks accurate estimation of patterns rather than
individual observations, this assumption is reasonable, but in some edge
cases may not always be justified. Missing values in D
are therefore
reconstructed in the recovered low-rank L
matrix according to the
underlying patterns in L
. There are three corollaries to keep in mind
regarding the quality of recovered missing observations:
Recovery of missing entries in
D
relies on accurate estimation ofL
;The fewer observations there are in
D
, the harder it is to accurately reconstructL
(therefore estimation of both unobserved and observed measurements inL
degrades); andGreater proportions of missingness in
D
artifically drive up the sparsity of the estimatedS
matrix. This is because it is not possible to recover a sparse event inS
when the corresponding entry inD
is unobserved. By definition, sparse events inS
cannot be explained by the consistent patterns inL
. Practically, if 20% of the entries inD
are missing, then at least 20% of the entries inS
will be 0.
Handling measurements below the limit of detection: When equipped with LOD information, PCP treats any estimations of values known to be below the LOD as equally valid if their approximations fall between 0 and the LOD. Over the course of optimization, observations below the LOD are pushed into this known range \([0, LOD]\) using penalties from above and below: should a \(< LOD\) estimate be \(< 0\), it is stringently penalized, since measured observations cannot be negative. On the other hand, if a \(< LOD\) estimate is \(>\) the LOD, it is also heavily penalized: less so than when \(< 0\), but more so than observations known to be above the LOD, because we have prior information that these observations must be below LOD. Observations known to be above the LOD are penalized as usual, using the Frobenius norm in the above objective function.
Gibson et al. (2022) demonstrates that
in experimental settings with up to 50% of the data corrupted below the LOD,
PCP with the LOD extension boasts superior accuracy of recovered L
models
compared to PCA coupled with \(LOD / \sqrt{2}\) imputation. PCP even
outperforms PCA in low-noise scenarios with as much as 75% of the data
corrupted below the LOD. The few situations in which PCA bettered PCP were
those pathological cases in which D
was characterized by extreme noise and
huge proportions (i.e., 75%) of observations falling below the LOD.
References
Cherapanamjeri, Yeshwanth, Kartik Gupta, and Prateek Jain. "Nearly optimal robust matrix completion." International Conference on Machine Learning. PMLR, 2017. [available here]
Gibson, Elizabeth A., Junhui Zhang, Jingkai Yan, Lawrence Chillrud, Jaime Benavides, Yanelli Nunez, Julie B. Herbstman, Jeff Goldsmith, John Wright, and Marianthi-Anna Kioumourtzoglou. "Principal component pursuit for pattern identification in environmental mixtures." Environmental Health Perspectives 130, no. 11 (2022): 117008.
Examples
#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-3 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data()
# Best practice is to conduct a grid search with grid_search_cv() function,
# but we skip that here for brevity.
pcp_model <- rrmc(data$D, r = 3, eta = 0.35)
data.frame(
"Observed_relative_error" = norm(data$L - data$D, "F") / norm(data$L, "F"),
"PCA_error" = norm(data$L - proj_rank_r(data$D, r = 3), "F") / norm(data$L, "F"),
"PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
"PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
)
#> Observed_relative_error PCA_error PCP_L_error PCP_S_error
#> 1 0.1286753 0.0738434 0.03283472 0.04249677