Open Access

Ridge-type regularization method for questionnaire data analysis

  • Yuta Umezu1Email author,
  • Hidetoshi Matsuoka2,
  • Hiroshi Ikeda2 and
  • Yoshiyuki Ninomiya3
Pacific Journal of Mathematics for Industry20168:5

DOI: 10.1186/s40736-016-0024-x

Received: 30 April 2016

Accepted: 18 August 2016

Published: 30 August 2016

Abstract

In questionnaire studies for evaluating objects such as manufacturing products, evaluators are required to respond to several evaluation items for the objects. When the number of objects is large, a part of the objects is often assigned randomly to each evaluator, and the response becomes a matrix with missing components. To handle this kind of data, we consider a model by using a dummy matrix representing the existence of the missing components, which can be interpreted as an extension of the GMANOVA model. In addition, to cope with the case where the numbers of the object and evaluation items are large, we consider a ridge-type estimator peculiar to our model to avoid instability in estimation. Moreover, we derive a C p criterion in order to select the tuning parameters included in our estimator. Finally, we check the validity of the proposed method through simulation studies and real data analysis.

Keywords

Array data C p criterion GMANOVA Missing value Ridge estimator

Introduction

In questionnaire studies, N evaluators are often required to respond K evaluation items for M objects selected randomly from J objects. For instance, they evaluate on a scale 1 to 5 for each of the evaluation items. Because the evaluator responds only for M objects, evaluations for the rest of JM objects are missing, that is, we can not observe them. Therefore, we have a three-dimensional array data of size J×K×N consisting of the MKN observations and (JM)K N missing values.

For such data, we are often interested in predicting the missing values based on the observations (for example, recommendation systems of Amazon or Netflix). Nowadays methods such as collaborative filtering or matrix completion are developed to predict the missing part. To predict it, it is indispensable to assume some conditions for the data structure in general. For example, Candès & Recht [2], and Koltchinskii et al. [5] reconstruct the matrix by assuming that the data structure is low-rank, and this is useful since it enables us to use a popular method of convex optimization. However, it is difficult to select a tuning parameter which is included in the method because we have no reasonable information criterion. In addition, it is difficult to evaluate the prediction accuracy because we have no evaluation formula for the variance of the predicted value.

On the other hand, correspondence analysis has been used in questionnaire data analysis in order to extract features from the data (e.g., Benzécri [1]). Correspondence analysis, however, is an exploratory method just like principal component analysis and is not applicable to the data including missing values. So we can not use it to analyze our data. As described in Section 2, it is possible to construct a parametric model by using a dummy matrix representing the existence of the missing values. The model we will consider can be interpreted as an extension of the generalized multivariate analysis of variance (GMANOVA) model in Potthoff & Roy [9] to for three-dimensional array data. Usually, a noise in the GMANOVA model is assumed to be distributed some Gaussian distribution. Unfortunately, since the data obtained from questionnaire study are discrete in general, it is unnatural to assume the normality for noise. Even so, we can express the ordinal least squares estimator explicitly and moreover evaluate the average or variance of the estimator. However, we encounter a problem that the estimator becomes unstable when M or K is large or when the multicollinearity is present in the data.

The ridge-type estimator is often used in order to assure the stability of the estimator (e.g., Hoerl & Kennard [4]). We then need to choose appropriate tuning parameters included in the estimator. Computational methods such as cross validation (CV; Stone [10]) are usually used for this choice although they come at a considerable computation cost. Information criteria such as C p (Mallows [6, 7]) may also be used to choose it. For example, Nagai [8] derived an unbiased estimator of the standardized mean squared error for the ridge-type estimator in the GMANOVA model. However, the objective variable in our data is an (M×K)-dimensional matrix, and so we can not apply his result since it is only for the usually GMANOVA model, that is, they assumed the normality for noise, and he does not considered the missing values.

Although it is sometimes important to predict the missing part in questionnaire studies, our goal in this paper is to construct an appropriate model. To do this, we derive an unbiased estimator of the standardized mean squared error for the model that is defined in Section 2. Moreover, in Section 3, making good use of matrix calculations, we develop a C p -type information criterion in order to select tuning parameters included in the estimator. The proposed method is shown to be valid through a simulation study in Section 4, and then the result in which the method is applied to real data is reported in Section 5. Some concluding remarks are presented in Section 6. Several matrix algebras used in this paper and some proofs are relegated to Appendix.

Setting and assumptions

In the following sections, we will denote 0 d , 1 d and I d by a d-dimensional zero-vector, one-vector and (d×d)-dimensional identity matrix for a positive integer d.

Let \({\mathcal {J}}=\{1,2,,\ldots, J\}\) be an index set of objects and \({\mathcal {J}}_{i}=\{j_{i1},j_{i2},\ldots,j_{iM}\}\) be a subset of \({\mathcal {J}}\) arranged in ascending order. In addition, let \(y_{ij_{im}k}\) be a response of the k-th evaluation item of the j i m -th object for the i-th evaluator, and we denote the data for the i-th evaluator by an (M×K)-dimensional matrix \(\phantom {\dot {i}\!}Y_{i}=(y_{ij_{im}k})_{m=1,2,\ldots,M; k=1,2,\ldots,K}\). For these data, we consider the model
$$\begin{array}{*{20}l} y_{ijk} =\mu+\alpha_{j}+\beta_{k}+\gamma_{jk}+\varepsilon_{ijk}, \end{array} $$
where μ is a general mean, α j and β k are main effects, γ j k is an interaction effect between the j-th object and the k-th item, and ε i j k is noise. Note that we can not fully observe the response y i j k ’s, more specifically speaking, y i j k is missing whenever \(j\not \in {\mathcal {J}}_{i}\). Let \(\tilde {X}_{i}\) be an (M×J)-dimensional matrix whose (m,j)-th element is 1 when j=j i m and 0 otherwise. Then, we can rewrite this model as
$$\begin{array}{*{20}l} Y_{i}=\bar{X}_{i}\bar{B}\bar{A}+E_{i}, \end{array} $$
(1)
where \(\bar {X}_{i}=(\mathbf {1}_{M},\tilde {X}_{i})\in \mathbb {R}^{M\times (J+1)},\;\bar {A}=(\mathbf {1}_{K},I_{K})'\in \mathbb {R}^{(K+1)\times K}\), and
$$\begin{array}{*{20}l} \bar{B}=\left(\begin{array}{cccc} \mu&\beta_{1}&\cdots&\beta_{K} \\ \alpha_{1}&\gamma_{11}&\cdots&\gamma_{1K} \\ \vdots&\vdots&\ddots&\vdots \\ \alpha_{J}&\gamma_{J1}&\cdots&\gamma_{JK} \end{array} \right)\in\mathbb{R}^{(J+1)\times (K+1)}. \end{array} $$

Let us suppose that \(\phantom {\dot {i}\!}E_{i}=(\boldsymbol {\varepsilon }_{i1},\boldsymbol {\varepsilon }_{i2},\ldots,\boldsymbol {\varepsilon }_{iK})=(\varepsilon _{ij_{im}k})_{m=1,2,\ldots,M;k=1,2,\ldots,K}\) are independent random matrices with mean E[E i ]=0 M 0 K′ and covariance V[vec(E i )]=ΣΞ, where Σ and Ξ are an unknown (K×K)-dimensional matrix and a known (M×M)-dimensional matrix, respectively. This means that the k-th column ε i k and the -th column ε i of E i have a covariance matrix \(\phantom {\dot {i}\!}\mathrm {E}[\boldsymbol {\varepsilon }_{ik}\boldsymbol {\varepsilon }'_{i\ell }]=\sigma _{k\ell }\Xi \) for k,=1,2,…,K. Although j m ’s are assigned randomly, we consider X i deterministic for simplicity. Note that this model includes the so-called GMANOVA model of Potthoff & Roy [9] in a special case when M=1 and E i is distributed according to some Gaussian distribution.

To avoid redundancy of the model, we impose
$$\begin{array}{*{20}l} \sum_{j=1}^{J}\alpha_{j} =\sum_{k=1}^{K}\beta_{k} =\sum_{j=1}^{J}\gamma_{jk} =\sum_{k=1}^{K}\gamma_{jk} =0 \end{array} $$
on the parameter as is often used in the ANOVA model. Since
$$\begin{array}{*{20}l} \alpha_{J}&=-\sum_{j=1}^{J-1}\alpha_{j},\;\; \beta_{K}=-\sum_{k=1}^{K-1}\beta_{k}, \\ \gamma_{Jk}&=-\sum_{j=1}^{J-1}\gamma_{jk},\;\;\;\;\;\text{and}\;\;\;\;\; \gamma_{jK}=-\sum_{k=1}^{K-1}\gamma_{jk}, \end{array} $$
we can remove this restriction. In fact, by defining \(C=(I_{J-1},-\mathbf {1}_{J-1})',\; D=(I_{K-1},-\mathbf {1}_{K-1})',\; \bar {\mathbf {\alpha }}=(\alpha _{1},\alpha _{2},\ldots,\alpha _{J-1})',\;\bar {\boldsymbol {\beta }}=(\beta _{1},\beta _{2},\ldots,\beta _{K-1})'\) and \(\bar {\Gamma }=(\gamma _{jk})_{j=1,2,\ldots,J-1;k=1,2,\ldots,K-1}\), \(\bar {B}\) can be rewritten as
$$\begin{array}{*{20}l} \bar{B}= \left(\begin{array}{cc} 1&\mathbf{0}' \\ \mathbf{0}&C \end{array} \right) \left(\begin{array}{cc} \mu&\bar{\mathbf{\beta}}' \\ \bar{\mathbf{\alpha}}&\tilde{\Gamma} \end{array} \right) \left(\begin{array}{cc} 1&\mathbf{0}' \\ \mathbf{0}&D' \end{array} \right), \end{array} $$
and thus we can define
$$\begin{array}{*{20}l} \bar{X}_{i}\left(\begin{array}{cc} 1&\mathbf{0}' \\ \mathbf{0}&C \end{array} \right)\in\mathbb{R}^{M\times J},\;\; \left(\begin{array}{cc} \mu&\bar{\boldsymbol{\beta}}' \\ \bar{\boldsymbol{\alpha}}&\tilde{\Gamma} \end{array} \right)\in\mathbb{R}^{J\times K}, \end{array} $$
and
$$ \begin{array}{l}\left(\begin{array}{cc}1& {\mathbf{0}}^{\prime}\\ {}\mathbf{0}& {D}^{\prime}\end{array}\right)\bar{A}=\left(\begin{array}{cc}{1}_{K-1}^{\prime }& 1\\ {}{I}_{K-1}& -{\mathbf{1}}_{K-1}\end{array}\right)\in {\mathbb{R}}^{K\times K}\end{array} $$

by X i , B and A, respectively.

In the following, let us denote \(\sum _{i=1}^{N}X_{i}'X_{i}\) and \(\sum _{i=1}^{N}X_{i}'Y_{i}\) by X X and X Y, respectively. Then an ordinary least square estimator of the model (1), that is, a minimizer of \(\sum _{i=1}^{N}\|Y_{i}-X_{i}BA\|_{F}^{2}\), is given by
$$\begin{array}{*{20}l} \tilde{B}=(X'X)^{-1}X'YA'(AA')^{-1}, \end{array} $$
where · F denotes a Frobenius norm, i.e, T F =(tr(T T))1/2 for a matrix T. It is easy to see that \(\tilde {B}\) is an unbiased estimator of B, and that, if X X/N converges to some positive definite matrix, \(\tilde {B}\) is a consistent estimator of B from the Chebyshev’s inequality. In addition, an unbiased estimator of Σ is given by
$$\begin{array}{*{20}l} \hat{\Sigma} =\frac{1}{n\text{tr}(\Xi)-S}\sum_{i=1}^{N}(Y_{i}-X_{i}\tilde{B}A)'(Y_{i}-X_{i}\tilde{B}A), \end{array} $$
(2)

where \(S=\sum _{i=1}^{N}\text {tr}\{X_{i}(X'X)^{-1}X_{i}\Xi \}\). The details for deriving the unbiasedness of (2) are given in Appendix 2.

However, when J or K are large, the inverse of X X or A A may not exist or the variance of the estimator may become unstable, and so we consider the ridge-type estimator given by
$$\begin{array}{*{20}l} \hat{B}_{\lambda,\mu} =(X'X+\lambda I_{J})^{-1}X'YA'(AA'+\mu I_{K})^{-1}, \end{array} $$
(3)
where λ and μ are positive constants, which are also known as tuning parameters (see, e.g., Hoerl & Kennard [4], Nagai [8]). Then we can obtain the predictor
$$\begin{array}{*{20}l} \hat{Y}_{i}=X_{i}\hat{B}_{\lambda,\mu}A. \end{array} $$
(4)

Deriving the C p criterion

3.1 Preparation

Nagai [8] derived a C p criterion for a ridge-type estimator in the GMANOVA model. His result and ours are different because there are missing values in the data and the observation is an (M×K)-dimensional matrix in our case. Moreover, we do not assume the normality of E i .

To derive a C p criterion, we need some preparation with matrix calculation. Let us define
$$\begin{array}{*{20}l} H_{\mu}=A'(AA'+\mu I_{K})^{-1}A \end{array} $$
and
$$\begin{array}{*{20}l} G_{\lambda}=(X'X+\lambda I_{J})^{-1}. \end{array} $$
Note that by definition of A,
$$\begin{array}{*{20}l} AA' =\left(\begin{array}{cc} K&\mathbf{0}' \\ \mathbf{0}& I_{K-1}+\mathbf{1}_{K-1}\mathbf{1}_{K-1}' \end{array} \right). \end{array} $$
Because the inverse matrix of (1+μ)I K−1+1 K−1 1 K−1′ is given by
$$\begin{array}{*{20}l} \frac{1}{1+\mu}\left(I_{K-1}-\frac{1}{K+\mu}\mathbf{1}_{K-1}\mathbf{1}_{K-1}'\right) \end{array} $$
from (20), it follows that
$$\begin{array}{*{20}l} H_{\mu} =\frac{1}{(1+\mu)(K+\mu)}\left(\begin{array}{cc} \tilde{H}_{\mu}&\mathbf{0} \\ \mathbf{0}'&K(1+\mu) \end{array} \right), \end{array} $$
(5)

where \(\tilde {H}_{\mu }=(K+\mu)I_{K-1}+\mu \mathbf {1}_{K-1}\mathbf {1}_{K-1}'\).

Next, we see that
$$\begin{array}{*{20}l} X_{i} =(\mathbf{1}_{M},\tilde{X}_{i}) \left(\begin{array}{cc} 1&\mathbf{0}' \\ \mathbf{0}&C \end{array} \right) =(\mathbf{1}_{M},\tilde{X}_{i}C) \end{array} $$
and then X X can be expressed as
$$\begin{array}{*{20}l} \sum_{i=1}^{N} \left(\begin{array}{cc} M&\mathbf{1}_{M}'\tilde{X}_{i}C \\ C'\tilde{X}_{i}'\mathbf{1}_{M}&C'\tilde{X}_{i}'\tilde{X}_{i}C \end{array} \right). \end{array} $$
Let us define
$$\begin{array}{*{20}l} \boldsymbol{\delta}=(\delta_{1},\delta_{2},\ldots,\delta_{J})' =\sum_{i=1}^{N}\tilde{X}_{i}'\mathbf{1}_{M}. \end{array} $$
(6)
Note that δ j represents the number of times such that the j-th object is assigned and \(\Delta =\sum _{i=1}^{N}\tilde {X}_{i}'\tilde {X}_{i}\) is a diagonal matrix whose (j,j)-th element is δ j . From (18) in Appendix 1, G λ can be expressed as
$$\begin{array}{*{20}l} G_{\lambda}= \left(\begin{array}{cc} {\frac{(NM+\lambda)+\boldsymbol{\delta}'C\tilde{G}_{\lambda}^{-1}C'\boldsymbol{\delta}}{(NM+\lambda)^{2}}} &{\frac{\boldsymbol{\delta}'C\tilde{G}_{\lambda}^{-1}}{NM+\lambda}} \\ {\frac{\tilde{G}_{\lambda}^{-1}C'\boldsymbol{\delta}}{NM+\lambda}}&{\tilde{G}}_{\lambda}^{-1} \end{array} \right), \end{array} $$
(7)
where
$$\begin{array}{*{20}l} \tilde{G}_{\lambda} =C'\left(\Delta-\frac{1}{NM+\lambda}\boldsymbol{\delta}\boldsymbol{\delta}'\right)C+\lambda I_{J-1}. \end{array} $$
Let us define \(\tilde {\Delta }=\Delta -(NM+\lambda)^{-1}\boldsymbol {\delta }\boldsymbol {\delta }'\). Then, from (19), we see that
$$\begin{array}{*{20}l} \tilde{\Delta}^{-1} =\Delta^{-1}+\frac{1}{\lambda}\mathbf{1}_{J}\mathbf{1}_{J}', \end{array} $$
(8)
since Δ −1 δ=1 J and δ 1 J =N M. Moreover, by using (19) again, we have
$$\begin{array}{*{20}l} \tilde{G}_{\lambda}^{-1} =\frac{1}{\lambda}I_{J-1} -\frac{1}{\lambda^{2}}C'\left(\tilde{\Delta}^{-1}+\frac{1}{\lambda}CC'\right)^{-1}C. \end{array} $$
Let \(\Delta _{-J}\in \mathbb {R}^{(J-1)\times (J-1)}\) be the sub-matrix of Δ made by removing J-th column and row of Δ, and \(\Delta ^{\dagger }=\Delta _{-J}^{-1}+\lambda ^{-1}I_{J-1}\). Note that the (j,j)-th element of Δ is given by \(\delta _{j}^{-1}+\lambda ^{-1}\) for j=1,2,…,J−1. Then \(\tilde {\Delta }^{-1}+\lambda ^{-1}CC'\) can be expressed as
$$\begin{array}{*{20}l} \left(\begin{array}{cc} \Delta^{\dagger}+\lambda^{-1}\mathbf{1}_{J-1}\mathbf{1}_{J-1}'&\mathbf{0} \\ \mathbf{0}'&\delta_{J}^{-1}+\lambda^{-1}J \end{array} \right) \end{array} $$
from (8). Let us define
$$\begin{array}{*{20}l} P=\Delta_{-J}(\Delta_{-J}+\lambda I_{J-1})^{-1}. \end{array} $$
(9)
Note that Δ −1=λ P. Then the inverse matrix of Δ +λ −1 1 J−1 1 J−1′ can be expressed as
$$ \begin{array}{lll}& {\varDelta}^{\dagger -1}-\frac{\varDelta^{\dagger -1}{\mathbf{1}}_{J-1}{\mathbf{1}}_{J-1}^{\prime }{\varDelta}^{\dagger -1}}{\lambda +{\mathbf{1}}_{J-1}^{\prime }{\varDelta}^{\dagger -1}{\mathbf{1}}_{J-1}}\kern2em & \kern2em \\ {}=& \lambda \left(P-\frac{P{1}_{J-1}{1}_{J-1}^{\prime }P}{1+\mathrm{t}\mathrm{r}(P)}\right).\kern2em & \kern2em \end{array} $$
In this equality, we just use 1 J−1′Δ −1 1 J−1=λtr(P). Finally, we obtain that
$$\begin{array}{*{20}l} &\tilde{G}_{\lambda}^{-1}= \frac{I_{J-1}-P}{\lambda} +\frac{P\mathbf{1}_{J-1}\mathbf{1}_{J-1}'P}{\lambda(1+\text{tr}(P))} -\frac{\delta_{J}\mathbf{1}_{J-1}\mathbf{1}_{J-1}'}{\lambda(\lambda+ J\delta_{J})}. \end{array} $$
(10)

3.2 Main result

Now, we can derive the C p criterion as an unbiased estimator of a standardized mean squared error (MSE) defined by
$$\begin{array}{*{20}l} \sum_{i=1}^{N}\mathrm{E}\Bigl[\text{vec}(\hat{Y}_{i}-\mathrm{E}[Y_{i}])'(\Sigma\otimes\Xi)^{-1}\text{vec}(\hat{Y}_{i}-\mathrm{E}[Y_{i}])\Bigr], \end{array} $$
where \(\hat {Y}_{i}\) is the predictor defined in (4). From (21), this is equivalent to
$$\begin{array}{*{20}l} \sum_{i=1}^{N}\mathrm{E}\Bigl[\text{tr}\Bigl\{(\hat{Y}_{i}-\mathrm{E}[Y_{i}])'\Xi^{-1}(\hat{Y}_{i}-\mathrm{E}[Y_{i}])\Sigma^{-1}\Bigr\}\Bigr]. \end{array} $$
Because E[Y i ]=Y i E i and
$$\begin{array}{*{20}l} &\mathrm{E}[\text{tr}\{\Sigma^{-1}(Y_{i}-\hat{Y}_{i})'E_{i}\}] \\ =&\mathrm{E}[\text{tr}\{\Sigma^{-1}E_{i}'E_{i}\}]-\mathrm{E}[\text{tr}\{\Sigma^{-1}\hat{Y}_{i}'E_{i}\}], \end{array} $$
the MSE can be rewritten as
$$\begin{array}{*{20}l} &\sum_{i=1}^{N}\mathrm{E}[\{\text{tr}\{(Y_{i}-\hat{Y}_{i})'\Xi^{-1}(Y_{i}-\hat{Y}_{i})\Sigma^{-1}\}] \\ &-\sum_{i=1}^{N}\mathrm{E}[\text{tr}(E_{i}'\Xi^{-1}E_{i}\Sigma^{-1})] \\ &+2\sum_{i=1}^{N}\mathrm{E}[\text{tr}(\hat{Y}_{i}'\Xi^{-1}E_{i}\Sigma^{-1}\}]. \end{array} $$
(11)
By using (21) and V[vec(E i )]=ΣΞ, the second term of the right-hand side in (11) can be reduced to NMK since
$$\begin{array}{*{20}l} &\mathrm{E}[\text{tr}(E_{i}'\Xi^{-1}E_{i}\Sigma^{-1})] \\ =&\mathrm{E}[\text{vec}(E_{i})'(\Sigma^{-1}\otimes\Xi^{-1})\text{vec}(E_{i})] \\ =&\text{tr}(I_{M}\otimes I_{K}) \\ =&MK. \end{array} $$
Next, we evaluate the third term of the right-hand side in (11). From (3) and the definition of the model in (1), we have
$$\begin{array}{*{20}l} \hat{Y}_{i} &=\sum_{h=1}^{N}X_{i}G_{\lambda}X_{h}'Y_{h}H_{\mu} \\ &=X_{i}G_{\lambda}X'XBAH_{\mu} +\sum_{h=1}^{N}X_{i}G_{\lambda}X_{h}'E_{h}H_{\mu}. \end{array} $$
Because the first term of the right-hand side in this equality is non-stochastic and E i ’s are independent, we see from (22) that
$$\begin{array}{*{20}l} &\mathrm{E}[\text{tr}(\hat{Y}_{i}'\Xi^{-1}E_{i}\Sigma^{-1})] \\ =&\sum_{h=1}^{N}\mathrm{E}[\text{vec}(E_{h})'(H_{\mu}\Sigma^{-1}\otimes X_{h}G_{\lambda}X_{i}'\Xi^{-1})\text{vec}(E_{i})] \\ =&\text{tr}\{(H_{\mu}\Sigma^{-1}\otimes X_{i}G_{\lambda}X_{i}'\Xi^{-1})(\Sigma\otimes\Xi)\} \\ =&\text{tr}(H_{\mu})\text{tr}(G_{\lambda}X_{i}'X_{i}). \end{array} $$
Thus the third term of the right-hand side in (11) is reduced to 2tr(H μ )tr(G λ X X). From (5), we have
$$\begin{array}{*{20}l} \text{tr}(H_{\mu}) &=\frac{1}{(1+\mu)(K+\mu)}\{\text{tr}(\tilde{H}_{\mu})+K(1+\mu)\} \\ &=\frac{K^{2}+3K\mu-2\mu}{(1+\mu)(K+\mu)}. \end{array} $$
(12)
On the other hand, by the definition of G λ in (7), we have G λ X X=I J λ G λ and
$$\begin{array}{*{20}l} \text{tr}(G_{\lambda}) =&\frac{1}{NM+\lambda} +\frac{\boldsymbol{\delta}'C\tilde{G}_{\lambda}^{-1}C'\boldsymbol{\delta}}{(NM+\lambda)^{2}} +\text{tr}(\tilde{G}_{\lambda}^{-1}) \end{array} $$
from (7). Since tr(P 1 J−1 1 J−1′P)=tr(P 2), the last term \(\text {tr}(\tilde {G}_{\lambda }^{-1})\) is reduced to
$$\begin{array}{*{20}l} \frac{J-1-\text{tr}(P)}{\lambda} +\frac{\text{tr}(P^{2})}{\lambda(1+\text{tr}(P))} -\frac{\delta_{J}(J-1)}{\lambda(\lambda+J\delta_{J})} \end{array} $$
from (10). Moreover, by a simple calculation, we have
$$\begin{array}{*{20}l} \boldsymbol{\delta}'C\mathbf{1}_{J-1} &=(NM+\lambda)-(\lambda+J\delta_{J}), \\ \boldsymbol{\delta}'CP\mathbf{1}_{J-1} &=(NM+\lambda)-(\lambda+\delta_{J})(1+\text{tr}(P)), \end{array} $$
and
$$\begin{array}{*{20}l} &\boldsymbol{\delta}'C(I_{J-1}-P)C'\boldsymbol{\delta} \\ =&\delta_{J}(J-1)(\lambda+\delta_{J}) \\ &-\lambda(\lambda+\delta_{J})^{2}(NM-J\delta_{J})\text{tr}(P). \end{array} $$
Then, it follows that
$$\begin{array}{*{20}l} &\lambda{\boldsymbol{\delta}}'C\tilde{G}_{\lambda}^{-1}C'\boldsymbol{\delta} \\ =&(NM+\lambda)^{2}\biggl(\frac{1}{1+\text{tr}(P)}-\frac{\delta_{J}}{\lambda+J\delta_{J}}\biggr) \\ &-\lambda(NM+\lambda), \end{array} $$
and thus we have
$$\begin{array}{*{20}l} \text{tr}(G_{\lambda}X'X)=f(P)+\frac{J\delta_{J}}{\lambda+J\delta_{J}}, \end{array} $$
(13)
where
$$\begin{array}{*{20}l} f(P)=\text{tr}(P)+\frac{\text{tr}(P)-\text{tr}(P^{2})}{1+\text{tr}(P)}. \end{array} $$
(14)

Combining all the above, we obtain the following theorem:

Theorem 1

An unbiased estimator of MSE in (11) is given by
$$\begin{array}{*{20}l} &\sum_{i=1}^{N}\text{tr}\{(Y_{i}-\hat{Y}_{i})'\Xi^{-1}(Y_{i}-\hat{Y}_{i})\Sigma^{-1}\}-NMK \\ &+2\left(f(P)+\frac{J\delta_{J}}{\lambda+J\delta_{J}}\right)\frac{K^{2}+3K\mu-2\mu}{(1+\mu)(K+\mu)}, \end{array} $$

where δ j , \(\hat {Y}_{i}\), P and f(P) are defined in (6), (4), (9) and (14), respectively.

Our result coincides with Nagai [8] in the special case when K=1. In this case, we can interpret our model (1) as usual multivariate linear regression model except for the missing of the data.

As a result, we propose the following index as a C p -type information criterion:
$$\begin{array}{*{20}l} C_{p} =&\sum_{i=1}^{N}\text{tr}\{(Y_{i}-\hat{Y}_{i})'\Xi^{-1}(Y_{i}-\hat{Y}_{i})\hat{\Sigma}^{-1}\}-NMK \\ &+2\left(f(P)+\frac{J\delta_{J}}{\lambda+J\delta_{J}}\right)\frac{K^{2}+3K\mu-2\mu}{(1+\mu)(K+\mu)}, \end{array} $$
(15)

where \(\hat {\Sigma }\) is an unbiased estimator of Σ defined in (2). By minimizing the C p in (15), we can obtain the optimal values of the tuning parameters (λ,μ).

Simulation study

In this section, we conduct some simulation studies to check the performance of the tuning parameter selection based on C p in (15). The performances for C p and CV are compared.

Concretely speaking, we assessed the performances in terms of the prediction squared error (PSE), that is,
$$\begin{array}{*{20}l} \tilde{\mathrm{E}}\left[(\tilde{Y}_{i}-X_{i}\hat{B}_{\hat{\lambda},\hat{\mu}}A)'\Xi^{-1}(\tilde{Y}_{i}-X_{i}\hat{B}_{\hat{\lambda},\hat{\mu}}A)\hat{\Sigma}^{-1}\right], \end{array} $$
(16)

where \(\tilde {Y}_{i}\) is the copy of Y i , \(\hat {\lambda }\) and \(\hat {\mu }\) are the values of the tuning parameters which minimize each of the criteria, and \(\hat {\Sigma }\) is an unbiased estimator of Σ given by (2). In addition, \(\tilde {\mathrm {E}}\) denotes the expectation with respect to only \(\tilde {Y}_{i}\). The expectation in PSE is evaluated using an empirical mean of n (=1,000) tuples of the test data \(\{(\tilde {Y}_{i},\tilde {X}_{i});i=1,2,\ldots,n\}\) and we conclude that the criterion giving the small value of the PSE is better. Moreover, we checked the standard deviation for difference between the values of PSE given by two criteria, because the performance of each criterion may almost be the same when it is large, even if the difference between the values of PSE are large. Thus, we conclude that the difference is significant when the value of the standard deviation is small. We also checked the computation time (sec) to compute C p and CV for each value of the tuning parameters as a secondary index for the assessment.

The simulation settings were as follows. First, we made (M×K)-dimensional matrices X i (i=1,2,…,N) by the sampling uniformly without replacement from {1,2,…,J}. We then made Y i based on the model in (1) for each i=1,2,…,N, and rounded so that the elements of Y i were in {1,2,…,5}. Next, we used Σ=(0.5|ij|) i,j=1,2,…,K and Ξ=(1−ρ)I M +ρ 1 M 1 M′ with a fixed ρ. This matrix Ξ is known as an intra-class correlation matrix and it represents correlations among the rows of Y i . We used it as one of the simplest matrices appropriate for representing correlations among objects. Note that while we use the intra-class correlation matrix here, our theory does not depend on any specific structure for Ξ. The true parameters α=(α 1,α 2,…,α J−1), β=(β 1,β 2,…,β J−1), Γ=(γ j k ) j=1,2,…,J−1;k=1,2,…,K−1 were drawn from
$$\begin{array}{*{20}l} \boldsymbol{\alpha} &\sim\mathrm{N}(\mathbf{0},10^{-3}\mathbb{I}_{J-1}),\;\;\;\;\; \boldsymbol{\beta} \sim\mathrm{N}(\mathbf{0},10^{-3}\mathbb{I}_{K-1}), \end{array} $$
and
$$\begin{array}{*{20}l} \Gamma &\sim\mathrm{N}(\mathbf{0},10^{-3}\mathbb{I}_{K-1}\otimes \mathbb{I}_{J-1}), \end{array} $$
where \(\mathbb {I}_{q}=I_{q}+\mathbf {1}_{q}\mathbf {1}_{q}'\) for an positive integer q, and μ=3. In this case, we can evaluate
$$\begin{array}{*{20}l} N\text{tr}(\Xi)-S=NM-J \end{array} $$
(17)

in (2). The details for deriving (17) are given in Appendix 3. Sample size N was set to 500 or 1,000, nine cases were considered for three-tuple (J,M,K), and fifty simulations were conducted.

Table 1 shows the results for ρ=0 and ρ=0.5, and the average and standard deviation of the PSE. Standard deviation of the differences between the values of the PSE (Diff) are also provided. In each case, we see that both the average and standard deviation of PSE for C p in (15) are smaller than those of CV. Moreover, comparing N=500 and N=1,000 with the same value of (J,M,K), the value of Diff is small when N=1,000. Thus we can say that the difference between the values given by C p and CV is significant as N increases.
Table 1

Comparison between C p and CV for simulated data

ρ

(J,M,K)

N

C p (sd)

CV (sd)

Diff

0

(30,5,2)

500

11.726 (0.432)

12.764 (0.697)

0.516

  

1000

11.370 (0.369)

11.661 (0.480)

0.226

 

(60,5,2)

500

13.311 (0.713)

16.189 (1.547)

1.146

  

1000

12.136 (0.408)

13.003 (0.703)

0.476

 

(90,5,2)

500

14.544 (0.688)

21.219 (2.249)

1.918

  

1000

12.819 (0.432)

14.757 (0.835)

0.664

 

(30,5,4)

500

23.623 (0.931)

25.612 (1.391)

1.068

  

1000

22.411 (0.539)

22.870 (0.694)

0.326

 

(60,5,4)

500

25.642 (1.207)

31.381 (3.331)

2.484

  

1000

23.919 (0.715)

26.073 (1.317)

1.050

 

(90,5,4)

500

26.141 (0.835)

37.661 (4.553)

4.198

  

1000

24.376 (0.616)

27.986 (1.371)

1.113

 

(30,10,4)

500

43.211 (1.101)

43.525 (1.230)

0.336

  

1000

41.958 (0.664)

42.084 (0.663)

0.161

 

(60,10,4)

500

46.600 (1.136)

49.628 (1.929)

1.475

  

1000

44.857 (0.781)

45.629 (0.831)

0.408

 

(90,10,4)

500

48.550 (1.040)

56.096 (3.186)

2.735

  

1000

45.542 (0.956)

47.192 (0.991)

0.611

0.5

(30,5,2)

500

12.905 (0.670)

14.282 (1.080)

0.892

  

1000

12.294 (0.650)

12.609 (0.774)

0.288

 

(60,5,2)

500

15.197 (1.043)

19.896 (2.325)

1.899

  

1000

13.760 (0.659)

14.895 (0.752)

0.481

 

(90,5,2)

500

16.520 (0.950)

28.466 (4.101)

3.773

  

1000

14.553 (0.560)

17.854 (1.386)

1.101

 

(30,5,4)

500

24.779 (1.289)

26.680 (1.715)

1.126

  

1000

22.189 (0.644)

22.972 (0.832)

0.548

 

(60,5,4)

500

27.027 (1.295)

36.329 (4.517)

3.960

  

1000

24.047 (0.714)

27.204 (1.710)

1.322

 

(90,5,4)

500

28.378 (1.413)

44.593 (5.329)

4.685

  

1000

25.965 (0.635)

33.100 (1.840)

1.698

 

(30,10,4)

500

45.319 (1.232)

46.186 (1.577)

0.725

  

1000

42.442 (1.009)

42.490 (0.917)

0.491

 

(60,10,4)

500

52.135 (3.347)

56.454 (3.215)

3.110

  

1000

46.593 (1.320)

48.044 (1.696)

0.841

 

(90,10,4)

500

55.531 (4.301)

65.578 (4.003)

3.381

  

1000

48.748 (1.115)

50.934 (1.566)

1.337

On the other hand, Fig. 1 shows the comparison of the computation time to compute C p and CV for each value of the tuning parameters. We set M=5 and K=4. On the left, we can see that the computation time for CV increases although that for C p is not much changed as N or J increase. An enlarged view of the computation time for C p is drawn on the right. Since the difference among lines is small, we can say that the computation time for C p is robust to scale changes. Moreover, the model selection via C p is easily implemented because C p in (15) has a simple form. On the whole, we conclude that the C p in (15) is better than CV.
Fig. 1

Transition of computation time. In each figure, the horizontal axis indicates the sample size N and the vertical axis indicates the average of computation time (sec) for each value of the tuning parameters. On the left, the solid and the dashed line represent the average of computation time via C p and CV, respectively. An extended figure of the average of computation time via C p is drawn on the right

Real data analysis

In this section, we compare the methods by applying them to real data. In the data, objects are grouped into three categories and we assume that the data for three categories are independent each other. For the three categories, (N,J)’s are respectively (1884,60), (1364,21), and (1425,44), and (K,M)=(4,5).

We used 1,200 samples obtained at random as training data for each category, and the rest of the data is used as test data. In addition, we set ρ=0 or ρ=0.5. Table 2 shows the PSE in (16) evaluated from the test data after selecting the tuning parameters based on C p and CV. Similarly to Section 4, we observe that the criterion giving a smaller value of the PSE is better, and so we can say that the tuning parameter selection based on C p is superior to that of CV. Looking at the result for C p , the value of the PSE with ρ=0.5 is small in categories 1 and 3, and the value of the PSE with ρ=0 is small in category 2. Thus it is suggested that the correlations among the objects in category 2 are smaller than those of other categories. Note that there is no significant difference of the results for C p and CV in category 2. It is because the size of test data is small compared to that in categories 1 and 3.
Table 2

Comparison between C p and CV for real data

ρ

category

C p

CV

0

1

21.433

22.239

 

2

20.323

20.364

 

3

19.989

20.177

0.5

1

19.861

20.390

 

2

20.523

20.566

 

3

19.000

19.221

Concluding remarks and future work

In this paper, we have considered an appropriate model and estimating method in a questionnaire study and derived the C p criterion to choose the tuning parameters included in the estimator. More precisely, using a dummy matrix representing the existence of the missing values, we have constructed a model which can be interpreted as an extension of the GMANOVA model of Potthoff & Roy [9] for three-dimensional array data. We have explicitly evaluated the penalty term in the C p without assuming the normality of the noise and shown that it becomes a simple form. Through the simulation study and real data analysis, we have confirmed the usefulness of the derived C p . This criterion has a high prediction accuracy and low computational costs compared to CV because it can be expressed by a simple form explicitly.

It is well known that predicting a missing part is important when we construct a recommendation system, which is sometimes required in a recent questionnaire study or WEB survey. However, it is in general difficult to evaluate the prediction accuracy for methods such as collaborative filtering or matrix completion. For this problem, by extending the method in this paper to a model which contains a random effect in the evaluators, it might be possible to draw common statistical inferences, including the evaluation of the prediction accuracy.

In the future, it is expected that similar results will be obtained for more complex models because the model we considered has a particular structure for X i and A. In addition, it will be necessary to treat the case where X i is random, Ξ is unknown or there exist correlations among the categories in order to use more flexible models.

\thelikesection Appendix 1: Matrix algebra

Here, we describe some matrix algebra that we have used in this paper. All the proofs can be found in Harville [3].

First, we describe two matrix inversion formulae. Let \(A\in \mathbb {R}^{p\times p},\;B\in \mathbb {R}^{p\times q}\), and \(C\in \mathbb {R}^{q\times q}\), and assume that A is non-singular. Then
$$\begin{array}{*{20}l} &\left(\begin{array}{cc} A&B \\ B'&C \end{array} \right)^{-1} \\ =&\left(\begin{array}{cc} A^{-1}+A^{-1}BD^{-1}B'A^{-1}&-A^{-1}BD^{-1} \\ -D^{-1}B'A^{-1}&D^{-1} \end{array} \right) \end{array} $$
(18)
if and only if D=CB A −1 B is non-singular. In addition, assume that C is non-singular. Then
$$\begin{array}{*{20}l} &(A+BCB')^{-1} \\ =&A^{-1}-A^{-1}B(C^{-1}+B'A^{-1}B)^{-1}B'A^{-1} \end{array} $$
(19)
if and only if C −1+B A −1 B is non-singular. This is also known as Woodbury’s formula, and in the special case where \(A=\alpha I_{p},\;B=\boldsymbol {b}\in \mathbb {R}^{p}\), and \(C=\beta \in \mathbb {R}\) such that α≠0 and α+β b b≠0, we have
$$\begin{array}{*{20}l} (\alpha I_{p}+\beta \boldsymbol{b}\boldsymbol{b}')^{-1} =\frac{1}{\alpha}\left(I_{p}-\frac{\beta}{\alpha+\beta\boldsymbol{b}'\boldsymbol{b}}\boldsymbol{b}\boldsymbol{b}'\right) \end{array} $$
(20)
Next, we describe the relationship between tr and vec operators. For matrices \(A\in \mathbb {R}^{p\times q},\;B\in \mathbb {R}^{q\times q}\), and \(C\in \mathbb {R}^{p\times p}\), we have
$$\begin{array}{*{20}l} \text{tr}(A'CAB') =\text{vec}(A)'(B\otimes C)\text{vec}(A). \end{array} $$
(21)
From (21), we can immediately see that
$$\begin{array}{*{20}l} \text{tr}(D'A'CAB') =\text{vec}(A)'(DB\otimes C)\text{vec}(A) \end{array} $$
(22)

for matrices A, B, C defined in (21), and \(D\in \mathbb {R}^{q\times q}\).

\thelikesection Appendix 2: Unbiasedness of (2)

Noting that A (A A )−1 A=I K by the definition of A, it follows that
$$\begin{array}{*{20}l} Y_{i}-X_{i}\tilde{B}A =E_{i}-\sum_{h=1}^{N}X_{i}(X'X)^{-1}X_{h}'E_{h}. \end{array} $$
(23)
In addition, for an (M×M)-dimensional matrix T, we can see that
$$\begin{array}{*{20}l} \mathrm{E}[E_{i}'TE_{i}]=\text{tr}(T\Xi)\Sigma. \end{array} $$
(24)
In fact, the (h,l)-th element of E iT E i is given by ε i hT ε i l for h,l=1,2,…,K, and thus we have
$$\begin{array}{*{20}l} \mathrm{E}[\boldsymbol{\varepsilon}_{ih}'T\boldsymbol{\varepsilon}_{il}] =\text{tr}(T\mathrm{E}[\boldsymbol{\varepsilon}_{il }\boldsymbol{\varepsilon}_{ih}']) =\text{tr}(T\Xi)\sigma_{lh}. \end{array} $$
This and the symmetry of Σ imply (24). From (23), (24), and the independence of E i , we have
$$\begin{array}{*{20}l} &\sum_{i=1}^{N}\mathrm{E}[(Y_{i}-X_{i}\tilde{B}A)'(Y_{i}-X_{i}\tilde{B}A)] \\ =&\sum_{i=1}^{N}\mathrm{E}[E_{i}'E_{i}] -2\sum_{i,h}\mathrm{E}[E_{h}'X_{h}(X'X)^{-1}X_{i}'E_{i}] \\ &+\sum_{i,h,l}\mathrm{E}[E_{h}'X_{h}(X'X)^{-1}X_{i}'X_{i}(X'X)^{-1}X_{l}'E_{l}] \\ =&N\text{tr}(\Xi)\Sigma-\sum_{h=1}^{N}\text{tr}\{X_{k}(X'X)^{-1}X_{h}'\Xi\}\Sigma \\ =&\{N\text{tr}(\Xi)-S\}\Sigma. \end{array} $$

This completes the proof.

\thelikesection Appendix 3: Derivation of (17)

By the same argument in Section 3, we see that (X X)−1 can be expressed as
$$\begin{array}{*{20}l} \left(\begin{array}{cc} J^{-2}\text{tr}(\Delta^{-1})& J^{-1}\mathbf{1}_{J-1}'Q \\ J^{-1}Q'\mathbf{1}_{J-1}&R \end{array}\right), \end{array} $$
where \(Q=\Delta _{-J}^{-1}-J^{-1}\text {tr}(\Delta ^{-1})I_{J-1}\) and R=(C C)−1 C Δ −1 C(C C)−1. Then we have
$$\begin{array}{*{20}l} &X_{i}(X'X)^{-1}X_{i}' \\ =&\frac{1}{J^{2}}\text{tr}(\Delta^{-1})\mathbf{1}_{M}\mathbf{1}_{M}' +\frac{1}{J}\tilde{X}_{i}CQ'\mathbf{1}_{J-1}\mathbf{1}_{M}' \\ &+\frac{1}{J}\mathbf{1}_{M}\mathbf{1}_{J-1}'QC'\tilde{X}_{i}' +\tilde{X}_{i}CRC\tilde{X}_{i}'. \end{array} $$
Note that C Q 1 J−1=(Δ −1J −1tr(Δ −1)I J )1 J and C R C=(I J J −1 1 J 1 J′)Δ −1(I J J −1 1 J 1 J′) by a simple calculation, and that \(\tilde {X}_{i}\mathbf {1}_{J}=\mathbf {1}_{M}\). Hence, X i (X X)−1 X i′ is reduced to \(\tilde {X}_{i}\Delta ^{-1}\tilde {X}_{i}'\) and we see that this is a diagonal matrix. Finally, since Ξ=(1−ρ)I M +ρ 1 M 1 M′ and
$$\begin{array}{*{20}l} &\sum_{i=1}^{N}\text{tr}\{X_{i}(X'X)^{-1}X_{i}'\mathbf{1}_{M}\mathbf{1}_{M}'\} \\ =&\sum_{i=1}^{N}\text{tr}\{\tilde{X}_{i}\Delta^{-1}\tilde{X}_{i}'\mathbf{1}_{M}\mathbf{1}_{M}'\} =\sum_{i=1}^{N}\mathbf{1}_{M}'\tilde{X}_{i}\Delta^{-1}\tilde{X}_{i}'\mathbf{1}_{M} \\ =&\sum_{i=1}^{N}\text{tr}\{\tilde{X}_{i}\Delta^{-1}\tilde{X}_{i}'\} =J, \end{array} $$
we obtain
$$\begin{array}{*{20}l} &\sum_{i=1}^{N}\text{tr}\{X_{i}(X'X)^{-1}X_{i}'\Xi\} \\ =&(1-\rho)\sum_{i=1}^{N}\text{tr}\{X_{i}(X'X)^{-1}X_{i}'\} \\ &+\rho\sum_{i=1}^{N}\text{tr}\{X_{i}(X'X)^{-1}X_{i}'\mathbf{1}_{M}\mathbf{1}_{M}'\} \\ =&(1-\rho)J+\rho J =J. \end{array} $$

This and tr(Ξ)=M imply (17).

Declarations

Acknowledgements

The authors would like to thank the reviewer for his/her valuable commentsand advice to improve the paper. This research was partially supported by a Grant-in-Aid for Scientific Research (23500353) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors’ Affiliations

(1)
Nagoya Institute of Technology
(2)
Fujitsu Laboratories Ltd
(3)
Institute of Mathematics for Industry

References

  1. Benzécri, J-P: Correspondence Analysis Handbook. Marcel Dekker, New York (1992).MATHGoogle Scholar
  2. Candès, EJ, Recht, B: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009).MathSciNetView ArticleMATHGoogle Scholar
  3. Harville, DA: Matrix Algebra from a Statistician’s Perspective. Springer, New York (1997).View ArticleMATHGoogle Scholar
  4. Hoerl, AE, Kennard, RW: Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 12(1), 55–67 (1970).MathSciNetView ArticleMATHGoogle Scholar
  5. Koltchinskii, V, Lounici, K, Tsybakov, AB: Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 39(5), 2302–2329 (2011).MathSciNetView ArticleMATHGoogle Scholar
  6. Mallows, CL: Some comments on C p . Technometrics. 15(1), 661–675 (1973).MATHGoogle Scholar
  7. Mallows, CL: More comments on C p . Technometrics. 37(4), 362–372 (1995).MathSciNetMATHGoogle Scholar
  8. Nagai, I: Modified C p criterion for optimizing ridge and smooth parameters in the mgr estimator for the nonparametric gmanova model. Open J. Stat. 1, 1–14 (2011).MathSciNetView ArticleGoogle Scholar
  9. Potthoff, RF, Roy, S: A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika. 51(3/4), 313–326 (1964).MathSciNetView ArticleMATHGoogle Scholar
  10. Stone, M: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B Stat Methodol. 36(2), 111–147 (1974).MathSciNetMATHGoogle Scholar

Copyright

© The Author(s) 2016