- ORIGINAL ARTICLE
- Open Access
- Published:
Ridge-type regularization method for questionnaire data analysis
Pacific Journal of Mathematics for Industry volume 8, Article number: 5 (2016)
Abstract
In questionnaire studies for evaluating objects such as manufacturing products, evaluators are required to respond to several evaluation items for the objects. When the number of objects is large, a part of the objects is often assigned randomly to each evaluator, and the response becomes a matrix with missing components. To handle this kind of data, we consider a model by using a dummy matrix representing the existence of the missing components, which can be interpreted as an extension of the GMANOVA model. In addition, to cope with the case where the numbers of the object and evaluation items are large, we consider a ridge-type estimator peculiar to our model to avoid instability in estimation. Moreover, we derive a C _{ p } criterion in order to select the tuning parameters included in our estimator. Finally, we check the validity of the proposed method through simulation studies and real data analysis.
Introduction
In questionnaire studies, N evaluators are often required to respond K evaluation items for M objects selected randomly from J objects. For instance, they evaluate on a scale 1 to 5 for each of the evaluation items. Because the evaluator responds only for M objects, evaluations for the rest of J−M objects are missing, that is, we can not observe them. Therefore, we have a three-dimensional array data of size J×K×N consisting of the MKN observations and (J−M)K N missing values.
For such data, we are often interested in predicting the missing values based on the observations (for example, recommendation systems of Amazon or Netflix). Nowadays methods such as collaborative filtering or matrix completion are developed to predict the missing part. To predict it, it is indispensable to assume some conditions for the data structure in general. For example, Candès & Recht [2], and Koltchinskii et al. [5] reconstruct the matrix by assuming that the data structure is low-rank, and this is useful since it enables us to use a popular method of convex optimization. However, it is difficult to select a tuning parameter which is included in the method because we have no reasonable information criterion. In addition, it is difficult to evaluate the prediction accuracy because we have no evaluation formula for the variance of the predicted value.
On the other hand, correspondence analysis has been used in questionnaire data analysis in order to extract features from the data (e.g., Benzécri [1]). Correspondence analysis, however, is an exploratory method just like principal component analysis and is not applicable to the data including missing values. So we can not use it to analyze our data. As described in Section 2, it is possible to construct a parametric model by using a dummy matrix representing the existence of the missing values. The model we will consider can be interpreted as an extension of the generalized multivariate analysis of variance (GMANOVA) model in Potthoff & Roy [9] to for three-dimensional array data. Usually, a noise in the GMANOVA model is assumed to be distributed some Gaussian distribution. Unfortunately, since the data obtained from questionnaire study are discrete in general, it is unnatural to assume the normality for noise. Even so, we can express the ordinal least squares estimator explicitly and moreover evaluate the average or variance of the estimator. However, we encounter a problem that the estimator becomes unstable when M or K is large or when the multicollinearity is present in the data.
The ridge-type estimator is often used in order to assure the stability of the estimator (e.g., Hoerl & Kennard [4]). We then need to choose appropriate tuning parameters included in the estimator. Computational methods such as cross validation (CV; Stone [10]) are usually used for this choice although they come at a considerable computation cost. Information criteria such as C _{ p } (Mallows [6, 7]) may also be used to choose it. For example, Nagai [8] derived an unbiased estimator of the standardized mean squared error for the ridge-type estimator in the GMANOVA model. However, the objective variable in our data is an (M×K)-dimensional matrix, and so we can not apply his result since it is only for the usually GMANOVA model, that is, they assumed the normality for noise, and he does not considered the missing values.
Although it is sometimes important to predict the missing part in questionnaire studies, our goal in this paper is to construct an appropriate model. To do this, we derive an unbiased estimator of the standardized mean squared error for the model that is defined in Section 2. Moreover, in Section 3, making good use of matrix calculations, we develop a C _{ p }-type information criterion in order to select tuning parameters included in the estimator. The proposed method is shown to be valid through a simulation study in Section 4, and then the result in which the method is applied to real data is reported in Section 5. Some concluding remarks are presented in Section 6. Several matrix algebras used in this paper and some proofs are relegated to Appendix.
Setting and assumptions
In the following sections, we will denote 0 _{ d }, 1 _{ d } and I _{ d } by a d-dimensional zero-vector, one-vector and (d×d)-dimensional identity matrix for a positive integer d.
Let \({\mathcal {J}}=\{1,2,,\ldots, J\}\) be an index set of objects and \({\mathcal {J}}_{i}=\{j_{i1},j_{i2},\ldots,j_{iM}\}\) be a subset of \({\mathcal {J}}\) arranged in ascending order. In addition, let \(y_{ij_{im}k}\) be a response of the k-th evaluation item of the j _{ i m }-th object for the i-th evaluator, and we denote the data for the i-th evaluator by an (M×K)-dimensional matrix \(\phantom {\dot {i}\!}Y_{i}=(y_{ij_{im}k})_{m=1,2,\ldots,M; k=1,2,\ldots,K}\). For these data, we consider the model
where μ is a general mean, α _{ j } and β _{ k } are main effects, γ _{ j k } is an interaction effect between the j-th object and the k-th item, and ε _{ i j k } is noise. Note that we can not fully observe the response y _{ i j k }’s, more specifically speaking, y _{ i j k } is missing whenever \(j\not \in {\mathcal {J}}_{i}\). Let \(\tilde {X}_{i}\) be an (M×J)-dimensional matrix whose (m,j)-th element is 1 when j=j _{ i m } and 0 otherwise. Then, we can rewrite this model as
where \(\bar {X}_{i}=(\mathbf {1}_{M},\tilde {X}_{i})\in \mathbb {R}^{M\times (J+1)},\;\bar {A}=(\mathbf {1}_{K},I_{K})'\in \mathbb {R}^{(K+1)\times K}\), and
Let us suppose that \(\phantom {\dot {i}\!}E_{i}=(\boldsymbol {\varepsilon }_{i1},\boldsymbol {\varepsilon }_{i2},\ldots,\boldsymbol {\varepsilon }_{iK})=(\varepsilon _{ij_{im}k})_{m=1,2,\ldots,M;k=1,2,\ldots,K}\) are independent random matrices with mean E[E _{ i }]=0 _{ M } 0 K′ and covariance V[vec(E _{ i })]=Σ⊗Ξ, where Σ and Ξ are an unknown (K×K)-dimensional matrix and a known (M×M)-dimensional matrix, respectively. This means that the k-th column ε _{ i k } and the ℓ-th column ε _{ i ℓ } of E _{ i } have a covariance matrix \(\phantom {\dot {i}\!}\mathrm {E}[\boldsymbol {\varepsilon }_{ik}\boldsymbol {\varepsilon }'_{i\ell }]=\sigma _{k\ell }\Xi \) for k,ℓ=1,2,…,K. Although j _{ m }’s are assigned randomly, we consider X _{ i } deterministic for simplicity. Note that this model includes the so-called GMANOVA model of Potthoff & Roy [9] in a special case when M=1 and E _{ i } is distributed according to some Gaussian distribution.
To avoid redundancy of the model, we impose
on the parameter as is often used in the ANOVA model. Since
we can remove this restriction. In fact, by defining \(C=(I_{J-1},-\mathbf {1}_{J-1})',\; D=(I_{K-1},-\mathbf {1}_{K-1})',\; \bar {\mathbf {\alpha }}=(\alpha _{1},\alpha _{2},\ldots,\alpha _{J-1})',\;\bar {\boldsymbol {\beta }}=(\beta _{1},\beta _{2},\ldots,\beta _{K-1})'\) and \(\bar {\Gamma }=(\gamma _{jk})_{j=1,2,\ldots,J-1;k=1,2,\ldots,K-1}\), \(\bar {B}\) can be rewritten as
and thus we can define
and
by X _{ i }, B and A, respectively.
In the following, let us denote \(\sum _{i=1}^{N}X_{i}'X_{i}\) and \(\sum _{i=1}^{N}X_{i}'Y_{i}\) by X ^{′} X and X ^{′} Y, respectively. Then an ordinary least square estimator of the model (1), that is, a minimizer of \(\sum _{i=1}^{N}\|Y_{i}-X_{i}BA\|_{F}^{2}\), is given by
where ∥·∥_{ F } denotes a Frobenius norm, i.e, ∥T∥_{ F }=(tr(T ^{′} T))^{1/2} for a matrix T. It is easy to see that \(\tilde {B}\) is an unbiased estimator of B, and that, if X ^{′} X/N converges to some positive definite matrix, \(\tilde {B}\) is a consistent estimator of B from the Chebyshev’s inequality. In addition, an unbiased estimator of Σ is given by
where \(S=\sum _{i=1}^{N}\text {tr}\{X_{i}(X'X)^{-1}X_{i}\Xi \}\). The details for deriving the unbiasedness of (2) are given in Appendix 2.
However, when J or K are large, the inverse of X ^{′} X or A A ^{′} may not exist or the variance of the estimator may become unstable, and so we consider the ridge-type estimator given by
where λ and μ are positive constants, which are also known as tuning parameters (see, e.g., Hoerl & Kennard [4], Nagai [8]). Then we can obtain the predictor
Deriving the C _{ p } criterion
3.1 Preparation
Nagai [8] derived a C _{ p } criterion for a ridge-type estimator in the GMANOVA model. His result and ours are different because there are missing values in the data and the observation is an (M×K)-dimensional matrix in our case. Moreover, we do not assume the normality of E _{ i }.
To derive a C _{ p } criterion, we need some preparation with matrix calculation. Let us define
and
Note that by definition of A,
Because the inverse matrix of (1+μ)I _{ K−1}+1 _{ K−1} 1 K−1′ is given by
from (20), it follows that
where \(\tilde {H}_{\mu }=(K+\mu)I_{K-1}+\mu \mathbf {1}_{K-1}\mathbf {1}_{K-1}'\).
Next, we see that
and then X ^{′} X can be expressed as
Let us define
Note that δ _{ j } represents the number of times such that the j-th object is assigned and \(\Delta =\sum _{i=1}^{N}\tilde {X}_{i}'\tilde {X}_{i}\) is a diagonal matrix whose (j,j)-th element is δ _{ j }. From (18) in Appendix 1, G _{ λ } can be expressed as
where
Let us define \(\tilde {\Delta }=\Delta -(NM+\lambda)^{-1}\boldsymbol {\delta }\boldsymbol {\delta }'\). Then, from (19), we see that
since Δ ^{−1} δ=1 _{ J } and δ ^{′} 1 _{ J }=N M. Moreover, by using (19) again, we have
Let \(\Delta _{-J}\in \mathbb {R}^{(J-1)\times (J-1)}\) be the sub-matrix of Δ made by removing J-th column and row of Δ, and \(\Delta ^{\dagger }=\Delta _{-J}^{-1}+\lambda ^{-1}I_{J-1}\). Note that the (j,j)-th element of Δ ^{†} is given by \(\delta _{j}^{-1}+\lambda ^{-1}\) for j=1,2,…,J−1. Then \(\tilde {\Delta }^{-1}+\lambda ^{-1}CC'\) can be expressed as
from (8). Let us define
Note that Δ ^{†−1}=λ P. Then the inverse matrix of Δ ^{†}+λ ^{−1} 1 _{ J−1} 1 J−1′ can be expressed as
In this equality, we just use 1 J−1′Δ ^{†−1} 1 _{ J−1}=λtr(P). Finally, we obtain that
3.2 Main result
Now, we can derive the C _{ p } criterion as an unbiased estimator of a standardized mean squared error (MSE) defined by
where \(\hat {Y}_{i}\) is the predictor defined in (4). From (21), this is equivalent to
Because E[Y _{ i }]=Y _{ i }−E _{ i } and
the MSE can be rewritten as
By using (21) and V[vec(E _{ i })]=Σ⊗Ξ, the second term of the right-hand side in (11) can be reduced to NMK since
Next, we evaluate the third term of the right-hand side in (11). From (3) and the definition of the model in (1), we have
Because the first term of the right-hand side in this equality is non-stochastic and E _{ i }’s are independent, we see from (22) that
Thus the third term of the right-hand side in (11) is reduced to 2tr(H _{ μ })tr(G _{ λ } X ^{′} X). From (5), we have
On the other hand, by the definition of G _{ λ } in (7), we have G _{ λ } X ^{′} X=I _{ J }−λ G _{ λ } and
from (7). Since tr(P 1 _{ J−1} 1 J−1′P)=tr(P ^{2}), the last term \(\text {tr}(\tilde {G}_{\lambda }^{-1})\) is reduced to
from (10). Moreover, by a simple calculation, we have
and
Then, it follows that
and thus we have
where
Combining all the above, we obtain the following theorem:
Theorem 1
An unbiased estimator of MSE in (11) is given by
where δ _{ j }, \(\hat {Y}_{i}\), P and f(P) are defined in (6), (4), (9) and (14), respectively.
Our result coincides with Nagai [8] in the special case when K=1. In this case, we can interpret our model (1) as usual multivariate linear regression model except for the missing of the data.
As a result, we propose the following index as a C _{ p }-type information criterion:
where \(\hat {\Sigma }\) is an unbiased estimator of Σ defined in (2). By minimizing the C _{ p } in (15), we can obtain the optimal values of the tuning parameters (λ,μ).
Simulation study
In this section, we conduct some simulation studies to check the performance of the tuning parameter selection based on C _{ p } in (15). The performances for C _{ p } and CV are compared.
Concretely speaking, we assessed the performances in terms of the prediction squared error (PSE), that is,
where \(\tilde {Y}_{i}\) is the copy of Y _{ i }, \(\hat {\lambda }\) and \(\hat {\mu }\) are the values of the tuning parameters which minimize each of the criteria, and \(\hat {\Sigma }\) is an unbiased estimator of Σ given by (2). In addition, \(\tilde {\mathrm {E}}\) denotes the expectation with respect to only \(\tilde {Y}_{i}\). The expectation in PSE is evaluated using an empirical mean of n (=1,000) tuples of the test data \(\{(\tilde {Y}_{i},\tilde {X}_{i});i=1,2,\ldots,n\}\) and we conclude that the criterion giving the small value of the PSE is better. Moreover, we checked the standard deviation for difference between the values of PSE given by two criteria, because the performance of each criterion may almost be the same when it is large, even if the difference between the values of PSE are large. Thus, we conclude that the difference is significant when the value of the standard deviation is small. We also checked the computation time (sec) to compute C _{ p } and CV for each value of the tuning parameters as a secondary index for the assessment.
The simulation settings were as follows. First, we made (M×K)-dimensional matrices X _{ i } (i=1,2,…,N) by the sampling uniformly without replacement from {1,2,…,J}. We then made Y _{ i } based on the model in (1) for each i=1,2,…,N, and rounded so that the elements of Y _{ i } were in {1,2,…,5}. Next, we used Σ=(0.5^{|i−j|})_{ i,j=1,2,…,K } and Ξ=(1−ρ)I _{ M }+ρ 1 _{ M } 1 M′ with a fixed ρ. This matrix Ξ is known as an intra-class correlation matrix and it represents correlations among the rows of Y _{ i }. We used it as one of the simplest matrices appropriate for representing correlations among objects. Note that while we use the intra-class correlation matrix here, our theory does not depend on any specific structure for Ξ. The true parameters α=(α _{1},α _{2},…,α _{ J−1})^{′}, β=(β _{1},β _{2},…,β _{ J−1})^{′}, Γ=(γ _{ j k })_{ j=1,2,…,J−1;k=1,2,…,K−1} were drawn from
and
where \(\mathbb {I}_{q}=I_{q}+\mathbf {1}_{q}\mathbf {1}_{q}'\) for an positive integer q, and μ=3. In this case, we can evaluate
in (2). The details for deriving (17) are given in Appendix 3. Sample size N was set to 500 or 1,000, nine cases were considered for three-tuple (J,M,K), and fifty simulations were conducted.
Table 1 shows the results for ρ=0 and ρ=0.5, and the average and standard deviation of the PSE. Standard deviation of the differences between the values of the PSE (Diff) are also provided. In each case, we see that both the average and standard deviation of PSE for C _{ p } in (15) are smaller than those of CV. Moreover, comparing N=500 and N=1,000 with the same value of (J,M,K), the value of Diff is small when N=1,000. Thus we can say that the difference between the values given by C _{ p } and CV is significant as N increases.
On the other hand, Fig. 1 shows the comparison of the computation time to compute C _{ p } and CV for each value of the tuning parameters. We set M=5 and K=4. On the left, we can see that the computation time for CV increases although that for C _{ p } is not much changed as N or J increase. An enlarged view of the computation time for C _{ p } is drawn on the right. Since the difference among lines is small, we can say that the computation time for C _{ p } is robust to scale changes. Moreover, the model selection via C _{ p } is easily implemented because C _{ p } in (15) has a simple form. On the whole, we conclude that the C _{ p } in (15) is better than CV.
Real data analysis
In this section, we compare the methods by applying them to real data. In the data, objects are grouped into three categories and we assume that the data for three categories are independent each other. For the three categories, (N,J)’s are respectively (1884,60), (1364,21), and (1425,44), and (K,M)=(4,5).
We used 1,200 samples obtained at random as training data for each category, and the rest of the data is used as test data. In addition, we set ρ=0 or ρ=0.5. Table 2 shows the PSE in (16) evaluated from the test data after selecting the tuning parameters based on C _{ p } and CV. Similarly to Section 4, we observe that the criterion giving a smaller value of the PSE is better, and so we can say that the tuning parameter selection based on C _{ p } is superior to that of CV. Looking at the result for C _{ p }, the value of the PSE with ρ=0.5 is small in categories 1 and 3, and the value of the PSE with ρ=0 is small in category 2. Thus it is suggested that the correlations among the objects in category 2 are smaller than those of other categories. Note that there is no significant difference of the results for C _{ p } and CV in category 2. It is because the size of test data is small compared to that in categories 1 and 3.
Concluding remarks and future work
In this paper, we have considered an appropriate model and estimating method in a questionnaire study and derived the C _{ p } criterion to choose the tuning parameters included in the estimator. More precisely, using a dummy matrix representing the existence of the missing values, we have constructed a model which can be interpreted as an extension of the GMANOVA model of Potthoff & Roy [9] for three-dimensional array data. We have explicitly evaluated the penalty term in the C _{ p } without assuming the normality of the noise and shown that it becomes a simple form. Through the simulation study and real data analysis, we have confirmed the usefulness of the derived C _{ p }. This criterion has a high prediction accuracy and low computational costs compared to CV because it can be expressed by a simple form explicitly.
It is well known that predicting a missing part is important when we construct a recommendation system, which is sometimes required in a recent questionnaire study or WEB survey. However, it is in general difficult to evaluate the prediction accuracy for methods such as collaborative filtering or matrix completion. For this problem, by extending the method in this paper to a model which contains a random effect in the evaluators, it might be possible to draw common statistical inferences, including the evaluation of the prediction accuracy.
In the future, it is expected that similar results will be obtained for more complex models because the model we considered has a particular structure for X _{ i } and A. In addition, it will be necessary to treat the case where X _{ i } is random, Ξ is unknown or there exist correlations among the categories in order to use more flexible models.
\thelikesection Appendix 1: Matrix algebra
Here, we describe some matrix algebra that we have used in this paper. All the proofs can be found in Harville [3].
First, we describe two matrix inversion formulae. Let \(A\in \mathbb {R}^{p\times p},\;B\in \mathbb {R}^{p\times q}\), and \(C\in \mathbb {R}^{q\times q}\), and assume that A is non-singular. Then
if and only if D=C−B ^{′} A ^{−1} B is non-singular. In addition, assume that C is non-singular. Then
if and only if C ^{−1}+B ^{′} A ^{−1} B is non-singular. This is also known as Woodbury’s formula, and in the special case where \(A=\alpha I_{p},\;B=\boldsymbol {b}\in \mathbb {R}^{p}\), and \(C=\beta \in \mathbb {R}\) such that α≠0 and α+β b ^{′} b≠0, we have
Next, we describe the relationship between tr and vec operators. For matrices \(A\in \mathbb {R}^{p\times q},\;B\in \mathbb {R}^{q\times q}\), and \(C\in \mathbb {R}^{p\times p}\), we have
From (21), we can immediately see that
for matrices A, B, C defined in (21), and \(D\in \mathbb {R}^{q\times q}\).
\thelikesection Appendix 2: Unbiasedness of (2)
Noting that A ^{′}(A A ^{′})^{−1} A=I _{ K } by the definition of A, it follows that
In addition, for an (M×M)-dimensional matrix T, we can see that
In fact, the (h,l)-th element of E i′T E _{ i } is given by ε i h′T ε _{ i l } for h,l=1,2,…,K, and thus we have
This and the symmetry of Σ imply (24). From (23), (24), and the independence of E _{ i }, we have
This completes the proof.
\thelikesection Appendix 3: Derivation of (17)
By the same argument in Section 3, we see that (X ^{′} X)^{−1} can be expressed as
where \(Q=\Delta _{-J}^{-1}-J^{-1}\text {tr}(\Delta ^{-1})I_{J-1}\) and R=(C ^{′} C)^{−1} C ^{′} Δ ^{−1} C(C ^{′} C)^{−1}. Then we have
Note that C Q ^{′}1_{ J−1}=(Δ ^{−1}−J ^{−1}tr(Δ ^{−1})I _{ J })1 _{ J } and C R C=(I _{ J }−J ^{−1} 1 _{ J } 1 J′)Δ ^{−1}(I _{ J }−J ^{−1} 1 _{ J } 1 J′) by a simple calculation, and that \(\tilde {X}_{i}\mathbf {1}_{J}=\mathbf {1}_{M}\). Hence, X _{ i }(X ^{′} X)^{−1} X i′ is reduced to \(\tilde {X}_{i}\Delta ^{-1}\tilde {X}_{i}'\) and we see that this is a diagonal matrix. Finally, since Ξ=(1−ρ)I _{ M }+ρ 1 _{ M } 1 M′ and
we obtain
This and tr(Ξ)=M imply (17).
References
- 1
Benzécri, J-P: Correspondence Analysis Handbook. Marcel Dekker, New York (1992).
- 2
Candès, EJ, Recht, B: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009).
- 3
Harville, DA: Matrix Algebra from a Statistician’s Perspective. Springer, New York (1997).
- 4
Hoerl, AE, Kennard, RW: Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 12(1), 55–67 (1970).
- 5
Koltchinskii, V, Lounici, K, Tsybakov, AB: Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 39(5), 2302–2329 (2011).
- 6
Mallows, CL: Some comments on C _{ p }. Technometrics. 15(1), 661–675 (1973).
- 7
Mallows, CL: More comments on C _{ p }. Technometrics. 37(4), 362–372 (1995).
- 8
Nagai, I: Modified C _{ p } criterion for optimizing ridge and smooth parameters in the mgr estimator for the nonparametric gmanova model. Open J. Stat. 1, 1–14 (2011).
- 9
Potthoff, RF, Roy, S: A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika. 51(3/4), 313–326 (1964).
- 10
Stone, M: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B Stat Methodol. 36(2), 111–147 (1974).
Acknowledgements
The authors would like to thank the reviewer for his/her valuable commentsand advice to improve the paper. This research was partially supported by a Grant-in-Aid for Scientific Research (23500353) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
- Array data
- C _{ p } criterion
- GMANOVA
- Missing value
- Ridge estimator