- ORIGINAL ARTICLE
- Open Access
Ridge-type regularization method for questionnaire data analysis
- Yuta Umezu^{1}Email author,
- Hidetoshi Matsuoka^{2},
- Hiroshi Ikeda^{2} and
- Yoshiyuki Ninomiya^{3}
https://doi.org/10.1186/s40736-016-0024-x
© The Author(s) 2016
- Received: 30 April 2016
- Accepted: 18 August 2016
- Published: 30 August 2016
Abstract
In questionnaire studies for evaluating objects such as manufacturing products, evaluators are required to respond to several evaluation items for the objects. When the number of objects is large, a part of the objects is often assigned randomly to each evaluator, and the response becomes a matrix with missing components. To handle this kind of data, we consider a model by using a dummy matrix representing the existence of the missing components, which can be interpreted as an extension of the GMANOVA model. In addition, to cope with the case where the numbers of the object and evaluation items are large, we consider a ridge-type estimator peculiar to our model to avoid instability in estimation. Moreover, we derive a C _{ p } criterion in order to select the tuning parameters included in our estimator. Finally, we check the validity of the proposed method through simulation studies and real data analysis.
Keywords
- Array data
- C _{ p } criterion
- GMANOVA
- Missing value
- Ridge estimator
Introduction
In questionnaire studies, N evaluators are often required to respond K evaluation items for M objects selected randomly from J objects. For instance, they evaluate on a scale 1 to 5 for each of the evaluation items. Because the evaluator responds only for M objects, evaluations for the rest of J−M objects are missing, that is, we can not observe them. Therefore, we have a three-dimensional array data of size J×K×N consisting of the MKN observations and (J−M)K N missing values.
For such data, we are often interested in predicting the missing values based on the observations (for example, recommendation systems of Amazon or Netflix). Nowadays methods such as collaborative filtering or matrix completion are developed to predict the missing part. To predict it, it is indispensable to assume some conditions for the data structure in general. For example, Candès & Recht [2], and Koltchinskii et al. [5] reconstruct the matrix by assuming that the data structure is low-rank, and this is useful since it enables us to use a popular method of convex optimization. However, it is difficult to select a tuning parameter which is included in the method because we have no reasonable information criterion. In addition, it is difficult to evaluate the prediction accuracy because we have no evaluation formula for the variance of the predicted value.
On the other hand, correspondence analysis has been used in questionnaire data analysis in order to extract features from the data (e.g., Benzécri [1]). Correspondence analysis, however, is an exploratory method just like principal component analysis and is not applicable to the data including missing values. So we can not use it to analyze our data. As described in Section 2, it is possible to construct a parametric model by using a dummy matrix representing the existence of the missing values. The model we will consider can be interpreted as an extension of the generalized multivariate analysis of variance (GMANOVA) model in Potthoff & Roy [9] to for three-dimensional array data. Usually, a noise in the GMANOVA model is assumed to be distributed some Gaussian distribution. Unfortunately, since the data obtained from questionnaire study are discrete in general, it is unnatural to assume the normality for noise. Even so, we can express the ordinal least squares estimator explicitly and moreover evaluate the average or variance of the estimator. However, we encounter a problem that the estimator becomes unstable when M or K is large or when the multicollinearity is present in the data.
The ridge-type estimator is often used in order to assure the stability of the estimator (e.g., Hoerl & Kennard [4]). We then need to choose appropriate tuning parameters included in the estimator. Computational methods such as cross validation (CV; Stone [10]) are usually used for this choice although they come at a considerable computation cost. Information criteria such as C _{ p } (Mallows [6, 7]) may also be used to choose it. For example, Nagai [8] derived an unbiased estimator of the standardized mean squared error for the ridge-type estimator in the GMANOVA model. However, the objective variable in our data is an (M×K)-dimensional matrix, and so we can not apply his result since it is only for the usually GMANOVA model, that is, they assumed the normality for noise, and he does not considered the missing values.
Although it is sometimes important to predict the missing part in questionnaire studies, our goal in this paper is to construct an appropriate model. To do this, we derive an unbiased estimator of the standardized mean squared error for the model that is defined in Section 2. Moreover, in Section 3, making good use of matrix calculations, we develop a C _{ p }-type information criterion in order to select tuning parameters included in the estimator. The proposed method is shown to be valid through a simulation study in Section 4, and then the result in which the method is applied to real data is reported in Section 5. Some concluding remarks are presented in Section 6. Several matrix algebras used in this paper and some proofs are relegated to Appendix.
Setting and assumptions
In the following sections, we will denote 0 _{ d }, 1 _{ d } and I _{ d } by a d-dimensional zero-vector, one-vector and (d×d)-dimensional identity matrix for a positive integer d.
Let us suppose that \(\phantom {\dot {i}\!}E_{i}=(\boldsymbol {\varepsilon }_{i1},\boldsymbol {\varepsilon }_{i2},\ldots,\boldsymbol {\varepsilon }_{iK})=(\varepsilon _{ij_{im}k})_{m=1,2,\ldots,M;k=1,2,\ldots,K}\) are independent random matrices with mean E[E _{ i }]=0 _{ M } 0 K′ and covariance V[vec(E _{ i })]=Σ⊗Ξ, where Σ and Ξ are an unknown (K×K)-dimensional matrix and a known (M×M)-dimensional matrix, respectively. This means that the k-th column ε _{ i k } and the ℓ-th column ε _{ i ℓ } of E _{ i } have a covariance matrix \(\phantom {\dot {i}\!}\mathrm {E}[\boldsymbol {\varepsilon }_{ik}\boldsymbol {\varepsilon }'_{i\ell }]=\sigma _{k\ell }\Xi \) for k,ℓ=1,2,…,K. Although j _{ m }’s are assigned randomly, we consider X _{ i } deterministic for simplicity. Note that this model includes the so-called GMANOVA model of Potthoff & Roy [9] in a special case when M=1 and E _{ i } is distributed according to some Gaussian distribution.
by X _{ i }, B and A, respectively.
where \(S=\sum _{i=1}^{N}\text {tr}\{X_{i}(X'X)^{-1}X_{i}\Xi \}\). The details for deriving the unbiasedness of (2) are given in Appendix 2.
Deriving the C _{ p } criterion
3.1 Preparation
Nagai [8] derived a C _{ p } criterion for a ridge-type estimator in the GMANOVA model. His result and ours are different because there are missing values in the data and the observation is an (M×K)-dimensional matrix in our case. Moreover, we do not assume the normality of E _{ i }.
where \(\tilde {H}_{\mu }=(K+\mu)I_{K-1}+\mu \mathbf {1}_{K-1}\mathbf {1}_{K-1}'\).
3.2 Main result
Combining all the above, we obtain the following theorem:
Theorem 1
where δ _{ j }, \(\hat {Y}_{i}\), P and f(P) are defined in (6), (4), (9) and (14), respectively.
Our result coincides with Nagai [8] in the special case when K=1. In this case, we can interpret our model (1) as usual multivariate linear regression model except for the missing of the data.
where \(\hat {\Sigma }\) is an unbiased estimator of Σ defined in (2). By minimizing the C _{ p } in (15), we can obtain the optimal values of the tuning parameters (λ,μ).
Simulation study
In this section, we conduct some simulation studies to check the performance of the tuning parameter selection based on C _{ p } in (15). The performances for C _{ p } and CV are compared.
where \(\tilde {Y}_{i}\) is the copy of Y _{ i }, \(\hat {\lambda }\) and \(\hat {\mu }\) are the values of the tuning parameters which minimize each of the criteria, and \(\hat {\Sigma }\) is an unbiased estimator of Σ given by (2). In addition, \(\tilde {\mathrm {E}}\) denotes the expectation with respect to only \(\tilde {Y}_{i}\). The expectation in PSE is evaluated using an empirical mean of n (=1,000) tuples of the test data \(\{(\tilde {Y}_{i},\tilde {X}_{i});i=1,2,\ldots,n\}\) and we conclude that the criterion giving the small value of the PSE is better. Moreover, we checked the standard deviation for difference between the values of PSE given by two criteria, because the performance of each criterion may almost be the same when it is large, even if the difference between the values of PSE are large. Thus, we conclude that the difference is significant when the value of the standard deviation is small. We also checked the computation time (sec) to compute C _{ p } and CV for each value of the tuning parameters as a secondary index for the assessment.
in (2). The details for deriving (17) are given in Appendix 3. Sample size N was set to 500 or 1,000, nine cases were considered for three-tuple (J,M,K), and fifty simulations were conducted.
Comparison between C _{ p } and CV for simulated data
ρ | (J,M,K) | N | C _{ p } (sd) | CV (sd) | Diff |
---|---|---|---|---|---|
0 | (30,5,2) | 500 | 11.726 (0.432) | 12.764 (0.697) | 0.516 |
1000 | 11.370 (0.369) | 11.661 (0.480) | 0.226 | ||
(60,5,2) | 500 | 13.311 (0.713) | 16.189 (1.547) | 1.146 | |
1000 | 12.136 (0.408) | 13.003 (0.703) | 0.476 | ||
(90,5,2) | 500 | 14.544 (0.688) | 21.219 (2.249) | 1.918 | |
1000 | 12.819 (0.432) | 14.757 (0.835) | 0.664 | ||
(30,5,4) | 500 | 23.623 (0.931) | 25.612 (1.391) | 1.068 | |
1000 | 22.411 (0.539) | 22.870 (0.694) | 0.326 | ||
(60,5,4) | 500 | 25.642 (1.207) | 31.381 (3.331) | 2.484 | |
1000 | 23.919 (0.715) | 26.073 (1.317) | 1.050 | ||
(90,5,4) | 500 | 26.141 (0.835) | 37.661 (4.553) | 4.198 | |
1000 | 24.376 (0.616) | 27.986 (1.371) | 1.113 | ||
(30,10,4) | 500 | 43.211 (1.101) | 43.525 (1.230) | 0.336 | |
1000 | 41.958 (0.664) | 42.084 (0.663) | 0.161 | ||
(60,10,4) | 500 | 46.600 (1.136) | 49.628 (1.929) | 1.475 | |
1000 | 44.857 (0.781) | 45.629 (0.831) | 0.408 | ||
(90,10,4) | 500 | 48.550 (1.040) | 56.096 (3.186) | 2.735 | |
1000 | 45.542 (0.956) | 47.192 (0.991) | 0.611 | ||
0.5 | (30,5,2) | 500 | 12.905 (0.670) | 14.282 (1.080) | 0.892 |
1000 | 12.294 (0.650) | 12.609 (0.774) | 0.288 | ||
(60,5,2) | 500 | 15.197 (1.043) | 19.896 (2.325) | 1.899 | |
1000 | 13.760 (0.659) | 14.895 (0.752) | 0.481 | ||
(90,5,2) | 500 | 16.520 (0.950) | 28.466 (4.101) | 3.773 | |
1000 | 14.553 (0.560) | 17.854 (1.386) | 1.101 | ||
(30,5,4) | 500 | 24.779 (1.289) | 26.680 (1.715) | 1.126 | |
1000 | 22.189 (0.644) | 22.972 (0.832) | 0.548 | ||
(60,5,4) | 500 | 27.027 (1.295) | 36.329 (4.517) | 3.960 | |
1000 | 24.047 (0.714) | 27.204 (1.710) | 1.322 | ||
(90,5,4) | 500 | 28.378 (1.413) | 44.593 (5.329) | 4.685 | |
1000 | 25.965 (0.635) | 33.100 (1.840) | 1.698 | ||
(30,10,4) | 500 | 45.319 (1.232) | 46.186 (1.577) | 0.725 | |
1000 | 42.442 (1.009) | 42.490 (0.917) | 0.491 | ||
(60,10,4) | 500 | 52.135 (3.347) | 56.454 (3.215) | 3.110 | |
1000 | 46.593 (1.320) | 48.044 (1.696) | 0.841 | ||
(90,10,4) | 500 | 55.531 (4.301) | 65.578 (4.003) | 3.381 | |
1000 | 48.748 (1.115) | 50.934 (1.566) | 1.337 |
Real data analysis
In this section, we compare the methods by applying them to real data. In the data, objects are grouped into three categories and we assume that the data for three categories are independent each other. For the three categories, (N,J)’s are respectively (1884,60), (1364,21), and (1425,44), and (K,M)=(4,5).
Comparison between C _{ p } and CV for real data
ρ | category | C _{ p } | CV |
---|---|---|---|
0 | 1 | 21.433 | 22.239 |
2 | 20.323 | 20.364 | |
3 | 19.989 | 20.177 | |
0.5 | 1 | 19.861 | 20.390 |
2 | 20.523 | 20.566 | |
3 | 19.000 | 19.221 |
Concluding remarks and future work
In this paper, we have considered an appropriate model and estimating method in a questionnaire study and derived the C _{ p } criterion to choose the tuning parameters included in the estimator. More precisely, using a dummy matrix representing the existence of the missing values, we have constructed a model which can be interpreted as an extension of the GMANOVA model of Potthoff & Roy [9] for three-dimensional array data. We have explicitly evaluated the penalty term in the C _{ p } without assuming the normality of the noise and shown that it becomes a simple form. Through the simulation study and real data analysis, we have confirmed the usefulness of the derived C _{ p }. This criterion has a high prediction accuracy and low computational costs compared to CV because it can be expressed by a simple form explicitly.
It is well known that predicting a missing part is important when we construct a recommendation system, which is sometimes required in a recent questionnaire study or WEB survey. However, it is in general difficult to evaluate the prediction accuracy for methods such as collaborative filtering or matrix completion. For this problem, by extending the method in this paper to a model which contains a random effect in the evaluators, it might be possible to draw common statistical inferences, including the evaluation of the prediction accuracy.
In the future, it is expected that similar results will be obtained for more complex models because the model we considered has a particular structure for X _{ i } and A. In addition, it will be necessary to treat the case where X _{ i } is random, Ξ is unknown or there exist correlations among the categories in order to use more flexible models.
\thelikesection Appendix 1: Matrix algebra
Here, we describe some matrix algebra that we have used in this paper. All the proofs can be found in Harville [3].
for matrices A, B, C defined in (21), and \(D\in \mathbb {R}^{q\times q}\).
\thelikesection Appendix 2: Unbiasedness of (2)
This completes the proof.
\thelikesection Appendix 3: Derivation of (17)
This and tr(Ξ)=M imply (17).
Declarations
Acknowledgements
The authors would like to thank the reviewer for his/her valuable commentsand advice to improve the paper. This research was partially supported by a Grant-in-Aid for Scientific Research (23500353) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Benzécri, J-P: Correspondence Analysis Handbook. Marcel Dekker, New York (1992).MATHGoogle Scholar
- Candès, EJ, Recht, B: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009).MathSciNetView ArticleMATHGoogle Scholar
- Harville, DA: Matrix Algebra from a Statistician’s Perspective. Springer, New York (1997).View ArticleMATHGoogle Scholar
- Hoerl, AE, Kennard, RW: Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 12(1), 55–67 (1970).MathSciNetView ArticleMATHGoogle Scholar
- Koltchinskii, V, Lounici, K, Tsybakov, AB: Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 39(5), 2302–2329 (2011).MathSciNetView ArticleMATHGoogle Scholar
- Mallows, CL: Some comments on C _{ p }. Technometrics. 15(1), 661–675 (1973).MATHGoogle Scholar
- Mallows, CL: More comments on C _{ p }. Technometrics. 37(4), 362–372 (1995).MathSciNetMATHGoogle Scholar
- Nagai, I: Modified C _{ p } criterion for optimizing ridge and smooth parameters in the mgr estimator for the nonparametric gmanova model. Open J. Stat. 1, 1–14 (2011).MathSciNetView ArticleGoogle Scholar
- Potthoff, RF, Roy, S: A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika. 51(3/4), 313–326 (1964).MathSciNetView ArticleMATHGoogle Scholar
- Stone, M: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B Stat Methodol. 36(2), 111–147 (1974).MathSciNetMATHGoogle Scholar