A kernel method for canonical correlation analysis ^{†}^{†}thanks: This is the full version of paper presented in IMPS2001 (International Meeting of Psychometric Society, Osaka, 2001)
Abstract
Canonical correlation analysis is a technique to extract common
features from a pair of multivariate data. In complex situations,
however, it does not extract useful features because of its linearity.
On the other hand, kernel method used in support vector
machine is an efficient approach to improve such a linear method.
In this paper, we investigate the effectiveness of applying kernel method
to canonical correlation analysis.
Keyword: multivariate analysis, multimodal data, kernel method,
regularization
1 Introduction
This paper deals with the method to extract common features from multiple information sources. For instance, let us consider a task of learning in pattern recognition, in which an object is given by using an image and its name is given by a speech. For a newly given image, the system is required to answer its name by a speech, and for a newly given speech, the system is to answer the corresponding image. The task can be considered to be a regression problem from image to speech and vice versa. However, since the dimensionalities of images and speeches are generally very large, a regression analysis many not work effectively. In order to solve the problem, it is useful to map the inputs into low dimensional feature space and then to solve the regression problem.
The canonical correlation analysis (CCA) has been used for such a purpose. CCA finds a linear transformation of a pair of multivariates such that the correlation coefficient is maximized. From an information theoretical point of view, the transformation maximizes the mutual information between extracted features. However, if there is nonlinear relation between the variates, CCA does not always extract useful features.
On the other hand, the support vector machines (SVM) are attracted a lot of attention by its stateofart performance in pattern recognition [8] The kernel trick used in SVM is applicable not only for classification but also for other linear techniques, for example, kernel regression and kernel PCA[6]
In this paper, we apply the kernel method to CCA. Since the kernel method is likely to overfit the data, we incorporate some regularization technique to avoid the overfitting.
2 Canonical correlation analysis
CCA has been proposed by Hotelling in 1935[3]. Suppose there is a pair of multivariates , CCA finds a pair of linear transformations such that the correlation coefficient between extracted features is maximized (Fig.1) For the sake of simplicity, we assume that the averages of and are 0, and the dimensionality of feature is 1, then by the transformations
(1) 
(2) 
we would like to find the transformation , that maximizes
(3) 
where represents the inner product. We have to further assume
(4) 
to reduce the freedom of scaling of and . and can be found by an eigen vector corresponding to the maximal eigen values of a generalized eigen value problem. If we need more than one dimension, we can take eigen vectors corresponding other maximal eigen values.
CCA is important in an information theoretical viewpoint, since it finds a transformation that maximizes the mutual information between features, when and are jointly Gaussian. Even if the assumption is not fullfilled, CCA can be still used in some cases. However, if the purpose is regression, the large values of correlation coefficients are crucially necessary. The reasons that correlation coefficients are small can be considered in the following cases:

and does not have almost any relation.

There is strong nonlinear relation between and ,
It is impossible of improvement in the first case. However, in the second case, we can obtain the relation by some methods. One of those methods is to allow the nonlinear transformation and Asoh et al[4] has proposed a neural network model that approximates the optimal nonlinear canonical correlation analysis. However, this model requires a lot of computation time and it also has a lot of local optima. In this paper, we incorporate the kernel method, which enables the nonlinear transformation as well as the small computation and no undesired local optima.
3 Kernel CCA
First, and are transformed into the Hilbert space, and . By taking inner products with a parameter in the Hilbert spaces, , and , we find a feature
(5) 
(6) 
which maximizes the correlation coefficients.
Now, suppose we have pairs of training samples . and can be found by solving the Lagrangean
(7)  
However, the Lagrangean is illposed as it is when the dimensionalities of the Hilbert spaces are large. Therefore, we introduce a quadratic regularization term and we get wellposed Lagrangean,
(8) 
where is a regularization constant. Note that the average of is given by
(9) 
and the average of is given by
(10) 
Now, from the condition that the derivative of by is equal to 0, we get
(11) 
where is a schalar, then as a result, we have
(12) 
Therefore, can be calculated by only inner products in . Kernel trick used in SVM uses a kernel function instead of the inner product between and . In practice, since we don’t need an explicit form of , we first determine that can be decomposed in the form of inner product. From Mercer theorem, the symmetric positive definite kernel can be decomposed into the inner product form.
Let us rewrite by the kernel. First, let , , and we define the matrices
(13) 
(14) 
Then, we obtain by
(15)  
where
(16)  
(17)  
(18)  
(19)  
(20) 
and , .
If is satisfied, and are positive definite almost surely, and we can show from the constraint, then as a result we have a generalized eigenvalue problem for ,
(21)  
(22) 
It can be solved by generalized eigenvalue problem package or Cholesky decomposition of and .
4 Computer simulation
4.1 Simulation 1
We generate training samples and test samples independently as follows. First is generated from the uniform distribution on , and then a pair of two dimensional variables and are generated by
(23) 
(24) 
where , are independent two dimensional Gaussian noise with a standard deviation 0.05.
We test for 40 training samples and 100 test samples. The – scatter plot of (linear) CCA is shown fig.3. The correlation coefficients are as follows, where the values for test samples are in the braces.
0.71  (0.40)  0.00  (0.09)  
0.00  (0.00)  0.27  (0.19) 
The – plot of kernel CCA is shown in fig.4. We used Gaussian kernel
(25) 
both for and , where parameters are take by , . The correlation coefficients are as follows, where the values for test samples are in the braces.
0.98  (0.95)  0.00  (0.02)  
0.00  (0.02)  0.97  (0.93) 
We only show upto the second components, though we have higher components in the kernel CCA.
4.2 Simulation 2
This section examines an artificial pattern recognition tasks in multimodal setting described in the beginning of the paper.
Training samples and are generated randomly from the uniform distribution on and make random pairs of training samples. Each pair of training samples represent a class center. Test samples are generated by adding an independent Gaussian noise with standard deviation 0.05 to training samples randomly chosen.
We test 10 training samples (classes) and 100 test samples.
– plot of CCA result is shown in fig.5. The correlation coefficient between features are as follows, where the values for test samples are in the braces.
0.40  (0.44)  0.00  (0.10)  
0.00  (0.05)  0.13  (0.19) 
– plot of kernel CCA result for the same dataset is shown in fig.6.
We use Gaussian kernel in which parameters are taken , . The correlation coefficients between features are as follows, where the values of test samples are in braces.
0.97  (0.90)  0.00  (0.04)  
0.00  (0.01)  0.95  (0.88) 
5 Concluding remarks
5.1 Kernel method and regularization
We have proposed kernel canonial correlation analysis in which the kernel method is incorporated in the kernel method. It is similar to SVM that the point is nonlinearization by kernel method and avoiding overfitting by regularization technique.
In general, it is important to determine the regularizaion parameter. Moreover, the selection of kernel form is crucial for the performance. Although all parameters are determined by hand in the simulations of this paper, we can take more systematic approaches, such as resampling methods like crossvalidation and emprical Bayes approaches[7]. In such techniques, we usually need iterative algorithms which is time consuming and is also likely to be trapped into a local optimum. To examine such issues are future work.
As for regularization term, we can use and instead of the quadratic term of regularization in this paper. In the kernel discriminant analysis described below, such a different type of regularization term is used. The time complexities for both types are same and empirically we are not able to find significant difference of performance. However, we may need more realistic experiments.
5.2 Relation to kernel discriminant analysis
The canonical correlation analysis is closely related to the Fisher’s discriminant analysis (FDA), which finds a mapping that minimizes the innerclass variance as well as maximizes interclass variance for effective pattern recognition. FDA can be considered as a special case of CCA. Mika et al[5] has proposed a kernel method for FDA, which is not strictly included into the kernel CCA because the kernel FDA does not transform the class label by nonlinear mapping. For both in kernel CCA and kernel FDA, it is difficult to obtain sparse representation of mapping. It would be promising idea to incorporate the sparsity as a utility function.
5.3 Future issues from the information theory
The author’s group has been proposed the multimodal independent component analysis (multimodal ICA) which extends the CCA by incorporating the information theoretic viewpoint[2]. The transformation is restricted to linear and it has been sometimes difficult to extract useful features from nonlinearly related multivariates. Now we can raise a question: Can we integrate the kernel CCA with multimodal ICA in order to extract useful features?
The answer for this question depends on the property of given data. If the noise level is low as in the simulation of this paper, the regularization constants are set to small values and it is desired that the correlation coefficients are almost 1. We cannot expect the performance is improved by multimodal ICA because the correlation coefficient close to 1 already achieves a large amount of mutual information.
On the other hand, when the noise level is large, the multimodal ICA possibly improves the performance. However, in such a case, the linear CCA is sometimes enough in practice. If we learn a multiple value function as in the aquisition of multiple consept[1], it may worth trying because the correlation coefficients are small even if the noise level is low.
Let us consider further the case the noise level is low. From the result of the simulations in the previous section, samples are mapped into a few clusters that will make regression between and difficult. In such a case, the distribution of and is desired to be scattered. From the information theoretic viewpoint, the feature space is preferrable to have large amount of entropy. Since the distribution with largest entropy is Gaussian under fixed average and variance, the Gaussianity can be used for the utility function. For example, the third and forth cumulants are preferrable to be as small as possible. It seems opposite from the projection pursuit and independent component analysis, but it may be caused from the difference of the purpose that ICA is for visualization while our task is for regression. The assumption of noise is also different. These issues are related to the sparsity stated in the previous section, and it is also future work.
References
 [1] S. Akaho, S. Hayamizu, O. Hasegawa, T. Yoshimura, H. Asoh: Concept aquisition from multiple information sources by the EM algorithm, Trans. of IEICE, Vol. J80A, No. 9, pp. 15461553, 1997. (in Japanese; English version (ETL technical report 978) is available at http://www.neurosci.aist.go.jp/~akaho/papers/ETLTR978E.ps.gz)
 [2] S. Akaho and S. Umeyama, Multimodal Independent Component Analysis — A method of feature extraction from multiple information sources, Electronics and Communications in Japan, Part 3, Fundamental Electronic Science, Vol.84, No.11, pp.21–28, 2001. (summary version is available in IJCNN’99 paper, http: //www.neurosci.aist.go.jp/ ~akaho/papers/IJCNN99.ps.gz)
 [3] T. W. Anderson: An Introduction to Multivariate Statistical Analysis — Second edition, John Wiley & Sons, 1984.
 [4] H. Asoh, O. Takechi: An approximation of Nonlinear Canonical Correlation Analysis by Multilayer Perceptrons, Proc. of Int. Conf. Artificial Neural Networks, pp. 713–716, 1994.
 [5] S. Mika, G. Rätsch, J. Weston, B. Schölkpf, KR. Müller: Fisher discriminant analysis with kernels, In Y.H. Hu, et al.(eds.): Neural Networks for Signal Processing IX, pp. 4148, IEEE, 1999.
 [6] B. Schölkopf, A. Smola and K. Müller: Kernel principal component analysis, In B. Schölkopf et al. (eds), Advances in Kernel Methods, Support Vector Learning, MIT Press, 1998.
 [7] M. E. Tipping: The relevant vector machine, to appear in Advances in Neural Information Processing Systems (NIPS) 12, 2000.
 [8] V.N. Vapnik : Statistical Learning Theory, John Wiley & Sons, 1998.