Home>

Always I am indebted.
I'm doing density estimation in Python
An error will occur if you increase the number of dimensions of the data.
I couldn't understand why this error occurred, so I asked a question.
(I would appreciate it if you could tell me the cause of the occurrence)

If the number of dimensions is 4 or more, the following error may occur.
The rate of occurrence is about 50%. (As the number of dimensions increases, so does the probability of error occurrence)
-------------------------------------------------- -------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-11-ce4c335d8dd1>in<module>
      2 xy = np.vstack ([a1, a2, a3, a4, a5])
      3 kernel = stats.gaussian_kde (xy)
---->4 z_est = kernel.evaluate (xy)
~ \ program \ anaconda \ lib \ site-packages \ scipy \ stats \ kde.py in evaluate (self, points)
    244 result = zeros ((m,), dtype = float)
    245
->246 whitening = linalg.cholesky (self.inv_cov)
    247 scaled_dataset = dot (whitening, self.dataset)
    248 scaled_points = dot (whitening, points)
~ \ program \ anaconda \ lib \ site-packages \ scipy \ linalg \ decomp_cholesky.py in cholesky (a, lower, overwrite_a, check_finite)
     89 "" "
     90 c, lower = _cholesky (a, lower = lower, overwrite_a = overwrite_a, clean = True,
--->91 check_finite = check_finite)
     92 return c
     93
~ \ program \ anaconda \ lib \ site-packages \ scipy \ linalg \ decomp_cholesky.py in _cholesky (a, lower, overwrite_a, clean, check_finite)
     38 if info>0:
     39 raise LinAlgError ("% d-th leading minor of the array is not positive"
--->40 "definite"% info)
     41 if info<0:
     42 raise ValueError ('LAPACK reported an illegal value in {} -th argument'
LinAlgError: 1-th leading minor of the array is not positive definite
Corresponding source code
# import
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
% matplotlib inline
#Data creation (generated in texto)
N1 = np.random.normal (size = 400)
N2 = np.random.normal (scale = 0.5, size = 400)
N3 = np.random.normal (scale = 0.5, size = 400)
N4 = np.random.normal (scale = 0.5, size = 400)
N5 = np.random.normal (scale = 0.5, size = 400)
N6 = np.random.normal (scale = 0.5, size = 400)
a1 = N1 + 1 * N2
a2 = N1-1 * N2
a3 = N1 + 1 * N3
a4 = N1-1 * N3
a5 = N1 + 1 * N4
a6 = N1-1 * N4
a7 = N1 + 1 * N5
a8 = N1-1 * N5
a9 = N1 + 1 * N6
a0 = N1-1 * N6
#Density estimation (Error occurs when a4 or higher)
xy = np.vstack ([a1, a2, a3])
kernel = stats.gaussian_kde (xy)
z_est = kernel.evaluate (xy)
# Visualization
x = a1
y = a2
plt.scatter (x, y, c = z_est)
What I tried

No error occurred up to 3D. (I tried about 30 times)
In 4 dimensions and above, there is a 50% chance that an error will occur.
(As the number of dimensions increases, so does the probability of error occurrence)

Supplementary information (FW/tool version, etc.)

python3.7
anaconda core i5

We would appreciate your help.

  • Answer # 1

    It will be a self-answer,
    1: It is difficult to solve this problem (problem due to Cholesky decomposition at the time of evaluate)
    2: In a library that can calculate kernel density other than sklearn, there are various problems and it can not be calculated in high dimensions (many people write that it can be done on the web, but there is no result of a proper investigation of 3 dimensions or more)

    So, after all, I made the kernel density of sklaern CV and decided the bandwidth and used it.
    However, sklearn has a different calculation method such as scipy, so there will be a slight difference.
    (Even if there is a difference, there is no problem because both of them deviate from the true value anyway)

    Probably, there seems to be no easy solution at present other than this.
    I hope it helps someone.