Home>

You have multiple data points, and I want to calculate their rank in cosine similarity.

# create dataset
import numpy as np
d = 3 # dimension
n = 10 # number of fields
np.random.seed (0) # make reproducible
X = np.random.random ((n, d)). Astype ('float32')
X [:, 0] + = np.arange (n)/1000.
# contents of X
array ([[0.5488135, 0.71518934, 0.60276335],
       [0.5458832, 0.4236548, 0.6458941],
       [0.4395872, 0.891773, 0.96366274],
       [0.3864415, 0.79172504, 0.5288949],
       [0.57204455, 0.92559665, 0.07103606],
       [0.09212931, 0.0202184, 0.83261985],
       [0.78415674, 0.87001216, 0.9786183],
       [0.8061586, 0.46147937, 0.7805292],
       [0.12627442, 0.639921, 0.14335328],
       [0.9536689, 0.5218483, 0.41466194]], dtype = float32)


At this time, it was found that using the Nearest Neighbor of scikit-learn, it is possible to obtain data close to each data.

from sklearn.neighbors import NearestNeighbors
# compute nearest neighbors
distance, indices = NearestNeighbors (n_neighbors = 4, metric = 'cosine'). fit (X) .kneighbors (X)
# contents of indices
array ([[0, 6, 3, 2], # 0th data is closest to 6th data, then 3rd and 2nd data
       [1, 7, 6, 0], # and so on
       [2, 6, 3, 0],
       [3, 0, 2, 6],
       [4, 8, 3, 0],
       [5, 1, 2, 7],
       [6, 0, 1, 2],
       [7, 1, 6, 0],
       [8, 4, 3, 0],
       [9, 7, 1, 0]])

On the other hand, I wanted the data that is the closest to the 6th data from the 0th data.
Have you already implemented an algorithm for finding the closest number?

If not, is it possible to calculate it in real time?
The maximum number of data is about 5 million, and the number of dimensions is assumed to be about 100 dimensions.

  • Answer # 1

      

    On the other hand, I wanted the data that is the closest to the 6th data from the 0th data.
      Have you already implemented an algorithm for finding the closest number?

    NearestNeighbors doesn't have such a feature.
    In k-nearest neighbor search, it is necessary to find the top k points that are close to the given point during inference.
    If it is implemented simply, it is only necessary to calculate the distances to all points, sort them, and return the top k points that are close to each other. However, if the number of samples increases, it takes time to calculate. I will.
    Therefore, NearestNeighbors stores the data in the kd tree or Ball tree data structure based on the coordinates of the points, so that when searching for k neighboring points, the distance is calculated by focusing on the points near the given point. , Reducing the amount of calculations.

    Therefore, it is impossible to know how close a point is when viewed from another point.

      If not, is it possible to calculate it in real time?

    I do not know without trying, but calculating the distance between all points is 5 million * 5 million, and I feel severe on a home PC.