Home>
You have multiple data points, and I want to calculate their rank in cosine similarity.
# create dataset
import numpy as np
d = 3 # dimension
n = 10 # number of fields
np.random.seed (0) # make reproducible
X = np.random.random ((n, d)). Astype ('float32')
X [:, 0] + = np.arange (n)/1000.
# contents of X
array ([[0.5488135, 0.71518934, 0.60276335],
[0.5458832, 0.4236548, 0.6458941],
[0.4395872, 0.891773, 0.96366274],
[0.3864415, 0.79172504, 0.5288949],
[0.57204455, 0.92559665, 0.07103606],
[0.09212931, 0.0202184, 0.83261985],
[0.78415674, 0.87001216, 0.9786183],
[0.8061586, 0.46147937, 0.7805292],
[0.12627442, 0.639921, 0.14335328],
[0.9536689, 0.5218483, 0.41466194]], dtype = float32)
At this time, it was found that using the Nearest Neighbor of scikitlearn, it is possible to obtain data close to each data.
from sklearn.neighbors import NearestNeighbors
# compute nearest neighbors
distance, indices = NearestNeighbors (n_neighbors = 4, metric = 'cosine'). fit (X) .kneighbors (X)
# contents of indices
array ([[0, 6, 3, 2], # 0th data is closest to 6th data, then 3rd and 2nd data
[1, 7, 6, 0], # and so on
[2, 6, 3, 0],
[3, 0, 2, 6],
[4, 8, 3, 0],
[5, 1, 2, 7],
[6, 0, 1, 2],
[7, 1, 6, 0],
[8, 4, 3, 0],
[9, 7, 1, 0]])
On the other hand, I wanted the data that is the closest to the 6th data from the 0th data.
Have you already implemented an algorithm for finding the closest number?
The maximum number of data is about 5 million, and the number of dimensions is assumed to be about 100 dimensions.

Answer # 1
Related articles
 i want to set reduce_only when placing a buy order to bybit via ccxt in python
 python  [pandas] how to search for unexpected data
 python  web scraping: transition to the next page in the search site
 python  i want to search for a specific value in a row specified by a twodimensional array
 [python] how to search for data close to a query from 2d distribution data
 python  i want to connect the images in two folders in order
 python  how to get alphabets in order with for in
 python  i want to search only a part of html (a certain selector) with find_all of beautifulsoup
 python  about the output order of set {}
 python  i made a code to search a website tag as a string, but can it be simplified?
 i want to set postonly when placing a buy order to bybit via ccxt in python
 about the order of graphs in python dateframe is changed arbitrarily
 python  i want to get the publisher name from the book name with the national diet library search api, but it does not hit
 get gogole search screen automatically with python
 i don't know the order of processing higherorder functions in python
 python  matplotlib drawing order
 python  i want to search the value of the first column from two twodimensional arrays with numpy, extract the row if the colum
 python  amazon search by list element
 about search method including [python] [\] mark
 python  how to search for duplicate characters in the character string
Trends
NearestNeighbors doesn't have such a feature.
In knearest neighbor search, it is necessary to find the top k points that are close to the given point during inference.
If it is implemented simply, it is only necessary to calculate the distances to all points, sort them, and return the top k points that are close to each other. However, if the number of samples increases, it takes time to calculate. I will.
Therefore, NearestNeighbors stores the data in the kd tree or Ball tree data structure based on the coordinates of the points, so that when searching for k neighboring points, the distance is calculated by focusing on the points near the given point. , Reducing the amount of calculations.
Therefore, it is impossible to know how close a point is when viewed from another point.
I do not know without trying, but calculating the distance between all points is 5 million * 5 million, and I feel severe on a home PC.