Home>

Application of k-nearest neighbor method to time series data
↑ The question is an extension of this question.

I tried to graph the anomaly of x-axis acceleration using the k-nearest neighbor method for time series data (acceleration data).
I was able to get a high degree of abnormality value firmly at the abnormal part.
I have a question here
・ How much anomaly should be taken to determine anomaly? How to determine the threshold
-What kind of code should be applied to evaluate how accurate the abnormality judgment is actually compared to the original data?

Two questions remain.

I would like somebody to teach.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors

'''
Divide data into slice windows for each size
'''
def main ():
    df = pd.read_csv ("20191121.csv")
    # Remove extra data from DataFrame
    df = df.drop (['name','x_rad/s','y_rad/s','z_rad/s'], axis = 1)
    df = df.set_index ('time')
    Visualize # x, y, z-axis acceleration
    df.plot (). legend (loc ='upper left')
    # 2480 x-axis accelerations from the front are used as training data, and the next 2479 are used as test data.
    # # df.iloc [2479] --->53845130
    # df.iloc [2480] --->53845150
    train_data = df.loc [: 53845130,'x_ags']
    test_data = df.loc [53845150 :,'x_ags'] .reset_index (drop = True)
    # Window width
    width = 30
    # k-nearest neighbor k
    nk = 1
    # Create a set of vectors using window width
    train = embed (train_data, width)
    test = embed (test_data, width)
    Clustering with # k-nearest neighbor method
    neigh = NearestNeighbors (n_neighbors = nk)
    neigh.fit (train)
    #Calculate distance
    d = neigh.kneighbors (test) [0]
    # Distance normalization
    mx = np.max (d)
    d = d/mx
    #Training data
    plt.subplot (221)
    plt.plot (train_data, label ='Training')
    plt.xlabel ("Amplitude", fontsize = 12)
    plt.ylabel ("Sample", fontsize = 12)
    plt.grid ()
    leg = plt.legend (loc = 1, fontsize = 15)
    leg.get_frame (). set_alpha (1)
    # Abnormality
    plt.subplot (222)
    plt.plot (d, label ='d')
    plt.xlabel ("Amplitude", fontsize = 12)
    plt.ylabel ("Sample", fontsize = 12)
    plt.grid ()
    leg = plt.legend (loc = 1, fontsize = 15)
    leg.get_frame (). set_alpha (1)
    # Verification data
    plt.subplot (223)
    plt.plot (test_data, label ='Test')
    plt.xlabel ("Amplitude", fontsize = 12)
    plt.ylabel ("Sample", fontsize = 12)
    plt.grid ()
    leg = plt.legend (loc = 1, fontsize = 15)
    leg.get_frame (). set_alpha (1)

def embed (lst, dim):
    emb = np.empty ((0, dim), float)
    for i in range (lst.size --dim + 1):
        tmp = np.array (lst [i: i + dim]) [:: -1] .reshape ((1, -1))
        emb = np.append (emb, tmp, axis = 0)
    return emb
if __name__ =='__main__':
    main ()
  • Answer # 1

    You may have misunderstood the k-nearest neighbor method.

    The k-nearest neighbor method with k = 1 is to "adopt the same correct answer as the training data of the closest distance", and the size of the distance does not matter in the judgment.

    Wikipedia-k-nearest neighbor method
    "The k-nearest neighbor method when k = 1 is called the nearest neighbor method, and the class of the training example closest to it is adopted."

    Note that k = 1 is not essential to the above story. Even if k>1, k training data are selected from the closest order regardless of the absolute distance, and the correct answer is predicted by the majority vote of the result. In the k-nearest neighbor method, after determining k, in the prediction,The size of the distance order is relevant, but the size of the absolute distance is irrelevant...

    On the contrary, if you feel from the domain knowledge that "it is correct that the judgment changes depending on the size of the absolute distance", it means that the k-nearest neighbor method is not suitable. Consider other techniques.