Home>

Dataset is given as a dictionary consisting of key alphabetic symbols and elemental numbers.
The numbers are stored in a list. The length of the list varies depending on the key.
Multiply each element of each key by a slash and use the largest one.
The value divided by the length of the array is output as the value between the two keys.

I want to do this with a large data set.

Error message

It works fine for small datasets,
With this code, if the number of keys in the dataset is large (e.g. 30,000)
Since the target to be compared in the calculation of a single key is enormous,
Executing the current code will take a lot of time.

I want to do parallel processing, but I don't know how to modify the code.
I would like advice.

Applicable source code
data = {
    'A': [1, 3, 5, 2, 1, 8, 9],
    'B': [9, 4, 3],
    'C': [8, 5, 5, 6, 1]
}
output = {}
for alph, nums in data.items ():
    avg = {}
    my_list = data [alph]
    for target_alph, target_nums in data.items ():
        target_list = data [target_alph]
        if alph == target_alph:
            continue
        max_nums = []
        for i in my_list:
            max_num = 0
            for j in target_list:
                result = i * j
                if result is not None and result>max_num:
                    max_num = result
            max_nums.append (max_num)
        avg [target_alph] = sum (max_nums)/len (max_nums)
    output [alph] = avg
print (output)


Output (Because a single line is long, it has a line break)

{
{'A': {'B': 37.285714285714285, 'C': 33.142857142857146},
'B': {'A': 48.0, 'C': 42.666666666666664},
'C': {'A': 45.0, 'B': 45.0}}
}
What you have tried with your answers

I tried to cut it into a function, but I couldn't cut it out well,
I got an error. Even within the function, it is necessary to use alph and target_alph, so there is no way to improve efficiency even if the function is used.

#!/usr/bin/env python
# coding: utf-8
data = {
    'A': [1, 3, 5, 2, 1, 8, 9],
    'B': [9, 4, 3],
    'C': [8, 5, 5, 6, 1]
}
output = {}
for alph, nums in data.items ():
    avg = {}
    my_list = nums
    for target_alph, target_nums in data.items ():
        target_list = target_nums
        "" "
        if alph == target_alph:
            continue
        max_nums = []
        for i in my_list:
            max_num = 0
            for j in target_list:
                result = i * j
                if result is not None and result>max_num:
                    max_num = result
            max_nums.append (max_num)
        "" "
        avg [target_alph] = avg_of_max (my_list, target_list)
    output [alph] = avg
print (output)

def avg_of_max_nums (my_list, target_list):
    for alph, nums in data.items ():
        for target_alph, target_nums in data.items ():
            if alph == target_alph:
                continue
            max_nums = []
            for i in my_list:
                max_num = 0
                for j in target_list:
                    result = i * j
                    if result is not None and result>max_num:
                        max_num = result
                max_nums.append (max_num)
             return sum (max_nums)/len (max_nums)

Error text

$python sample.py
  File "sample.py", line 49
    return sum (max_nums)/len (max_nums)
                                       ^
IndentationError: unindent does not match any outer indentation level
Supplemental information (FW/tool version etc.)

python3.6

  • Answer # 1

    By improving the algorithm and implementing it with numpy, several digits will be faster for some large data.

    Since only the largest one is taken, no other calculations are required
    max ([x * y for y in lst])isx * max (lst)(x, y, lst are appropriate data. Non-negative)

    The value divided by the length of the array is just an average

    Speaking only in conclusion, it is only necessary to calculate the maximum value and average of each data and find the direct product. This can be written quickly with numpy.

    import timeit
    from itertools import permutations
    import numpy as np
    data1 = {
        'A': [1, 3, 5, 2, 1, 8, 9],
        'B': [9, 4, 3],
        'C': [8, 5, 5, 6, 1]
    }
    data2 = {str (i): [np.random.randint (1000) for _ in range (50)]
             for i in range (50)}
    data3 = {str (i): [np.random.randint (1000) for _ in range (500)]
             for i in range (500)}
    def original (data):
        output = {}
        for alph, nums in data.items ():
            avg = {}
            my_list = data [alph]
            for target_alph, target_nums in data.items ():
                target_list = data [target_alph]
                if alph == target_alph:
                    continue
                max_nums = []
                for i in my_list:
                    max_num = 0
                    for j in target_list:
                        result = i * j
                        if result is not None and result>max_num:
                            max_num = result
                    max_nums.append (max_num)
                avg [target_alph] = sum (max_nums)/len (max_nums)
            output [alph] = avg
        return output
    def changed1 (data):
        idx = list (data.keys ())
        maxes = np.array ([max (data [k]) for k in idx])
        means = np.array ([sum (data [k])/len (data [k]) for k in idx])
        A = np.outer (means, maxes)
        idx_ij = {k: i for i, k in enumerate (idx)}
        d = {k: {} for k in idx}
        for k1, k2 in permutations (idx, 2):
            d [k1] [k2] = A [idx_ij [k1]] [idx_ij [k2]]
        return d
    print (original (data1))
    print (changed1 (data1))
    "" "=>
    {'B': {'A': 48.0, 'C': 42.666666666666664}, 'A': {'B': 37.285714285714285, 'C': 33.142857142857146}, 'C': {'B': 45.0, 'A ': 45.0}}
    {'B': {'A': 48.0, 'C': 42.666666666666664}, 'A': {'B': 37.28571428571429, 'C': 33.142857142857146}, 'C': {'B': 45.0, 'A ': 45.0}}
    "" "
    print (timeit.timeit (lambda: original (data1), number = 10000))
    print (timeit.timeit (lambda: changed1 (data1), number = 10000))
    "" "=>
    0.2061475639929995
    0.27564139396417886
    "" "
    print (timeit.timeit (lambda: original (data2), number = 10))
    print (timeit.timeit (lambda: changed1 (data2), number = 10))
    "" "=>
    4.815463805978652
    0.01332262804498896
    "" "
    print (timeit.timeit (lambda: changed1 (data3), number = 10))
    "" "=>
    1.288064556021709
    "" "

    Although it is not that smart code, it seems that it has improved to just over two digits for the time being.

  • Answer # 2

    * This is not a direct answer.

    Is the code written without waste before considering parallelization or using the algorithm?

    for alph, nums in data.items ():
        avg = {}
        my_list = data [alph]

    This is nums == my_list.

  • Answer # 3

    First, it is easier to parallelize by cutting out only this part into the following function.

    max_nums = []
            for i in my_list:
                max_num = 0
                for j in target_list:
                    result = i * j
                    if result is not None and result:
                        max_num = result
                max_nums.append (max_num)
            avg [target_alph] = sum (max_nums)/len (max_nums)
    def avg_of_max_nums (my_list, target_list):
        ... # Make a function that calculates and returns the average of max_nums as follows
        return sum (max_nums)/len (max_nums)
    # avg [target_alph] = avg_of_max (my_list, target_list)
    Use like #

    After that, there are two points for speeding up the tip.

    if result is not None and result:is always true, isn't it necessary?

    I think the average calculation is probably faster using statistics.mean.

    Finally, for parallelization, you should use concurrent.futures.ProcessPoolExecutor or multiprocessing.pool.Pool.