Dataset is given as a dictionary consisting of key alphabetic symbols and elemental numbers.
The numbers are stored in a list. The length of the list varies depending on the key.
Multiply each element of each key by a slash and use the largest one.
The value divided by the length of the array is output as the value between the two keys.
I want to do this with a large data set.
Error messageIt works fine for small datasets,
With this code, if the number of keys in the dataset is large (e.g. 30,000)
Since the target to be compared in the calculation of a single key is enormous,
Executing the current code will take a lot of time.
I want to do parallel processing, but I don't know how to modify the code.
I would like advice.
data = {
'A': [1, 3, 5, 2, 1, 8, 9],
'B': [9, 4, 3],
'C': [8, 5, 5, 6, 1]
}
output = {}
for alph, nums in data.items ():
avg = {}
my_list = data [alph]
for target_alph, target_nums in data.items ():
target_list = data [target_alph]
if alph == target_alph:
continue
max_nums = []
for i in my_list:
max_num = 0
for j in target_list:
result = i * j
if result is not None and result>max_num:
max_num = result
max_nums.append (max_num)
avg [target_alph] = sum (max_nums)/len (max_nums)
output [alph] = avg
print (output)
Output (Because a single line is long, it has a line break)
{
{'A': {'B': 37.285714285714285, 'C': 33.142857142857146},
'B': {'A': 48.0, 'C': 42.666666666666664},
'C': {'A': 45.0, 'B': 45.0}}
}
What you have tried with your answers
I tried to cut it into a function, but I couldn't cut it out well,
I got an error. Even within the function, it is necessary to use alph and target_alph, so there is no way to improve efficiency even if the function is used.
#!/usr/bin/env python
# coding: utf8
data = {
'A': [1, 3, 5, 2, 1, 8, 9],
'B': [9, 4, 3],
'C': [8, 5, 5, 6, 1]
}
output = {}
for alph, nums in data.items ():
avg = {}
my_list = nums
for target_alph, target_nums in data.items ():
target_list = target_nums
"" "
if alph == target_alph:
continue
max_nums = []
for i in my_list:
max_num = 0
for j in target_list:
result = i * j
if result is not None and result>max_num:
max_num = result
max_nums.append (max_num)
"" "
avg [target_alph] = avg_of_max (my_list, target_list)
output [alph] = avg
print (output)
def avg_of_max_nums (my_list, target_list):
for alph, nums in data.items ():
for target_alph, target_nums in data.items ():
if alph == target_alph:
continue
max_nums = []
for i in my_list:
max_num = 0
for j in target_list:
result = i * j
if result is not None and result>max_num:
max_num = result
max_nums.append (max_num)
return sum (max_nums)/len (max_nums)
Error text
$python sample.py
File "sample.py", line 49
return sum (max_nums)/len (max_nums)
^
IndentationError: unindent does not match any outer indentation level
Supplemental information (FW/tool version etc.)
python3.6

Answer # 1

Answer # 2
* This is not a direct answer.
Is the code written without waste before considering parallelization or using the algorithm?
for alph, nums in data.items (): avg = {} my_list = data [alph]
This is nums == my_list.

Answer # 3
First, it is easier to parallelize by cutting out only this part into the following function.
max_nums = [] for i in my_list: max_num = 0 for j in target_list: result = i * j if result is not None and result: max_num = result max_nums.append (max_num) avg [target_alph] = sum (max_nums)/len (max_nums)
def avg_of_max_nums (my_list, target_list): ... # Make a function that calculates and returns the average of max_nums as follows return sum (max_nums)/len (max_nums) # avg [target_alph] = avg_of_max (my_list, target_list) Use like #
After that, there are two points for speeding up the tip.
if result is not None and result:
is always true, isn't it necessary?I think the average calculation is probably faster using statistics.mean.Finally, for parallelization, you should use concurrent.futures.ProcessPoolExecutor or multiprocessing.pool.Pool.
Related articles
 about threads, parallel processing, asynchronous processing
 parallel processing  about synchronous processing of python (i want to perform screen drawing and communication by synchronous
 python 3x  about iterative processing when the number of data of python scraping destination is unknown
 dart  about outofcontext processing in state_notifier
 javascript  when performing random processing, i want to create values that do not overlap and that are 16px at a time
 java  about expansion and contraction processing of processing
 javascript  about parentheses for processing functions using spread syntax
 opencv  about the processing result when the size is set to 0 in the bilateral filter cv2bilateralfilter
 c: i have doubts about loop processing
 java  about the end of processing waiting for communication reception of bufferedinputstream
 about python if statement processing
 about asynchronous/synchronous processing of javascript/nodejs
 windows  about specified time processing in discordpy
 php  i am worried about the statement of processing to update another table in mysql
 c ++  about exclusive processing using lock_guard
 about security when processing the value entered in the javascript text area
 processing  [p5js] i want to create a system that changes the scene when clicked
 about mouse over processing when chrome operation is performed with vba and selenium
 python  about black and white processing of images
 swift  about delegate processing
 [python] about atcoder arc006b
 python  i'm in trouble because i don't understand atocoder171f
 python  i'm in trouble because i don't understand atcoder171e
 python  i'm in trouble because i don't understand atcoder146d
 python  i'm in trouble because i don't understand atcoder167c
 python  i'm in trouble because i don't understand atcoder150c
By improving the algorithm and implementing it with numpy, several digits will be faster for some large data.
Since only the largest one is taken, no other calculations are required
max ([x * y for y in lst])
isx * max (lst)
(x, y, lst are appropriate data. Nonnegative)The value divided by the length of the array is just an average
Speaking only in conclusion, it is only necessary to calculate the maximum value and average of each data and find the direct product. This can be written quickly with numpy.
Although it is not that smart code, it seems that it has improved to just over two digits for the time being.