Find messages in a large python list

My environment is python3. I created a list of stored user messages. The format of the list is similar to [class_user,], which is a class and collects information about the identity of the user. I have a question. This list is very large now, so I get very slow messages. If we don’t use database, what should I do to improve the speed.

A multiprocessing pool could help
Something in the neighborhood of

from multiprocessing import cpu_count, Pool, Queue

def check_user(user, outputQueue):
    if 'nancy' in user.name.lower():
        outputQueue.put(user)
        
allUsers = [user01, user02, user03]
outputQueue = Queue() 
pool = Pool(cpu_count())
for user in pool.starmap(check_user, zip(allUsers, outputQueue)):
    pass

You might process them with the pool rather than just sorting but you do need something atomic like a Queue to collect the results

Do you have to store users in a list? Can you create a more efficient datastructure alongside?

Um… I understand your thoughts and thanks. But your code I can’t run directly, because Queue() is not like an iterator, so I am modifying the code, is this correct?

from multiprocessing import cpu_count, Pool, Queue


class User(object):

  def __init__(self, name):
      self.name = name


def check_user(user_temp):
    if 'bob' in user_temp.name.lower():
        return 200, user_temp
    else:
        return 400, None


allUsers = [User('Alice'), User('Bob'), User('Tony')]
outputQueue = Queue() 
pool = Pool(cpu_count())
for status, user in pool.starmap(check_user, zip(allUsers)):
    if status == 200:
        print(user.__dict__)
        break

I still have a problem: the pool.starmap method that gets the result must wait until all child processes are running, and I can exit the event after receiving the message. I can’t use pool.Queue() to communicate.

Because my data is store in user, and users is a class store in list.
so i think list is more adapt me, if i use directory, maybe i get message is more difficulty. Maybe i can create a tree myself, and use multiprocess to control. how do you think?

Try this
Maybe you don’t need the queue after all

#!/usr/bin/env python

from multiprocessing import cpu_count, Pool


class User(object):
  def __init__(self, name):
      self.name = name


def check_user(user):
    if 'bob' in user.name.lower():
        return user


def sort_all(users):
    pool = Pool(cpu_count())
    for user in pool.imap(check_user, users):
        if user:
            yield user


allUsers = [User('Alice'), User('Bob'), User('Tony')]
found = [user for user in sort_all(allUsers)]


for user in found:
    print(user.__dict__)

That should use as much cpu as possible, bound most likely by I/O
a pool in Python is quite efficient; you can slam all cores to 100% if you are doing enough to need it

2 Likes

Databases typically support indexing. An index is a datastructure, usually some kind of tree, sometimes a hashmap that allows efficient lookups.

You could use in process/in memory sqlite, load the data, and query it.

Alternatively, you could index the data yourself. For example, if you have Alice/Bob/Tony, and you need exact matches, your
index would be {"Alice":user1, "Bob": user2, "Tony": user3}
You could also use suffix trees if you wanted efficient substring matches.

One more thing you could do before having to go and figure out computer science stuff, you could try using attr/attrs/pyattr to define the user class and telling it to use slots to store data in object instances.

How big is your list, and of you happen to know what criteria you want to match on.

How many users do you have in a list? how many lookups per second do you need? And how quickly do you need the result of each lookup?

2 Likes

@khaudio Your method is perfect for me, I understand your thoughts. You are a good friend. Thank you, I look forward to communicating with you.

1 Like

@risk In fact, this list is used in the blockchain. I created a small blockchain with python3. When the number of blocks exceeds 10,000, the search speed is very slow, so I want to find a way to improve the speed.