Increment Field Value based on previous value

CybeastRaystriker · May 11, 2022, 7:09pm

Hey folks. Here’s a problem I’m working on currently. Basically, I have a dataframe of orders, and i need to track and update the version of the orders as they appear again in the data.

[ sample data]
id, version
1, 1
2, 1
3, 1
1, 1
3, 1
4, 1
2, 1
2, 1
3 ,1

What the transformed data should look like:
1,1
2,1
3,1
1,2
3,2
4,1
2,2
2,3
3,3

Currently, the data is in a df and I’m using a dict to keep track of the id and the version.
I loop over each row of data-

For the current row, if id is not in the dict, I add the id and version (as key-value)
If the current row’s id exists in the dict, I get the value, increment it by 1 and set it back in the dict for future rows.

Note: I’m not setting the updated version value in the df, instead I’m using a list and appending the version (for each data row) and at the end am simply replacing the version column in the df (as lists maintain the order)

As you might have assumed, I’m using python currently. Using tqdm, I ascertained that I’m processing 8000 to 9000 rows a second (the data is 3 to 8 million rows). This job is running on a Zen3 based VM on AWS (on an M1 mac, I believe I’m getting ~14k rows per second).

Any ideas as to how I can improve my processing time?

MatthewSmith · May 11, 2022, 8:58pm

Can you give more context about the problem, its domain and the goal you are trying to achieve?

–Edit:
The way the problem is presented above is through an explanation of how you tried to solve it, instead of what the problem is.

CybeastRaystriker · May 11, 2022, 10:34pm

Yeah, I think I did a horrible job presenting the problem.

What I’m dealing with is a csv of orders (actually, multiple).

Each order has a unique OrderID
Each “order” can go through multiple “states” → (“new”, “cancelled”, “partiallyfilled”, “filled”, “expired”)

So, when an order is placed in the system, it has the state of “new”.
Subsequently the (same) order can reach any of the other states.

So for example:
Order 1: [new → cancelled]
Order 2: [new → partiallyfilled → expired]
Order 3: [new → filled]
Order 4: [new → partiallyfilled → partiallyfilled → expired]
… and so on and so forth.

What I need to do, is compute the “Version” of the order.
When the order is new, it should be given the version 1 by default. Subsequently, if the order is later in the data, for every time the order is in the data, the version needs to be “incremented”.

So, for order #4 in the above example,
(State: Version) ==> [(new : 0) → (partiallyfilled : 1) → (partiallyfilled : 2) → (expired : 3) ]

Is that better?

bedHedd · May 12, 2022, 8:19am

background

I had a run into a similar situation where had a pandas dataframe and I needed to sum the number of entries from multiple columns based on a custom condition. The condition was, if a value in a column was greater than 0, update the count by 1, then check across a defined range of columns and update the counter based on the condition. I used a apply (across rows), lambda (lamda across columns), and a custom function (for my counter).

guides that helped me

This guide introduced the idea of nesting a custom function within a apply statement

This is a great reference for using a pandas apply function.

Lamda is also a great function to read up on

speed and runtime

In terms of speed, I’ve seen comments about apply and lamda being quite slow, but depending on their implementation, these functions should beat regular for loops. If such libraries don’t use numpy or c++ for their loops.
This video actually has a deeper explanation if you want to learn more

numpy

If you are going for speed, check numpy functions

Numpy uses c++ to implement loops, so runtime is much faster compared to a python for loop.

side note

The description of your problem sounds like a usecase for a relational database

You’d want to set a unique primary key either in the form of a date or a identifier. If I had a hand in design the database, I’d probably have a table for the transactions logging the datetime, id, and status at the time, then a query for the current status. The query would return check the most recent transaction and return the status