Hey folks. Here’s a problem I’m working on currently. Basically, I have a dataframe of orders, and i need to track and update the version of the orders as they appear again in the data.
[ sample data]
id, version
1, 1
2, 1
3, 1
1, 1
3, 1
4, 1
2, 1
2, 1
3 ,1
What the transformed data should look like:
1,1
2,1
3,1
1,2
3,2
4,1
2,2
2,3
3,3
Currently, the data is in a df and I’m using a dict to keep track of the id and the version.
I loop over each row of data-
- For the current row, if id is not in the dict, I add the id and version (as key-value)
- If the current row’s id exists in the dict, I get the value, increment it by 1 and set it back in the dict for future rows.
Note: I’m not setting the updated version value in the df, instead I’m using a list and appending the version (for each data row) and at the end am simply replacing the version column in the df (as lists maintain the order)
As you might have assumed, I’m using python currently. Using tqdm, I ascertained that I’m processing 8000 to 9000 rows a second (the data is 3 to 8 million rows). This job is running on a Zen3 based VM on AWS (on an M1 mac, I believe I’m getting ~14k rows per second).
Any ideas as to how I can improve my processing time?