I’ve lurked on these build logs and mess threads for a while and thought it looked pretty fun. I primarily do programming projects but I also keep up a few servers and will hopefully get a new space for them soon so maybe we’ll dive into that move at some point.
This isn’t a build log, it’s documentation of the slow decay one’s brain experiences due to prolonged exposure to the internet.
i use arch btw
Current Project: Reddit data stuff
I’m always playing with the reddit submission corpus (there are comments and more recent datasets out there) as it’s a fairly large dataset to play around with and has a lot of thread-able work that can be done to it for reasonably interesting results. Primarily I play around with NLP or build out networks of who posts where.
The current build I’m working on takes any given subreddit, sub A, and builds a network of all the other subreddits posters in sub A also post to. With that I can build neat little force graphs depicting clusters of users and how users are more or less ‘connected’ by the number of users that post in each sub.
Performance
The test dataset I’m using is 47M entries (65GB) and can process all the submissions resulting in networks for ~40k subreddits in about 40min on my laptop.
I haven’t done any cleaning or serialization yet so I’m reading in all 65GB of json every time I boot right now, full startup takes 75s which makes it a pain for testing.
Most networks can be built in under a second, larger networks take up to 5 seconds. I’m hoping to get this to a point where it would be realtime usable like a webserver (<1s per request, always).
Still a ways to go here.
Is my NVME drive set up right?
The way I read in all the json is dumb and quick and should theoretically be reading as fast as I can pull data off disk. However my cpu utilization is pretty low and the disk usage rate never exceeds 1Gbps. lspci
shows it connected via an x4 pcie link and . I haven’t taken much time to look into this but that seems wrong, this drive should do 2Gbps sequential. Perhaps the threaded read is closer to random reads but the hard 1Gbps limit is suspicious to me.
Roaring Bitmaps are neat
In order to do a bunch of the work I need to keep sets of user IDs so I can get accurate intersections of the users who post in two different subreddits. I found Roaring Bitmaps to be a pretty easy way to do this and it’s pretty quick. For larger subreddits it is great as the bitmaps are pretty well filled but for smaller subreddits the data is sparse which isn’t the best use case.
Current Status
I’m working through the best way to show just the interesting bits. Currently just about every graph is going to include subreddits like askreddit, pics, wtf, etc. The normy shit. I’m currently trying to normalize the data such that the data is reflective of users that post in a given subreddit more than the average set of users. This all is a bit heady and confusing for a project I’m typically only working on before work in the morning.