Consolepunk's slow creep into madness

I’ve lurked on these build logs and mess threads for a while and thought it looked pretty fun. I primarily do programming projects but I also keep up a few servers and will hopefully get a new space for them soon so maybe we’ll dive into that move at some point.

This isn’t a build log, it’s documentation of the slow decay one’s brain experiences due to prolonged exposure to the internet.

i use arch btw

Current Project: Reddit data stuff

I’m always playing with the reddit submission corpus (there are comments and more recent datasets out there) as it’s a fairly large dataset to play around with and has a lot of thread-able work that can be done to it for reasonably interesting results. Primarily I play around with NLP or build out networks of who posts where.

The current build I’m working on takes any given subreddit, sub A, and builds a network of all the other subreddits posters in sub A also post to. With that I can build neat little force graphs depicting clusters of users and how users are more or less ‘connected’ by the number of users that post in each sub.

Performance

The test dataset I’m using is 47M entries (65GB) and can process all the submissions resulting in networks for ~40k subreddits in about 40min on my laptop.

I haven’t done any cleaning or serialization yet so I’m reading in all 65GB of json every time I boot right now, full startup takes 75s which makes it a pain for testing.

Most networks can be built in under a second, larger networks take up to 5 seconds. I’m hoping to get this to a point where it would be realtime usable like a webserver (<1s per request, always).

Still a ways to go here.

Is my NVME drive set up right?

The way I read in all the json is dumb and quick and should theoretically be reading as fast as I can pull data off disk. However my cpu utilization is pretty low and the disk usage rate never exceeds 1Gbps. lspci shows it connected via an x4 pcie link and . I haven’t taken much time to look into this but that seems wrong, this drive should do 2Gbps sequential. Perhaps the threaded read is closer to random reads but the hard 1Gbps limit is suspicious to me.

Roaring Bitmaps are neat

In order to do a bunch of the work I need to keep sets of user IDs so I can get accurate intersections of the users who post in two different subreddits. I found Roaring Bitmaps to be a pretty easy way to do this and it’s pretty quick. For larger subreddits it is great as the bitmaps are pretty well filled but for smaller subreddits the data is sparse which isn’t the best use case.

Current Status

I’m working through the best way to show just the interesting bits. Currently just about every graph is going to include subreddits like askreddit, pics, wtf, etc. The normy shit. I’m currently trying to normalize the data such that the data is reflective of users that post in a given subreddit more than the average set of users. This all is a bit heady and confusing for a project I’m typically only working on before work in the morning.

2 Likes

I took the time to write the roaring bitmaps to disk so I can pull them back in rather than read all 65GB of json every time. Now startup is ~15s and testing is much better.

The output of the graph data is pretty good at this point, I’m now playing with various ways to build and filter the graph such that useful information can be derived from it.

Preview of what a graph for the subreddit ProGun :


We can see the very general subreddits (gifs, pics, aww, etc) in one cluster and the gun-related subs in another. This kind of clustering is what I am after and its starting to take shape.

Several big projects more important than my typical tinkerings here have come up, some of it relatively interesting.

Most recently has been an ethernet run. I needed to make a run from the basement to the attic (2 levels in-between) preferably without making holes in any walls. I managed it but it nearly killed me. I found a void in one of my walls which had access to the inside of an exterior wall where a sewer gas vent pipe runs the full height of the house. There are small cutouts at each floor to allow that pipe and some other plumbing through.

I was able to run fish tape from the attic, through the void, along the vent pipe and down to the first floor. I could not really manipulate the fish tape once it got through to the first floor so I went to the basement and used a endoscope style camera (amazon link) with a little hook attachment to find it up through a hole in the floor and brought it down. A hell of a lot easier said than done, but that little camera made it possible.

The house is old and has lathe and plaster, cutting that out is fairly dangerous as you risk cracking the wall. Using an oscillating multi tool (amazon link) with a grout removal blade got through the plaster pretty easily, then I used a wood blade for the lathe. You do have to be pretty sure of your measurements, if you don’t remove enough lathe it’s practically impossible to just take another few millimeters off as its then only attached to the stud by a single nail and just shakes.

The wall I wanted to run down from the attic was an exterior wall at one point and had a fire break (horizontal stud.) Real pain in the ass. I ended up running down an interior wall and poking all the way through to the keystone on the other side.

Lessons learned I guess.

2 Likes