Genetic Research -- Who's got some spare cores?

Hello all!

I've got a project likely coming up where I need to parse through more than 36TB worth of data. In particular, I will be needing to perform a cluster analysis for transcription factor prediction based on gene microarrays in the Gene Expression Omnibus. The analysis will be using a new version of TF-Cluster, with a yet to be written wrapper.

Source here:
Paper here:

What I could use help with that is easy is to comment below, so I can follow up when the program is all done and I just need cycles or to give feedback on documentation.. What I could use that is hard is tracking down a bizarre memory leak of 101 vertexes in the graph, creating a set of scripts to deploy and collect jobs, and adding support for the file formats used by the GEO. I'm also open to suggestions on how to better do things in general, too.

Right now, if you want to play around with TF-Cluster, it requires git, libc, and make. Documentation should be mostly up to date, but there's certainly places where it is lacking -- I'm working on it.

If anything is unclear, I should be able to explain everything related to the project.


Commenting for the easy part. Got about halfway through the paper, will finish it later. It's an interesting hypothesis and kind of the brute force approach it seems like. Especially interested in the ipscs

Believe me, it's not brute force. What do you mean by "ipscs"?

I have no idea what I just read.


You know the science that keeps getting tossed around in the lounge? Well, this is that, but unfiltered.

I can't keep up with the lounge so I never go in there.

Whatever it was it sounded interesting. Sadly I'm just a lowely security analyst and not smart enough to comprehend 1/4 of that.

1 Like

You could suggest comments and formatting changes in the code, though!

Don't know much about bioinformatics so I'll take your word for it. Basically cataloguing all the genes shared by the tfs, and charting their connectivity? Seems like an immense task might be better phrasing.

IPSCs = induced pluripotent stem cells. I worked with a graduate student who was trying to figure out a way to model prader-willi syndrome with IPSCs, but she ran into lots of issues with the stem cells, some of the genes were no longer functioning properly, particular ones that coded for one of the ligases of interest to the disease. I have a nutrition background so I wasn't too heavily involved (mainly just as reference for the enzyme kinetics) but when I left the issue they were stuck at was that the dna wasn't methylated so they weren't sure why it wasn't working properly and were blaming transcription factors.

We're finding how the genes are correlating in their expression. With this and what are the expected TFs (I'll be making that more flexible, but that's another thing) clusters of similarly expressed genes are put into groups which are expected to be governed by similar or same TFs. The algorithm is O(NM) ~ O(N^2).

Yeah, that's a transcription issue. Stem cells would work if the problem was the gene, so that means they should have been looking at TFs, but that's a pretty big task. Methylation does alot of funny things, but those are for the lab techs to deal with :P


@jajone4 idk if this is your wheelhouse or not

How do you validate this code so that you know it does what you think it does?

I've been comparing against the original perl version and by hand testing. The original program was verified by comparing results against known lab collected data. The technology and resources to not at this time exist to perform a thorough evaluation, thus transcription factors being a hot area of research. With the full GEO clustered, it will be made easier and more accessible to other researchers and lab technicians to evaluate program validity.

It's the best any one has right now.

Oh interesting. So are you adding anything new or just mainly porting it from Perl to C++? By setting this up and running it on my PC, does that help you debug at all?

I've ported, but also added some functionality, accuracy, speed (5 hours -> ~2 seconds on my workstation), alternative statistical methods, and I'm adding a different run mode soon. At some point, I'd like to add file support for the data stored in the GEO but there's quite alot going on right now. I'm having to defend my work against the prof's entire grad student team, teach them what is going on, and so on.

Right now, you can test it and feedback from that process would help me identify weak areas in that process. However, any work done by the program I would be totally blind to. If I can gather interest on here I can push to make the distributed scripts and stuff, add file support, and get funds and time to take on the project to compute through the GEO.

1 Like

cool, I will try and get it running this weekend and let you know.

Thanks! It should be easy, but I need people outside of my box to think that, too :P

I could really use over the weekend testing from people!

This is a very interesting topic. I have a background in both Biology and Computer Science. I will have to remember to look into this post further when I have time. I don't really have a moment yet, but it definitely looks interesting. Pairing transcription factors with cellular pathways is a very important thing which I think has definitely been overlooked in hte past because of just how hard it is to do.

1 Like

Interesting and if you wanted tech people to help maybe a docker or snap package to run isolated. Windows users can always just reinstall....
I might clone a VM and put the code on it to lend a hand....wimpy PC but hey its in the mix.

By cores I thought you meant you wanted us to donate our bodies or something. Slightly disappointment to find out you meant CPU cores.