Code used to produce Covid-19 projections released. (15,000 line C file)

Hondo · May 12, 2020, 7:35am

Slightly baity title but that single fact really made a big impression on me. I’ve been looking through the code they released on GitHub and whilst I don’t know maths and statistics well enough to comment on their models, their code is kind of atrocious. I’m not saying it’s wrong, but I am saying I’m horrified that trillion dollar decisions have been made on the back of some “how hard can coding be” professor’s hack job. And the version they released is one they spent a month cleaning up - they wont release the version they actually used which one of the engineers assigned to clean it up let slip was a single 15,000 line C file.

I couldn’t find this mentioned anywhere else on here - apologies if this is a dupe. I missed the last few shows so maybe they already covered it.

The site below is obviously partisan, but article seems solid.

My C++ days are behind me but just from a dev lead point of view, this wouldn’t pass muster in my team on testing, commenting and structure alone. I shouldn’t be surprised - academics routinely assume coding is easy and have no involvement by actual professional programmers. But still… trillions of dollars rest on this.

thro · May 12, 2020, 7:55am

Why the fuck would any sane person write this in C in any case?

It can be the most beautifully crafted C project ever designed… It’s still like using a circular saw with the guard removed for personal grooming…

Hondo · May 12, 2020, 8:28am

Yep. One of the things people are saying is “why is this not in Python”? Seems the answer is that this was a single script that was written to and edited for over 15 years. He and his students just kept hacking it around and adding to it as needed. One of the people cleaning it up for public release said something like “to be fair, most of the bugs we found were in unused code”.

PhaseLockedLoop · May 12, 2020, 10:27am

why the heck does that matter? Any scientist or engineer knows Math and Statistics doesnt care about your programming language. Use what works for you.

thro · May 12, 2020, 10:27am

Even if it was originally written in C, surely inside of 10 years porting it to something fit for purpose would have been a good idea.

PhaseLockedLoop · May 12, 2020, 10:28am

So you are saying C is not fit for fast and optimized mathematical calculations?

thro · May 12, 2020, 10:29am

Because C is prone to subtle bugs, terse, etc.

Using C when you don’t need to is dumb.

thro · May 12, 2020, 10:30am

Because C is prone to subtle bugs, terse, etc.

Using C when you don’t need to is dumb.

I’m saying that for scientific modelling where accuracy is important and speed less so - no it is not suitable.

PhaseLockedLoop · May 12, 2020, 10:31am

Im aware of its thread safety bugs all to well. I have a career that requires me to use it.

Im saying these people are picking apart his code and saying they are skeptical of the model because of the language. Look if they are so skeptical why didnt they attack the code from a different angle. Take the statistical model used and verify it elsewhere. Its not rocket science when it comes to stats. If you can repeat it then you have a finding. If you can repeat it a sufficiently large amount of times with big sample sizes you have a good statistic. I dont understand the point of picking it apart based on its language.

Do you work in the field?

The only issue I see with his code is a scientific one and as an engineer I have a problem with this

"
Undocumented equations. Much of the code consists of formulas for which no purpose is given. John Carmack (a legendary video-game programmer) surmised that some of the code might have been automatically translated from FORTRAN some years ago.

For example, on line 510 of SetupModel.cpp there is a loop over all the “places” the simulation knows about. This code appears to be trying to calculate R0 for “places”. Hotels are excluded during this pass, without explanation.

This bit of code highlights an issue Caswell Bligh has discussed in your site’s comments: R0 isn’t a real characteristic of the virus. R0 is both an input to and an output of these models, and is routinely adjusted for different environments and situations. Models that consume their own outputs as inputs is problem well known to the private sector – it can lead to rapid divergence and incorrect prediction. There’s a discussion of this problem in section 2.2 of the Google paper, “Machine learning: the high interest credit card of technical debt“.
"

For me its not about the language but more about him needing to properly document not only his code but his math. Its bad science to not give every detail so everybody can reproduce and prove it. Settled science LOVES to make these kinds of models and skip over the real science

thro · May 12, 2020, 10:36am

No. but I’ve seen enough C bugs to last a lifetime.

Did you read the article? People are/were skeptical of the code due to the non-deterministic results given identical seed conditions…

PhaseLockedLoop · May 12, 2020, 10:38am

Im not disagreeing that its bad math. I read the article. I was just pointing out arguing the language is non sensical… you can have crappy buggy code in every language. Yes some are harder than others but the problem with the results and the math here is bad math not necessarily all on the code. Fergusons work simply isnt documented and neither is his math. Being skeptical of that is really good. I encourage it. I dont eat the bullshit im told to believe

dont judge a language by bad code examples. Code is elegant and almost all bugs are an operator/author error

Oroy · May 12, 2020, 10:39am

The best thing about this is that they refused to release it at all until a couple companies had fixed it up and then squashed the commit history in order to obfuscate what the actual original code was like.

thro · May 12, 2020, 10:40am

Sure.

But when you’re a math professor and not an actual professional C coder, then perhaps using something other than C, at least for your prototype is prudent. To make sure your sim actually works.

Re-write the slow bits if required.

You don’t start hacking away in C unless you have to.

Code is written by humans. If humans continually make errors with the language (that lead to run-time behaviour problems) it is a language shortcoming.

Yes, C has its place. But if you don’t need to use it for performance reasons, making the choice to use it from the outset (as this guy did, given its a single C file) is dumb.

A a C programmer who thinks it’s always user error - you’ve never written any bugs in C, right? Only people who make mistakes do that…?

PhaseLockedLoop · May 12, 2020, 10:44am

See this would lead people to thinking its a cover up or something paranoia inducing. Honestly its probably just bad math… that led to bad science… that led to bad code. Thats literally what I see here

in the article it mentions it was translated from fortran possibly. I actually believe that claim after opening it and taking a look at the reference.

What needs to happen now is everyone needs to go take the model and put it together in other languages etc. They need to take it and see if it holds up. It needs to be done by multiple independent third parties.

yeah it is but I see the article calling out the results… for me its not just a code debug but an math “debug” we need his model. We need it all explained. It can be written in any language at that point. Quite frankly this looks like a college students work… and thats an even scarier thought

Without knowing the exact stochastic model that was used we cant take averages elsewhere and compare it to real world results. I agree with a lot of whats being said but I dont really want to blame it on the language as much as its bad science

We need to call out bad science as engineers, scientists and doctors. We have the brains and requisite knowledge to understand it. It is our responsibility

When a coder can call it out better than the rest of the scientific community… theres a big problem

Oroy · May 12, 2020, 10:56am

See this would lead people to thinking its a cover up or something paranoia inducing. Honestly its probably just bad math…

I am perfectly fine viewing their actions as malicious if what they do is leaned on that much for treading on people.

No problem with C in and of itself but given that it’s 15K LoC this looks more like it was the profs artefact that students had to pay tribute to opposed to being a project with carefully weighted pros and cons with regards to the models validity.

PhaseLockedLoop · May 12, 2020, 11:15am

I hate when science gets political. It needs to be non partisan

Hondo · May 12, 2020, 2:11pm

C is not fit for people who don’t know what they’re doing with it. I think that’s the main point. I think it’s worth reading the actual critique of the code. There are attempts and then abandoned attempts at concurrency in there; egregious misuse of memory… I certainly wouldn’t say you can’t use C for this. But I think it’s reasonable to say that in this day and age an academic would be better off using something more modern rather than C. It’s less about the language, though I agree with thro that’s an issue, and more what it signifies. If anything a C programmer should have an even more horrified reaction to a program being one giant 15,000 line C file than the rest of us! Because a C programmer understands what an amateur will do with the language. I mean, when I’m staring at a code file filled with GoTo statements as I am with this, I am concerned.

The articles are well worth reading. It’s not so much saying their results are wrong, but that it’s deeply worrying when trillion dollar decisions are made based on something that would maybe scrape through a basic programming course. I think at least some of us here are familiar with academics who think programming is just a side task that doesn’t really require much skill or, more importantly, proper processes to develop and maintain it.

Because they’re not critiquing the model. The article author (and myself) do not say we’re qualified to dispute his model or mathematics (though some who work in the same field are doing so); we’re concerned about the implementation used. It ought to be for others to examine and review the actual models, but in this case they seem to be somewhat inseperable from the implementation for two reasons: Because they got their projections by just running this program over and over and averaging its varying results in a non-repeatable way; and because there’s enough concern about the actual implementation to wonder how it affects the model’s theory. Run this on one type of CPU get one result. Run it on a different type of CPU, get another. This is worrying stuff and independent of politics, imo.

And to re-emphasize a key point: they’re refusing to release the original version, only this modified version which outsiders came in and worked on for a month.

DevBlox · May 12, 2020, 2:27pm

There’s a lot of legacy like this in academics. At least this is in C. Often it’s in an obscure language that no one uses anymore or ever used in the first place. I encountered some 15 years old programs that are coded in a visual programming tool of some sort (like with electronics schematics, for you to get an idea), that’s still in use for converting experiment output. Until recently, when a friend that works with it asked me for some help.

One of them was converting a file from a binary form to csv. A file of a few gigs would be processed for 10 minutes and crashes if it runs out of memory. It loaded up everything into memory and matrix transposed the whole thing, because guess what, the way that the array was populated was in the wrong orientation, that’s what the thing supported only. Rewrote the thing in ~60 LoC of Go code. Few gigs file churns out in under 20 seconds. Apparently it was a major breakthrough for the lab.

My conclusion was that academics can’t write efficient and correct code if it meant their life. That was not the only occasion. So I do agree with the argument that it should not have been written in C, but rather, perhaps, an academic favorite - Python. Would have saved them from undefined behavior, and outputs being non-deterministic, which I guess, one led to another. You first have to make it correct, then make it fast. It’s not the language’s or anyone else’s fault. The model has to be correct before it is fast. EDIT: it may have been legacy that accidentaly found it’s way out, but they wanted to release it, even though it wasn’t solid.

I doubt there’s any kind of malice there. My competitive experience with students from that college wasn’t the best. They’re quite stuck up, but I guess you get a little of that when being at one of the best colleges, as it’s being touted. I agree with the article’s author that it’s probably embarrassment, rather than anything else, based on the emotional state a stuck up person might have upon failure.

Anyway, this is a highly personal opinion. Politics does find it’s way into things, so I wouldn’t be surprised there.

thro · May 13, 2020, 12:26am

If people are wanting to analyse the model/algorithm then surely the professor has an actual algorithm/design from before he started hacking away in C? Or some documentation? Because you know… that’s the first thing a proper comp sci major (or programming class from before university) would have been taught (at least, we were back when I started, maybe in the days of web development this has changed).

Write the spec, then write the code… and verify it implements the spec…

I jest, because according to Carmark, a bunch of the bugs fixed were in code that never even got called. Clearly there is no design document, and this poorly written C hack is the documentation.

This project is a shit-show.

I say that as someone who is very much not politically aligned with the clear objective (downplaying covid19) of the site this review is hosted on…

Agreed.

A lot of people who write C really shouldn’t. As one of my comp sci professors told me back in the day - the 90/10 rule. 90% of the time is spent in 10% of the code. Even if you optimise the non-hot 90% of the code to zero, you’ve saved 10% of your run-time. At the cost of how much maintainability? Good job! Get 90% of the result for 10% of the effort (and much better maintainability) instead.

Plus, modern libraries are much better and more battle-tested than some random guy’s pet algorithm project.

Sure, I get it this code is old, but if there was a spec (as drilled into me constantly during my education) both debugging and porting to something more maintainable/safe/performant would be trivial.

Hossam · June 4, 2020, 9:28pm

Not if you know what you’re doing