We are developing the social individualist meta-context for the future. From the very serious to the extremely frivolous... lets see what is on the mind of the Samizdata people.
Samizdata, derived from Samizdat /n. - a system of clandestine publication of banned literature in the USSR [Russ.,= self-publishing house]
|
The ‘elite’ should learn to code (better) “Science is the belief in the ignorance of experts.” (Richard Feynman)
Over a decade ago, climategate confirmed that Jones, Mann and Briffa knew exactly what they were doing when they scaled the hockystick to hide the decline while having not a clue about what the decline meant. However it incidentally revealed the, uh, ‘quality’ of their code.
Neil Ferguson’s extra-lockdown/marital escapade says much about his elite opinion of us common people (and of the lockdown), but meanwhile someone has taken a look at the, uh, ‘quality’ of his code.
Conclusions. All papers based on this code should be retracted immediately. Imperial’s modelling efforts should be reset with a new team that isn’t under Professor Ferguson, and which has a commitment to replicable results with published code from day one.
On a personal level, I’d go further and suggest that all academic epidemiology be defunded. This sort of work is best done by the insurance sector.
Read the review to see what leads to these conclusions. You have to laugh (in order not to cry ).
|
Who Are We? The Samizdata people are a bunch of sinister and heavily armed globalist illuminati who seek to infect the entire world with the values of personal liberty and several property. Amongst our many crimes is a sense of humour and the intermittent use of British spelling.
We are also a varied group made up of social individualists, classical liberals, whigs, libertarians, extropians, futurists, ‘Porcupines’, Karl Popper fetishists, recovering neo-conservatives, crazed Ayn Rand worshipers, over-caffeinated Virginia Postrel devotees, witty Frédéric Bastiat wannabes, cypherpunks, minarchists, kritarchists and wild-eyed anarcho-capitalists from Britain, North America, Australia and Europe.
|
The populist right continue to blame anyone (the media, scientists, China ect) but the government itself for lockdown, because they have put so many of their hopes in the figure of Boris Johnson. As one MP has said “We can’t blame the PM so we’ve decided to blame the advisers”.
If Boris announces only minor changes to current policy tomorrow I wonder if the anti lockdown right will start to turn their fire on the government. My guess is most will just continue to blame ‘The Liberal Elite’, a usefully nebulous scapegoat if ever there was one.
Let us say, for the sake of argument, that the mathematical death model of the of Professor Ferguson had been CORRECT – not wildly wrong.
This would still in no way have justified the “lockdown”, there was no evidence that locking up the population would stop the spread of the virus. Vast numbers of people would have died – either way (with or without the “lockdown”) And that assumes that the purpose of the “lockdown” was to combat the virus – which it is now clear that it WAS NOT.
The purpose of the “lockdown” was to spread the sort of social control that the Green activist that Professor Ferguson was involved with wanted – but it is not JUST her (or him). It is the entire international establishment – including much of Corporate Big Business.
Some “conspiracy theories” are true – that does not mean they released the virus on purpose (it may have been a genuine accident in their internationally funded lab near Wuhan) – but “NEVER LET A CRISES GO TO WASTE”.
The “educated” international establishment were eager to use the virus as a excuse for what they have long wanted to do – the destruction of liberty is their aim, the virus is just an excuse for actions they have long wished to undertake anyway.
People such as various “Progressive” Governors in the United States make this very clear “close the Churches – but keep the abortion mills open”.
No medical treatment for cancer or heart disease – but if you want to cut a baby up, there is plenty of hospital space open to you. But why one baby – let us cut up LOTS of babies. And why stop when the baby is born – both the Governor of Virginia and the Governor of New York (Media Darling Andrew Cuomo) say it is fine to kill babies AFTER they are born.
You want to go to Church? How dare you! Men with guns will burst in to prevent religious services (as they did in France – President Macron).
You want booze and drugs? Excellent! We want a degenerate population dependent on us for booze and drugs – have them for free (as they are doing in California) and we will put you up in a hotel – a hotel we will STEAL (as they are also doing in California).
“You wish to be critical of our allies the Communist Party Dictatorship in China? HOW DARE YOU. You RACIST. This is HATE SPEECH” the entire council of once conservative San Antonio (in Texas – not officially yet Mexico) voted to end Freedom of Speech – no doubt that will help treat the Chinese Virus (oh dear I just committed a crime).
It goes on – round the world.
They are realising criminals (including murderers) from prison whilst, at the same time, disarming the honest citizens. This (both parts) is the policy of the government of Canada – and many other governments (fully supported by the international establishment elite, including much of big business).
The only good thing is that it is all out in the open now.
These are not good people who happen to have another point of view – they are utterly evil people who have taken the virus as an excuse to engage in the destruction of liberty they have long longed to do.
One an not “reason with” a vicious individual such as the Governor of Michigan – one can only defeat them, or be destroyed by them.
One needs to take the “new normal” threatened (as an eternal destruction of liberty) by the Prime Minister of the Republic of Ireland, and the rest of the international establishment elite, and shove their “new normal” down their throats.
rosenquist makes a good point – and I have myself made excuses for Prime Minister Johnson and the rest of the elected politicians in this country, and I APOLOGISE FOR THAT.
Yes it is very difficult to say NO to the “experts” and “public servants” – but the elected government does have the power to say NO.
If the Prime Minister in his address on Sunday does not give a clear and early date for the end of “lockdown”, then what remains of his credibility will be gone.
As for the United States – only four (4) members of the House of Representatives voted against the bailout orgy that will, de facto, bankrupt the United States.
No other members of the House of Representatives (other than these four) deserves any respect – certainly not the “libertarian” Justin Amish who was given the chance to vote NO and did not do so.
The British House of Commmons?
Looking past their tears to their actual voting behaviour – none of them (none of them – of any party) have behaved well.
The idea that the world economy was shut down due to simulations run on code that hasn’t been can’t be regression tested is really something.
I mean the model isn’t even wrong. It’s just garbage.
“Read the review to see what leads to these conclusions. You have to laugh”
I did. But more at the review than the code. If this is the worst issue they can find, then I’d say it’s not a problem.
Of course, we’ve only just started. I appreciate the link to the code – I’ll definitely have a look later on – and I agree this is how science ought to work. And I might point out, this is hugely more impressive than the climate scientists, who were still resisting releasing their code decades later.
Monte Carlo methods work this way by design. You’ve got a hugely complicated statistical model that you can’t find the distribution for analytically. So what you do is you pick random input values covering the range of uncertainty, run the model on each, and you get a random spread of outputs that shows you the range of possible outputs fitting with the input. It’s like testing the accuracy of a rifle by picking it up and firing lots of shots at the target. The way you pick it up and hold it is slightly random, the wind and temperature and density of the air are random, the amount of propellant in each bullet and the weight and shape of each bullet is random, and so the spread of holes in the target is random. If you can get them all within a handspan of one another, that’s pretty accurate.
It’s a nice property to be able to replicate the test exactly – same handling, same air currents, same bullets. It makes it easier to identify problems. But nobody would say that rifle accuracy trials were useless because they could not be exactly replicated. What’s important is that you can replicate the size and shape of the spread, not the exact location of each hole. And this code review doesn’t even mention that.
But we’ll see. I await further developments with interest.
Niv,
Apparently the sim code does not reproduce output even in a single thread environment.
Assuming there are no hw interrupt or synch issues, which there should not be, then in my experience the most likely problem is failure to initialize some part of memory correctly. This can cause “random” results depending on the state of the hw at the start of each run. There are other possibilities.
Back when I was doing this stuff for pay, I’d usually resort to a complete power cycle of the system in an attempt to reproduce such behavior. But it’s definitely a bug.
And yes, I do know how rng’s and seeds work.
The non-portability to other computer systems is indicative of a related class of poor coding practices, which can also cause non-reproducability.
What regression testing is there for the human brain?
This especially when the code examined is not the original code? That requires pregression testing.
Gross QA policy failure here. Pot calling the kettle black.
It’s not that that ‘expert’ is not wrong, but that the code analysis ‘evidence’ is junk.
Keep safe and best regards
“Assuming there are no hw interrupt or synch issues, which there should not be, then in my experience the most likely problem is failure to initialize some part of memory correctly.”
The most likely problem, I suspect, is that they’re not caching the state of the random number generator along with the tables of results.
So the first run initialises the random number generator, reads out a few thousand random numbers to generate the tables, caches them to a file, then carries on with more random numbers being used to generate the final results.
The second run initialises the random number generator, loads the intermediate results tables from the file, then carries on reading out more random numbers to get the final results. But because these random numbers are now coming from the start of the sequence, not a few thousand values in, the results are different.
If you want exactly replicable results when caching intermediate values, you also have to cache all the random number generator states, or reset them again to new starting seeds, or use different sub-streams before and after the cache step.
I don’t know what the problem actually was, but at first glance at the description of what they did, that’s where I’d start investigating.
Oh, and Fred the Fourth + a million.
And maybe NiV too – though without evidence from the original code and dynamic examination, how would one know?
The person who wrote that doesn’t seem to grasp what the review explained. Well designed programs using Monte Carlo methods use pseudo-random number generators that take a “seed” value to initiate them. Given the same seed, the pseudo-random number generator will produce the same sequence of “random” numbers. That allows a Monte Carlo simulation to *exactly* reproduce the same result each time it is run using the same seed. If it doesn’t, that means there is a bug. The Imperial model code does *not* reproduce the same results each time and is therefore buggy.
If someone wrote a program to solve a mathematical equation that has a single result, it should produce the same answer each time. If it produced the result 40 once and then 400 the next time it ran: people would consider it buggy and have no reason to know whether *any* of the results it produced were accurate. Instead, the approach used by the Imperial College model folks would be to average the 40 and 400 results and come up with 220 and say that is the answer (or whatever other average result from more runs): when there is no reason to know for sure what the nature of the bug was and whether the result was anywhere in that range at all.
A stochastic simulation should be able to be run more than once to produce the exact same results: it merely internally runs multiple simulations with different random numbers started from the same seed, or is run multiple times given different seeds. Ideally you’d have more than one person/team produce a simulation model using the same specs: and then test them running against each other.
Even aside from the problems with the code itself: there are problems with the underlying nature of the model and the assumptions its used. This is the same modeling group that in the past produced results that provided to be wildly out of touch with the reality of what happened: and yet they don’t seem to have ever gone back to understand why and updated their methods since they continued to produce results that predicted absurdly high results compared to what wound up happening.
Most epidemiological models are too simplistic since they assume the same constant R, the number of people a virus infects, for the whole populace. I haven’t examined this one, but I suspect it does the same thing, whereas in many cases there are “superspreaders” that are responsible for a disproportionate number of infections. In this case I’ve seen at least one study indicating that may be true of this virus, though I haven’t checked for critiques or confirmation:
Epidemiology and Transmission of COVID-19 in Shenzhen China: Analysis of 391 cases and 1,286 of their close contacts
Most models just use a static average R but neglect to take into account the impact of these superspreaders on how R changes over time. A top level analysis suggests this should undermine the typically projected exponential spread. If superspreaders come into contact with many people to spread the disease, they may also be more likely to come into contact with other superspreaders earlier than the rest of the public and therefore get the disease earlier. As superspreaders become immune, that would lower the average R and therefore lower the level required for herd immunity. Picture the case where all superspreaders were immune, the average R would drop drastically.
At least one study has attempted to take into account the variance of susceptibility to the virus which also varies, though I haven’t delved into it to see how credible it is or looked for critiques:
Individual variation in susceptibility or exposure to SARS-CoV-2 lowers the herd immunity threshold.
PS, here is a page from the Santa Fe Institute on the impact of a populace with superspreaders where R varies:
Transmission T-024: Cristopher Moore on the heavy tail of outbreaks
“If someone wrote a program to solve a mathematical equation that has a single result, it should produce the same answer each time. If it produced the result 40 once and then 400 the next time it ran: people would consider it buggy and have no reason to know whether *any* of the results it produced were accurate.”
The solution to the equation in this case is a statistical distribution. The ‘answer’ is that it should output a number between 10 and 500, say, with a particular complicated, non-analytical distribution. Run it 1000 times, it comes up with various numbers between 10 and 500. If you tweak a line that shouldn’t change the output and run it again and it suddenly comes up with 50,000, that’s a problem. But if the right answer is “between 10 and 500”, then both 40 and 400 are perfectly satisfactory.
Writing simulations that can be made deterministic by setting the seed is useful for debugging, and comparing differing implementations, and in particular for producing automated regression tests that can be run without the application of any intelligence, but it’s actually a bad idea to do it when doing the science. If you always use the same sequence, then you’re not exploring the full range of variation that genuinely random inputs will give you. If the particular bit of sequence you pick happens to miss a problem case that would trigger a failure, and you keep on using the same set of random numbers, you’ll never see it. A better approach for the science is to generate new seeds each time you run, but make a record of them so if such a problem does show up, you can debug what happened.
I agree absolutely that it’s good practice and very useful if the simulation has this property, but it’s also very hard to do, takes a lot of extra time and effort to get right, and is not as critical to the correctness of the scientific results as the article might give the impression is the case.
And yes, using a fixed R0 for the entire population and using average values for transmission rates misses rare events out in the tails or in particular sub-populations, which is precisely why these sorts of models use Monte Carlo methods to model the population in much finer detail than the SIR-based models everyone else uses.
It’s great that people are looking at it now, but I’ll wait until they find something more interesting before I get excited. As for the practices of the software engineering community, I’d be curious to know why my PC spends time every month downloading software/security patches if commercially engineered software is all bug-free? ‘Microsoft!’ used to be a swearword, round my way. I’ve never come across any large bit of software that didn’t have bugs, despite all the trendy methods they use, and some of the worst were the ones engineered by the strictest, most formal and bureaucratic procedures. (Let’s not even talk about the ‘Waterfall’ model…) What’s more important is whether it’s open to examination and continual testing, and whether they fix problems when they find them. It’s the process that matters.
This getting to be a surprising habit, but I largely agree with NiV.
In particular, I have great problems with the Imperial team’s modelling. However I have not seen a single criticism of Imperial’s modelling in Sue Denim’s paper.
You have got to distinguish between a model (an attempt at a mathematical idealisation of reality, probably including some randomness) and the code, which is an attempt to implement that model.
It is one of the major fallacies of the 21st century to confuse these two.
Indeed, I would argue that “ensemble modelling”, as perpetrated by climate scientists, is, in part, an example of that fallacy in practise.
Here are a few propositions that i might defend in future comments.
* Whether the ICL model is correct or not, is of no more practical interest than whether Prof. Ferguson’s lover is married or not.
* One thing that IS relevant, is whether the ‘lockdown’ worked. That is up for debate, since in many countries +US states the ‘lockdown’ was imposed at the same time as distancing. How could we possibly know whether distancing alone would have been just as good?
* In many countries +states, people talk about a ‘lockdown’ when in fact all what happened is no more than enforced distancing.
* Almost all of the economic damage would have been achieved anyway, due to people being rationally afraid of catching the virus. Most of the rest of the economic damage could have been avoided by hiring intelligent policepersons.
* Absence of evidence is not evidence of absence. The fact that we do not have a good model is exactly why governments should have closed borders, enforced masks +distancing, and done as many tests as possible **EARLY**. Intelligent governments have done 3 or 4 of the above. Boris did none of the above until it was TOO LATE.
* Paul Marks has an increasingly tenuous grasp of reality; but more firm than e.g. Mr Ecks.
I would suggest that the whole argument is moot anyway as, just as it is impossible to predict the weather more than a few days into the future, the progress of a virus is likewise impossible to predict. There are just too many variables and unknowns to make it work, even if your computer program is perfect and replicable.
Niv,
Cache…
Yup, quite possible.
Way way back at Big Silicon Valley Co. we used the term (from biology and other places) “emergent behavior” to describe the unpredictable behavior of systems with multiple interacting caches.
But we also ran automated regression testing round the clock. Good design is one thing, but so is Defense In Depth.
“Chaos: When the present determines the future, but the approximate present does not approximately determine the future.” ~ Edward Lorenz
Most real world systems are chaotic. Engineering (and modeling) work only if you avoid chaotic systems like the plague, or the virus.
Niv,
No one uses pseudorandom number generators that way . One calls it with a seed and it returns a number. Subsequent calls without a seed use the previous result as the seed for the next call. So calls to the generator with the same original seed produce an identical sequence of random numbers. No need to cache them, the generator effectively does that for you.
So a second run with same seed should produce the same final result since every call to the generator should return the same value that it did on the first run. That’s the beauty of using a pseudo random number generator. It’s results are reproducible for testing purposes.
Put the whole simulation process in a loop times a thousand. Put the original call with a seed inside the loop, you get 1000 identical results. Put it outside, you get 1000 different results.
If they used the same seed and got a different result, it’s not the random number generator that’s causing it. There’s one or more bugs. As others have said, probably uninitialized storage.
That’s about the code. Whether the model’s algorithm is correct or not is another question entirely. I’m a software designer and haven’t got a clue about how epidemiological models should work. Variables, relationships between variables, etc.
my experience was modeling oil well production on large IBM mainframes
.forgive the typos, using a finger instead of a mouse is difficult!
BTW, I think the software community is being too hard on him. “One off” software is rarely tested and documented as much as it should. And repurposing it years later is just asking for trouble. He’s a biological mathematician, not a programmer. He should be judged on the quality of model’s design, not on it’s implementation. And that judgment should come from other epidemiologists.
From someone who is mostly just baffled reading this thread:
Am I reading correctly that a “random number generator” is supposed to return the same result time after time when given the same starting input?
How does he fare if you simply judge him on his results?
“No one uses pseudorandom number generators that way . One calls it with a seed and it returns a number. Subsequent calls without a seed use the previous result as the seed for the next call. So calls to the generator with the same original seed produce an identical sequence of random numbers.”
Yes, but that doesn’t work when you’re doing concurrent programming, and it’s not the issue in this case. We’re not talking about running the same code doing the same thing twice getting different results. We’re talking about someone first doing a big, time-consuming computation pulling out the first part of the sequence, saving the results so you don’t have to re-do it every time, then doing a bit more computation with more random numbers, versus someone loading the previously saved intermediate results, then doing just the end bit again that’s now pulling the random numbers from a much earlier part of the sequence.
So the first run is:
Set seed to fixed value.
(R1, R2, R3, R4, …, R1000) producing [intermediate result]
[Intermediate Result] + R1001, R1002, R1003, …, R1010 producing Final result.
The second run is:
Set seed to fixed value.
Load [intermediate result] saving a huge amount of time.
[Intermediate Result] + R1, R2, R3, …, R10 producing Final result.
It’s not a big surprise that if you do that, you get different results.
Handing random number generation for exactly reproducible results is a tricky business. You usually have several stages of processing, each pulling out random numbers. If you want later stages to function the same, even when earlier stages can pull out different numbers of random numbers, you have to have each call to the random number generator done in such a way that it doesn’t affect any of the others. The standard way to do this is to have sub-streams – you effectively have lots of independent random number generators, each with its own seed and state – and every call and every thread uses a different sub-stream. Then even if the code is processed out of order (as happens with concurrent processing) or if you change the functioning of early parts of the process, it doesn’t affect the sequence of numbers generated later on in the processing. But in some places, you might have different paths through the code that are supposed to do the same thing, in which case they have to use the same substream, so you can’t just automatically make every call a different sub-stream. It’s complicated.
If you just use the standard basic set-the-seed/call-repeatedly method, with a single random number generator, it can then be a real pig to upgrade it to concurrent processing when the size of the problem gets bigger and the wait for results gets longer. (and that’s besides all the other nightmares of concurrent processing!) The ideal is to code it that way from the start. Clearly they didn’t, and then they attempted at some point to upgrade it to concurrent deterministic processing but didn’t complete the conversion. I’m not particularly surprised, and I agree it’s less than ideal, but it doesn’t necessarily mean the results are wrong.
Bobby,
If one calls random with seed of 6 one might get 4 as the first result. Subsequent calls without a seed might return 6,9,3,6,4,3,7,8
Call it again with seed 6 the subsequent calls will return the same 6,9,3,6,4,3,7,8
That’s why it is really a pseudo random number generator.
When the number of seed values is 2**32, that usually isn’t a problem.
Call it with seed of 5 one might get 2 as a result. Subsequent calls without a seed might return 9,9,6,2,5,1,7,0
Poorly, I’m afraid.
But not as poorly as the politicians and bureaucrats that decided that the model was “good enough for government work”
“Am I reading correctly that a “random number generator” is supposed to return the same result time after time when given the same starting input?”
Yes. But mainly for testing/debugging purposes.
A pseudorandom number generator is a mathematical function that outputs a sequence of values that ‘look random’. You start with an initial value, called the ‘seed’, and then you transform it using a ‘scrambling’ function to produce the next value, then scramble that to produce the value after, and so on.
In ‘normal’ use, you set the seed to some genuinely random value (the current time is commonly used) and then produce a random-looking sequence in the software. It’s an easy/cheap method of simulating randomness on a deterministic computer, but it does have its pitfalls. However, using randomness in an algorithm makes it more difficult to tell if you’ve implemented your algorithm right, because you can’t compare results with anyone else. So there is a way to use it to produce a fixed sequence every time of random-looking values, which makes your code reproducible even while relying on ‘randomness’.
Niv,
But they said they tried it with a single cpu and it still didn’t work.
But I agree that multitasking (multithreading) is a lot more difficult even without having to provide a separate random function for each task.
“But they said they tried it with a single cpu and it still didn’t work.”
Yes, but they weren’t doing the same thing each time. See the first run/second run example above.
I fail to see why your first run second run example wouldn’t produce identical results. The intermediate value is the same and the final processing in both runs uses the same seed. Why would the results differ?
Chris,
Because the first run sucked the first thousand random numbers out of the generator (to generate the intermediate results), and the second didn’t. One is
[Intermediate Result] + R1001, R1002, R1003, …, R1010
and the other is
[Intermediate Result] + R1, R2, R3, …, R10
But if you reset the seed in the second run to same value you used in the first run you will get the same first 1000 pseudo random numbers.
I must be missing something here, but since I can’t imagine why someone would write a program that way and I’m not about to try and figure out by analyzing 15,000 lines of undocumented, uncommented code, I guess I’ll just let it go.
Writing more than a sentence or two on an iPad is tortuous.
Niv,
Oh! I see what you’re saying. That they didn’t set the seed after they got the intermediate result. And use that seed for the second run. A mistake even a junior basic programmer shouldn’t make.
Good grief!
“But if you reset the seed in the second run to same value you used in the first run you will get the same first 1000 pseudo random numbers.”
Yes you would, but in the second run you’re not doing so. You don’t need to, because you don’t need to calculate the intermediate values, because they were already calculated and saved when you did the first run.
If the first stage takes an hour to run, and the second stage a minute, you can work a lot faster this way. You do the first run to generate the intermediate results in an hour, and then you play with setting different parameters in the second stage, each run now only taking a minute. It’s a common way of working with lengthy calculations.
“That they didn’t set the seed after they got the intermediate result. And use that seed for the second run. A mistake even a junior basic programmer shouldn’t make.”
Yes! Exactly! (They should have used a new seed, different for each part.)
“Good grief!”
Indeed!
So they optimized their program for performance on multiple runs.
Sadly for them and the rest of us, optimizing an incorrect program merely means you generate incorrect results more efficiently.
Yay?
And hell’s bells! I would have been fired from every SW job I ever had over 25 years if I tried to pass off code like that as usable product. But hey, no big deal. I was merely doing boring stuff like Kanji font renderers and nuclear cruise missile targeting data systems.
“So they optimized their program for performance on multiple runs. Sadly for them and the rest of us, optimizing an incorrect program merely means you generate incorrect results more efficiently.”
But if you don’t have a requirement for exact replication, it’s not incorrect.
It’s still perfectly usable, and mathematically valid. It’s just harder to debug/validate.
I’ve got a tenner that says that the code was “better” before Microsoft got their hands on it.
Chris in Texas and NiV:
Thank you. I know just enough about RNG’s to be . . . well . . . ignorant.
I can give you a possible explanation of why multiple runs of a sophisticated numerical processing program (even on a single processor) do not give the exact same result. It might or might not explain the lack of repeatability that everyone is worrying about.
The simplest example (from my experience) is a digital signal processing program that has to run in real time. I have seen it in speech coding and in demodulation of sophisticated types of radio signals. The processing takes a variable amount of execution time, but must be completed (as a short-term average) within a fixed time – otherwise the output will not be available for such things as speech output or modem Automatic Retransmission Requests (ARQs). The sophisticated processing involves such things as inversion of matrices that sometimes are near the edge of mathematical stability; this means the answer might take longer to produce and then be rubbish. A typical way of dealing with this is to detect that a timing overrun is close to occurring and abort the computation and substitute the results of the previous signal frame (say a millisecond or less old). As the exact computation time does vary by minuscule amounts, just once in a while, the computation is abandoned in a different place in the computation – which leads to a different result. With more than one place in the algorithms where exact execution time varies (which also makes using elapsed time advantageous over using iteration counts), this can all be made very obscure. It is also true that different numbers of invocations of a pseudo-random number generator (which might be used to initialise sub-parts of other algorithms) can ‘amplify’ the effect and make chaotic the effect of such not very frequent and minor timing differences.
Such effects might also occur in programs that model, for example, weather systems or other large numerical models. The only requirement for this is that execution time is very high so there is considerable benefit from aborting worst-case computations – this to save overall execution time by using an execution-time dependent algorithm that is an approximation of what is already known to be an approximation. In this case, a 3D cell value might be substituted by an average of adjacent cells, or copied from the previous time step.
For fun, I’ll describe what is I think the most obscure occurrence I have come across (so far) in my life. This involved computer code that was perfectly correct implementation in a numerically algorithmic way (for the radio demodulator), in the ‘C’ programming language – and the amount of commenting of the code was irrelevant. It involved calling the intrinsic sin(x) function. The problem was that with certain input data (IIRC high signal-to-noise ratio made it more likely), the execution time of the computer library’s built-in function used a different algorithm for values of ‘x’ near to zero. This added the additional execution time and also generated a synchronous instruction trap whenever a floating point underflow occurred – the synchronous instruction trap did nothing useful (in this case) beyond returning result zero for the single floating point operation, but doing that much more slowly than usual invocations of sin(x) – this leading to timing out and the program aborting itself – for no good reason that the perfectly competent programmer who wrote perfectly correct code for the algorithms could determine – nor his boss, nor his boss’s boss, nor the several co-workers who were consulted. I (with history from the days when computers were even slower and computation time at even more of a premium) wondered why he was calling the sin(x) function rather than using a lookup table (for the 1024 values of x that would ever occur in practice). He said (I think not unreasonably) that it was such as small part of the computation that its timing could not possibly be a problem – and that his program worked as expected nearly all of the time on lots of different input data. I had to resort to looking at the computer manufacturer’s source code for the mathematical libraries (which fortunately my big scientific client had available) in order to understand what was going on. Needless to say, thereafter a look-up table was used: hope the programmer added a comment to stop anyone later reversing the solution to save memory and a few lines of code.
I am sure that it is possible that execution-time dependencies generating different computational paths would be a feature of the code some of you are criticising here (especially given the interest/attempts in porting to MIMD architectures – that’s a type of parallelism). And it need not involve any poor coding. And it would not be at all obvious how the timing issue affected computational results from totally different pieces of code.
[Aside: I might say those claiming to be programming experts because of work on large project commercial software, databases etc might be viewed as total amateurs on this sort of number-crunching. But then some numerical mathematician or avionics system programmer might turn up an say these digital signal processors are not the experts they think.]
Keep safe and best regards
@NIV, I think you are completely wrong regarding the PRNG issue. The reason code has to be made deterministic is not so that you can regression test it. Regression testing is a means not an ends. It is entirely to miss the territory for the map. The purpose of the regression tests (and tests more broadly) is to determine the correctness of the code. If the code is non deterministic that is almost impossible to do. (BTW, for non programmers reading along “deterministic” in this context means that if you give a program the same inputs you always get the same outputs. Which is to say the inputs “determine” the outputs, reliably.)
What I think is shocking about the article the OP cites is not the sloppy lack of proper process interlocks and proper management of the PRNG, it is that fact that the code is evidently horrible. (15kloc all in one file? I mean that tells you all you need to know. The fact it was written in C++ is a further indication, that is totally the wrong language for this kind of work in today’s world.)
And also, the fact that the “code is the documentation” is a perfect example that this is amateur hour. Don’t get me wrong, I certainly understand the reality of software and documentation, however, in this type of situation, where you are modelling something, you ABSOLUTELY need to write down a mathematical description of what you are modelling and that should translate directly into the code. It seems from the comments in this article, that that was not done. No doubt it is a mish mash of a bunch of papers, all jammed together without a thought of how the pieces interacted.
And that is where these three things come together. The problem is not the lack of determinism, or the lack of a comprehensive description of the function or the low quality of the code per se, it is the consequence of that — namely that nobody has the first damn clue if the output of the program is right or has any meaning at all. And a large part of that is because nobody know what the right output is in the first place.
I’ve seen this sort of thing many times. Some scientist with no training in how to write software well, writes a little program for fun. Little programs are easy to write after all. He then adds to it over the years, and it is just his plaything, and nobody cares about its correctness. Then all of a sudden he becomes valuable to the politicians, and so they wheel him out of the dusty basement, throw lots of money at him and his little piece of crap toy program that has never been tested or validated, and is just an accreted bunch of crap suddenly becomes a gold standard. It tells you a lot that they brought in experts to try to clean it up. I’ve been one of those experts in similar situations: lipstick on a pig is too generous a description.
I think the author’s conclusion paragraph, which is readily understandable by non programmers, is the bullseye in this piece.
Fraser,
An unfortunately large part of my career was spent doing what I called “janitorial programming”.
I.e. Cleaning up other people’s messes.
In industry we referred to the phenomenon you describe as a piece of code “escaping” as opposed to being properly productized and released. Happened a lot with various useful tools and utilities.
OTOH, that’s basically how PhotoShop entered the world, so…
Wonderful comments from many here, especially Nigel S and Fraser O.
Just noting that judging Newton’s physics by peer review woukd have been largely unhelpful, and doing the same with his alchemy utterly pointless.
And here, when ‘science’ is used as a cloak for political ends, judge the scientist too. You don’t need to be a mathematical biologist to smell a rat.
“The purpose of the regression tests (and tests more broadly) is to determine the correctness of the code. If the code is non deterministic that is almost impossible to do.”
Not at all. As I said above, the ‘output’ of the program is actually a statistical distribution. What is required is that the statistical distribution of the output for a given set of input distributions remains the same.
The scientist using the software knows this. It probably never even occurs to him that anyone would expect different outputs of a given distribution to be exactly the same! Your uniform distribution random number generator always produces the exact same number every time it is run? Something wrong with that!
If you’re testing a random number generator, you require it to produce different outputs each time it’s run. What you’re testing is that the distribution of the numbers is correct.
Of course, that’s a more difficult and much slower thing to determine than whether two numbers are equal, so instead you can spot-test it by finding an exact output for a given initial seed. That doesn’t necessarily tell you that the result is still correct for any other initial seed, which is the actual requirement, but Hey! It’s easy to test!
The way the scientist works is to start with simple cases, and check if the output makes intuitive sense, given their understanding of the phenomenon being simulated. Sometimes the output does something unexpected, so the scientist digs into the internals to figure out what is going on. Sometimes he finds a bug and corrects it. Sometimes he realises his intuition was wrong, and debugs that. Debugging the intuition is a big part of what makes him an expert. The code in the computer and the ‘code’ in his brain are evolving together, in tandem. It’s not like an accountancy or stock control package, where it’s written once and then used ‘as is’ forever after. It is being constantly and continually redeveloped and debugged, every day, by its user. New features are added to aid that debugging, to provide more sophisticated diagnostics to help the expert spot where something subtle may be going wrong. He is constantly exploring new territory, doing new things with it, so it’s of no use whatsoever to know that it still works back in well-explored civilisation. There is no once-and-for-all requirement that you can lay down, fix, freeze, and code a well-structured highly-engineered machine to address. The entire point of research software is that the requirement is continually changing. It’s a moving target. And half the code is in the user’s brain.
It’s like the difference between a battered land rover being used to cross the jungle, full of improvised bodges and fixes and tools the explorer finds useful for improvising their way around fallen logs and ditches and river beds, and a limousine used for moving around the city in luxury. The limousine expects a smooth flat road. No low-hanging tree branches to rip off the wing mirrors. No unexpected boulders to be shifted out of the way. The limousine designer expects requirements to change only rarely. And the designers of luxury limousines sniff disdainfully at the land rover’s crude duct-tape-and-string engineering.
It’s a completely different mindset, resulting in a different requirement, and the mutual incomprehension between scientists and software engineers is a long-standing problem. There are a huge number of advanced software engineering techniques that would be incredibly useful to the scientist explorer. They’re usually untrained amateurs who taught themselves to code, and so their code has lots of unnecessary inefficiencies and flaws. The jungle is full of the rusting wreckage of abandoned land rovers that got broken by bad maintenance and couldn’t be rescued. But the software engineers totally fail to communicate any of that because they insist on bringing everything back to the only paradigm they know, fitted to a particular sort of environment, where the requirements don’t keep changing, where you’re always doing the same relatively simple and well-understood thing over and over again. They don’t try to understand the customer’s requirement and work with it, they instead tell the customer they’re wrong.
A piece of code adds 12 uniformly distributed random numbers together and subtracts 6 to produce a random number with a standard Gaussian distribution. Somebody points out that the library has a proper Box-Muller Gaussian distribution generator, so you change the code. How do you regression-test your program, to be sure the output distribution hasn’t significantly changed? Can you see that a method that sets a particular seed, and demands a specific exact output is useless here? The requirement is about the distribution. The regression test has to test the distribution. If you are continually making such changes to the program, if every change you make breaks all your tests, so you have a constant added overhead of constantly rewriting (and testing!) all your tests too, you need to change your approach to regression testing.
The virus needs to be contained by regional authorities acting independently, with minimal involvement of the central government.
But Johnson wants to centralize power into Whitehall, despite Whitehall being utterly inept at handling anything outside of London.
Let Scotland, Wales, Midlands, devise and administer their own methods. Whitehall should stay in London and kindly sod off.
Apparently John Locke got his private peer review: he asked Christiaan Huygens if the math was correct.
There is a case to be made that the code used in models should always be released by publicly-funded research institutions. In this case, of course, we will simply see whether the prediction is correct. Not that i care one way or the other.
@Fraser
Maybe, but that is why real mathematicians do sense checks and other more technical analysis-and it has to be a real mathematician to do that when we’re talking about random outputs.
Whether Ferguson merits the description of mathematician is up for debate- I personally strongly doubt it but I don’t have enough evidence to be sure.
I would rather a shitty, cobbled-together program which produces results which are roughly in line with mathematical approximations than beautiful, sleek, optimised and well-documented code that produces bullshit after professional programmers have spent nine months writing it.
[Ducks and legs it, at speed]
I am minded to recall to our attention my (unusually confrontational) response to Mr Ecks back on March 18, 2020 at 10:20 am. Here, for today’s purpose, is the essence of it:
For those not following COVID-19 matters very closely, Report 9 is that of Imperial College’s Professor Ferguson, issued shortly before I posted my comment.
I (largely) support the views of NiV immediately above. Combining those with mine, I ask (a man’s Pretty Please – and why would that not be allowed here) can we look (too) to the relative ratios (for infections and/or deaths) for the various scenarios, rather than solely the absolute values. That would be so much more professional.
Keep safe and best regards
NiV:
No. Their program’s output was not meant to be a statistical distribution. It was meant to map inputs to outputs reliably. They made it a distribution because they could not make it run reliably – they just started running it many times and averaging what was meant to be a reliable output, but was in fact a random variable. They (and you) claim this is good enough, but this is an unsupported statement.
Apart from the coding implementation, their model seems be so wildly over-defined as to be able to give you any result you want. It has 450 different parameters, many of them not a single float value, but a float matrix – and there is no indication of any sort which are based on ground truth, and which are pulled out of thin air. Pause to wrap your mind around this. 450+ degrees of freedom – floats, not boolean – most of them from parameters that are abso-bleeping-lutely unknowable and also changing over time: try, for example, “Relative place contact rate given enhanced social distancing by place type after change” or “Proportion of key workers whose households are also treated as key workers”. Ferguson and his ilk have admitted before that their models’ outputs can swing wildly based on a single output’s change, and therefore they restrict very tightly how much inputs can be varied between runs; this of course means that ALL these obscure variables absolutely MUST a basis in reality. It’s also not (per your earlier claim) how you run a proper Monte Carlo. You need to let your variables, well, vary, and you need to run, oh, who knows how many with that many inputs, but certainly many more that hundreds of simulations.
Oh, also, the model also has has numerous special rules (e.g. hotels are excluded from “place sweep”) for which there is no documented reason.
It’s a modeling disaster. It’s a Rube Goldberg machine that consistently overcounts. Wassily Leontief’s input-output models were a similarly ambitious and similarly successful attempt to model a system whose complexity is many orders of magnitude beyond human ability to conceptualize, but at least they were simple (ha!) systems of thousands of linear equations.
Plamus writes:
I prefer the substitution:
But actually, even that is not true. The required outputs are all probability distributions or their close derivatives.
Keep safe and best regards
I think this thread is now long enough for me, the originator of the post, to chip in without danger of distorting it.
This is where I work. If one can have expertise in software engineering, I have it. Although by far the larger part of my experience was gained commercially, I have coded in academia and in academic-commercial joint projects. (I have even coded in an EU-run joint academic-commercial project, which was insightful – the insight gained was that the word ‘dysfunctional’ was far too weak.) I once reduced an academic to tears, and to giving me a solemn promise he would never work on the software side again, at the end of explaining to him how completely his ‘expertise’ was ignorance. (Although I wished he had not cried – and felt that, though firm, my manner had hardly justified it – I respected him for having that very basic competence that lets a man see how little he knows, plus that basic honesty that did not try to bluster his way out of it.)
So with my claim to being an expert and Feynman’s quote about the ignorance of experts both kept well in mind, let’s try and grasp what we are looking at here. A possible domain equivalent of Feynman’s remark could be
A teenager – the teenage Niall Kilmartin, for example – writing their first program in school, thinks that since they understand what they are trying to do, and understand the programming language, their code will just naturally express their intent, needing no great further help. But as programs grow from ‘Hello, World!’ towards the solving of hard problems, this opinion of the unaided capacity of domain knowledge and computer literacy moves from being probable, through imprudent to risky and then to being quite ridiculous.
I have known really competent coding academics – cryptographers, for example. But I also know that academia is full of people who have never progressed beyond my teenage state – and who do not even suspect this. Professor Ferguson and his team are in this state.
For clarity, consider the following analogy: imagine the prof turns his epidemiology model into, not a computer program, but a legal action plan for US lockdown. Imagine bobby b, say, explaining to us that the so-called plan betrays quite laughable ignorance of how law works in general let alone of the US constitution in particular, that it tells us the professor does not even suspect how little he knows of very basic details of how courts work, of what legal terms of art really mean, of how the police function, etc., etc. Bobby b would not be criticising the prof’s epidemiological skills; he would not even be commenting on those. He would be noting Ferguson’s failure to turn his model into a legally viable approach to containing the virus. Bobby b’s damning critique would be quite independent of the question whether the model, if turned into legal rules and/or emergency decrees by someone competent at that task, would or would not be beneficial.
We are not even talking here about whether the numbers predicted by the model would correspond to likely outcomes in the real world or not. We are talking about whether the numbers spat out by the program correspond to the numbers the model would have predicted if humans alone had been capable of solving it for real-world-sized examples.
So far, what we appear to know from what code is in the public domain is that, for two decades, the team are more likely to have been taking their code for a random walk than developing it in any flattering sense of that term. Did they ever even once push pen and paper around to get stats for a country of 30 inhabitants living in 10 families and then verify that the model’s output stats conformed? Did they ever verify that they still did so, as they added new features? Or did they see the program’s output as numbers to be believed, not verified, relying on a largely-unconscious view of the coding process that I had when I was 15 years old?
I cannot close without noticing the revealingly-wasted time spent by critics of the OP link argument challenging the secondary matter of reproducibility from same-seed, as if that could restore credibility – critics who did not remark the CPU affinity or data table organisation parallels. That tests might check for statistical, not identical, conformance, depending on their manner of interaction with the program, is as true as it is irrelevant: the more basic failure was that there should have been some, and the non-reproducibility of runs was a probable but secondary symptom of their lack. As far as we common people have so far been allowed to know, there are no tests – and good grounds to think that manual regression testing was never done to the level this task requires, and maybe not at all.
I am late to this game. I can add nothing to what NiV or Nigel Sedgwick have said. Even though it is my field. Oh, well. i’m not a huge fan of numerical methods. It’s kinda why I have a mechanical watch – Omega Speedmaster Automatic. I mean I know they Monte-Carlo and Runge-Kutta and all that malarkey are useful but… They’re just ugly and I like my maths to be beautiful. Gimli might give a technically excellent blowjob but I’d still be thinking of Galadriel.
Thank you again NickM. Mr Ed too.
Please note I am making a special exception for you both, as my personally chosen limit per Samizdata comment stream is 5 (higher IIRC than for anywhere else). And this is the 6th.
If Niall, sensible waiter[sic] that he is, opens another OP of suitable content – I might well have further views.
Keep safe all – keep engaged too
Niall: thanks to your comment, i have had another look at the Sue Denim article.
The 1st time, i stopped reading in the middle of the very long section titled Non-deterministic outputs (which takes up half of the article).
Apparently other commenters did the same, which is why they have been going down that particular rabbit hole.
That section should have been placed last, with a warning to readers that it can be skipped.
The following sections about problems with the code:
No tests;
Undocumented equations;
Continuing development;
highlight what i’d think are more serious problems.
Not that failure to replicate results starting with the same seed isn’t worrying, but i’d think that it does not necessarily imply that the code won’t work. (Different results with different hardware is even more worrying — or is it? if you get different results with the same hardware, of course you get different results with different hardware.)
+1, Snorri.
I have the impression that it is not unusual for domain experts to put their most important comments briefly and spend much time on secondary matters, unconsciously assuming their readers will be other domain experts who will rank the author’s points by importance not by length and/or will want more on the secondary matters that are less usual, needing no long-winded expansion of basic points whose importance will be instantly grasped.
I recall an example from the military domain; a historian quoted a dramatically-phrased report by a German officer of beating back a Russian attack in WWII and then (very helpfully for the general reader) explained that the key sentence
is so brief, so quietly mentioned within other far more exciting information that a non-military reader could easily miss it – and could easily miss that they did not know what it means.
(FYI, it means that defending troops must be so well disciplined that they wholly resist the urge to fire until the attacking infantry are terrifyingly close – then abruptly on-command start to fire in sustained bursts, bravely ignoring returned and/or covering fire as they stay on aim to control their weapons, so killing the maximum number in the short period before the enemy go to ground too close to evade further losses or to be an effective fire-step for the next attack.)
The officer expected to be read by a superior or a colleague, i.e. another domain expert, and/or was not focussed on a wider audience. The historian was writing for the general reader, so knew to translate, not the words, but the relative importance of the points.
(Arguably, I should have done the same as that historian. One sees these things in retrospect.)
The only thing we know for sure about this and all other models is they are all at least partly wrong. Question is how right should they be to base public policy on them? Especially when such policy involves millions of lives or trillions of dollars.
I think the answer is “they need to be a lot more right than any of them are”. But how to turn that statement into policy?
I like one comment above saying ‘keep the central government out’. That’s a very good start. Here’s another: no public health policy to be based on any model that can’t predict next year’s flu season any better than 20%. And the model has to do that three years’ running before the government can use it for anything.
Oh, and peer review, lots of it (multiple institutions, dozens of reviewers, anonymous and open reviews, etc.), of the model, the code, the validation testing, etc. No scientist should have any problem with such requirements.
“I once reduced an academic to tears, and to giving me a solemn promise he would never work on the software side again, at the end of explaining to him how completely his ‘expertise’ was ignorance.”
Mmm. Yes. I’ve had the same sort of experience explaining to software engineers (and software auditors) how completely their ‘expertise’ too was ignorance. There were no tears, but they were a bit annoyed about it!
“Did they ever even once push pen and paper around to get stats for a country of 30 inhabitants living in 10 families and then verify that the model’s output stats conformed? Did they ever verify that they still did so, as they added new features?”
OK. Did they?
How do you know? Do you have *evidence* they didn’t, or are you just *assuming* they didn’t, because that’s your view of academics?
“I cannot close without noticing the revealingly-wasted time spent by critics of the OP link argument challenging the secondary matter of reproducibility from same-seed, as if that could restore credibility”
I’m not sure I follow this. So far as I could see, the OP link’s argument seemed to consist almost *entirely* of criticising non-reproducibility from same-seed, which of course is not an issue for stochastic methods. (And we only spent so long discussing it because a number of people didn’t understand and needed a more detailed explanation.)
That’s not to say there are no material criticisms to be made – I’m sure there are. But this non-exact-reproducibility isn’t a material failure in the code, and the fact the author of the OP link seemed to think it was seems to me more indicative of the reviewer’s ignorance. Likewise, the way other people leapt on his comments as a vindication, also apparently without understanding what the issue was. “Belief in the ignorance of experts” applies as much to this anonymous reviewer as to Prof Fergusson.
As I said, I think it’s great that people are working on the code, and when they find some actual material problems with the simulation I’m sure that will be an interesting debate. But until then, I think we’re being a bit premature.
+1 to Naill’s comment. Your analysis is spot on. This is a classic Dunning Kruger event.
I have hired and fired a lot of programmers and I can tell you quite simply the difference. Bad programmers write code and assume that what they wrote is correct. Good programmers write code and assume that what they wrote is not correct. The good programmers are almost always right, so good programmers write their code in such a way as to prevent or make easily detectable these bugs.
The idea that you can bodge together a large software model like a bodged together Land Rover betrays a deep ignorance of this reality. Frankly, Land Rovers are simple machines compared to even moderately sized programs, and Land Rover mechanical parts work together in fairly intuitive ways. Programs are notorious for their non obvious or non intuitive behaviors. And when you introduce multi-threading, or GPUs into the mix the complexity and non intuitive behavior increases one or two orders of magnitude. Good programmers spend most of their time building tools and infrastructure to wrestle these beasts of complexity and unpredictability into submission, to manage it down to simplicity and predictability. When multithreading is involved good programmers antennae are up, wrapping that threading up in protective layers because they know not only that the synchornization problems are the most likely cause of bugs, but to avoid the hell that is the process of finding and fixing bugs arising out of multithreading sync problems. They are often almost impossible, and impossibly frustrating, to find and fix.[*]
Which is why what this programmer did is so very wrong. It is deeply revealing that he doesn’t have the first clue how to do this.
Now to NIV’s point, sure, screwing around in your lab to put together a toy program to experiment with your ideas, I totally get it. It makes some sense. But to allow that program to “escape” as some commenter excellently described it, is not only a mistake, but a gross failure of most basic scientific duty. But in a sense I don’t really blame him. He is just some unimportant scientist wheeled out of the basement into the stary light of the red carpet because he is a useful idiot to the politicians.
Bottom line — the reason why systematic testing (and consequently, determinism) and complexity management is necessary is not some idle software engineering curiosity. It is to determine that the results are correct. Just because your program pukes out a bunch of data doesn’t tell you anything at all about how correct it is. Musing with your intuitive thoughts is all very well — that is something you do with the mathematical model and publish in scientific papers. Translating that into code is an engineering discipline. Normally it doesn’t matter, because most of what these sorts of scientist do doesn’t really matter. However, the real world evidence is that his models did not model the real world accurately. In fact his results were DRAMATICALLY wrong. And his carelessness has been the proximate cause of burning down the whole world.
That is quite an achievement for a minor academic. It would be nice if he had the decency to issue a public apology and slink away into his previous irrelevancy. The fact that he doesn’t tells you a lot about the state of science today.
[*] BTW, to go full circle, the reason multi-threaded bugs are so hard to fix is that they are non deterministic. It is almost impossible to recreate them reliably. The first priority in addressing such bugs is to reproduce them reliably — to eliminate the non determinism. Which is why the correct thing to do with this model is throw it away and rewrite it using people who know what they are doing.
NiV: “and when they find some actual material problems with the simulation”
Would, ‘his simulation has been consistently wrong on every occasion Ferguson put his results into the public domain’, be a material problem?
Would there be much difference between his simulation and, me walking down the pub* – giving you a low end figure at the beginning of the evening, getting pissed, then giving you a bigger number at the end of the evening. The measure is, can I be wrong more often that Neil Ferguson’s Commodore 64 ?
*for added randomness, it could be a different pub each night.
NiV: ” I think we’re being a bit premature.”
Economy takes a nose dive, couple of million extra unemployed, airline industry decimated. Now, you decide to be cautious. Huh!
“The good programmers are almost always right, so good programmers write their code in such a way as to prevent or make easily detectable these bugs.”
So why doesn’t that work?
“Bottom line — the reason why systematic testing (and consequently, determinism) and complexity management is necessary is not some idle software engineering curiosity.”
I agree that systematic testing is absolutely necessary. But the testing has to test the *actual* requirements for the program’s output, which in the case of research software are nearly impossible to capture, and even harder to automate. It requires an active intelligence to investigate and understand what the software is doing internally.
People who write down a bunch of requirements they ‘captured’, write automated tests, and then say “It passed the automated tests, so I can assume it’s accurate and deliver it” are doing precisely what you complained about – assuming that the code they wrote (to test it!) did what it was intended to.
And this was a classic case of the problem. To write automatic intelligence-free tests of software that uses random numbers, you need to be able to set the seeds and generate output test vectors to check against. “Are the numbers output exactly the same? Check! Move on.” But the test proposed is testing the wrong thing. The software is by design not supposed to be deterministic. So if you make it deterministic to be able to test it, you’re contradicting the requirement! It’s wrong by definition.
In practice, requirements are always missed, misunderstood, ignored as too difficult, approximated, or done incompletely. And even if you can correctly specify what you want, testing the correctness of software is like automated theorem-proving, an NP-hard problem. The number of paths through the software expands exponentially. There are Formal Methods that can, at huge expense, prove the correctness (assuming correct specification) of some simple programs, but that’s not what 99% of software engineers actually do. Nobody in the industry guarantees correctness. You can take it for granted that any complex bit of software is almost certainly not correct.
“However, the real world evidence is that his models did not model the real world accurately. In fact his results were DRAMATICALLY wrong.”
What are you referring to here? Do you mean when he claimed we were on track for around 20,000 deaths, and we have just passed 30,000? I think he caveated that quite heavily at the time.
GregWA:
Glenn Reynolds, channeling George Box, put it best:
All models are wrong, but some are useful.
That goes for quantum mechanics and general relativity.
If you ask me, Boris is a silly twit (and that is an understatement) for even looking at them.
He should have looked at the exponential rise* in the number of deaths, first in Italy, then in the UK, and asked “experts” when it is going to slow down. The “experts” who gave a definite answer should have been dismissed as obvious charlatans.
Those who said that we cannot possibly know, should have been asked what can be done about it.
There is too much to say about this, so i won’t say anything.
You should follow Nassim Taleb on twitter.
You should have put a full stop after ‘model’.
In this situation, of course: when it comes to well-understood diseases, that’s a different matter.
* That is assuming, of course, that Boris understands the concept of ‘exponential rise’.
@ Nullius in Verba
So why doesn’t that work?
I’ll assume your question is “why does most software have bugs?” And the answer is twofold: first most software is written by bad programmers (since the large majority of “programmers” are really terrible), second the amount of complexity there is in software — it is a miracle it works at all.
Life critical software — and epidemiological models must surely fit in that category since millions of people’s lives depend on it — are written to a very different standard than the Alarm Clock app on your phone.
I agree that systematic testing is absolutely necessary. But the testing has to test the *actual* requirements for the program’s output, which in the case of research software are nearly impossible to capture, and even harder to automate. It requires an active intelligence to investigate and understand what the software is doing internally.
You are confusing requirements and software. Certainly requirements are hard to capture, and certainly they develop over time. However, whatever the requirements are the software should be built to meet them and tested to ensure they are met as written. You don’t fiddle with the requirements in code — you fiddle with the requirements in the requirements. In fact this is the very reason that fiddling around burying your requirements in the source code is an absolutely terrible idea. Requirements are a much higher level of abstraction than C++ code, and consequently are much easier to analyze, manipulate, investigate and so on to determine what you want. Doing that inside the software — where the requirements are mashed into the software along with all the million implementation decisions, is an absolutely terrible idea. It makes the requirements harder to see, much harder to change, and impossible to test.
For this type of software in particular written requirements along with a process to convert the requirements into code faithfully is absolutely vital. Perhaps the epidemiologist should just be writing the requirements and he can hop on RentACoder.com to get someone competent to put his ideas into practice. That way he can operate at the level of a scientist rather than pretending he is a code jockey.
And this was a classic case of the problem. To write automatic intelligence-free tests of software that uses random numbers, you need to be able to set the seeds and generate output test vectors to check against. “Are the numbers output exactly the same? Check! Move on.” But the test proposed is testing the wrong thing. The software is by design not supposed to be deterministic. So if you make it deterministic to be able to test it, you’re contradicting the requirement! It’s wrong by definition.
That isn’t true. What you are doing is isolating all the non deterministic behavior into one single seed variable. That way you can test the software adequately, and also have a pseudo non determinism to simulate the stochastic process. So run your simulation ten thousand times with different seeds. Each run will be deterministic (which is to say your testing can say with some confidence that the results are correct), but the results from different runs will be different which will provide the distribution you are seeking.
Perhaps it would be helpful if I explained how I would approach a problem like this. (I’m not super familiar with the problem domain, but this is a good general approach to developing simulation software. It is, for example, the approach I used when i wrote some software to simulate people’s behavior when evacuating a building in a fire.)
1. Define a set of rules that you believe describes the problem domain.
2. Define a set of composition rules that you believe describes how these rules interact.
3. Write unit tests to verify that the rules in 1 give the correct behavior
4. Write unit tests to verify that the rules in 2 give the correct behavior using IoC to isolate the composition from the individual rules
5. Write an engine that executes a table of these rules.
a. A crucial part of this engine is that it defines a timebase, so that the rules are executed
in a set of rounds (like a game) so that we abstract away the impact of real time behavior.
The lack of a timebase is probably the main reason the software in the model is screwed up.
b. This engine manages a thread pool with their own seeded PRNGs which are generated derministically from the main PRNG
c. The thread interaction takes place based on the timebase, not the caprice of the thread scheduler
d. The engine has a seedable PRNG, which is used to seed the threads and generate a random set of starting conditions.
6. Write some unit tests to ensure that the engine, the timebase and thread interaction work correctly (using IoC to abstract away
from the particulars of the rules.)
7. Write some simple End To End tests that use a very, very small model for which the results can be manually calculated.
8. Optionally write some special case End To End tests that test some special case bahaviors that might be important.
9. If possible build tools that monitor what is going on and use redundant calculations to verify that the results are going
as expected.
Using this approach you can produce a deterministic multi threaded product that is seedable, manageble, and debuggable. It tests as best as is possible, and doesn’t bury the requirements in the code.
The rules in 1 and 2 are very simple mathematical equations, or sometimes simple program functions. These rules? The scientist can write them, and play around with them to their hearts content. They change the rules, run the unit tests and, on failure, determine if the rule change is good or of the test is bad. Moreover, with this simulation engine he can interact with other scientists and have them examine his rules (nobody is going to look at his rats nest software) and discuss then on an appropriate level, propose new rules, rerun with different sets of rules etc.
The point is that the experimental bit — the rules — are isolated and clearly stated. The engine can be debugged and tested to make sure it will execute the rules by using fixed debuggable rules, and the scientist can do science by defining the rules at a high level, without getting buried in a big messy steaming pile of crap.
You wouldn’t let an epidemiologist design a bridge that people’s lives depended on, why the hell would you let him write a piece of software that people’s lives depended on?
As I was thinking about this I remembered a project that I did. I worked for a chemical company who treated water with various chemicals to prevent fouling and pollution. There were hundreds of chemicals and the chemical interactions were extremely complicated. Of course my high school chemistry wasn’t up to the job. So I wrote a tool that allowed the scientists to describe the way various chemicals interacted in different concentrations, with different catalysts, in different environments, temps etc.. They didn’t write any code, just chemistry in a bunch of excel spreadsheets using a template I provided for them (and which we interactively modified to meet their needs.)
The chemists specified all the chemical behaviors they cared about and my software read all this in and generated C# code which was a set of unit tests. Then the main software that was designed to know all this information could run its calculation engines through my unit tests to verify it behaved in accordance with what the chemists said was the correct behavior.
I sat in a few meetings where the chemists discussed these spreadsheet specifications and the right behaviors. Looking stuff up in books, doing experiments in the lab etc. The chemists talked chemistry, not code.
Chemists were doing chemistry, and could focus on that rather than fiddling around with C# (or god forbid C++) code. Chemists are good at chemistry, but they are terrible at C++.
The project was a tremendous success.
“You are confusing requirements and software. Certainly requirements are hard to capture, and certainly they develop over time. However, whatever the requirements are the software should be built to meet them and tested to ensure they are met as written.”
Sorry, I was unclear in the way I phrased that. What I meant was that you have to test that the code meets your *actual* requirements, not the requirements you’ve written down. In the same way that people write code thinking that it captures what they wanted the computer to do rather than what they told the computer to do, the same applies to requirements.
The problem is compounded with research software, because in this case you don’t know what you want it to do until you have tried to get it to do it and seen how and why it fails. You can’t sit down up front and write down an equation, because *nobody* knows what the equation is supposed to be. It’s research!
And that’s why any development paradigm based on the Waterfall model – Requirements, then Design, then Implementation, then Verification, then Maintenance – doesn’t work for research, and why we have this persistent problem with paradigms. You write down requirements, implement tests, implement code, look at the output, and realise the requirements and equations you wrote were wrong and it’s not doing what you want. So you change the code one way and try it, no, try something else, no, try another thing, yes! That’s a lot closer! And you gradually home in on the equations you need.
Testing is at the very heart of how it works, but not the sort of automated pass-or-fail test script you’re talking about. You plot out pictures and animations and timelines, you plot graphs, you spot outliers and oddities and dig into them with more detail, you pick particular cases and follow what happens to them, you trace the execution of the code step-by-step to understand how the results you see arise, and you think about whether those changes are realistic and what you want. Is the code wrong, or is your intuition? Did you implement the equation wrongly, or did you implement the wrong equation? This is testing. It’s the same sort of testing a coder does when they’re trying to locate a bug. But it’s not a formal ‘test’ test. It’s quite likely you don’t even write down what tests you did. But quite often the instrumentation you insert to diagnose and identify the issue becomes the new functionality, that enables you to diagnose similar problems in future. The software *is* the test harness.
Only when you have *finished* the process can you sit down, write out the rules and equations, write requirements and describe the functionality you want, and then hand it all over to a software engineer and hope you expressed yourself clearly and unambiguously enough that they can build what you want from the verbal description alone.
“The lack of a timebase is probably the main reason the software in the model is screwed up.”
Does it lack a timebase? It looks like it’s got one to me.
So far as I can see, it’s use of multithreading is solely via OpenMP that does ‘parallel for’ to various loops and spreads the iterations in each loop across the available threads. Each thread is allocated its own separately-seeded PRNG. I’m not sure, but I strongly suspect the issue is that you don’t have control over how that allocation is done, so the iterations (and hence simulated individuals) don’t always get allocated to the thread with the same PRNG. (Or at least, I can’t see anything that would let you control it.)
In a sense, the problem would be they didn’t implement their own multi-threading, they used a library implementation produced by software engineers, and that didn’t give them the control they needed! I think they need to divide the data into blocks themselves, and then ensure that the set of blocks processed together in parallel each get the PRNG appropriate to the data block number, not the thread number, and remain sequential within each block.
But I’ve only just started looking at the code. It’s a first impression.
@Nullius in Verba
Sorry, I was unclear in the way I phrased that. What I meant was that you have to test that the code meets your *actual* requirements, not the requirements you’ve written down.
No, you were clear enough. That is what I understood you to mean. But I think your perspective is completely wrong. Code should implement the requirements exactly as written. Of course if those requirements are incorrect or soft, that is fine, you change the requirements then you change the code. The idea that you randomly change the requirements as you write the code is dreadful. It is untestable, brittle, guaranteed to be buggy and perhaps most importantly in these cases, not transparent. Nobody can see what you are doing without digging deep into the code. Apparently you are doing that dig, good luck, I don’t have the stomach for it.
And that’s why any development paradigm based on the Waterfall model – Requirements, then Design, then Implementation, then Verification, then Maintenance
I didn’t advocate for that, I would never advocate for that. That paradigm of software development simply doesn’t work. Even in cases where it did work, it only worked because they pretended to do it that way while actually doing it a different way. The correct approach in this situation where, as you say, the requirements are fluid, is an agile model, which is specifically designed to manage changing requirements. Though in this case, one guy fiddling for years, probably better is a rules based model without a specific development paradigm. You can play around with the rules to you hearts content, just don’t mess with the engine that implements them without extreme care. And maintain the unit tests as you fiddle around. Keep the rules simple, cohesive, side effect free, short, mathematical in nature and so forth. Written in such a way that a non programmer can look at them and understand the rule.
Only when you have *finished* the process can you sit down, write out the rules and equations
That’s not true. When you first write the software you have rules in mind, you just bury them in the code in an ugly intractable way. This approach makes it harder to play around with the requirements, not easier. And as I said above, if you have a rules based approach you can actually discuss the rules with other scientists, because you have the level of abstraction correct.
The thing with the East Anglia CRU is that when people dug into the code their rules were shmushed all over the place and people dug in and found all these incriminating comments. If they had abstracted the rules out into an explicit upfront statement then their kludges could have been discussed and reviewed. For example, they could have made the case for the rule that changed the tree ring data treatment in 1960. Maybe that is justifiable, but by burying it in the code in an impossible to find way it looks mighty suspicious. You can publish the rules statement in an academic paper and have an academic discussion about it without having to have your colleagues read over 15kloc of impenetrable garbage.
Writing code this way is the opposite of peer review. It is exactly the opposite of how science software should be written.
Like I say with my Chemist example — the chemists discussed the rules as given in a form that was designed to make it easy to understand because it was stripped of all detail except the chemistry. I sat in the meetings and often didn’t have a clue understanding the subtle points they were making to each other. I still don’t really know what “enthalpy” means. However, they encoded it in the rules and it made its way into the test suite.
Does it lack a timebase? It looks like it’s got one to me.
Maybe. I don’t know I haven’t looked at the code. So I was speculating on the cause. It would explain the behavior, but for sure I could be wrong. It is the sort of dumb ass mistake that somebody who doesn’t know what they are doing would make.
But I’ve only just started looking at the code. It’s a first impression.
Good luck with that.
A bit late to comment on this one, been busy.
I agree with “Sue Denim”s analysis and conclusion. I was impressed by Nigel Sedgwick’s description of a sin() issue: I’ve not encountered that one, though I have comparable war stories.
Many commenters failed to read enough of Sue’s article, as she clearly seperated the non-deterministic behaviour from the use of a stochastic (monte-carlo) algorithm.
I have similar experience, in my case, for generating cryptographic authentication challenges used in a £bn industry.
1. You design to use a settable seed IN TEST so you can regression test, and ensure that whatever correctness you expensively tested for at the beginning, you retain after every change. Redoing a regression test is cheap.
2. You perform a design validation run to establish that what comes out matches what should come out, per spec. This is done at the beginning, as it’s expensive.
3. You perform separate tests on random number ‘randomness’ and ensure that sufficient entropy is used IN LIVE to generate a suitable statistical variance. This needs re-doing after each change in any area that relates to the PRNG.
4. That’s not all, of course.
But enough of the crap code.
My angle on this is what liability does Neil Fergusson have?
If he knew his code was crap and the output was garbage, and he knowingly instructed Government on that basis, then there are clearly criminal wrongs here.
If he didn’t know how bad his code was, but nonetheless, instructed Government based upon it, the evil is replaced by gross negligence, but nonetheles, a criminal matter.
Or do we just let him say: “I know a lot about the science, but my code is so crap. Sorry for the inconvenience, seems that the output is garbage and they all crash…”
We could call it the Boeing Defence.
Crap code or not, it still seems to me that reality comes in within his predictions.
At least when I was in the development game, any time the requirements were found to be inadequate or needed to be revised we filed documentation bugs against the dev docs, just as we would for the code itself. Then the test docs got updated following the revisions to the dev docs, and then (using the test-driven development model) eventually the relevant sections of code would get modified to meet the new test framework.
If you do it the other way around you end up with a mess, like trying to add a room to your house by extending the roof and then adding bits from the top down.
All models need validation.
A model is an imaginary, mathematical representation of reality, or pretends to be. For the model to have any value at all it must be validated, i.e. it’s results must be tested against real-world results. Only when such correspondence is proven can the model have any forecasting value.
This particular model (whether correctly coded or not) has never been validated. Moreover – **it cannot be validated** – we don’t know exactly how the virus spreads, we can’t run a real-world test spread for model validation. The model has failed abysmally in predicting former epidemics (SARS, MERS, Swine flu).
So, the value of the model is NULL and ZERO. Sure – it’s the only model we have – but this does not make it correct and valid.
It is possible that the model’s predictions are correct or near to truth in the current plague – but if that is the case (I don’t think so) then it is a purely random occurrence. Any guess can be correct, but that does not make it a “scientific model”.
Those promoting this model (Ferguson and his college) are con men.
Another side argument would be that the model has never been “peer reviewed” – give than it’s code is illegible.
“No, you were clear enough. That is what I understood you to mean. But I think your perspective is completely wrong. Code should implement the requirements exactly as written.”
I disagree with that, but for the sake of argument suppose we say so. In that case, why are you not doing so now?
The code was originally written *without* any requirement for exact replicability. Indeed, the science imposes no such requirement – it’s only needed for a particular choice of testing strategy. As such, it meets the requirement ‘exactly as written’. So why are you now saying it’s wrong?
What you have done is to add a new ‘requirement’ that the original coder didn’t recognise and didn’t code to achieve, (because he wasn’t using that testing strategy,) and then declare it ‘wrong’ because it doesn’t meet your new requirement that didn’t exist at the time. You’re testing against alternative requirements you’ve just made up yourself, not against the requirements ‘exactly as written’ by the original owner. And you’re doing it not because there is any functional need for it, but because you want to use a particular testing method to keep the testing simple/easy.
That’s mixed up. The choice of testing strategy is a design decision, not a requirement.
Consider again the example I gave above:
The requirement is for a particular distribution, so the distribution is what you must test. Generate a million values, and then do statistical tests on the output (e.g. Chi-square) to ensure the results are within an acceptable degree of accuracy. The DIEHARD tests are a good example. There is no problem at all doing a pass-or-fail test on a non-deterministic program.
Before the change it passes the test. Make the change. After the change it still passes the test. So you know (with a known degree of confidence) the requirement is met, and the code is correct.
Now consider how you would use a fixed seed and explicit output to test it. Before the change, twelve uniformly distributed values are used per number generated. After the change there are only two (or one, depending on how the library codes it). There is *no way* the output can be the same. The test fails. Does this mean that the code is incorrect? That it doesn’t correctly implement the user’s requirement? No! Because you’re not testing an actual requirement!
This is a design choice. Being able to make the code deterministic is a useful property as it makes debugging and comparison of implementations much easier, but it comes at a cost, because it is significantly more complex to code and far more difficult to get right, and it constrains you or adds overhead generating new test vectors if you want to make changes that change the use of random inputs (as in my example). There’s a trade-off to be made. Are you going to spend a lot of time doing the sort of debugging where the specific instance matters? The maths on which the application is based is stochastic, so probably not. Are you going to spend a lot of time updating and modifying the code doing new experiments? Yes! So there’s a considerable cost to having to maintain strict determinism (as you say yourself, achieving it in a multi-threaded environment is hard even for trained programmers, the cost will be considerably heavier for non-specialists), and of limited benefit. And since the user will be running it many times and analysing the output to check the distribution makes sense anyway, they may well decide on a different trade-off. It will be seen as a nice-to-have if you can do it cheaply, but probably more trouble than it’s worth. So they could have decided instead to test by examining the distribution manually each time, and skip implementing strict determinism.
This is the sort of thing a good software engineer ought to understand. There is a difference between the actual requirements, the requirements as written down, and the design choices made to get the job done most cost-effectively. The methodology and coding standards to be used are a design choice. There are always trade-offs to be made – any given technique always has costs and benefits. Those costs and benefits will differ from project to project, and even between different parts of a single project. Some code will last forever, other bits of code will be constantly revised. Some bits of code are load-bearing, other bits of code are decorative sugar. Some bits of code will be solely maintained by one guy who knows it inside out, other bits of code will be worked on by teams and by different people at different times. The costs and benefits are different. The trade-offs should also be different. Like a structural engineer choosing materials for building a bridge trades strength for weight for corrosion resistance for elasticity for ductility for thermal coefficient of expansion for cost and so on, so a software engineer should have quantified the benefits of their design choices and be able to trade them off against one another in the specific context of each project. They should be able to provide the evidence for their claims. Numbers, not slogans.
A bad software engineer, on the other hand, is dogmatic, inflexible, and a devotee of an endless variety of passing fads and fashions sold by bullshit ‘consultants’ who con the management out of millions for their elixirs and panaceas. They are ‘engineers by rote‘. There’s a procedure to be followed. A checklist. A ritual. Management know that the code is bad and expensive. They hire a consultant who tells them if they follow the process (trade mark, patent pending) then it will become better and cheaper. Wonderful! And then two years later they’re back again, hiring another consultant. Another charlatan selling fake diet pills. (“I tried the lemon-and-cat-food diet and in six weeks lost 50 pounds! Only £50/pack.”) They fully accept now that the previous paradigm they used to fervently believe in was foolish nonsense, but are nevertheless convinced the new paradigm is guaranteed to work. And people get so religious about it.
There’s no quantification. There are no trade-offs. There is just one right way to do it: follow these rules. All of them. All the time. (You want to write a three line program? You still have to write a user requirements document, and a system requirements document, and an architectural design document, and an interface specification, and a software management plan, and a configuration control policy, and a quality management plan, and test data, and test logs, and any changes have to be submitted via the correct software change request forms and be considered by the change review board…) And yet, mysteriously, the code is still bad and expensive. If we’re all following these industry-standard methods now, how come?
“And the answer is twofold: first most software is written by bad programmers (since the large majority of “programmers” are really terrible), second the amount of complexity there is in software — it is a miracle it works at all.”
Quite.
“But enough of the crap code.
My angle on this is what liability does Neil Fergusson have?
If he knew his code was crap and the output was garbage, and he knowingly instructed Government on that basis, then there are clearly criminal wrongs here.”
He has no liability.
1. He made clear at the time he gave his advice that the science was uncertain, the error bounds broad, the data scant.
2. Nobody yet has shown that the output was ‘garbage’. Nobody has yet demonstrated anything specifically it got wrong. All that’s happened is that some people have made a different design choice and then pointed out the code didn’t do it that way.
3. The reasoning behind those recommendations that everyone is complaining about *were not based on that model*. (They were certainly not based on the deterministic replicability of the model!) And when people don’t even understand the science to know that much, how much are the rest of their claims really worth?
At the risk of turning bits of this thread into a mutual admiration society, I will +1 a remark of Fraser Orr’s in a comment he began by +1-ing one of mine.
+1 to this.
It is not surprising that more than one person who understands the above can be found on a blog like this. To paraphrase:
Bad politicians and bad programmers fail to plan for their own fallibility.
My first comment (May 10, 2020 at 2:00 pm) starting at the paragraph beginning ‘For clarity’ and continuing to the end of the next paragraph, explains why Sue Denim’s paper very appropriately says nothing at all about the Imperial team’s modelling. (Of course, Clovis could not read that when commenting the day before I wrote it, so maybe then did read it and agreed, but I expand here for yet more clarity in case anyone wants it.)
For example, the Imperial team predicted 40,000 deaths in Sweden for an early-May date when there were in fact under 3,000 deaths. It would not be strictly inconsistent with Sue Denim’s analysis for Professor Ferguson one day to claim that a proper implementation of his model would have predicted only a mildly-overestimated 4,000 deaths in unlocked-down Sweden, and it was accumulated bugs in the wretched software that increased the prediction by a factor of 10.
I too have grave doubts about Imperial’s epidemiological modelling, but commenting on that was not Sue Denim’s job, nor mine as regards this post – and neither of us would claim particular domain expertise if we did.
@Niall
I understood what you were saying about implementation. I know perfectly well that at a certain (fairly low) level of complexity I should turn over my stupid little programme to an expert who will shred it and produce something that works [provided I know exactly what I want and exactly what my model is].
That wasn’t my point.
True, but almost inconceivable, the reason being that despite all the stochasticity (whether deliberate or inadvertent) the key determinants in these “models” are the input parameters (estimates). There’s a whole lot of [probability] theory tied up with analysing behaviour of epidemiological (and many other) systems near criticality. The input parameters determine whether outcomes are near criticality or not-with small changes in values toggling that answer. Anybody facing real criticism will point to the sensitivity and say that they just relied on others and used (what turned out to be) the wrong estimates as input.
I’m not saying this is right, or good, just what would happen.
NiV:
Wrong. (Facebook link)
Plamus: from your link:
This is quite correct*. But the implication is that we cannot know either the number of victims OR the economic damage without enforced distancing/lockdown.
BOTH the number of deaths AND the economic damage might have been almost identical with spontaneous, instead of mandatory, distancing/lockdowns.
Which is why the entire modelling debate reminds me of how many angels can dance on the head of a pin.
* Apart from the confusion between “social distancing” and “lockdown”.
“Wrong. (Facebook link)”
Thanks.
Your link said:
Wrong. Totally and utterly wrong. The calculation of the 2 million in the US and half a million in the UK was based entirely on the infection fatality rate of ~1% they got from empirical studies of case data previously published, combined with a rough assessment of the herd immunity level, which was estimated to be 60-80%, based on estimates of R0 again derived externally from empirical estimates. The UK population is 65 million. 65 million times 0.8 times 0.01 is 520,000. That’s it.
These criticisms of the code are being offered by people so utterly incompetent as to not even understand that the outputs they’re complaining about didn’t even come from the model they’re criticising! If you don’t even understand the basic, obvious, explained-in-the-original-paper science, how are you going to be able to comment intelligently on the details of the code?!
It’s a bit hypocritical to demand such high standards of others, but then get caught yourself making silly mistakes like this. Muphry’s law, I suppose.
For the rest, there’s some moaning about the state of the documentation (which is fair enough, but not actually identification of an error in the model), and an application to Sweden, which was obviously carefully selected because it’s already known that it doesn’t fit the basic assumption that behaviour is changed by government policy. That was already known to be a limitation of the models. (“However, there are very large uncertainties around the transmission of this virus, the likely effectiveness of different policies and the extent to which the population spontaneously adopts risk reducing behaviours. This means it is difficult to be definitive about the likely initial duration of measures which will be required, except that it will be several months. Future decisions on when and for how long to relax policies will need to be informed by ongoing surveillance.”) The fact remains that in Sweden the public’s behaviour changed and hence the contact rate dropped massively without government restrictions being applied, and in the UK it didn’t. We still had 30,000 dead, and counting.
Niall Kilmartin writes above (summarising a point made elsewhere by Sue Denim):
Now back to around mid-March. Did Professor Ferguson, in his model runs for Report 9, make any allowance for the influence (initial condition herd immunity and partial prophylaxis) of smoking/smokeless nicotine consumption? Such an allowance of much more intrinsic herd immunity might well have pitched the model run death results closer to 3,000 than to 40,000.
If Ferguson did not and the revised model parameters then better predict the subsequent actuality, what is he guilty of? Perhaps not knowing enough at the time, just like everyone else.
This is my 7th comment on this thread; sorry about that.
Keep safe and best regards
Interesting recent post on Steetwise Professor:
https://streetwiseprofessor.com/imperial-should-have-called-winston-wolf/
relating to:
https://lockdownsceptics.org/second-analysis-of-fergusons-model/
where we will all be surprised to find that ICL have been making false statements about their random number generators.
@Nullius in Verba
What you have done is to add a new ‘requirement’ that the original coder didn’t recognise and didn’t code to achieve,
It isn’t a “requirement” it is a best practice. Imagine if I delivered software to one of my customers that was full of bugs. Would they accept the excuse “well the requirements didn’t state that the code be bug free.” Or imagine I delivered it where the GUI was full of spelling and grammar errors… “the requirements didn’t state that the text be correct English.” Or imagine I delivered their latest traffic management application program written in Z80 assembler language. “Sorry, the requirements didn’t state what language it should be in.”
These things are not requirements, they are best practices. That a program runs the same every time is best practice, because without it you can’t test it, and you can’t determine if the results are correct. Of course if your requirements don’t include “test the output to verify it is correct”, then you might be on to something, but I have found very few software users who don’t consider “correct behavior” a necessity.
That’s mixed up. The choice of testing strategy is a design decision, not a requirement.
It is indeed a design decision. It is just a very bad design decision.
I find it quite troubling that you would defend a software design that is of such poor quality, with such inability to verify the results in what has turned out to be a life critical application. Imagine someone designed a heart pacemaker like this. At least a heart pacemaker bug only kills a few people. This stupid piece of software has been proximately responsible for burning down the world. Have you any idea the layers and layers of quality checks that go into designing life critical software? I have done it, and it is not an easy path to walk. I assure you nobody goes in and fiddles with the code.
I doubt we will agree on this matter, but all due respect to you that you at least took the time to read the code.
“It isn’t a “requirement” it is a best practice.”
No, it’s a design choice. One option out of many. Whether it’s “best” in any sense depends on the circumstances.
“Imagine if I delivered software to one of my customers that was full of bugs.”
Everyone always does. The industry standard is somewhere around 5-20 defects per thousand lines of delivered code. (Source)
There are big businesses around the world employing millions of people whose livelihoods depend on the correctness of Microsoft Excel. (And Microsoft, despite its reputation, is pretty good by industry standards.) Have you ever seen Excel’s random number generator?!
“Would they accept the excuse “well the requirements didn’t state that the code be bug free.””
I thought that was what you were saying earlier? “Code should implement the requirements exactly as written.” If it wasn’t written down, it’s not a requirement. Nor would any sensible software developer ever promise to deliver bug-free code.
And as I keep on saying, not being able to use your favourite testing strategy is not a bug!
“I find it quite troubling that you would defend a software design that is of such poor quality, with such inability to verify the results in what has turned out to be a life critical application.”
First, as I’ve already explained several times, you *can* verify the results. You test the statistical distribution. It tests the actual requirement, and it’s much more robust to changes that alter the number or order of PRNG calls. It’s better.
And second, the model was never ‘life critical’. It was heavily caveated at the time it was presented, and the primary points made by the paper that led to the change in policy did not depend on the model.
The (conditional) prediction of half a million deaths comes from the estimates of the infection fatality rate and R0 which came from empirical observations of cases in the literature. That it would hit very big numbers within about 4-8 weeks was obvious from the 10-fold per week rate at which deaths were then rising. It’s simple maths that because of the 3 week time lag between infection and death and the 10-fold per week rate of rise, the numbers could increase a thousand-fold in the three weeks between you taking action and it having any effect, which meant you were already close to the edge and didn’t have much time to muck about.
The capacity of the NHS reported in the paper was provided by the NHS itself. That if left unchecked it would overwhelm the NHS was again obvious. And the effect that 10% of the population not being able to get hospital treatment when they needed it would have on raising mortality rates above 1% was not estimated, since there is no data on that, but was left to the imagination, which even politicians could understand the potential implications of.
Because you were already so close to the edge, and because of the time lag between action and results, you would have to drop R0 to below 1 right now to avoid disaster, and given the rapid rate of rise it was evident R0 was pretty big, and so it would take a massive drop to do so.
The model simply provided some pictures of possible/plausible scenarios to assist the intuition. It was explained numerous times that they were subject to lots of assumptions, many of which were very unlikely to be true, based on very shaky statistics about the virus because it was so new, and there was little data or research on it yet, and some pretty huge assumptions about how the public would react or what that would do to person-to-person contact rates. It was explained exhaustively that the plots were, at the current state of knowledge, no more than illustrative sketches, that would need to be refined by further observation. Readers were warned repeatedly not to put too much weight on the fine details of what were then no more than preliminary ballpark estimates. (The correctness of the model is far more relevant to the decisions about how to end lockdown than it was to start it.)
Even if you got rid of the model entirely, it wouldn’t even touch the main conclusions of the paper with regard to the lockdown decision. The decision was in fact based on the empirical observations (mainly from China) of R0 and mortality, which were at that time far shakier. Even with a perfect program, GIGO applies. If you want to criticise, you’d be far better off starting there.
“I doubt we will agree on this matter, but all due respect to you that you at least took the time to read the code.”
Thank you! I try. And to you too.
Nullius in Verba
First, as I’ve already explained several times, you *can* verify the results. You test the statistical distribution. It tests the actual requirement, and it’s much more robust to changes that alter the number or order of PRNG calls. It’s better.
This is exactly wrong. The purpose of the software is to produce this statistical distribution. How do you test if it is correct?
Think about this, if I write TurboTax and it tells me that my tax liability is $6,503.12, how do I know that number is right? It is a valid number plainly. It might even seem reasonable. But how do I know that it didn’t forget my personal deduction, or to tax my capital gains at a lower rate, or included my dividend payments? Or even if it gets my taxes right how do I know it’ll get your taxes right?
You are saying it gets a result, but how do you know the result is right? If it is because you intuit (if you’ll excuse the pun) that it is right then you might as well not even write the program, just write down your intuition.
Otherwise you are just fiddling around with the program to make it produce the result you want rather than actually simulating anything.
And perhaps even more importantly, how does anyone else know your result is right? Nobody can read your 15kloc of code and have any understanding of what you are doing. So nobody can critique your approach, except to say your code is crap.
Anyway, I don’t think anyone else is listening, and I doubt we are going to change each others mind.
“The purpose of the software is to produce this statistical distribution. How do you test if it is correct?”
There are lots of ways. Chi-square test, Kolmogorov-Smirnov test, Anderson-Darling test, Kuiper’s test. You can produce a histogram and test if the bars fall inside extected intervals, you can apply the inverse CDF and test if the result is uniform, you can test for independence between successive values. You can apply any measure of distance between the expected and the observed outcome, measure the statistical spread, and set bounds on how large it is allowed to be.
Have a look at the DIEHARD tests, for example, which are specifically designed for testing random number generators.
“You are saying it gets a result, but how do you know the result is right?”
How do you *ever* know a calculation is right?
For a regression test, you want to know if the result is the same. So you do it a whole bunch of times, measure the distribution, and then the test is that with the same input it should produce output values within the same distribution.
For a correctness test, you can apply consistency checks – does it satisfy the properties the calculation is meant to look for? Like when you solve an equation, you can plug the result back into the equation and see if it is satisfied. Do percentages add up to 100? Do population numbers remain constant? Do probabilities add up to 1? If you increase an input, does the output tend to increase or decrease as expected? If you’re doing physics, is energy and momentum conserved?
Or you can spot-check the details. Follow the events simulated for a single individual – do they make intuitive sense? Do they occur as often as you expect? In the order you expect? Follow the events in a single place. Look for cases of a particular sort of event. Super-spreaders, for example. Do they occur at large gatherings? Is it a particular sort of person, or a particular sort of event that drives it?
Or you can look at special cases. Build a small model, or a model with a particular property that makes it easy to analyse. If you set it up with a uniform population and random population-wide interconnections, do you get the standard simple SIR-model behaviour? If you put them on a long, thin island with only short-range connections and infect one end, does the epidemic propagate along the length of the island? If you start with everyone immune, does the epidemic immediately die out?
Or you can compare to reality. If you put in the parameters for seasonal flu, or mumps, or syphilis, or the black death in 14th century England, does it fit the observed pattern? Does the distribution of cases reported at hospitals match observed statistics from real hospitals? Do employer records, and claims submitted for sick pay match up? If you measure how many people in a particular location have antibodies to a particular disease, does it fit the distribution in the model? If the model says that people in hotels tend to travel longer distances than most, and so have a different distribution of disease, does that happen in reality? What properties lead to diseases spreading fast or slow, and do the fast/slow-spreading diseases in reality have those properties?
The tests you can do are endless. The more different tests you do, the more confidence you gain. And in general, when a tool is in constant use, and when minds are constantly poking at the output, trying to understand what’s going on, trying to work out why it’s not doing what you expected, or to answer some new question you’ve not asked before, it can act as a continual testing process. Because you’re not feeding in the inputs, getting an output, and passively accepting it as truth because “the computer says so”. You are constantly and actively asking “Why did it do that? Is this right? What’s going on?” This is what good researchers do.
“And perhaps even more importantly, how does anyone else know your result is right? Nobody can read your 15kloc of code and have any understanding of what you are doing. So nobody can critique your approach, except to say your code is crap.”
I certainly agree with and fully support the idea that code should be openly published and systematically challenged, as a way of gaining scientific credibility.
But the same thing applies to people’s thoughts and opinions. Your brain builds a model of the world, and comes to all sorts of conclusions about it – ‘beliefs’ we’ll call them. How can we know they’re correct? Nobody can read your mind and have any understanding of what you’re doing in there, so nobody can critique your approach except to say your thinking is crap.
The only way you can find and fix your own errors (and *everybody* makes them!) is to expose your thoughts to examination, and have other people try to rip them apart. As Mill put it: “Complete liberty of contradicting and disproving our opinion, is the very condition which justifies us in assuming its truth for purposes of action; and on no other terms can a being with human faculties have any rational assurance of being right.” You have to debate your beliefs with people who disagree, present your reasoning, listen to the reasoning of those whose beliefs differ, and try to figure out who has the better reasons for belief. On no other terms can a being with human faculties have any rational assurance of being right.
So even if we have no chance of changing one another’s minds, there’s still enormous value in discussing it. It tests our beliefs and our understanding. It cleans our beliefs of error, and provides better justified confidence in the surviving beliefs’ truth. You can never guarantee bug-free beliefs, but more testing, and in particular more diverse and independent testing, is the best way to approach it.
It’s been an enjoyable debate, and with you in particular. Thank you.
It’s also been an informative one – I too have enjoyed reading Fraser’s well-expressed contributions, and Tim’s and others. Perhaps I learned most from Snorri (see Snorri Godhi, May 10, 2020 at 2:50 pm and Niall Kilmartin, May 10, 2020 at 3:14 pm) but yours, Nullius, as often, have compelled (I choose that word advisedly ) a deeper review of the arguments. The net effect is that someone who reads the thread can then understand the post’s link much better than just from my domain-knowledge-assuming OP.
Sadly, you yourself, Nullius, appear – as often – to have learnt nothing and forgotten nothing. Your last (so-far ) comment above (Nullius in Verba, May 12, 2020 at 7:48 am), shows you ending where you began, presenting as the open-minded defender of debate who needs to encourage us all to more of this virtue, while seeming to have learnt least of all the significant participants in this thread. This not-uncommon outcome is peculiarly obvious to me here since, just as Mr Ed or bobby b doubtless notice when non-lawyers are attempting to grasp technicalities, and whether they then engage with the explanations provided or not, so on this subject I see how your long comments intermingle defences of your view with beginner’s questions about critiques of it. Like the professor, you avoid the first step of competence – that
(A certain amount we have commented, likewise. )
So it is with no great expectation of being understood, but in hopes it anyway may be of interest to any (few, no doubt ) remaining readers, that I comment on your remarks about statistical testing in the above. The start point is that in computing, what you don’t know how to test you don’t know how to write. Suppose your program uses a pseudo-random generator.
1) Write tests for the non-random subunits/side-units.
2) Either by using the hidden data of the pseudo-random generator or by mocking in a ‘pseudo’ pseudo-random generator, write ordinary-style tests for the pseudo-random unit.
3) And, perhaps, – ‘and’, not ‘or’ – write multi-run statistical tests which (as well, perhaps, as being stress tests) verify (i.e. a computer test is verifying, not the human eyeball) that the output’s overall statistical distribution conforms to what the model claims it should under the input’s overall statistical distribution.
– I knew a fair amount about the black death before I prepped myself (i.e. prepped my optimism by reminding myself how much worse it could be) by watching the 26 episode ‘great-courses’ lectures on the black death earlier this year.
– Industry practice knows another kind of test: the sanity test (or smoke test, as in turning on a car engine to see if smoke starts pouring out). This is a minimal test, usually done within some process, whose sole function is to tell whether things have gone so totally fruitcake that you should abort right away – continuing risks not even leaving you usable debug data.
If you think we know enough about the black death for what you propose to be more than a sanity test – one with grave dangers of tempting you to fit your data – then you need to know more about the black death. We know more about bugs that are more recent and more minor – but one of the things recent events have reminded us of is how much minor illness is not recorded even today, because it is (mostly) minor. Perhaps the 1918 Spanish flu is the sweet spot of recent-knowledge plus seriousness – but it’s still just a sanity test.
If a model can withstand no harder test than a sanity test then, as Aristotle pointed out long ago, it is not ready for numbers and should stay in the judgement and gut feel of the experienced epidemiologist (valuable, just not computationally so). It is so not ready for the prime time it has been getting. If the professor had said e.g.,
then Sue Denim would not have written, nor I nor the commenters. There would be other comments but not the ones above. As it is, what he should have said, if truthful and comprehending, is (borrowing Fraser’s eloquently accurate phrase):
Some of my remarks earlier in this comment may strike you as harsh, Nullius. Consider the possibility that, like professor Ferguson’s numbers, they greatly overestimate a nevertheless-real issue. Like Fraser, I feel done in this thread. As always, feel free to have the last word if you wish.
“Sadly, you yourself, Nullius, appear – as often – to have learnt nothing and forgotten nothing.”
On the contrary! The first and most important thing was I learnt that Fergusson had published the code! That enabled me to download the code and learn (in rough outline, so far) how his approach works. When Sue Denim complains about the multi-threading code not maintaining determinism, I now know how Fergusson attempted it, I know how he used OpenMP to parallelize certain loops, that he did set each thread to use a separate PRNG, and I can see why that wouldn’t work. I note that Sue Denim didn’t mention any of that – and evidently hadn’t looked. When people say the code is ‘evidently horrible’ knowing nothing more than that there’s 15kloc in one file and it’s written in C++, I on the contrary know exactly how readable or not it is. I know, for example, that about half that 15 kloc was just reading in the parameters – each bit is simple to understand, hard-coded checks to identify the label and when found read it into the appropriate variables, the only reason it is so long is simply that there are lots of parameters to load!
I learned where the code was, and learned enough about it by reading it to make informed comments and informed judgements about it. I could see clearly that Sue Denim hadn’t reviewed the code at all – all he/she had done was to trawl for dirt looking through the issues log and the bug fixes! And I could see other people making statements about how ‘horrible’ the code was or how to ‘fix’ the code evidently without having read it. (Like “The lack of a timebase is probably the main reason the software in the model is screwed up.”)
As for the rest of the conversation, I’ve been having this same argument with software engineers for more than 20 years now, and nobody here came up with any better arguments than I’ve seen presented before. But it’s always useful to have the debate with a new audience, as you never know where you might find a new insight.
I’ve said that good engineering is quantified. You have to measure how much time you spend on documentation, writing unit tests, testing, and so on, and how those costs can be traded off against one another. So did anyone cite the data to back up their opinions? Did anyone acknowledge that trade-offs exist and lay out what they would be in this case?
No, all we got was dogmatic appeal to authority and re-iteration of the dogma: “That a program runs the same every time is best practice, because without it you can’t test it, and you can’t determine if the results are correct.” Despite the fact that I’d already shown that yes you can test it, and determine that the results are correct.
I’d already given the example of replacing one method of random number generation with another more efficient one, that the deterministic approach necessarily fails on – it reports false positives, it requires rewriting all your test data every time the number or order of PRNG calls changes, it imposes notoriously complex constraints that even expert coders find hard, it provides far more ways to ‘break’ the code.
Did anyone comment on that? Explain how their approach could deal efficiently with such changes? Explain how to make it easy to parallelize stochastic simulations without messing up the determinism? Explain how the tests could be rebuilt without adding lots of overhead? Explain how the trade-off of all the different costs and benefits made keeping determinism worth it? No.
But in science, you can learn even from a negative result. You give half the patients the new drug, half of them a placebo, observe the outcomes, and see that that there is no statistically significant difference between them. You have learnt something, even if it was what you expected.
(You should have also learnt, if you didn’t already know, that it is perfectly possible to test if two statistical distributions are the same even without exact determinism!)
It was, as I said, an enjoyable debate, and several people argued well, not just Fraser. But you’re right, the arguments presented were not good enough to make me change my mind on software engineering. I do hope I haven’t reduced any to tears!
You probably aren’t talking to me here, but for my part, after this:
I assumed that your intention in putting out such a silly strawman was to wrap up the discussion. But if not: I never heard of anyone ever writing any sort of docs for a three-line program. On the other hand, for actual software that requires a team to write, we always considered the accompanying documentation to be part of the deliverables, every bit as much as the code. And if you have ever tried to make substantial modifications to a 5 year old code base written by some other team you can’t reach, without any docs or bug db or BVTs or regression test sets or even any specs, you might be able to appreciate the value of such documentation to the future usefulness of the software.
“But if not: I never heard of anyone ever writing any sort of docs for a three-line program.”
I have! It was mandatory policy for *all* software development at the company I used to work at, and even a three-line program counted as ‘software’ for the purposes of the rules. I had several arguments with the software auditors about exactly that issue. The rules were widely ignored and not strictly enforced, but there were software engineers there who seriously argued for it. It started at around the time of ISO9001, and has been the official policy on and off ever since.
There was one time I unwisely mentioned I had about a hundred undocumented scripts I had used to generate various graphics for reports. They were generally about 20-30 lines each – you write the code, generate the picture, fiddle with it until it looks right, cut and paste, and move on with the report. My manager at the time was a software engineer, and said they needed documenting. I argued. “Nevertheless”, he said. So I wrote a quick script to fill in some templates and generate all the documentation they said they required. I then printed them all out – 100+ scripts, remember – and shoved them in the bottom of my filing cabinet.
Ten years later, I shredded them all. Nobody had ever looked at them. Nobody had ever used them. A total waste of a tree.
But if an auditor had ever come round and asked to see the documentation, I’d have been safe! And this, I concluded, was the fundamental reason for its existence. Bureaucracy is all about protecting yourself from auditors.
“On the other hand, for actual software that requires a team to write, we always considered the accompanying documentation to be part of the deliverables, every bit as much as the code.”
I agree with that. For development in large teams, distributed over time or space, a certain amount of infrastructure to enable communication and coordination is essential. Documentation is part of that. For *some* software projects, it is useful.
My objection is that methods genuinely useful in *some* circumstances get dogmatically applied in *all* circumstances, because the people making the rules don’t actually understand what they’re doing, they’re just invoking a ragbag of fashionable methods they picked up in a training course by rote. They don’t think about when and why.
“And if you have ever tried to make substantial modifications to a 5 year old code base written by some other team you can’t reach, without any docs or bug db or BVTs or regression test sets or even any specs, you might be able to appreciate the value of such documentation to the future usefulness of the software.”
I’ve spent much of my professional life doing exactly that!
And the answer is that it’s usefulness depends on how long you’ve got. If you have only a very short time to develop an interface or some modifications to the code, then there are some bits of documentation that can be helpful. If you’ve got a day to write it – you don’t have time to wade through the code and you need a quick summary. However, when taking over a project like that in the longer term, I generally ignore all the existing documentation and instead work my way through the source code.
There was one particular project that I took over, that had been one guy’s private analysis tool for about 8 years. (He retired. And I was the only other person who had used it regularly.) There *was* in fact some documentation for it (most of it I think written retrospectively because the software engineers had nagged about it) but I didn’t actually get round to looking at it until about three years after taking over. And frankly, the only use I ever made of it was to reference it when management asked me whether it was documented.
We did actually try to develop a properly software engineered version of it. They spent a lot of time and money capturing the requirements, documenting, writing tests, following the process. The first real job they used it on, the old guy did the technical review for. Turned out it produced the wrong numbers; it did the calculations wrong! Many of the requirements were ambiguously phrased, and the coder had implemented the wrong interpretation of what was intended. This was about two weeks before the delivery deadline. So the old man spent a week of late nights re-doing the analysis with his old software, to meet the contract deliverable. He was furious! And quite rude! The newly engineered version got canned, and we stuck with the old one.
In the 10+ years I’ve been running it, in 50,000 lines of code, we only found a total of around five actual bugs in it that affected the results. And in four of those cases they were detected immediately by the days of routine validation we do every single analysis. (In the other case, I picked it up about two weeks after we delivered, and we had to do a revision.)
So yes, I’m very well aware of what it’s like to take over a project without documentation! And I never had any problems with it.
https://www.youtube.com/watch?v=_ZLrZc9NPVw&feature=youtu.be