TECH NEWS – DarkBERT is almost exactly the same as any other “friendly neighbourhood AI”, but it shouldn’t be trusted with nuclear launch codes.
If you’re worried that the current iteration of generative AIs is too kind and empathetic, DarkBERT is for you. This new language model is trained on the worst part of the internet, the Dark Web.
Perhaps the funniest name yet, DarkBERT is a generative AI trained exclusively on the Dark Web to compare it to a traditional counterpart. The team behind it – which reports its findings in a paper that has been published in advance but is still awaiting peer review – wanted to understand whether using the Dark Web as a dataset would give the AI a better context of the language it uses, and thus make it more valuable to those who want to comb the Dark Web for research or for cybercrime law enforcement.
It has also thoroughly combed a place that most people don’t really want to go and index the different domains, for which the DarkBERT team certainly deserves thanks.
The Dark Web is an area of the internet that Google and other search engines ignore. So the vast majority of people do not visit it. It is only accessible through special software called Tor (or similar). As such, it has gained quite a reputation for what happens there. Urban legends tell of torture chambers, assassins and all sorts of horrific crimes. But the truth is that most of them are just scams and other ways of stealing data without the browser security we take very much for granted. Yet, the Dark Web is allegedly used by cybercrime networks to chat anonymously. It is, therefore, a critical target for law enforcement agencies.
A South Korean team has turned on a language model to scour the Dark Web using Tor. It then feeds back the raw data, creating a model that can better interpret the language used there. Once completed, they compared how it performed with existing models that researchers had previously created. These include the RoBERTa and BERT models.
The results presented in the preprint showed that DarkBERT outperformed the others on all datasets but was close to them.
Since all the AIs were derived from similar frameworks, they were expected to perform similarly, but DarkBERT excelled explicitly in the dark web.
What will DarkBERT be used for? Hopefully, it won’t get the nuclear launch codes. But the team expects it to be an effective tool for scanning the Dark Web for cybersecurity threats. As well as monitoring forums to identify illicit activity. Let’s hope this doesn’t give OpenAI any ideas.