DNS Filtering Blog: Latest Trends and Updates | DNSFilter

Mid-Winter Nights Hallucinations: Some Thoughts on Our New GenAI Category

Written by Gregg Jones | Jan 30, 2024 9:51:23 PM


AI, LLM, generative content, NLP, big data, neural processing, machine learning, GPT. In 2023 it's undeniable that these were some of the most heard terms from various businesses, news outlets and the social media sphere. Ultimately this alphabet soup can mean just as much as it sometimes doesn’t—and, as often is the case, the internet leans into the trend.


Sites popped up everywhere—some reputable while others less so—promising cyberpunk profile pictures, curated dating advice, a quick summary to that book you swore up and down you’d read for book club, business propositions, tweets, essays, marketing releases, code. The list of capabilities is dizzying when you get right down to it. An abundance of tools ready for you at the drop of a hat. But at what cost?

Generative AI Content Filtering

DNSFilter has recently implemented a Generative AI (GenAI) content category and we want to take some time to discuss this new block category, as well as some security considerations about the sites that might fall under this category.

First, let’s define what you’ll find in the GenAI category when you toggle it on. We’ve focused primarily on the free and open Generative AI tools that will generate content at a prompt, with a few extra chatbots from more unusual places. “But my [insert legitimate tool]!” I can hear you say—worry not! Generally, the tools that you’ve integrated into your workflows that are paid for and are approved by your IT or security department are not likely to be found in this category.

Why focus on free Generative AI tools?

Let’s look at some basics of training Artificial Intelligence to do what you want it to do. To keep it simple, we will not be touching on the various algorithms, data science and statistical math involved in getting these things to work.

Picture an untrained AI as a freshman in college—let’s call this program “Brian.” 

Brian has gone through high school and gotten the basics of some things down and can ballpark some concepts from “college level” questions. This is your first AI framework that you’ve coded to answer a question. Now this question can frankly be anything, and in this scenario we’ve made Brian passionate about writing and understanding poems, so Brian majors in English Literature.

In order for the AI to write its own poems and understand underlying themes, they’re going to need to spend *a lot* of time in the library reading and interpreting other authors' works. Doubly so if they want to write in a specific style like iambic pentameter or haiku. And then Brian is going to try, and fail, and try, and fail—again and again. Until their professor says “close” and Brian gets closer to being a proficient haiku writer. Rinse and repeat until 9 out of 10 times Brian can produce a haiku the professor is satisfied with.

Good AI training

Is incredibly complex

It snows on Mt Fuji

Show me the (copyrighted) data

Now that you have an idea how it generally (sometimes) works: What’s the problem? Part of the issue is how much data Brian needs to get even vaguely close to writing a haiku. Where’s that data come from? How’s it sourced? Maybe (or even probably) this data contains information you didn’t mean to include such as phone numbers, addresses, and financial information.

In most cases, free tools use data from anywhere they can get their hands on, and more often than not your prompts are being used to train that AI even more. Public data, third party paid data, general web scraping, and even libraries of images on the internet are commonly used for training sources. It’s truly a case of “once it’s on the internet, it’s fair play” at its purest.

You may find yourself asking, “Isn’t that piracy?” or “Isn’t that copyright infringement?” 

Those questions and boundaries are exactly what lawmakers are trying to answer.

An alternative risk with open AI tools in your work environment is there is zero clue or visibility into where the training data came from. There is a very low chance it will be mind-blowingly original. More likely than not it is going to feel akin to an off-brand knockoff. There is an old trope that humanity has been telling the same seven stories over and over since the dawn of time—except with AI generated content it takes this quite literally. Mathematically AI is finding just another variation on the same theme that has been fed to it in the beginning.

The very real security risks of free Generative AI tools

Copyright risks aside, there are genuine security risks of using free Generative AI tools, or allowing their use on your network. Remember how indiscriminately some AI engines consume content, not considering permissions matters. If you were to connect to an internal database, you have to assume that your database is now part of an external training set. This puts you at risk for leaks of your proprietary information, and in some malicious cases the owner of the tool may “run off” with that data itself. 

Researchers recently discovered that even ChatGPT can be prone to leaks and faults by asking the engine to repeat a word infinitely. After a period of time, an error would cause it to begin displaying unrelated and sensitive data. This exploit is now against its Terms of Service. (For more information on this fascinating bug see here)

On the other end of malicious use, these tools are not security bastions—they can be just as vulnerable to attacks and exploits as any other tool. It just so happens that their free open nature increases those odds. When it’s all just there as an interface on a page or a Git repo away, it tends to be open season to try to get it to break, bend, or leak. 

Those risks don’t even take into consideration AIs whose purpose is malicious from inception. There have been dummy bots that can propagate malware, generate phishing emails, flat out give misinformation, poison the well for other chat bots, participate in cryptoscams and background mining, commit general identity theft, or credential theft—the list goes on all while imitating a “normal” bot experience.

Overall, we feel it is a net positive to toggle on the new Gen AI content category. Covering your bases from both unintentional leaks and malicious behavior can’t be a bad thing. It’s a new category, so it will be constantly improving and getting better over time. The discussion around AI and generative content is also constantly evolving and is a moving target for both business and security professionals. We may as well give the best effort we can now to prevent the issues of tomorrow. 

Find a tool we overlooked or want to put one up for checking? Send us a message and we’ll look into it. 

That’s all from the Intelligence Desk today, thanks for reading.