Solving Captchas with AI
A couple days ago I half-jokingly tried to solve a captcha using an AI because it was funny to use AI to solve something another AI model made, but the results genuinly shocked me. Using a couple Ollama instances, you can solve captchas in a little under 5 seconds.
This project is strictly for educational understanding. It’s not about creating tools for spamming websites or creating bot accounts, but rather understanding how consumer AI models can interpret and interact with tests that were designed to thwart their less advanced counterparts. Because of this, I have no plans on releasing the full source code for this project. If you want to make spam accounts with AI, you can go make a tool yourself. I am not going to hold responsibilty for skid behavior from my project.
To understand how this tool works, here are the key technologies involved:
These technologies combined offer a new approach to solving captchas.
The process involved breaking down captchas into individual images and describing each of them with the LLaVA image model. By distributing the workload across several Ollama instances, we optimized the speed of image recognition. You dont need multiple instances, but the time save that comes with using multiple is pretty insane. After getting a description of each image, we use a more general purpose LLM to understand the prompt of the captcha, and select the images accordingly.
The first step is getting the images from the captcha to our ollama instances. The problem is, ollama cant just take an image file directly without downloading it and writing that download to our storage, which is inefficient, and prone to error. Instead, we can convert the image urls (extracted directly from the captcha) to base64 strings.
const getBase64 = async (url) => {
const imageUrlData = await fetch(url);
const buffer = await imageUrlData.arrayBuffer();
const stringifiedBuffer = Buffer.from(buffer).toString("base64");
return stringifiedBuffer;
};
let image_urls = ["https://simo.ng/captcha/img1.png", "https://simo.ng/captcha/img2.png", "https://simo.ng/captcha/img3.png"]
let base64Images = reqs.map(async (i) => await getBase64(i));
base64Images = await Promise.all(getBase64Array);
The second step is distributing the images of the captcha to the Ollama instances. This is done by using a simple algorithm that assigns images to instances based on their capacity and speed. Here’s a simplified version of the algorithm:
let ollama_instances = [
{ url: 'xxx.xxx.xxx.xxx:1000', amount: 3 },
{ url: 'xxx.xxx.xxx.xxx:8101', amount: 3 },
{ url: 'xxx.xxx.xxx.xxx:8081', amount: 2 },
{ url: 'xxx.xxx.xxx.xxx:80', amount: 1 },
];
images.forEach((i, x) => {
// Assign images based on the capacity and speed of each AI instance.
let url = ollama_instances.find((instance) => instance.amount > 0).url;
ollama_instances.find((instance) => instance.url == url).amount -= 1;
});
After sending these, the respective models do their magic and return the descriptions of the images. The combined responses look something like
{
'0': ' The image shows a brass cymbal, which is a musical instrument commonly used in percussion.',
'1': ' The image shows a candle placed inside a clear glass holder on a windowsill.',
'2': ' The image shows a metal object with intricate patterns, which appears to be a cover for a saucepan or a similar cooking utensil.',
'3': ' The object in the image is a cymbal.',
'4': ' The object in the image is a drum, specifically a cymbal.',
'5': ' The object in the image is a brass bongo drum.',
'6': ' The object in the image is a Tibetan singing bowl, also known as a bell.',
'7': ' The object in the image is a lit candle.',
'8': ' The object in the image is a record turntable.'
}
We then pipe this to a more general purpose LLM to understand the prompt of the captcha, and select the images accordingly. The prompt we send this LLM is along the lines of
if the prompt was "Please click each image containing an object made from wax" and
the options are {"0":" The image shows... } What options are you picking?
The LLM then outputs the numbers of the correct captchas, and we can then use some client side javascript to click the images.
It really is interesting to see what AI can do nowadays, even with consumer hardware. It will be interesting to see what happens when these image models become easier to run, and GPU’s become more powerful, how will captchas adapt? Will captchas eventually be replaced by something else? Only time will tell
For more insights and a deeper dive into the technologies used, see the blog post ollama made on vision models.