REMINDER: EDUCATIONAL PURPOSES ONLY

Introduction
- Purpose and Ethics
Understanding the Technology
How We Did It
Behind the Scenes
Seeing It in Action
Conclusion

Introduction

A couple days ago I half-jokingly tried to solve a captcha using an AI because it was funny to use AI to solve something another AI model made, but the results genuinly shocked me. Using a couple Ollama instances, you can solve captchas in a little under 5 seconds.

Purpose and Ethics

This project is strictly for educational understanding. It’s not about creating tools for spamming websites or creating bot accounts, but rather understanding how consumer AI models can interpret and interact with tests that were designed to thwart their less advanced counterparts. Because of this, I have no plans on releasing the full source code for this project. If you want to make spam accounts with AI, you can go make a tool yourself. I am not going to hold responsibilty for skid behavior from my project.

Understanding the Technology

To understand how this tool works, here are the key technologies involved:

Ollama: An all in one software that makes running LLM’s locally easy.
LLaVA Vision Model: A new vision model designed for recognizing and describing images in detail.

These technologies combined offer a new approach to solving captchas.

How It Works

The process involved breaking down captchas into individual images and describing each of them with the LLaVA image model. By distributing the workload across several Ollama instances, we optimized the speed of image recognition. You dont need multiple instances, but the time save that comes with using multiple is pretty insane. After getting a description of each image, we use a more general purpose LLM to understand the prompt of the captcha, and select the images accordingly.

Behind the Scenes

The first step is getting the images from the captcha to our ollama instances. The problem is, ollama cant just take an image file directly without downloading it and writing that download to our storage, which is inefficient, and prone to error. Instead, we can convert the image urls (extracted directly from the captcha) to base64 strings.

const getBase64 = async (url) => {
	const imageUrlData = await fetch(url);
	const buffer = await imageUrlData.arrayBuffer();
	const stringifiedBuffer = Buffer.from(buffer).toString("base64");

	return stringifiedBuffer;
};

let image_urls = ["https://simo.ng/captcha/img1.png", "https://simo.ng/captcha/img2.png", "https://simo.ng/captcha/img3.png"]

let base64Images = reqs.map(async (i) => await getBase64(i));

base64Images = await Promise.all(getBase64Array);

The second step is distributing the images of the captcha to the Ollama instances. This is done by using a simple algorithm that assigns images to instances based on their capacity and speed. Here’s a simplified version of the algorithm:

let ollama_instances = [
{ url: 'xxx.xxx.xxx.xxx:1000', amount: 3 },
{ url: 'xxx.xxx.xxx.xxx:8101', amount: 3 },
{ url: 'xxx.xxx.xxx.xxx:8081', amount: 2 },
{ url: 'xxx.xxx.xxx.xxx:80', amount: 1 },
];

images.forEach((i, x) => {
  // Assign images based on the capacity and speed of each AI instance.
  let url = ollama_instances.find((instance) => instance.amount > 0).url;

  ollama_instances.find((instance) => instance.url == url).amount -= 1;

});

After sending these, the respective models do their magic and return the descriptions of the images. The combined responses look something like

{
  '0': ' The image shows a brass cymbal, which is a musical instrument commonly used in percussion.',
  '1': ' The image shows a candle placed inside a clear glass holder on a windowsill.',
  '2': ' The image shows a metal object with intricate patterns, which appears to be a cover for a saucepan or a similar cooking utensil.',
  '3': ' The object in the image is a cymbal.',
  '4': ' The object in the image is a drum, specifically a cymbal.',
  '5': ' The object in the image is a brass bongo drum.',
  '6': ' The object in the image is a Tibetan singing bowl, also known as a bell.',
  '7': ' The object in the image is a lit candle.',
  '8': ' The object in the image is a record turntable.'
}

We then pipe this to a more general purpose LLM to understand the prompt of the captcha, and select the images accordingly. The prompt we send this LLM is along the lines of

if the prompt was "Please click each image containing an object made from wax" and 
the options are {"0":" The image shows... } What options are you picking?

The LLM then outputs the numbers of the correct captchas, and we can then use some client side javascript to click the images.

Seeing It in Action

Conclusion

It really is interesting to see what AI can do nowadays, even with consumer hardware. It will be interesting to see what happens when these image models become easier to run, and GPU’s become more powerful, how will captchas adapt? Will captchas eventually be replaced by something else? Only time will tell

For more insights and a deeper dive into the technologies used, see the blog post ollama made on vision models.