Cloudflare offers easier way to stop AI bots – Computerworld

Cloudflare’s content delivery network makes it easy for customers who are tired of malicious bots to block them from their website.

It’s long been possible to prevent well-behaved bots from crawling your business website by adding a “robots.txt” file that lists who is welcome and who isn’t. Content delivery networks like Cloudflare offer visual interfaces to make creating such files simple.

But with a new generation of ill-behaved AI bots coming, scraping content to feed their big language models, Cloudflare has introduced an even faster way to block all of these bots with a single click.

“The popularity of generative AI has skyrocketed demand for content used to train models or run inferences, and while some AI companies clearly identify their web scraping bots, not all AI companies are transparent,” Cloudflare staff wrote in a blog post.

According to the authors of the article, “Google reportedly pays $60 million a year to license Reddit’s user-generated content, Scarlett Johansson claimed that OpenAI used her voice for its new personal assistant without her consent, and most recently, Perplexity was accused of impersonating legitimate visitors to scrape content from websites. The value of original content en masse has never been higher.”

Last year, Cloudflare introduced a way for each of its customers, regardless of their plan, to block specific categories of bots, including certain AI crawlers. These bots, Cloudflare said, observe queries in sites’ robots.txt files and do not use unlicensed content to train their models or collect data to power retrieval-augmented generation (RAG) applications.

It does this by identifying bots by their “user agent string” — a sort of business card presented by browsers, robots, and other tools that request data from a web server.

“Even if these AI bots play by the rules, Cloudflare customers overwhelmingly choose to block them. We hear loud and clear that customers do not want AI bots visiting their websites, especially ones that do so dishonestly,” the post reads.

According to Cloudflare, the top four AI web crawlers visiting protected sites are Bytespider, Amazonbot, ClaudeBot, and GPTBot. Bytespider, the most frequent visitor, is operated by ByteDance, the Chinese company that owns TikTok. It visited 40.4% of protected websites and is reportedly used to collect training data for its large language models (LLMs), including those that power its ChatGPT rival Doubao. Amazonbot would be used to index content to help Amazon’s Alexa chatbot answer questions, while ClaudeBot collects data for Anthropic’s AI assistant Claude.

Blocking malicious bots

Blocking bots based on their user agent string will only work if those bots are telling the truth about their identity – but there are signs that this is not the case for all of them, or not all of the time.

In such cases, other measures will be necessary – and the primary recourse for companies against unwanted web scraping is usually reactive: taking legal action, according to Thomas Randall, director of AI market research at Info-Tech Research Group.

“While there are some software applications for web scraping prevention (like DataDome and Cloudflare), they can only go so far: if an AI bot rarely scrapes a site, it can still go unnoticed,” he said via email.

To justify legal action against malicious bot operators, companies will have to do more than claim that the bot didn’t leave when asked.

According to Randall, the best solution is for companies to “hide their intellectual property or other valuable information behind a paywall. Any scraping done behind that paywall is subject to legal action, backed up by a clear and restrictive copyright license on the site. So the organization must be prepared to take legal action. Any scraping done on the public site is accepted as part of the organization’s risk tolerance.”

Randall noted that if organizations have the resources to go further, they might consider rate-limiting connections to their site, temporarily and automatically blocking suspicious IP addresses, limiting information about why access was blocked to a message such as “For assistance, contact support at [email protected]” to force human interaction, and double-checking how much of their website is available on their mobile site and apps.

“Ultimately, the scraping cannot be stopped, but at best hindered,” he said.