Websites Block Tech Giants from Using their Data to Train LLMs


  • Websites are starting to block big tech from using their content to train AI, showing a shift in how the web operates. 
  • Google has launched a tool allowing sites to opt out, but it’s less popular than similar tools. 
  • The balance between protecting content and remaining visible in AI searches is a big challenge.

Recently there’s been a significant shift unfolding. Top websites are starting to guard their content against tech giants like Google and OpenAI. This step changes the longstanding relationship between web publishers and search engines. The shift is prompted by the rise of artificial intelligence (AI) technologies.

Websites protect their content

Traditionally, websites have used a simple yet powerful tool known as `robots.txt` to manage how search engines interact with their content. This arrangement allowed websites to benefit from the traffic directed by search engines. However, advanced AI models have introduced new complexities to this relationship. Companies such as OpenAI and Google have been using vast amounts of online content to train their AI systems. These AIs can now directly answer user queries, reducing the need for users to visit the original websites. They disrupt the flow of traffic from search engines to these sites.

In response, Google has introduced a new protocol called Google-Extended. It enables websites to block the use of their content for training AI models. The protocol was rolled out in September last year and it has seen adoption by around 10% of the top 1,000 websites. This includes high-profile names like The New York Times and CNN.

Comparing adoption and looking ahead

While Google-Extended represents a step toward giving websites control over their content, its adoption rate trails behind other tools such as OpenAI’s GPTBot. The hesitance may stem from worry over visibility in future AI-driven search results. Websites blocking access to their content risk being overlooked by AI models. They will potentially miss out on being included in answers to relevant queries.

The scenario with The New York Times is particularly telling. The publication has engaged in a copyright dispute with OpenAI. Since then, it has taken a firm stance by using Google-Extended to block AI model training access to its content.  

Google’s experimental Search Generative Experience (SGE) hints at a potential shift in how information is curated and presented to users. It highlights AI-generated content over traditional search methods. The decisions made by tech companies and web publishers will shape the digital ecosystem. It will influence how information is accessed and consumed in the AI age.

Disclaimer. The information provided is not trading advice. Cryptopolitan.com holds no liability for any investments made based on the information provided on this page. We strongly recommend independent research and/or consultation with a qualified professional before making any investment decisions.

Share link:

Randa Moses

Randa is a passionate blockchain consultant and researcher. Deeply engrossed with the transformative power of blockchain, she weaves data into fascinating true-to-life next generation businesses. Guided by a steadfast commitment to research and continual learning, she keeps herself updated with the latest trends and advancements in the marriage between blockchain and artificial intelligence spheres.

Most read

Loading Most Read articles...

Stay on top of crypto news, get daily updates in your inbox

Related News

AI Video Tools
Subscribe to CryptoPolitan