About Collection sg01

Background

AI-powered web crawlers from OpenAI (GPTBot), Anthropic (ClaudeBot), Amazon (Amazonbot), and others now visit millions of websites daily. Their behavior differs from traditional search bots in how they handle cross-origin resources, binary files, and redirect chains.

Research Goals

Which file types do AI crawlers request versus ignore across origins?
Do crawlers follow links in CSS, JavaScript, and JSON files?
How do crawlers handle files without extensions?
Do crawlers parse XML sitemaps and RSS feeds from third-party hosts?
What are the download size limits for cross-origin content?
Do crawlers follow redirects that lead to a different origin?

Setup

This collection is hosted on AWS S3 in us-east-1 with access logging enabled. S3 server access logs capture every request with timestamps, IP addresses, methods, status codes, referrers, and user-agent strings. Files are linked from research domains to test cross-origin discovery and following behavior.

File Categories

Text Files

HTML pages with semantic markup, CSS stylesheets, JavaScript with class-based code, JSON data structures, XML sitemaps, RSS feeds, and plain text files.

Image Files

PNG, JPEG, GIF (animated), SVG, and WebP images with procedurally generated content at various dimensions.

Other Files

Files without extensions, application manifests, ZIP archives containing reports and datasets, and multi-page PDF documents.