About Collection sg01
Background
AI-powered web crawlers from OpenAI (GPTBot), Anthropic (ClaudeBot), Amazon (Amazonbot), and others now visit millions of websites daily. Their behavior differs from traditional search bots in how they handle cross-origin resources, binary files, and redirect chains.
Research Goals
- Which file types do AI crawlers request versus ignore across origins?
- Do crawlers follow links in CSS, JavaScript, and JSON files?
- How do crawlers handle files without extensions?
- Do crawlers parse XML sitemaps and RSS feeds from third-party hosts?
- What are the download size limits for cross-origin content?
- Do crawlers follow redirects that lead to a different origin?
Setup
This collection is hosted on AWS S3 in us-east-1 with access logging enabled. S3 server access logs capture every request with timestamps, IP addresses, methods, status codes, referrers, and user-agent strings. Files are linked from research domains to test cross-origin discovery and following behavior.
File Categories
Text Files
HTML pages with semantic markup, CSS stylesheets, JavaScript with class-based code, JSON data structures, XML sitemaps, RSS feeds, and plain text files.
Image Files
PNG, JPEG, GIF (animated), SVG, and WebP images with procedurally generated content at various dimensions.
Other Files
Files without extensions, application manifests, ZIP archives containing reports and datasets, and multi-page PDF documents.