Research Collection sg01
This collection hosts various file types to study how web crawlers and AI bots interact with different content formats across origins. Each file is designed to test specific aspects of cross-domain crawler behavior including content type handling, binary downloads, and link following patterns.
Collection Overview
This is a controlled environment with known file types, sizes, and link structures. Server access logs capture every request, allowing detailed analysis of which crawlers visit which files, how they handle different content types, and whether they follow internal links between resources.
The collection includes text formats (HTML, CSS, JavaScript, JSON, XML), image formats (PNG, JPEG, GIF, SVG, WebP), and binary formats (extensionless files, archives, PDFs). Each file contains substantial, non-trivial content.
Available Files
Complete list of files in this collection:
Architecture
Links
See the about page for methodology details, subscribe to the RSS feed for updates, or view the humans.txt for credits.