Research Collection sg01

This collection hosts various file types to study how web crawlers and AI bots interact with different content formats across origins. Each file is designed to test specific aspects of cross-domain crawler behavior including content type handling, binary downloads, and link following patterns.

Collection Overview

This is a controlled environment with known file types, sizes, and link structures. Server access logs capture every request, allowing detailed analysis of which crawlers visit which files, how they handle different content types, and whether they follow internal links between resources.

The collection includes text formats (HTML, CSS, JavaScript, JSON, XML), image formats (PNG, JPEG, GIF, SVG, WebP), and binary formats (extensionless files, archives, PDFs). Each file contains substantial, non-trivial content.

Available Files

Complete list of files in this collection:

Architecture

Links

See the about page for methodology details, subscribe to the RSS feed for updates, or view the humans.txt for credits.