Website archival / crawling
Good camparism article: 11 best open-source web crawlers and scrapers in 2024
Tools
Sorted by potential cadidates for scraping with authentication:
- Katana
- Golang, > 11k GH ⭐
- ✅ Can crawl sites that require authentication, not only basic auth
- MechanicalSoup
- Python
- ✅ Can crawl sites that require authentication, not only basic auth, through submitting forms
- heritrix3
- Java
- ✅ Can crawl sites that require authentication, not only basic auth
- Apache Nutch
- Java
- ✅ Can crawl sites that require authentication, not only basic auth
- ❌ "Overly complicated for simple or small-scale crawling tasks, whereas lighter, more straightforward tools could be more effective"
- gospider
- ❌ Doesn't handle authentication
- spidy
- ❌ Doesn't handle authentication
- scrapy
- ❌ Can only handle basic/http auth
- crawlee
- 🤷 Authentication: Scraping auth data using JSDOM
- node-crawler
- Selenium
- ❌ Heavy, runs a full headless browser
- Doesn't specialize in crawling
- ✅ Can crawl sites that require authentication, not only basic auth
- Webmagic
- Java
- 🤷 Authentication ?
- Nokigiri
- Ruby
- 🤷 Authentication ?
ArchiveBox
- ArchiveBox Open-source self-hosted web archiving
Outdated
- crawler4j
- Last commit 2020