Website archival / crawling

Tools

Sorted by potential cadidates for scraping with authentication:

Katana
- Golang, > 11k GH ⭐
- ✅ Can crawl sites that require authentication, not only basic auth
MechanicalSoup
- Python
- ✅ Can crawl sites that require authentication, not only basic auth, through submitting forms
heritrix3
- Java
- ✅ Can crawl sites that require authentication, not only basic auth
  - Authentication and Cookies
Apache Nutch
- Java
- ✅ Can crawl sites that require authentication, not only basic auth
  - httpclient-auth.xml.template
- ❌ "Overly complicated for simple or small-scale crawling tasks, whereas lighter, more straightforward tools could be more effective"
gospider
- ❌ Doesn't handle authentication
spidy
- ❌ Doesn't handle authentication
scrapy
- ❌ Can only handle basic/http auth
crawlee
- 🤷 Authentication: Scraping auth data using JSDOM
node-crawler
- ❌ Doesn't handle authentication
Selenium
- ❌ Heavy, runs a full headless browser
- Doesn't specialize in crawling
- ✅ Can crawl sites that require authentication, not only basic auth
Webmagic
- Java
- 🤷 Authentication ?
Nokigiri
- Ruby
- 🤷 Authentication ?