Web scraping AI WebCrawler can be a powerful tool for automating data collection, but it’s not always straightforward especially when JavaScript-heavy websites are involved. Recently, I tried using Crawler4AI to download PDFs from a dataset on data.humdata.org. While the tool is designed to handle JavaScript rendering, I ran into an error that left me scratching my head:Error: Wait condition failed: 'int' object has no attribute 'strip'
.
In this post, I’ll walk through the problem, explain why it happened, and share the solution that finally worked. If you’re struggling with similar issues, read on!
The Goal
The website in question hosts a repository of PDF files, and I wanted to download all of them programmatically. The site’s robots.txt
allows scraping (except specific disallowed paths), so this was a legitimate use case.
The Initial Approach
Crawler4AI’s documentation provides a template for downloading files using JavaScript automation. Here’s the code I adapted:
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig from crawl4ai import AsyncWebCrawler import os, asyncio from pathlib import Path async def download_multiple_files(url: str, download_path: str): config = BrowserConfig(accept_downloads=True, downloads_path=download_path) async with AsyncWebCrawler(config=config) as crawler: run_config = CrawlerRunConfig( js_code=""" const downloadLinks = document.querySelectorAll( "a[title='Download']", document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null ); for (const link of downloadLinks) { link.click(); } """, wait_for=10 # Wait for downloads to start ) result = await crawler.arun(url=url, config=run_config) if result.downloaded_files: print("Downloaded files:") for file in result.downloaded_files: print(f"- {file}") else: print("No files downloaded.") # Usage download_path = os.path.join(Path.cwd(), "Downloads") os.makedirs(download_path, exist_ok=True) asyncio.run(download_multiple_files("https://data.humdata.org/dataset/repository-for-pdf-files", download_path))
The Error
The code failed with:
× Unexpected error in _crawl_web at line 1551... Error: Wait condition failed: 'int' object has no attribute 'strip'
The traceback pointed to an issue with the wait_for
parameter in CrawlerRunConfig
.
Diagnosing the Problem
- JavaScript Code Issues:
- The original code used
document.querySelectorAll
with parameters meant for XPath (e.g.,XPathResult
). This is incorrect becausequerySelectorAll
only accepts CSS selectors, not XPath arguments. - The correct approach for XPath queries is to use
document.evaluate()
.
- The original code used
- Wait Condition Misconfiguration:
- The
wait_for
parameter was set to an integer (10
), but Crawler4AI likely expects a string representing a JavaScript condition or CSS selector. Passing an integer caused a type error when the library tried to process it as a string.
- The
The Fix
Correct the JavaScript Code
Replace the flawed querySelectorAll
logic with proper XPath execution:
const links = document.evaluate( "//a[@title='Download']", document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null ); for (let i = 0; i < links.snapshotLength; i++) { const link = links.snapshotItem(i); link.click(); }
Adjust the wait_for
Parameter
Instead of passing wait_for=10
, use a valid condition. To wait for a fixed time, add a delay in the JavaScript code:
// After clicking links, wait 5 seconds await new Promise(resolve => setTimeout(resolve, 5000));
Or set wait_for
to a selector that appears after downloads start (e.g., a confirmation popup).
Final Code:
run_config = CrawlerRunConfig( js_code=""" const links = document.evaluate( "//a[@title='Download']", document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null ); for (let i = 0; i < links.snapshotLength; i++) { const link = links.snapshotItem(i); link.click(); } // Optional: Add a delay to ensure downloads start await new Promise(resolve => setTimeout(resolve, 5000)); """, wait_for="a[title='Download']" # Wait for download links to render )
Additional Tips
- Respect
robots.txt
:
The site specifiesCrawl-Delay: 10
, so add a delay between requests to avoid overloading the server. - Handle Dynamic Content:
Some websites load content lazily. Usewait_for
to wait for specific elements to appear. - Debugging:
Enable Crawler4AI’s logging or usetry/except
blocks to capture errors.
Final Thoughts
Crawler4AI is a robust tool for scraping JavaScript-heavy sites, but it requires careful configuration. The key takeaways:
- Use XPath correctly: Avoid mixing CSS and XPath syntax.
- Validate parameters: Ensure
wait_for
expects the right data type (string vs. integer). - Test incrementally: Start with a single download before scaling up.
With these adjustments, I successfully downloaded all PDFs from the dataset.