How can I download PDF’s using an AI WebCrawler?

Web scraping AI WebCrawler can be a powerful tool for automating data collection, but it’s not always straightforward especially when JavaScript-heavy websites are involved. Recently, I tried using Crawler4AI to download PDFs from a dataset on data.humdata.org. While the tool is designed to handle JavaScript rendering, I ran into an error that left me scratching my head:
Error: Wait condition failed: 'int' object has no attribute 'strip'.

In this post, I’ll walk through the problem, explain why it happened, and share the solution that finally worked. If you’re struggling with similar issues, read on!

The Goal

The website in question hosts a repository of PDF files, and I wanted to download all of them programmatically. The site’s robots.txt allows scraping (except specific disallowed paths), so this was a legitimate use case.

The Initial Approach

Crawler4AI’s documentation provides a template for downloading files using JavaScript automation. Here’s the code I adapted:

from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai import AsyncWebCrawler
import os, asyncio
from pathlib import Path

async def download_multiple_files(url: str, download_path: str):
    config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
    async with AsyncWebCrawler(config=config) as crawler:
        run_config = CrawlerRunConfig(
            js_code="""
                const downloadLinks = document.querySelectorAll(
                  "a[title='Download']",
                  document,
                  null,
                  XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
                  null
                );
                for (const link of downloadLinks) {
                  link.click();              
                }
            """,
            wait_for=10  # Wait for downloads to start
        )
        result = await crawler.arun(url=url, config=run_config)

        if result.downloaded_files:
            print("Downloaded files:")
            for file in result.downloaded_files:
                print(f"- {file}")
        else:
            print("No files downloaded.")

# Usage
download_path = os.path.join(Path.cwd(), "Downloads")
os.makedirs(download_path, exist_ok=True)
asyncio.run(download_multiple_files("https://data.humdata.org/dataset/repository-for-pdf-files", download_path))

The Error

The code failed with:

 × Unexpected error in _crawl_web at line 1551...  
   Error: Wait condition failed: 'int' object has no attribute 'strip'

The traceback pointed to an issue with the wait_for parameter in CrawlerRunConfig.

Diagnosing the Problem

  1. JavaScript Code Issues:
    • The original code used document.querySelectorAll with parameters meant for XPath (e.g., XPathResult). This is incorrect because querySelectorAll only accepts CSS selectors, not XPath arguments.
    • The correct approach for XPath queries is to use document.evaluate().
  2. Wait Condition Misconfiguration:
    • The wait_for parameter was set to an integer (10), but Crawler4AI likely expects a string representing a JavaScript condition or CSS selector. Passing an integer caused a type error when the library tried to process it as a string.

The Fix

Correct the JavaScript Code

Replace the flawed querySelectorAll logic with proper XPath execution:

const links = document.evaluate(
    "//a[@title='Download']",
    document,
    null,
    XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
    null
);
for (let i = 0; i < links.snapshotLength; i++) {
    const link = links.snapshotItem(i);
    link.click();
}

Adjust the wait_for Parameter

Instead of passing wait_for=10, use a valid condition. To wait for a fixed time, add a delay in the JavaScript code:

// After clicking links, wait 5 seconds
await new Promise(resolve => setTimeout(resolve, 5000));

Or set wait_for to a selector that appears after downloads start (e.g., a confirmation popup).

Final Code:

run_config = CrawlerRunConfig(
    js_code="""
        const links = document.evaluate(
            "//a[@title='Download']",
            document,
            null,
            XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,
            null
        );
        for (let i = 0; i < links.snapshotLength; i++) {
            const link = links.snapshotItem(i);
            link.click();
        }
        // Optional: Add a delay to ensure downloads start
        await new Promise(resolve => setTimeout(resolve, 5000));
    """,
    wait_for="a[title='Download']"  # Wait for download links to render
)

Additional Tips

  1. Respect robots.txt:
    The site specifies Crawl-Delay: 10, so add a delay between requests to avoid overloading the server.
  2. Handle Dynamic Content:
    Some websites load content lazily. Use wait_for to wait for specific elements to appear.
  3. Debugging:
    Enable Crawler4AI’s logging or use try/except blocks to capture errors.

Final Thoughts

Crawler4AI is a robust tool for scraping JavaScript-heavy sites, but it requires careful configuration. The key takeaways:

  • Use XPath correctly: Avoid mixing CSS and XPath syntax.
  • Validate parameters: Ensure wait_for expects the right data type (string vs. integer).
  • Test incrementally: Start with a single download before scaling up.

With these adjustments, I successfully downloaded all PDFs from the dataset.

Related blog posts