How to Fix Error With Web Scraping in Javascript

I’m sharing my recent experience debugging a Puppeteer Timeout Error while working on a web scraping project in javascript. I’ll walk you through the problem I encountered, how I diagnosed it, and the steps I took to overcome the issue. I hope that by sharing my process, you gain some insights and practical techniques that you can apply to your own projects.

Here’s the code:

const puppeteer = require('puppeteer');

async function scrapeProduct(url) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);

    const [el] = await page.$x('//*[@id="leftColumn"]/div[1]/h1');
    const txt = await el.getProperty('textContent');
    const title = await txt.jsonValue();

    const [el2] = await page.$x('//*[@id="quotes_summary_current_data"]/div[2]/div[1]/span[2]');
    const txt2 = await el2.getProperty('textContent');
    const type = await txt2.jsonValue();

    const [el3] = await page.$x('//*[@id="quotes_summary_current_data"]/div[2]/div[3]/span[2]');
    const txt3 = await el3.getProperty('textContent');
    const issuer = await txt3.jsonValue();

    const [el4] = await page.$x('//*[@id="quotes_summary_current_data"]/div[2]/div[4]/span[2]');
    const txt4 = await el4.getProperty('textContent');
    const isin = await txt4.jsonValue();

    const [el5] = await page.$x('//*[@id="quotes_summary_current_data"]/div[2]/div[5]/span[2]');
    const txt5 = await el5.getProperty('textContent');
    const bclass = await txt5.jsonValue();

    const [el6] = await page.$x('//*[@id="last_last"]');
    const txt6 = await el6.getProperty('textContent');
    const price = await txt6.jsonValue();

    const [el7] = await page.$x('//*[@id="quotes_summary_current_data"]/div[1]/div[2]/div[1]/span[2]');
    const txt7 = await el7.getProperty('textContent');
    const daily_movement = await txt7.jsonValue();

    const [el8] = await page.$x('//*[@id="quotes_summary_secondary_data"]/div/ul/li[1]/span[2]');
    const txt8 = await el8.getProperty('textContent');
    const morning_star_rating = await txt8.jsonValue();

    console.log({title, type, issuer, isin, bclass, price, daily_movement, morning_star_rating});

    browser.close();
};

scrapeProduct('https://www.investing.com/funds/allan-gray-balanced-fund-c-chart');

This is now the error I receive whenever I run the code:

Debugging a Puppeteer Timeout Error

While building a web scraper with Puppeteer, I encountered an error that looked like this in the terminal:

: TimeoutError: Navigation timeout of 30000 ms exceeded

This error typically happens when the page takes too long to load. It can be due to a slow network, heavy dynamic content, or even if the XPath selectors aren’t properly matching any elements on the page, resulting in an unhandled promise rejection. Let me explain in detail.

What Went Wrong?

Navigation Timeout

The default timeout of 30,000 ms (or 30 seconds) is sometimes too short, especially with pages that load a lot of dynamic content or under slow network conditions. To address this, I needed to adjust the timeout setting and ensure that the page had indeed fully loaded before proceeding.

Unhandled Promise Rejection

Sometimes, if an element isn’t found with the provided XPath, the promise gets rejected and—if not handled correctly—results in an unhandled promise rejection. I learned that wrapping my code in a try/catch block is crucial to capture these errors and handle them gracefully.

Direct XPath Access

Directly destructuring the result of the XPath query without checking if an element exists can lead to errors. I solved this by creating a helper function that safely retrieves text from an element selected via XPath. This approach not only streamlines the code but also adds robustness.

The Correct Code

Below is the refactored version of my code that incorporates better error handling, an increased timeout, and an additional feature that writes the scraped data to a JSON file for further practice.

 puppeteer = require('puppeteer');
const fs = require('fs');

async function scrapeProduct(url) {
  // Launch the browser in headless mode.
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  try {
    // Increase timeout to 60 seconds and wait until the network is idle.
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 60000 });
    
    // Helper function to get text content via XPath.
    const getTextByXPath = async (xpath) => {
      const [element] = await page.$x(xpath);
      if (!element) {
        throw new Error(`Element not found for XPath: ${xpath}`);
      }
      const property = await element.getProperty('textContent');
      return property.jsonValue();
    };

    // Retrieve the required elements.
    const title = await getTextByXPath('//*[@id="leftColumn"]/div[1]/h1');
    const type = await getTextByXPath('//*[@id="quotes_summary_current_data"]/div[2]/div[1]/span[2]');
    const issuer = await getTextByXPath('//*[@id="quotes_summary_current_data"]/div[2]/div[3]/span[2]');
    const isin = await getTextByXPath('//*[@id="quotes_summary_current_data"]/div[2]/div[4]/span[2]');
    const bclass = await getTextByXPath('//*[@id="quotes_summary_current_data"]/div[2]/div[5]/span[2]');
    const price = await getTextByXPath('//*[@id="last_last"]');
    const daily_movement = await getTextByXPath('//*[@id="quotes_summary_current_data"]/div[1]/div[2]/div[1]/span[2]');
    const morning_star_rating = await getTextByXPath('//*[@id="quotes_summary_secondary_data"]/div/ul/li[1]/span[2]');

    // Consolidate all scraped data.
    const productData = {
      title,
      type,
      issuer,
      isin,
      bclass,
      price,
      daily_movement,
      morning_star_rating
    };

    console.log(productData);
    
    // Practice functionality: Write the scraped data to a JSON file.
    fs.writeFileSync('productData.json', JSON.stringify(productData, null, 2));
    console.log("Product data saved to productData.json");

  } catch (error) {
    console.error("Error occurred during scraping:", error);
  } finally {
    await browser.close();
  }
}

// Run the scraping function with the target URL.
scrapeProduct('https://www.investing.com/funds/allan-gray-balanced-fund-c-chart');

Explanation of Code

Handling Navigation Timeout

I increased the timeout from 30 seconds to 60 seconds by adjusting the goto options. The addition of waitUntil: 'networkidle2' ensures that Puppeteer waits until there are no more than two network connections for at least 500 ms, implying that the page has mostly finished loading.

Robust Element Retrieval

I introduced a helper function called getTextByXPath that not only retrieves the text content of an element located via XPath, but also checks if the element exists. If the element isn’t found, it throws a meaningful error which is then caught in the try/catch block. This approach minimizes the occurrence of unhandled promise rejections.

Error Handling

The use of try/catch blocks around the core code ensures that any errors encountered during the scraping process are properly logged, and the browser instance is closed gracefully in the finally block. This improves the robustness of the script.

Additional Practice Functionality

To further extend the functionality, I added code to write the scraped product data into a JSON file (productData.json). This not only helps with debugging but also allows you to practice working with file I/O in Node.js, making the script useful for both scraping and data storage.

Final Thoughts

I’ve learned that robust error handling, thoughtful timeout configurations, and creating helper functions for repetitive tasks are critical when building reliable web scrapers. Debugging issues such as the Puppeteer Timeout Error can be challenging, but with systematic troubleshooting and by leveraging Node.js best practices, you can overcome them. This project was not only a great learning experience but also helped me build a more resilient scraper that can handle the intricacies of dynamic web pages.