How do I Find Request URL with JSON Information While Web Scraping

I’ve encountered a problem during web scraping while trying to extract data from a table on a website using the BeautifulSoup and requests libraries. I couldn’t find any relevant information in the XHR or JS sections of Chrome’s inspect tool, which made it difficult to locate the API endpoint I was hoping for. Without the JSON file, I resorted to scraping the data directly from each page, but the process turned out to be quite slow. This article discusses the issue I faced and how I optimized my web scraping process.

The Error

The primary issue I encountered was that I couldn’t find the request URL with JSON data while inspecting the website. I was hoping to discover an API endpoint that would return the data in JSON format, making the scraping process more efficient. Unfortunately, I didn’t find any relevant information in the XHR or JS sections of Chrome’s developer tools.

With no API endpoint available, I had no choice but to continue scraping each page individually using the HTML content. However, this approach led to several challenges:

The website had around 20,000 pages, each with approximately 15 rows of data.
Scraping each page took a long time, and the process was inefficient because of the sheer volume of pages I had to scrape.

My Web Scraping Code

To scrape the data from the website, I wrote the following asynchronous Python code using asyncio, requests, and BeautifulSoup. The code fetches the HTML content from each page, parses it using BeautifulSoup, and extracts the data from the table. Here’s how the code looks:

import logging
from bs4 import BeautifulSoup
import asyncio
import aiohttp

BASE_URL = "https://example.com/page="

async def fetch_page(session, url, page):
    """Fetch the HTML content of a page."""
    try:
        async with session.get(url) as response:
            return await response.text()
    except Exception as e:
        logging.error(f"❌ Failed to fetch page {page}: {e}")
        return None

async def scrape_page(session, page):
    """Scrape data from a single page."""
    url = BASE_URL + str(page)
    html = await fetch_page(session, url, page)

    if not html:
        logging.error(f"❌ Failed to fetch page {page}. Skipping...")
        return [], None  # Skip if page fails

    soup = BeautifulSoup(html, "html.parser")
    table = soup.find("table", {"id": lambda x: x and x.startswith("guid-")})

    if not table:
        logging.warning(f"⚠️ No table found on page {page}. Check if the structure has changed!")
        return [], None

    titles = [th.text.strip() for th in table.find_all("th")]
    rows = table.find_all("tr")[1:]  # Skip first row (headers)
    data = [[td.text.strip() for td in row.find_all("td")] for row in rows]

    print(f"✅ Scraped {len(data)} records from page {page}")  # DEBUG PRINT
    return data, titles

Code Breakdown:

fetch_page(session, url, page):
This function makes an HTTP request to fetch the HTML content of a specific page.
scrape_page(session, page):
This function processes each page’s HTML to extract the table data. It first checks if the table exists, then extracts the table headers (th tags) and rows (tr tags), and finally returns the data.
Logging and Debugging:
The code logs progress and warnings, such as when a page fails to load or if the table structure is not found.

The Problem:

The scraping process works, but given the large number of pages (around 20,000), scraping each page is taking too long. Fetching and parsing the HTML content from each page sequentially makes the process slow and inefficient.

Optimization Suggestions

Use Asynchronous Scraping More Efficiently

While the current code uses asyncio and aiohttp to fetch pages concurrently, it can still be optimized by adjusting how many pages are processed in parallel.

Here’s how I can improve the scraping speed:

Limit the Number of Concurrent Requests:
To avoid overloading the server or your local machine’s resources, it’s important to limit the number of concurrent requests. Use an asyncio.Semaphore to control the number of requests that can be made at the same time.

async def fetch_page(session, url, page, semaphore):
    """Fetch the HTML content of a page."""
    async with semaphore:
        try:
            async with session.get(url) as response:
                return await response.text()
        except Exception as e:
            logging.error(f"❌ Failed to fetch page {page}: {e}")
            return None

async def scrape_all_pages(start_page, end_page):
    """Scrape all pages concurrently."""
    semaphore = asyncio.Semaphore(10)  # Limit to 10 concurrent requests
    async with aiohttp.ClientSession() as session:
        tasks = [scrape_page(session, page) for page in range(start_page, end_page + 1)]
        results = await asyncio.gather(*tasks)
    return results

By controlling the number of concurrent requests, I can balance the scraping speed and resource usage.

Consider Using an API If Available

Even though I couldn’t find the API initially, it’s always worth trying other methods to check if the website has a hidden API endpoint.

Here are some additional steps I could take:

Check for JavaScript-generated API calls:
Look at the Network tab in Chrome DevTools for any XHR (XMLHttpRequest) or fetch requests that return JSON data. These might be API calls used by the website to load data.
Use a Headless Browser (e.g., Selenium):
If the data is generated dynamically by JavaScript, using a tool like Selenium or Playwright might help as they can interact with JavaScript on the page and fetch data directly from the rendered DOM.

Optimize Data Storage

Another bottleneck could be the way I save the data. Instead of saving data page by page, I could batch the data and save it in chunks. This would reduce the time spent writing to the .csv file.

import csv

def save_to_csv(data, filename="scraped_data.csv"):
    """Save data to a CSV file."""
    with open(filename, mode="a", newline="", encoding="utf-8") as file:
        writer = csv.writer(file)
        writer.writerows(data)

async def scrape_all_pages(start_page, end_page):
    """Scrape all pages concurrently and save data."""
    semaphore = asyncio.Semaphore(10)
    async with aiohttp.ClientSession() as session:
        tasks = [scrape_page(session, page) for page in range(start_page, end_page + 1)]
        results = await asyncio.gather(*tasks)
    
    # Save data after scraping all pages
    for data, titles in results:
        if data:
            save_to_csv(data)

This will append the data to the CSV file as I scrape pages, reducing memory consumption.

Final Thoughts

Scraping large websites with thousands of pages can be a slow and resource-intensive process. In my case, after optimizing the number of concurrent requests using asyncio and aiohttp, and adjusting how data is saved, I was able to speed up the scraping significantly.

Here’s a summary of the optimizations I implemented:

Controlled concurrency to balance scraping speed and resource usage.
Looked for hidden APIs or endpoints for faster data extraction.
Optimized data storage by saving the data in batches to reduce overhead.

Web scraping can be tricky, especially when dealing with large volumes of data, but with these techniques, I was able to streamline the process and make it much more efficient.

How do I Find Request URL with JSON Information While Web Scraping

The Error

My Web Scraping Code

Code Breakdown:

The Problem:

Optimization Suggestions

Use Asynchronous Scraping More Efficiently

Consider Using an API If Available

Optimize Data Storage

Final Thoughts

Related blog posts

How Do I Fix “Game is not Define” Error in My Console Game

How I Fix the Error While Sending Request From Web Server to Game Server

How to Fix the CANNOT_BIND Error While Integrating Amazon Game Circle on Kindle Fire

How I Fix an Array Index Error While Swiping in My Match-3 Game