I’ve encountered a problem during web scraping while trying to extract data from a table on a website using the BeautifulSoup
and requests
libraries. I couldn’t find any relevant information in the XHR or JS sections of Chrome’s inspect tool, which made it difficult to locate the API endpoint I was hoping for. Without the JSON file, I resorted to scraping the data directly from each page, but the process turned out to be quite slow. This article discusses the issue I faced and how I optimized my web scraping process.
The Error
The primary issue I encountered was that I couldn’t find the request URL with JSON data while inspecting the website. I was hoping to discover an API endpoint that would return the data in JSON format, making the scraping process more efficient. Unfortunately, I didn’t find any relevant information in the XHR or JS sections of Chrome’s developer tools.
With no API endpoint available, I had no choice but to continue scraping each page individually using the HTML content. However, this approach led to several challenges:
- The website had around 20,000 pages, each with approximately 15 rows of data.
- Scraping each page took a long time, and the process was inefficient because of the sheer volume of pages I had to scrape.
My Web Scraping Code
To scrape the data from the website, I wrote the following asynchronous Python code using asyncio
, requests
, and BeautifulSoup
. The code fetches the HTML content from each page, parses it using BeautifulSoup, and extracts the data from the table. Here’s how the code looks:
import logging
from bs4 import BeautifulSoup
import asyncio
import aiohttp
BASE_URL = "https://example.com/page="
async def fetch_page(session, url, page):
"""Fetch the HTML content of a page."""
try:
async with session.get(url) as response:
return await response.text()
except Exception as e:
logging.error(f"❌ Failed to fetch page {page}: {e}")
return None
async def scrape_page(session, page):
"""Scrape data from a single page."""
url = BASE_URL + str(page)
html = await fetch_page(session, url, page)
if not html:
logging.error(f"❌ Failed to fetch page {page}. Skipping...")
return [], None # Skip if page fails
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {"id": lambda x: x and x.startswith("guid-")})
if not table:
logging.warning(f"⚠️ No table found on page {page}. Check if the structure has changed!")
return [], None
titles = [th.text.strip() for th in table.find_all("th")]
rows = table.find_all("tr")[1:] # Skip first row (headers)
data = [[td.text.strip() for td in row.find_all("td")] for row in rows]
print(f"✅ Scraped {len(data)} records from page {page}") # DEBUG PRINT
return data, titles
Code Breakdown:
fetch_page(session, url, page)
:
This function makes an HTTP request to fetch the HTML content of a specific page.scrape_page(session, page)
:
This function processes each page’s HTML to extract the table data. It first checks if the table exists, then extracts the table headers (th
tags) and rows (tr
tags), and finally returns the data.- Logging and Debugging:
The code logs progress and warnings, such as when a page fails to load or if the table structure is not found.
The Problem:
The scraping process works, but given the large number of pages (around 20,000), scraping each page is taking too long. Fetching and parsing the HTML content from each page sequentially makes the process slow and inefficient.
Optimization Suggestions
Use Asynchronous Scraping More Efficiently
While the current code uses asyncio
and aiohttp
to fetch pages concurrently, it can still be optimized by adjusting how many pages are processed in parallel.
Here’s how I can improve the scraping speed:
- Limit the Number of Concurrent Requests:
To avoid overloading the server or your local machine’s resources, it’s important to limit the number of concurrent requests. Use anasyncio.Semaphore
to control the number of requests that can be made at the same time.
async def fetch_page(session, url, page, semaphore):
"""Fetch the HTML content of a page."""
async with semaphore:
try:
async with session.get(url) as response:
return await response.text()
except Exception as e:
logging.error(f"❌ Failed to fetch page {page}: {e}")
return None
async def scrape_all_pages(start_page, end_page):
"""Scrape all pages concurrently."""
semaphore = asyncio.Semaphore(10) # Limit to 10 concurrent requests
async with aiohttp.ClientSession() as session:
tasks = [scrape_page(session, page) for page in range(start_page, end_page + 1)]
results = await asyncio.gather(*tasks)
return results
By controlling the number of concurrent requests, I can balance the scraping speed and resource usage.
Consider Using an API If Available
Even though I couldn’t find the API initially, it’s always worth trying other methods to check if the website has a hidden API endpoint.
Here are some additional steps I could take:
- Check for JavaScript-generated API calls:
Look at the Network tab in Chrome DevTools for any XHR (XMLHttpRequest) or fetch requests that return JSON data. These might be API calls used by the website to load data. - Use a Headless Browser (e.g., Selenium):
If the data is generated dynamically by JavaScript, using a tool like Selenium or Playwright might help as they can interact with JavaScript on the page and fetch data directly from the rendered DOM.
Optimize Data Storage
Another bottleneck could be the way I save the data. Instead of saving data page by page, I could batch the data and save it in chunks. This would reduce the time spent writing to the .csv
file.
import csv
def save_to_csv(data, filename="scraped_data.csv"):
"""Save data to a CSV file."""
with open(filename, mode="a", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerows(data)
async def scrape_all_pages(start_page, end_page):
"""Scrape all pages concurrently and save data."""
semaphore = asyncio.Semaphore(10)
async with aiohttp.ClientSession() as session:
tasks = [scrape_page(session, page) for page in range(start_page, end_page + 1)]
results = await asyncio.gather(*tasks)
# Save data after scraping all pages
for data, titles in results:
if data:
save_to_csv(data)
This will append the data to the CSV file as I scrape pages, reducing memory consumption.
Final Thoughts
Scraping large websites with thousands of pages can be a slow and resource-intensive process. In my case, after optimizing the number of concurrent requests using asyncio and aiohttp, and adjusting how data is saved, I was able to speed up the scraping significantly.
Here’s a summary of the optimizations I implemented:
- Controlled concurrency to balance scraping speed and resource usage.
- Looked for hidden APIs or endpoints for faster data extraction.
- Optimized data storage by saving the data in batches to reduce overhead.
Web scraping can be tricky, especially when dealing with large volumes of data, but with these techniques, I was able to streamline the process and make it much more efficient.