I built my first Wikipedia crawler in Web Scraping with Python, and it crashed with an SSL: CERTIFICATE_VERIFY_FAILED
as soon as urlopen()
redirected from HTTP to HTTPS. Installing certifi, creating an SSL context, and adding a user-agent, depth limit, timeouts, and polite delays cured the error and made the script stable. The cleaned-up version now roams two clicks deep, logs every internal link, and skips broken pages without taking Wikipedia down with me. One little SSL snag ended up teaching me certificate basics, respectful crawling habits, and the value of incremental refactors
I’m working through Web Scraping with Python and my very first crawler looked like this:
My Code
urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org" + pageUrl)
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
# We have encountered a new page
newPage = link.attrs["href"]
print(newPage)
pages.add(newPage)
getLinks(newPage)
getLinks("")
It looks harmless, but the very first run crashed with:
Error Code
.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] …>
Below is the path I took to solve it, improve the script, and squeeze a few extra lessons out of the problem.
Why the CERTIFICATE_VERIFY_FAILED
Error Popped Up
- What really happened?
urlopen()
quietly redirected me fromhttp://
tohttps://
. - How Python reacted:
Python asked my machine, “Do we trust Wikipedia’s certificate?” - My machine’s reply:
“I’m not sure, I don’t see the right root CA,” so OpenSSL threw the error andurlopen
wrapped it in a friendlyURLError
.
Usual Culprits
Cause | Quick test | Simple fix |
---|---|---|
Out-of-date or missing root certificates | python -c "import ssl, certifi; print(certifi.where())" | pip install certifi and point your SSL context to it |
A company proxy that re-signs traffic | Open the site in a browser—do you get a warning? | Import the proxy’s CA certificate into your store |
Previous meddling with SSL settings | — | Create a fresh virtual environment |
Cleaner Version of the Crawler
After a bit of trial and error, here’s the script I use now:
urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
from bs4 import BeautifulSoup
import ssl, certifi, re, time
START_PAGE = "/wiki/Main_Page"
MAX_DEPTH = 2
BASE_URL = "https://en.wikipedia.org"
visited = set()
ctx = ssl.create_default_context(cafile=certifi.where()) # trust certifi’s CA bundle
def get_links(page_url: str, depth: int = 0):
"""Walk Wikipedia links up to MAX_DEPTH levels deep."""
if depth > MAX_DEPTH:
return
full_url = BASE_URL + page_url
try:
req = Request(full_url,
headers={"User-Agent": "SimpleCrawler/0.1 (+tutorial)"})
html = urlopen(req, context=ctx, timeout=10)
except (HTTPError, URLError, ssl.SSLError) as e:
print(f"✗ {full_url} → {e}")
return
soup = BeautifulSoup(html, "html.parser")
print(f"{' '*depth}✓ {page_url}")
for link in soup.find_all("a", href=re.compile(r"^(/wiki/[^:#]*$)")):
href = link.get("href")
if href and href not in visited:
visited.add(href)
get_links(href, depth + 1)
time.sleep(0.5) # be polite
if __name__ == "__main__":
visited.add(START_PAGE)
get_links(START_PAGE)
print(f"\nTotal unique pages collected: {len(visited)}")
Explain Code
- HTTPS + custom User-Agent – avoids silent redirects and looks more polite.
- Certifi CA bundle – wipes out the SSL error on any platform.
- Depth limit – prevents accidental self-DDOS.
- Timeouts and error handling – keeps the crawl moving even when pages fail.
- Politeness delay – gives Wikipedia a breather.
Practice Tasks
Task | Skill I sharpened |
---|---|
Save URLs to SQLite instead of printing | Basic database I/O |
Pull page titles and first paragraphs | Parsing and data cleaning |
Count external links vs internal ones | Simple analytics |
ThreadPoolExecutor crawl | Concurrency without going overboard |
robots.txt check with urllib.robotparser | Web-crawler etiquette |
Add argparse for CLI options | Reusable scripts |
Tackling each mini goal forced me to touch new parts of the standard library and keep my code tidy.
A Quick Word on “scrapy: command not found
”
Install for the right interpreter
-m pip install --upgrade pip
python -m pip install scrapy
Add the scripts directory to PATH
(if your shell can’t see the command).
# macOS / Linux
export PATH="$HOME/.local/bin:$PATH"
# Windows (PowerShell)
setx PATH "%USERPROFILE%\AppData\Roaming\Python\Python37\Scripts;%PATH%"
Verify
-m scrapy version
If you see a version string, you’re good to go.
Final Thought
Fixing one SSL error sounded like a quick chore, yet it pushed me to:
- Read up on certificate chains,
- Build a friendlier crawler,
- Add error handling and delays,
- And jot down a tidy list of future improvements.
Each bump made the project stronger, and more importantly, taught me why the tools behave the way they do.