How I Fix the SSL Error in My First Wikipedia Crawler Using Python

I built my first Wikipedia crawler in Web Scraping with Python, and it crashed with an SSL: CERTIFICATE_VERIFY_FAILED as soon as urlopen() redirected from HTTP to HTTPS. Installing certifi, creating an SSL context, and adding a user-agent, depth limit, timeouts, and polite delays cured the error and made the script stable. The cleaned-up version now roams two clicks deep, logs every internal link, and skips broken pages without taking Wikipedia down with me. One little SSL snag ended up teaching me certificate basics, respectful crawling habits, and the value of incremental refactors

I’m working through Web Scraping with Python and my very first crawler looked like this:

My Code

urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org" + pageUrl)
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
# We have encountered a new page
newPage = link.attrs["href"]
print(newPage)
pages.add(newPage)
getLinks(newPage)

getLinks("")

It looks harmless, but the very first run crashed with:

Error Code

.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] …>

Below is the path I took to solve it, improve the script, and squeeze a few extra lessons out of the problem.

Why the CERTIFICATE_VERIFY_FAILED Error Popped Up

  • What really happened?
    urlopen() quietly redirected me from http:// to https://.
  • How Python reacted:
    Python asked my machine, “Do we trust Wikipedia’s certificate?”
  • My machine’s reply:
    “I’m not sure, I don’t see the right root CA,” so OpenSSL threw the error and urlopen wrapped it in a friendly URLError.

Usual Culprits

CauseQuick testSimple fix
Out-of-date or missing root certificatespython -c "import ssl, certifi; print(certifi.where())"pip install certifi and point your SSL context to it
A company proxy that re-signs trafficOpen the site in a browser—do you get a warning?Import the proxy’s CA certificate into your store
Previous meddling with SSL settingsCreate a fresh virtual environment

Cleaner Version of the Crawler

After a bit of trial and error, here’s the script I use now:

urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
from bs4 import BeautifulSoup
import ssl, certifi, re, time

START_PAGE = "/wiki/Main_Page"
MAX_DEPTH = 2
BASE_URL = "https://en.wikipedia.org"

visited = set()
ctx = ssl.create_default_context(cafile=certifi.where()) # trust certifi’s CA bundle

def get_links(page_url: str, depth: int = 0):
"""Walk Wikipedia links up to MAX_DEPTH levels deep."""
if depth > MAX_DEPTH:
return

full_url = BASE_URL + page_url
try:
req = Request(full_url,
headers={"User-Agent": "SimpleCrawler/0.1 (+tutorial)"})
html = urlopen(req, context=ctx, timeout=10)
except (HTTPError, URLError, ssl.SSLError) as e:
print(f"✗ {full_url} → {e}")
return

soup = BeautifulSoup(html, "html.parser")
print(f"{' '*depth}✓ {page_url}")

for link in soup.find_all("a", href=re.compile(r"^(/wiki/[^:#]*$)")):
href = link.get("href")
if href and href not in visited:
visited.add(href)
get_links(href, depth + 1)
time.sleep(0.5) # be polite

if __name__ == "__main__":
visited.add(START_PAGE)
get_links(START_PAGE)
print(f"\nTotal unique pages collected: {len(visited)}")

Explain Code

  • HTTPS + custom User-Agent – avoids silent redirects and looks more polite.
  • Certifi CA bundle – wipes out the SSL error on any platform.
  • Depth limit – prevents accidental self-DDOS.
  • Timeouts and error handling – keeps the crawl moving even when pages fail.
  • Politeness delay – gives Wikipedia a breather.

Practice Tasks

TaskSkill I sharpened
Save URLs to SQLite instead of printingBasic database I/O
Pull page titles and first paragraphsParsing and data cleaning
Count external links vs internal onesSimple analytics
ThreadPoolExecutor crawlConcurrency without going overboard
robots.txt check with urllib.robotparserWeb-crawler etiquette
Add argparse for CLI optionsReusable scripts

Tackling each mini goal forced me to touch new parts of the standard library and keep my code tidy.

A Quick Word on “scrapy: command not found

Install for the right interpreter

    -m pip install --upgrade pip
    python -m pip install scrapy

    Add the scripts directory to PATH (if your shell can’t see the command).

      # macOS / Linux
      export PATH="$HOME/.local/bin:$PATH"

      # Windows (PowerShell)
      setx PATH "%USERPROFILE%\AppData\Roaming\Python\Python37\Scripts;%PATH%"

      Verify

        -m scrapy version

        If you see a version string, you’re good to go.

        Final Thought

        Fixing one SSL error sounded like a quick chore, yet it pushed me to:

        • Read up on certificate chains,
        • Build a friendlier crawler,
        • Add error handling and delays,
        • And jot down a tidy list of future improvements.

        Each bump made the project stronger, and more importantly, taught me why the tools behave the way they do.

        Related blog posts