How I Fix the SSL Error in My First Wikipedia Crawler Using Python

I built my first Wikipedia crawler in Web Scraping with Python, and it crashed with an SSL: CERTIFICATE_VERIFY_FAILED as soon as urlopen() redirected from HTTP to HTTPS. Installing certifi, creating an SSL context, and adding a user-agent, depth limit, timeouts, and polite delays cured the error and made the script stable. The cleaned-up version now roams two clicks deep, logs every internal link, and skips broken pages without taking Wikipedia down with me. One little SSL snag ended up teaching me certificate basics, respectful crawling habits, and the value of incremental refactors

I’m working through Web Scraping with Python and my very first crawler looked like this:

My Code

urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org" + pageUrl)
    bsObj = BeautifulSoup(html)
    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if "href" in link.attrs:
            if link.attrs["href"] not in pages:
                # We have encountered a new page
                newPage = link.attrs["href"]
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)

getLinks("")

It looks harmless, but the very first run crashed with:

Error Code

.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] …>

Below is the path I took to solve it, improve the script, and squeeze a few extra lessons out of the problem.

Why the `CERTIFICATE_VERIFY_FAILED` Error Popped Up

What really happened?
urlopen() quietly redirected me from http:// to https://.
How Python reacted:
Python asked my machine, “Do we trust Wikipedia’s certificate?”
My machine’s reply:
“I’m not sure, I don’t see the right root CA,” so OpenSSL threw the error and urlopen wrapped it in a friendly URLError.

Usual Culprits

Cause	Quick test	Simple fix
Out-of-date or missing root certificates	`python -c "import ssl, certifi; print(certifi.where())"`	`pip install certifi` and point your SSL context to it
A company proxy that re-signs traffic	Open the site in a browser—do you get a warning?	Import the proxy’s CA certificate into your store
Previous meddling with SSL settings	—	Create a fresh virtual environment

Cleaner Version of the Crawler

After a bit of trial and error, here’s the script I use now:

urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
from bs4 import BeautifulSoup
import ssl, certifi, re, time

START_PAGE = "/wiki/Main_Page"
MAX_DEPTH  = 2
BASE_URL   = "https://en.wikipedia.org"

visited = set()
ctx = ssl.create_default_context(cafile=certifi.where())   # trust certifi’s CA bundle

def get_links(page_url: str, depth: int = 0):
    """Walk Wikipedia links up to MAX_DEPTH levels deep."""
    if depth > MAX_DEPTH:
        return

    full_url = BASE_URL + page_url
    try:
        req  = Request(full_url,
                       headers={"User-Agent": "SimpleCrawler/0.1 (+tutorial)"})
        html = urlopen(req, context=ctx, timeout=10)
    except (HTTPError, URLError, ssl.SSLError) as e:
        print(f"✗ {full_url} → {e}")
        return

    soup = BeautifulSoup(html, "html.parser")
    print(f"{'  '*depth}✓ {page_url}")

    for link in soup.find_all("a", href=re.compile(r"^(/wiki/[^:#]*$)")):
        href = link.get("href")
        if href and href not in visited:
            visited.add(href)
            get_links(href, depth + 1)
            time.sleep(0.5)   # be polite

if __name__ == "__main__":
    visited.add(START_PAGE)
    get_links(START_PAGE)
    print(f"\nTotal unique pages collected: {len(visited)}")

Explain Code

HTTPS + custom User-Agent – avoids silent redirects and looks more polite.
Certifi CA bundle – wipes out the SSL error on any platform.
Depth limit – prevents accidental self-DDOS.
Timeouts and error handling – keeps the crawl moving even when pages fail.
Politeness delay – gives Wikipedia a breather.

Practice Tasks

Task	Skill I sharpened
Save URLs to SQLite instead of printing	Basic database I/O
Pull page titles and first paragraphs	Parsing and data cleaning
Count external links vs internal ones	Simple analytics
ThreadPoolExecutor crawl	Concurrency without going overboard
robots.txt check with `urllib.robotparser`	Web-crawler etiquette
Add `argparse` for CLI options	Reusable scripts

Tackling each mini goal forced me to touch new parts of the standard library and keep my code tidy.

A Quick Word on “`scrapy: command not found`”

Install for the right interpreter

-m pip install --upgrade pip
python -m pip install scrapy

Add the scripts directory to PATH (if your shell can’t see the command).

# macOS / Linux
export PATH="$HOME/.local/bin:$PATH"

# Windows (PowerShell)
setx PATH "%USERPROFILE%\AppData\Roaming\Python\Python37\Scripts;%PATH%"

Verify

-m scrapy version

If you see a version string, you’re good to go.

Final Thought

Fixing one SSL error sounded like a quick chore, yet it pushed me to:

Read up on certificate chains,
Build a friendlier crawler,
Add error handling and delays,
And jot down a tidy list of future improvements.

Each bump made the project stronger, and more importantly, taught me why the tools behave the way they do.

How I Fix the SSL Error in My First Wikipedia Crawler Using Python

My Code

Error Code

Why the `CERTIFICATE_VERIFY_FAILED` Error Popped Up

Usual Culprits

Cleaner Version of the Crawler

Explain Code

Practice Tasks

A Quick Word on “`scrapy: command not found`”

Final Thought

Related blog posts

How Do I Fix Error When Adding GameAnalytics/Google Play Services in Java?

How to Create a Chatbot with Generative AI From Prompt Engineering to Deployment?

How I Fix “Type ‘Void’ is Not Assignable to Type ‘FC‘” in React JS

How to Fix the ‘TableData is Not Defined’ Error (react/jsx-no-undef) in React JS

How I Fix the SSL Error in My First Wikipedia Crawler Using Python

My Code

Error Code

Why the CERTIFICATE_VERIFY_FAILED Error Popped Up

Usual Culprits

Cleaner Version of the Crawler

Explain Code

Practice Tasks

A Quick Word on “scrapy: command not found”

Final Thought

Related blog posts

How Do I Fix Error When Adding GameAnalytics/Google Play Services in Java?

How to Create a Chatbot with Generative AI From Prompt Engineering to Deployment?

How I Fix “Type ‘Void’ is Not Assignable to Type ‘FC‘” in React JS

How to Fix the ‘TableData is Not Defined’ Error (react/jsx-no-undef) in React JS

Why the `CERTIFICATE_VERIFY_FAILED` Error Popped Up

A Quick Word on “`scrapy: command not found`”