How to Fix UnicodeDecodeError in CSV Handling with Python Code

Working with CSV files is a common task in data processing, especially when dealing with client information, business data, or any structured dataset with Python. However, CSV files can sometimes throw unexpected errors, particularly when they contain non-ASCII characters or hidden formatting issues. I’ll walk you through a practical example of how to handle such issues and enhance your CSV processing script with additional functionality.

UnicodeDecodeError

I recently encountered a UnicodeDecodeError while working on a small project to manage client information for a business. The error looked like this:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 1956: ordinal not in range(128)

This error occurs because Python’s default ASCII codec cannot decode non-ASCII characters (e.g., accented letters, non-breaking spaces, or special symbols). Even though the file appeared to contain plain text, there were hidden characters causing the issue.

Step-by-Step Approach

To resolve this issue, I took the following steps:

Specify the Encoding

The first step was to explicitly specify the file’s encoding when opening it. Common encodings include utf-8 and latin-1. Here’s how I did it:

import csv

with open('notes.csv', newline='', encoding='utf-8') as csvfile:
    notes = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in notes:
        for x in row:
            print(x)

If utf-8 didn’t work, I tried latin-1:

with open('notes.csv', newline='', encoding='latin-1') as csvfile:
    notes = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in notes:
        for x in row:
            print(x)

Check for Hidden Characters

If specifying the encoding didn’t resolve the issue, I inspected the file for hidden or non-printable characters. I read the file in binary mode to identify problematic bytes:

with open('notes.csv', 'rb') as f:
    content = f.read()
    print(content)

This revealed unexpected byte sequences, such as 0xa0 (a non-breaking space).

Clean the CSV File

Once I identified the problematic characters, I cleaned the file by replacing or removing them. For example, I replaced non-breaking spaces with regular spaces:

with open('notes.csv', 'rb') as f:
    content = f.read()

# Replace non-breaking spaces with regular spaces
cleaned_content = content.replace(b'\xa0', b' ')

with open('notes_cleaned.csv', 'wb') as f:
    f.write(cleaned_content)

Handle Errors Gracefully

To avoid crashes, I used the errors parameter in the open() function to handle decoding errors gracefully. For example, I replaced problematic characters with a placeholder:

with open('notes.csv', newline='', encoding='utf-8', errors='replace') as csvfile:
    notes = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in notes:
        for x in row:
            print(x)

Alternatively, I could ignore problematic characters using errors='ignore'.

Verify the CSV File

Finally, I verified the CSV file for irregularities (e.g., mismatched quotes or unexpected line breaks) using a text editor like VS Code or Sublime Text.

Enhancing the Code with Additional Functionality

To make the script more robust and practical, I added the following features:

Automatic Encoding Detection

I used the chardet library to automatically detect the file’s encoding:

import chardet

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        raw_data = f.read()
    result = chardet.detect(raw_data)
    return result['encoding']

Logging

I added logging to provide informative messages during execution:

import logging

logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

Data Cleaning

I created a function to clean the CSV file by replacing non-breaking spaces and stripping extra whitespace:

def clean_csv(file_path, output_path, encoding='utf-8'):
    try:
        if not encoding:
            encoding = detect_encoding(file_path)
            logging.info(f"Detected encoding: {encoding}")

        with open(file_path, 'r', encoding=encoding, errors='replace') as infile:
            reader = csv.reader(infile, delimiter=',', quotechar='"')
            cleaned_rows = []
            for row in reader:
                cleaned_row = [cell.replace('\xa0', ' ').strip() for cell in row]
                cleaned_rows.append(cleaned_row)

        with open(output_path, 'w', encoding='utf-8', newline='') as outfile:
            writer = csv.writer(outfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            writer.writerows(cleaned_rows)

        logging.info(f"Cleaned CSV saved to: {output_path}")

    except Exception as e:
        logging.error(f"An error occurred: {e}")

Data Validation

I added a function to validate the CSV file’s structure:

def validate_csv(file_path, encoding='utf-8'):
    try:
        with open(file_path, 'r', encoding=encoding, errors='replace') as csvfile:
            reader = csv.reader(csvfile, delimiter=',', quotechar='"')
            header = next(reader)
            logging.info(f"CSV Header: {header}")

            for i, row in enumerate(reader, start=2):
                if len(row) != len(header):
                    logging.warning(f"Row {i} has {len(row)} columns, expected {len(header)}")

    except Exception as e:
        logging.error(f"Validation error: {e}")

Export Cleaned Data

The cleaned data is saved to a new CSV file for further use.

How to Use the Enhanced Script

Install the chardet library:

pip install chardet

Save the script to a Python file (e.g., clean_csv.py).
Run the script:

python clean_csv.py

Check the output:

The cleaned CSV file (notes_cleaned.csv) will be created.
Logs will provide details about the cleaning and validation process.

Example Output

INFO: Detected encoding: utf-8
INFO: Cleaned CSV saved to: notes_cleaned.csv
INFO: CSV Header: ['Client Last Name', 'Client First Name', 'Phone', 'Alt Phone', 'Dog', 'Gender', 'Breed', 'Weight', 'Price', 'Rabies', 'Vet', 'Groomer', 'Health', 'Behavioral Notes', 'Scheduled', 'Groom Notes', '', '', '']
WARNING: Row 43 has 18 columns, expected 19

Final Thoughts

Handling CSV files can be tricky, especially when dealing with encoding issues or hidden characters. By specifying the encoding, cleaning the data, and adding robust functionality like automatic encoding detection and logging, you can create a more reliable and user-friendly script. This approach not only solves immediate problems but also makes your code easier to maintain and debug in the future.