I hit this snag while cleaning up an old Python 2.7 box that still churns out reports. One Excel file refused to convert because of a single en dash hiding in the data. Here’s how I spotted the bug, squashed it, and then polished the script so it can serve me (and maybe you) for years to come.
My Code
#!/home/casper/python/core/2.7.14/exec/bin/python2.7
# -*- coding: utf-8 -*-
import openpyxl
import csv
wb = openpyxl.load_workbook('RiskLimitSnapshot.xlsx')
sh = wb.get_active_sheet()
with open('goodRiskLimitSnapshot.csv', 'wb') as f:
c = csv.writer(f)
for r in sh.rows:
c.writerow([cell.value for cell in r])
The Error
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 74: ordinal not in range(128)
Explain Error
u2013
= en dash. Perfectly normal in Unicode, but not part of ASCII’s tiny 128-character family.- Python 2 tries ASCII first whenever it has to squeeze a
unicode
string into raw bytes. Anything above 127 explodes withUnicodeEncodeError
. csv
in Python 2 only writes bytes, so the unwanted conversion happens insidecsv.writerow
, miles away from my loop.
I tried opening the file with io.open(..., encoding='utf8')
next, but that swung the pendulum the other way:
TypeError: write() argument 1 must be unicode, not str
Now the file object expected unicode
, while csv.writer
was still sending byte strings. Same fight, opposite corner.
Correct Code
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
import openpyxl, csv, os, sys
def excel_to_csv(xlsx_path, csv_path, sheet_name=None):
"""
Convert one sheet from an .xlsx file to a UTF-8 CSV file.
"""
wb = openpyxl.load_workbook(xlsx_path, data_only=True, read_only=True)
sh = wb[sheet_name] if sheet_name else wb.active
with open(csv_path, 'wb') as fh:
writer = csv.writer(fh)
for row in sh.rows:
utf8_row = []
for cell in row:
# Make every value Unicode first
text = u'' if cell.value is None else unicode(cell.value)
# Then encode safely to UTF-8
utf8_row.append(text.encode('utf-8'))
writer.writerow(utf8_row)
if __name__ == '__main__':
# Usage: python excel2csv.py sheet.xlsx sheet.csv "Optional Sheet Name"
if len(sys.argv) < 3:
sys.exit("Usage: excel2csv.py <xlsx> <csv> [<sheet-name>]")
excel_to_csv(sys.argv[1], sys.argv[2],
sys.argv[3] if len(sys.argv) > 3 else None)
print("Saved →", os.path.abspath(sys.argv[2]))
Why it works
- I keep the file in binary mode (
'wb'
) becausecsv
demands raw bytes on Python 2. - I explicitly encode every cell to UTF-8, side-stepping the silent ASCII trap.
read_only=True
streams rows, so even a giant workbook doesn’t hog RAM.
Fun Practice Tweaks
What to add | One-line hint |
---|---|
Convert all sheets | Loop over wb.sheetnames , dump <sheet>.csv . |
Skip blank rows | if all(c.value is None for c in row): continue . |
Swap weird dashes | text = text.replace(u'\u2013', '-') . |
Try another delimiter | csv.writer(fh, delimiter='\t') . |
See progress | if i % 1000 == 0: print("row", i) . |
Batch every .xlsx | for path in glob.glob('*.xlsx'): . |
Tiny unit test | Write clean(text) and hit it with pytest . |
These micro challenges keep the muscle memory fresh when you’re stuck on legacy Python.
Key Take Aways
- Declare the encoding yourself in Python 2; never let ASCII sneak in.
csv
is byte-only on Py2. Either hand it bytes or use Samuel Colvin’sunicodecsv
back-port for nicer semantics.- A few quality-of-life flags (sheet name, delimiter, logging) turn a throwaway script into a tiny command-line app you’ll reuse.
Final Thought
I wrote the first version just to dodge one stubborn en dash, but the refactor gave me a durable tool that chews through spreadsheets all day. That’s the sweet spot with these little maintenance chores: fix one headache, learn a clean pattern, and walk away with code that pays rent every morning.