The script builds a tiny XML file with lxml: it creates a <results>
root, adds two nodes for “Country” and “City,” then tries to save the tree with doc.write(outFile)
. Unfortunately, it opens the file using open('output.xml', 'w')
, which returns a text stream that expects str
data, while ElementTree.write()
emits bytes (UTF-8 by default).
When those bytes hit a text-only handle, Python 3 raises TypeError: must be str, not bytes
. Opening the file in binary mode ('wb'
) or letting lxml handle the path fixes the clash.
How I tripped over Python 3’s text/bytes split and tightened up my XML generator
The Error
I was dusting off a small XML-report script nothing mission-critical, just a helper that spits out a list of countries and cities for a test harness. On Python 2.7 it purred. I upgraded the repo to Python 3.2, slammed Run, and the interpreter hit me with:
TypeError: must be str, not bytes
The crash happened on the very last write. Classic Py3 moment. Here’s the forensic rundown and the tidy upgrade I ended up shipping.
The Original Code
# -*- coding: utf-8 -*-
import time
from datetime import date
from lxml import etree
from collections import OrderedDict
page = etree.Element('results')
doc = etree.ElementTree(page)
# Elements
etree.SubElement(page, 'Country', Tim='Now',
name='Germany', AnotherParameter='Bye',
Code='DE', Storage='Basic')
etree.SubElement(page, 'City',
name='Germany', Code='PZ',
Storage='Basic', AnotherParameter='Hello')
# Save
outFile = open('output.xml', 'w') # <- trouble lives here
doc.write(outFile)
Explain Error
Python 3 draws a hard line between text (str
) and binary (bytes
):
What I did | What that means in Py3 |
---|---|
open('output.xml', 'w') | I opened the file in text mode. The stream expects str. |
doc.write(outFile) | lxml.etree.ElementTree.write() pumps out bytes as soon as an encoding is in play (UTF-8 is the default). |
Bytes flowing into a text stream triggers the exact TypeError
I saw. The function isn’t misbehaving my file handle is.
The Fix Error
Option | Change | Why it works |
---|---|---|
A. Let lxml open the file | doc.write('output.xml', encoding='utf-8', xml_declaration=True) | I pass a file path (a str). lxml opens the file itself—in binary mode—so its bytes go where bytes belong. |
B. Open the file in binary mode | python<br>with open('output.xml', 'wb') as f:<br> doc.write(f, encoding='utf-8', xml_declaration=True) | 'wb' returns a binary stream, which welcomes the bytes without complaint. |
Either way, goodbye TypeError
.
Fix Code
While I was in the code, I polished it into a reusable helper:
- Pretty prints the XML (so I can eyeball diffs).
- Builds nodes from an
OrderedDict
, not hard-coded literals. - Accepts an output filename on the command line.
- Times the run—handy when I batch-generate big reports.
#!/usr/bin/env python3
"""
xml_builder.py – bite-size XML generator
Run:
python xml_builder.py [outfile.xml]
"""
import sys, time
from datetime import datetime
from collections import OrderedDict
from lxml import etree
# ---------- helpers ----------
def build_tree(records):
root = etree.Element('results', generated=datetime.utcnow().isoformat())
for tag, attrs in records:
etree.SubElement(root, tag, **attrs)
return etree.ElementTree(root)
def save_tree(tree, filename='output.xml'):
with open(filename, 'wb') as fh: # binary = no TypeError
tree.write(fh,
encoding='utf-8',
xml_declaration=True,
pretty_print=True)
# ---------- main ----------
if __name__ == '__main__':
t0 = time.perf_counter()
data = [
('Country', OrderedDict([
('Tim', 'Now'),
('name', 'Germany'),
('AnotherParameter', 'Bye'),
('Code', 'DE'),
('Storage', 'Basic')
])),
('City', OrderedDict([
('name', 'Germany'),
('Code', 'PZ'),
('Storage', 'Basic'),
('AnotherParameter', 'Hello')
]))
]
tree = build_tree(data)
target = sys.argv[1] if len(sys.argv) > 1 else 'output.xml'
save_tree(tree, target)
print(f'XML written to {target} in {time.perf_counter() - t0:.4f}s')
Explain it
Want to stretch the script? Try these:
- Add
argparse
so you can inject tag/attribute pairs from the shell. - Validate the output against an XSD before writing catch schema drift early.
- Benchmark
pretty_print=True
vsFalse
with a few hundred k nodes. - Swap libraries: rewrite with
xml.etree.ElementTree
and compare speed and API clarity.
Final Thought
The bug took five minutes to squash, but it reminded me why Python 3’s harsh stance on text versus bytes is a blessing. After the fix I walked away with cleaner I/O, a faster script, and a little utility I can drop into any project. Worth the detour and now my CI pipeline stays green.