How to Sign an XML Document with Python?

If you’re building a Django-based sales system and need to incorporate electronic invoicing compliant with government regulations, XML signing is a critical step. Many developers face challenges here, especially when transitioning from testing to production. I’ll walk through how to sign XML documents in Python, troubleshoot common errors (like the dreaded “Incorrect reference digest value”), and ensure compliance with regulatory standards.

XML Signature Validation Failures

In my case, the regulatory authority required XML invoices to be signed with a company’s digital certificate, zipped, and sent via a web service. Despite using Python’s lxml and cryptography libraries, the web service returned an error:
Error: The entered electronic document has been altered - Detail: Incorrect reference digest value.

This error indicates that the digest value (a hashed summary of the XML content) calculated during signing doesn’t match what the web service computed. The culprit? Incorrect canonicalization or improper handling of the XML structure during signing.

Understanding XML Signatures

XML signatures follow the XML Signature Syntax and Processing standard. The process involves:

  1. Canonicalization: Converting the XML to a standardized format (to ignore trivial differences like whitespace).
  2. Digest Calculation: Hashing the canonicalized content.
  3. Signature Generation: Encrypting the digest with a private key.

Mistakes in any step break the signature validation.

What Went Wrong?

Let’s dissect the original code and identify issues:

Incorrect Canonicalization

The code uses etree.tostring(copy_tree, exclusive=1) for canonicalization. While exclusive=1 applies Exclusive XML Canonicalization, the regulatory authority might require Inclusive Canonicalization (common in Latin American e-invoicing systems). This discrepancy alters the digest value.

Manual Whitespace Stripping

Using re.sub(b'>\s*<', b'><', xml_serialized) is a red flag. Canonicalization should handle whitespace—manual stripping can corrupt the structure.

Reference URI Mismatch

The Reference element’s URI attribute was empty (URI=""), implying the entire document is signed. However, some systems expect a fragment identifier (e.g., URI="#Invoice"), especially if the XML contains multiple signable sections.

Incorrect Order of Operations

The DigestValue was calculated before finalizing the SignedInfo structure. If the SignedInfo itself is modified after hashing, the digest becomes invalid.

Revised Python Code

Here’s an improved approach using lxml and cryptography, with careful attention to canonicalization and transformations:

from lxml import etree
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.asymmetric import padding
from cryptography.hazmat.primitives.serialization import load_pem_private_key
import base64

def sign_xml(xml_path, private_key_path, cert_path, output_path):
    # Load XML
    tree = etree.parse(xml_path)
    root = tree.getroot()

    # Register namespaces
    ns = {
        "ds": "http://www.w3.org/2000/09/xmldsig#",
        "ext": "urn:oasis:names:specification:ubl:schema:xsd:CommonExtensionComponents-2",
        "cac": "...",  # Add all required namespaces
    }

    # Prepare the Signature element in the correct location
    extension_content = root.find(".//ext:ExtensionContent", namespaces=ns)
    if extension_content is None:
        raise ValueError("ExtensionContent not found!")

    # Create Signature structure
    signature = etree.SubElement(extension_content, "{http://www.w3.org/2000/09/xmldsig#}Signature", Id="SignatureSP")
    signed_info = etree.SubElement(signature, "{http://www.w3.org/2000/09/xmldsig#}SignedInfo")

    # Canonicalization method (confirm with your authority!)
    etree.SubElement(
        signed_info,
        "{http://www.w3.org/2000/09/xmldsig#}CanonicalizationMethod",
        Algorithm="http://www.w3.org/TR/2001/REC-xml-c14n-20010315"  # Inclusive Canonicalization
    )

    # Signature method
    etree.SubElement(
        signed_info,
        "{http://www.w3.org/2000/09/xmldsig#}SignatureMethod",
        Algorithm="http://www.w3.org/2001/04/xmldsig-more#rsa-sha256"
    )

    # Reference element with correct URI
    reference = etree.SubElement(
        signed_info,
        "{http://www.w3.org/2000/09/xmldsig#}Reference",
        URI=""  # Or "#Invoice" if targeting a specific element
    )

    # Transforms
    transforms = etree.SubElement(reference, "{http://www.w3.org/2000/09/xmldsig#}Transforms")
    etree.SubElement(
        transforms,
        "{http://www.w3.org/2000/09/xmldsig#}Transform",
        Algorithm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"
    )
    # Add canonicalization transform if needed
    etree.SubElement(
        transforms,
        "{http://www.w3.org/2000/09/xmldsig#}Transform",
        Algorithm="http://www.w3.org/TR/2001/REC-xml-c14n-20010315"
    )

    # Digest method
    etree.SubElement(
        reference,
        "{http://www.w3.org/2000/09/xmldsig#}DigestMethod",
        Algorithm="http://www.w3.org/2001/04/xmlenc#sha256"
    )

    # --- Calculate DigestValue ---
    # Canonicalize the XML *excluding* the Signature element
    signed_info_element = root.find(".//ds:SignedInfo", namespaces=ns)
    if signed_info_element is not None:
        signed_info_element.getparent().remove(signed_info_element)

    # Use lxml's canonicalize() for accurate canonicalization
    canonical_xml = etree.tostring(
        root,
        method="c14n",
        exclusive=False,
        with_comments=False
    )

    # Compute digest
    digest = hashes.Hash(hashes.SHA256(), backend=default_backend())
    digest.update(canonical_xml)
    digest_value = base64.b64encode(digest.finalize()).decode()
    etree.SubElement(reference, "{http://www.w3.org/2000/09/xmldsig#}DigestValue").text = digest_value

    # --- Calculate SignatureValue ---
    # Re-canonicalize SignedInfo
    signed_info_canonical = etree.tostring(
        signed_info,
        method="c14n",
        exclusive=False
    )

    # Load private key
    with open(private_key_path, "rb") as f:
        private_key = load_pem_private_key(f.read(), password=None, backend=default_backend())

    # Sign
    signature_bytes = private_key.sign(
        signed_info_canonical,
        padding.PKCS1v15(),
        hashes.SHA256()
    )
    signature_value = base64.b64encode(signature_bytes).decode()
    etree.SubElement(signature, "{http://www.w3.org/2000/09/xmldsig#}SignatureValue").text = signature_value

    # Add X509 certificate
    key_info = etree.SubElement(signature, "{http://www.w3.org/2000/09/xmldsig#}KeyInfo")
    x509_data = etree.SubElement(key_info, "{http://www.w3.org/2000/09/xmldsig#}X509Data")
    with open(cert_path, "rb") as cert_file:
        cert_data = cert_file.read()
    etree.SubElement(x509_data, "{http://www.w3.org/2000/09/xmldsig#}X509Certificate").text = base64.b64encode(cert_data).decode()

    # Save signed XML
    tree.write(output_path, encoding="UTF-8", xml_declaration=True)

Key Fixes Explained

  1. Canonicalization Method:
    Switched to method="c14n" (Inclusive Canonicalization) instead of manual regex stripping. Confirm with your authority which method they require.
  2. Transforms Order:
    Added both enveloped-signature (to exclude the Signature itself during hashing) and a canonicalization transform if needed.
  3. Digest Calculation:
    Removed the Signature element before canonicalizing the XML to avoid hashing an incomplete structure.
  4. Proper Namespace Handling:
    Explicitly defined all namespaces to prevent mismatches.

Testing and Validation

  1. Validate XML Structure:
    Use tools like XMLSchema to ensure compliance with the regulatory schema.
  2. Online Validators:
    Test your signed XML with tools like XMLSec or government-provided validators.
  3. Compare with Official Samples:
    Obtain a correctly signed XML sample from your regulatory authority and compare structures using diff tools.

Final Thoughts

XML signing in Python is feasible but requires meticulous attention to canonicalization, transforms, and digest calculations. While lxml and cryptography work, consider using specialized libraries like xmlsec for a more streamlined workflow. Always double-check the regulatory requirements for canonicalization methods, URI references, and certificate formatting—these details are often the difference between success and cryptic errors.

Related blog posts