How do I Calculate a Hash of a File Using Python

Daniyal Ahmed

9 months ago

How do I Calculate a Hash of a File Using Python

I am excited to share with you one of my favorite little projects a simple Python script that calculates the SHA-256 hash of a file. Have you ever needed to verify the integrity of a file or ensure it hasn’t been tampered with? I certainly have, and I found that cryptographic hashing is a powerful way to do just that. Whether you’re validating downloads, auditing data, or simply exploring the world of cryptography, this tool can be incredibly handy. Let’s dive into the code and break it down step by step!

The Code

Below is the complete Python script that computes a file’s hash. I encourage you to copy and paste this into a .py file or even a Jupyter notebook, then run it with your file of choice.

import hashlib

def file_hash(file_path, algo="sha256"):
    """
    Calculate the hash of a file using the specified algorithm.
    Default algorithm is SHA-256.
    """
    h = hashlib.new(algo)
    with open(file_path, "rb") as f:
        # Read the file in chunks to handle large files efficiently
        while chunk := f.read(8192):
            h.update(chunk)
    return h.hexdigest()

# Example usage
file_path = "clcoding.txt"
print("SHA-256 Hash:", file_hash(file_path))

Importing the hashlib Library

I started by importing Python’s built-in hashlib library. This library offers a suite of cryptographic hash functions, which means I can create hash objects for various algorithms like SHA-256, MD5, or SHA-1 without any additional installations. It’s one of those fantastic features of Python’s standard library that makes coding fun and efficient.

Defining the `file_hash` Function

The heart of the script is the file_hash function. This function accepts two parameters:

file_path: The location of the file whose hash you want to compute (for example, "data.zip").
algo: The hashing algorithm to use, defaulting to "sha256".

Inside the function, here’s what I do:

Create a Hash Object:
I initialize a new hash object using hashlib.new(algo). This object will compute the hash using the chosen algorithm.
Open the File in Binary Mode:
I open the file in binary mode using open(file_path, "rb"). This is crucial because hashing requires binary data, not text.
Read in Chunks:
To efficiently handle large files, I read the file in 8192-byte chunks. This is done with a while loop:

while chunk := f.read(8192):
    h.update(chunk)

This approach ensures that even if your file is several gigabytes in size, it won’t overwhelm your system’s memory.

Return the Digest:
Finally, I call h.hexdigest() to retrieve the computed hash as a hexadecimal string—a unique fingerprint for your file.

Example Usage

In the example, the script calculates the SHA-256 hash of a file named clcoding.txt. You can replace "clcoding.txt" with any file path you wish to check. Running the code will print the hash to your console, giving you a quick and reliable way to verify file integrity.

Why Use SHA-256?

I chose SHA-256 for this project because it’s one of the most widely used cryptographic hash functions. Here’s why:

Unique Fingerprint:
SHA-256 generates a unique 64-character hexadecimal fingerprint for a file. Even the tiniest change in the file—say, one single bit—results in a completely different hash.
Security and Reliability:
Although there are faster alternatives like MD5 or SHA-1, those algorithms have known vulnerabilities. For any application where security matters, SHA-256 (or even SHA-512) is a much safer bet.
Versatility:
Whether you’re verifying a file download, ensuring data consistency in your pipeline, or auditing file modifications, SHA-256 provides a robust solution.

Customize Your Hashing

One of the great things about this script is its flexibility. If you need a different algorithm, simply change the algo parameter. For instance:

MD5:

file_hash(file_path, "md5")

SHA-1:

file_hash(file_path, "sha1")

SHA-512:

file_hash(file_path, "sha512")

Just remember, while MD5 and SHA-1 might be faster, they are not recommended for security-critical applications due to their vulnerabilities.

Handling Large Files

I paid special attention to how the script handles large files. By reading the file in chunks (8192 bytes at a time), the script remains memory-efficient. This means you can compute the hash for files that are several gigabytes in size without running into memory issues. If you ever need to adjust the chunk size, feel free to tweak that number to suit your file sizes and system capabilities.

Final Thoughts

I believe that calculating file hashes is an essential skill for anyone working with data or interested in security. This script is more than just a piece of code; it’s a tool that adds an extra layer of trust and security to your workflow. Whether you’re a developer, a security enthusiast, or a data professional, knowing how to verify file integrity is invaluable.