Detecting Duplicate Files by Hash in Python

A common file-management problem is finding files that have different names but exactly the same content. A simple and reliable way to solve this is to detect duplicate files in a directory based on content hash in Python. The basic idea is to scan each file, compute a hash value such as SHA-256, and then group together files that produce the same hash.

How it works

A hash function converts file content into a short fixed-length string. If two files have the same content, they should produce the same hash.

The process is:

Walk through all files in a directory.
Open each file and calculate its content hash.
Store file paths by hash value.
Print the groups that have more than one file.

Example Python code:

from pathlib import Path
import hashlib
from collections import defaultdict


def hash_file(file_path, chunk_size=8192):
    sha256 = hashlib.sha256()

    with open(file_path, "rb") as f:
        while chunk := f.read(chunk_size):
            sha256.update(chunk)

    return sha256.hexdigest()


def find_duplicate_files(directory):
    hashes = defaultdict(list)

    for path in Path(directory).rglob("*"):
        if path.is_file():
            try:
                file_hash = hash_file(path)
                hashes[file_hash].append(str(path))
            except Exception as e:
                print(f"Could not read {path}: {e}")

    duplicates = {
        file_hash: paths
        for file_hash, paths in hashes.items()
        if len(paths) > 1
    }

    return duplicates


if __name__ == "__main__":
    folder = "your_directory_here"
    duplicates = find_duplicate_files(folder)

    if duplicates:
        print("Duplicate files found:\n")
        for file_hash, paths in duplicates.items():
            print(f"Hash: {file_hash}")
            for path in paths:
                print(f"  - {path}")
            print()
    else:
        print("No duplicate files found.")

from pathlib import Path
import hashlib
from collections import defaultdict


def hash_file(file_path, chunk_size=8192):
    sha256 = hashlib.sha256()

    with open(file_path, "rb") as f:
        while chunk := f.read(chunk_size):
            sha256.update(chunk)

    return sha256.hexdigest()


def find_duplicate_files(directory):
    hashes = defaultdict(list)

    for path in Path(directory).rglob("*"):
        if path.is_file():
            try:
                file_hash = hash_file(path)
                hashes[file_hash].append(str(path))
            except Exception as e:
                print(f"Could not read {path}: {e}")

    duplicates = {
        file_hash: paths
        for file_hash, paths in hashes.items()
        if len(paths) > 1
    }

    return duplicates


if __name__ == "__main__":
    folder = "your_directory_here"
    duplicates = find_duplicate_files(folder)

    if duplicates:
        print("Duplicate files found:\n")
        for file_hash, paths in duplicates.items():
            print(f"Hash: {file_hash}")
            for path in paths:
                print(f"  - {path}")
            print()
    else:
        print("No duplicate files found.")

Explanation of the code

The hash_file() function reads the file in small chunks and updates the SHA-256 hash step by step. This is important because large files should not be loaded fully into memory.

The find_duplicate_files() function scans every file inside the target directory and its subdirectories using Path(directory).rglob("*"). For each file, it computes the hash and stores the file path in a dictionary where the key is the hash.

If multiple files share the same hash, they are treated as duplicates.

If you want to detect duplicate files in a directory based on content hash in Python, reading files in chunks is one of the best practices because it keeps memory usage low and works well even for large files.

Example output:

Duplicate files found:

Hash: 3a7bd3e2360a3d...
  - folder/file1.txt
  - folder/copy/file1_copy.txt

Useful notes

SHA-256 is a strong and commonly used hash algorithm.
For very large folders, hashing every file can take time.
If you want extra safety, you can compare files byte-for-byte after matching hashes, although for most cases the hash match is enough.
You can also skip files based on extension, size, or folder if needed.

The easiest way to detect duplicate files in a directory based on content hash in Python is to walk through the directory, calculate a hash for each file, and group files with the same hash. This method is simple, accurate, and easy to extend for cleanup tools or storage management scripts.

How it works

Explanation of the code

Useful notes

Leave a Reply Cancel reply