Skip to content
Explain to Dev
Explain to Dev

Empowering developers with the knowledge to build, create, and innovate in the software world.

  • Home
  • About
  • Java
  • Python
  • PHP
  • .NET
  • Node.js
  • SQL
  • Privacy Policy
Explain to Dev

Empowering developers with the knowledge to build, create, and innovate in the software world.

Detecting Duplicate Files by Hash in Python

etd_admin, April 11, 2026April 11, 2026

A common file-management problem is finding files that have different names but exactly the same content. A simple and reliable way to solve this is to detect duplicate files in a directory based on content hash in Python. The basic idea is to scan each file, compute a hash value such as SHA-256, and then group together files that produce the same hash.

How it works

A hash function converts file content into a short fixed-length string. If two files have the same content, they should produce the same hash.

The process is:

  1. Walk through all files in a directory.
  2. Open each file and calculate its content hash.
  3. Store file paths by hash value.
  4. Print the groups that have more than one file.

Example Python code:

from pathlib import Path
import hashlib
from collections import defaultdict


def hash_file(file_path, chunk_size=8192):
    sha256 = hashlib.sha256()

    with open(file_path, "rb") as f:
        while chunk := f.read(chunk_size):
            sha256.update(chunk)

    return sha256.hexdigest()


def find_duplicate_files(directory):
    hashes = defaultdict(list)

    for path in Path(directory).rglob("*"):
        if path.is_file():
            try:
                file_hash = hash_file(path)
                hashes[file_hash].append(str(path))
            except Exception as e:
                print(f"Could not read {path}: {e}")

    duplicates = {
        file_hash: paths
        for file_hash, paths in hashes.items()
        if len(paths) > 1
    }

    return duplicates


if __name__ == "__main__":
    folder = "your_directory_here"
    duplicates = find_duplicate_files(folder)

    if duplicates:
        print("Duplicate files found:\n")
        for file_hash, paths in duplicates.items():
            print(f"Hash: {file_hash}")
            for path in paths:
                print(f"  - {path}")
            print()
    else:
        print("No duplicate files found.")

Explanation of the code

The hash_file() function reads the file in small chunks and updates the SHA-256 hash step by step. This is important because large files should not be loaded fully into memory.

The find_duplicate_files() function scans every file inside the target directory and its subdirectories using Path(directory).rglob("*"). For each file, it computes the hash and stores the file path in a dictionary where the key is the hash.

If multiple files share the same hash, they are treated as duplicates.

If you want to detect duplicate files in a directory based on content hash in Python, reading files in chunks is one of the best practices because it keeps memory usage low and works well even for large files.

Example output:

Duplicate files found:

Hash: 3a7bd3e2360a3d...
  - folder/file1.txt
  - folder/copy/file1_copy.txt

Useful notes

  • SHA-256 is a strong and commonly used hash algorithm.
  • For very large folders, hashing every file can take time.
  • If you want extra safety, you can compare files byte-for-byte after matching hashes, although for most cases the hash match is enough.
  • You can also skip files based on extension, size, or folder if needed.

The easiest way to detect duplicate files in a directory based on content hash in Python is to walk through the directory, calculate a hash for each file, and group files with the same hash. This method is simple, accurate, and easy to extend for cleanup tools or storage management scripts.

Python FilesHashPython

Post navigation

Previous post
Next post

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

©2026 Explain to Dev | WordPress Theme by SuperbThemes