Skip to content
Explain to Dev
Explain to Dev

Empowering developers with the knowledge to build, create, and innovate in the software world.

  • Home
  • About
  • Java
  • Python
  • PHP
  • .NET
  • Node.js
  • SQL
  • Privacy Policy
Explain to Dev

Empowering developers with the knowledge to build, create, and innovate in the software world.

How to Efficiently Read Huge CSVs in Chunks and Process Them in Python

etd_admin, September 11, 2025September 11, 2025

If a CSV won’t fit in memory, you don’t need a bigger machine—you need a streaming mindset. In this guide, we’ll walk through simple, reliable patterns to read a very large CSV file in chunks and process it in Python, so you can filter, aggregate, transform, and export results without crashing your laptop.

The core idea: stream rows, don’t load everything

Instead of pandas.read_csv("file.csv") (which loads the whole file), use chunked iteration. You’ll read a portion of rows at a time, do work on that slice, and then discard it before moving to the next slice. This keeps memory flat while your work scales to arbitrarily large files.

Quick start with pandas (chunksize)

pandas.read_csv supports chunked reads out of the box.

import pandas as pd

# Tune chunksize based on memory and work per chunk (e.g., 50_000–500_000)
chunks = pd.read_csv(
    "big.csv",
    chunksize=200_000,       # number of rows per chunk
    usecols=["user_id", "amount", "country"],  # read only what you need
    dtype={"user_id": "int64", "amount": "float64", "country": "category"},
    low_memory=True
)

total_by_country = {}

for i, df in enumerate(chunks, start=1):
    # Work on the chunk
    df = df[df["amount"] > 0]                # filter
    grp = df.groupby("country")["amount"].sum()

    # Merge partial results into running totals
    for country, s in grp.items():
        total_by_country[country] = total_by_country.get(country, 0.0) + float(s)

    print(f"Processed chunk {i}")

print("Totals:", total_by_country)

Why this works:

  • chunksize turns read_csv into an iterator.
  • usecols and dtype reduce memory and speed parsing.
  • Each loop processes only ~N rows at a time, keeping RSS steady.

Writing chunked results out as you go

If your processing generates rows to save, append them incrementally—don’t store everything in a Python list first.

import pandas as pd
from pathlib import Path

out_path = Path("filtered.csv")
header_written = False

for df in pd.read_csv("big.csv", chunksize=100_000):
    # Example: keep only paid orders
    small = df.query("status == 'PAID'")[["order_id", "user_id", "amount"]]
    small.to_csv(out_path, mode="a", index=False, header=not header_written)
    header_written = True

Tips:

  • Use mode="a" to append.
  • Write the header only once.
  • Prefer column subsets to lower I/O.

Stream with the standard library (csv) for ultra-low overhead

When you don’t need pandas’ power (groupby, joins, etc.), the built-in csv module is minimal and memory-efficient.

import csv

totals = {}

with open("big.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:  # iterates one row at a time
        if row["status"] == "PAID":
            country = row["country"]
            amount = float(row["amount"])
            totals[country] = totals.get(country, 0.0) + amount

print(totals)

This reads one record at a time (constant memory), ideal for simple passes.

Combine chunked partials into a single result

Two common patterns:

1) Reduce to scalars/maps (running totals)

As shown earlier, accumulate values into dictionaries or counters.

from collections import Counter
import pandas as pd

country_counts = Counter()

for df in pd.read_csv("big.csv", chunksize=200_000, usecols=["country"]):
    country_counts.update(df["country"].value_counts().to_dict())

print(country_counts.most_common(10))

2) Append to an on-disk store (Parquet/Feather/CSV/DB)

For large outputs, write each chunk out. Parquet is efficient and columnar.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

writer = None

for df in pd.read_csv("big.csv", chunksize=250_000):
    table = pa.Table.from_pandas(df, preserve_index=False)
    if writer is None:
        writer = pq.ParquetWriter("big.parquet", table.schema)
    writer.write_table(table)

if writer is not None:
    writer.close()

Parquet advantages:

  • Smaller files than CSV.
  • Much faster reads for analytics later.
  • Schema stored once; works well with Arrow/pandas/Polars.

Apply functions safely per chunk

Avoid lambdas that create Python-level loops over rows when possible. Prefer vectorized ops:

import pandas as pd

def clean_chunk(df: pd.DataFrame) -> pd.DataFrame:
    # Vectorized transforms
    df = df.assign(
        amount_usd = df["amount"] * 1.12,
        is_high_value = df["amount"] >= 1000
    )
    return df

for df in pd.read_csv("big.csv", chunksize=100_000):
    df = clean_chunk(df)
    # write or aggregate...

When the file is really huge (or you want parallelism)

Dask: keeps a pandas-like API but plans work lazily and executes in parallel.

import dask.dataframe as dd

ddf = dd.read_csv("big*.csv", usecols=["country", "amount"], dtype={"country": "category", "amount": "float64"})
result = ddf[ddf["amount"] > 0].groupby("country")["amount"].sum().compute()
print(result)

Polars (with scan_csv) or pyarrow.dataset can also stream efficiently with strong performance.

Choose these when a single-process, chunked pandas loop becomes the bottleneck and you can benefit from multi-core execution.

End-to-end example: Filter, aggregate, export

import pandas as pd
from pathlib import Path

INPUT = "orders_2024.csv.gz"
OUT_FILTERED = Path("orders_paid.csv.gz")

header_written = False
country_totals = {}

for chunk in pd.read_csv(
    INPUT,
    chunksize=300_000,
    compression="gzip",
    usecols=["order_id", "user_id", "country", "amount", "status"],
    dtype={"order_id": "int64", "user_id": "int64", "country": "category", "amount": "float64", "status": "category"},
):
    paid = chunk.query("status == 'PAID' and amount > 0")

    # Append filtered rows to compressed CSV
    paid.to_csv(
        OUT_FILTERED,
        mode="a",
        index=False,
        header=not header_written,
        compression="gzip",
    )
    header_written = True

    # Update totals
    grp = paid.groupby("country")["amount"].sum()
    for c, s in grp.items():
        country_totals[c] = country_totals.get(c, 0.0) + float(s)

print("Country totals:", country_totals)
print(f"Wrote filtered rows to {OUT_FILTERED}")

This pattern will scale to tens or hundreds of millions of rows while keeping memory stable.

To read a very large CSV file in chunks and process it in Python, stream your data with chunksize, process each slice with vectorized operations, and write results as you go. You’ll get predictable memory usage and excellent throughput even on modest hardware. With the same mindset, you can scale up to Dask or Polars when you need parallelism or more speed. Mastering these patterns lets you confidently read a very large CSV file in chunks and process it in Python for ETL, analytics, and data cleaning tasks of any size—so the file size stops being your problem and becomes just another input.

Python File ReadingPython

Post navigation

Previous post
Next post
©2025 Explain to Dev | WordPress Theme by SuperbThemes