How to Efficiently Read Huge CSVs in Chunks and Process Them in Python etd_admin, September 11, 2025September 11, 2025 If a CSV won’t fit in memory, you don’t need a bigger machine—you need a streaming mindset. In this guide, we’ll walk through simple, reliable patterns to read a very large CSV file in chunks and process it in Python, so you can filter, aggregate, transform, and export results without crashing your laptop. The core idea: stream rows, don’t load everything Instead of pandas.read_csv("file.csv") (which loads the whole file), use chunked iteration. You’ll read a portion of rows at a time, do work on that slice, and then discard it before moving to the next slice. This keeps memory flat while your work scales to arbitrarily large files. Quick start with pandas (chunksize) pandas.read_csv supports chunked reads out of the box. import pandas as pd # Tune chunksize based on memory and work per chunk (e.g., 50_000–500_000) chunks = pd.read_csv( "big.csv", chunksize=200_000, # number of rows per chunk usecols=["user_id", "amount", "country"], # read only what you need dtype={"user_id": "int64", "amount": "float64", "country": "category"}, low_memory=True ) total_by_country = {} for i, df in enumerate(chunks, start=1): # Work on the chunk df = df[df["amount"] > 0] # filter grp = df.groupby("country")["amount"].sum() # Merge partial results into running totals for country, s in grp.items(): total_by_country[country] = total_by_country.get(country, 0.0) + float(s) print(f"Processed chunk {i}") print("Totals:", total_by_country)import pandas as pd # Tune chunksize based on memory and work per chunk (e.g., 50_000–500_000) chunks = pd.read_csv( "big.csv", chunksize=200_000, # number of rows per chunk usecols=["user_id", "amount", "country"], # read only what you need dtype={"user_id": "int64", "amount": "float64", "country": "category"}, low_memory=True ) total_by_country = {} for i, df in enumerate(chunks, start=1): # Work on the chunk df = df[df["amount"] > 0] # filter grp = df.groupby("country")["amount"].sum() # Merge partial results into running totals for country, s in grp.items(): total_by_country[country] = total_by_country.get(country, 0.0) + float(s) print(f"Processed chunk {i}") print("Totals:", total_by_country) Why this works: chunksize turns read_csv into an iterator. usecols and dtype reduce memory and speed parsing. Each loop processes only ~N rows at a time, keeping RSS steady. Writing chunked results out as you go If your processing generates rows to save, append them incrementally—don’t store everything in a Python list first. import pandas as pd from pathlib import Path out_path = Path("filtered.csv") header_written = False for df in pd.read_csv("big.csv", chunksize=100_000): # Example: keep only paid orders small = df.query("status == 'PAID'")[["order_id", "user_id", "amount"]] small.to_csv(out_path, mode="a", index=False, header=not header_written) header_written = Trueimport pandas as pd from pathlib import Path out_path = Path("filtered.csv") header_written = False for df in pd.read_csv("big.csv", chunksize=100_000): # Example: keep only paid orders small = df.query("status == 'PAID'")[["order_id", "user_id", "amount"]] small.to_csv(out_path, mode="a", index=False, header=not header_written) header_written = True Tips: Use mode="a" to append. Write the header only once. Prefer column subsets to lower I/O. Stream with the standard library (csv) for ultra-low overhead When you don’t need pandas’ power (groupby, joins, etc.), the built-in csv module is minimal and memory-efficient. import csv totals = {} with open("big.csv", newline="", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: # iterates one row at a time if row["status"] == "PAID": country = row["country"] amount = float(row["amount"]) totals[country] = totals.get(country, 0.0) + amount print(totals)import csv totals = {} with open("big.csv", newline="", encoding="utf-8") as f: reader = csv.DictReader(f) for row in reader: # iterates one row at a time if row["status"] == "PAID": country = row["country"] amount = float(row["amount"]) totals[country] = totals.get(country, 0.0) + amount print(totals) This reads one record at a time (constant memory), ideal for simple passes. Combine chunked partials into a single result Two common patterns: 1) Reduce to scalars/maps (running totals) As shown earlier, accumulate values into dictionaries or counters. from collections import Counter import pandas as pd country_counts = Counter() for df in pd.read_csv("big.csv", chunksize=200_000, usecols=["country"]): country_counts.update(df["country"].value_counts().to_dict()) print(country_counts.most_common(10))from collections import Counter import pandas as pd country_counts = Counter() for df in pd.read_csv("big.csv", chunksize=200_000, usecols=["country"]): country_counts.update(df["country"].value_counts().to_dict()) print(country_counts.most_common(10)) 2) Append to an on-disk store (Parquet/Feather/CSV/DB) For large outputs, write each chunk out. Parquet is efficient and columnar. import pandas as pd import pyarrow as pa import pyarrow.parquet as pq writer = None for df in pd.read_csv("big.csv", chunksize=250_000): table = pa.Table.from_pandas(df, preserve_index=False) if writer is None: writer = pq.ParquetWriter("big.parquet", table.schema) writer.write_table(table) if writer is not None: writer.close()import pandas as pd import pyarrow as pa import pyarrow.parquet as pq writer = None for df in pd.read_csv("big.csv", chunksize=250_000): table = pa.Table.from_pandas(df, preserve_index=False) if writer is None: writer = pq.ParquetWriter("big.parquet", table.schema) writer.write_table(table) if writer is not None: writer.close() Parquet advantages: Smaller files than CSV. Much faster reads for analytics later. Schema stored once; works well with Arrow/pandas/Polars. Apply functions safely per chunk Avoid lambdas that create Python-level loops over rows when possible. Prefer vectorized ops: import pandas as pd def clean_chunk(df: pd.DataFrame) -> pd.DataFrame: # Vectorized transforms df = df.assign( amount_usd = df["amount"] * 1.12, is_high_value = df["amount"] >= 1000 ) return df for df in pd.read_csv("big.csv", chunksize=100_000): df = clean_chunk(df) # write or aggregate...import pandas as pd def clean_chunk(df: pd.DataFrame) -> pd.DataFrame: # Vectorized transforms df = df.assign( amount_usd = df["amount"] * 1.12, is_high_value = df["amount"] >= 1000 ) return df for df in pd.read_csv("big.csv", chunksize=100_000): df = clean_chunk(df) # write or aggregate... When the file is really huge (or you want parallelism) Dask: keeps a pandas-like API but plans work lazily and executes in parallel. import dask.dataframe as dd ddf = dd.read_csv("big*.csv", usecols=["country", "amount"], dtype={"country": "category", "amount": "float64"}) result = ddf[ddf["amount"] > 0].groupby("country")["amount"].sum().compute() print(result)import dask.dataframe as dd ddf = dd.read_csv("big*.csv", usecols=["country", "amount"], dtype={"country": "category", "amount": "float64"}) result = ddf[ddf["amount"] > 0].groupby("country")["amount"].sum().compute() print(result) Polars (with scan_csv) or pyarrow.dataset can also stream efficiently with strong performance. Choose these when a single-process, chunked pandas loop becomes the bottleneck and you can benefit from multi-core execution. End-to-end example: Filter, aggregate, export import pandas as pd from pathlib import Path INPUT = "orders_2024.csv.gz" OUT_FILTERED = Path("orders_paid.csv.gz") header_written = False country_totals = {} for chunk in pd.read_csv( INPUT, chunksize=300_000, compression="gzip", usecols=["order_id", "user_id", "country", "amount", "status"], dtype={"order_id": "int64", "user_id": "int64", "country": "category", "amount": "float64", "status": "category"}, ): paid = chunk.query("status == 'PAID' and amount > 0") # Append filtered rows to compressed CSV paid.to_csv( OUT_FILTERED, mode="a", index=False, header=not header_written, compression="gzip", ) header_written = True # Update totals grp = paid.groupby("country")["amount"].sum() for c, s in grp.items(): country_totals[c] = country_totals.get(c, 0.0) + float(s) print("Country totals:", country_totals) print(f"Wrote filtered rows to {OUT_FILTERED}")import pandas as pd from pathlib import Path INPUT = "orders_2024.csv.gz" OUT_FILTERED = Path("orders_paid.csv.gz") header_written = False country_totals = {} for chunk in pd.read_csv( INPUT, chunksize=300_000, compression="gzip", usecols=["order_id", "user_id", "country", "amount", "status"], dtype={"order_id": "int64", "user_id": "int64", "country": "category", "amount": "float64", "status": "category"}, ): paid = chunk.query("status == 'PAID' and amount > 0") # Append filtered rows to compressed CSV paid.to_csv( OUT_FILTERED, mode="a", index=False, header=not header_written, compression="gzip", ) header_written = True # Update totals grp = paid.groupby("country")["amount"].sum() for c, s in grp.items(): country_totals[c] = country_totals.get(c, 0.0) + float(s) print("Country totals:", country_totals) print(f"Wrote filtered rows to {OUT_FILTERED}") This pattern will scale to tens or hundreds of millions of rows while keeping memory stable. To read a very large CSV file in chunks and process it in Python, stream your data with chunksize, process each slice with vectorized operations, and write results as you go. You’ll get predictable memory usage and excellent throughput even on modest hardware. With the same mindset, you can scale up to Dask or Polars when you need parallelism or more speed. Mastering these patterns lets you confidently read a very large CSV file in chunks and process it in Python for ETL, analytics, and data cleaning tasks of any size—so the file size stops being your problem and becomes just another input. Python File ReadingPython