Skip to content
Explain to Dev
Explain to Dev

Empowering developers with the knowledge to build, create, and innovate in the software world.

  • Home
  • About
  • Java
  • Python
  • PHP
  • .NET
  • Node.js
  • SQL
  • Privacy Policy
Explain to Dev

Empowering developers with the knowledge to build, create, and innovate in the software world.

How to Find and Extract All URLs from a Text File in Python (With Clean, Reliable Results)

etd_admin, September 11, 2025September 11, 2025

If you need to find and extract all URLs from a given text file in Python, you can do it in a few lines—and do it well enough to handle real-world messiness (trailing punctuation, duplicates, bad fragments). Below are practical, copy-pasteable patterns that scale from a quick script to production-ready parsing.

What counts as a “URL”?

In everyday text, links show up as:

  • https://example.com/path?q=1
  • http://sub.domain.co.uk
  • www.example.org/docs
  • Bare domains in emails or logs (rare): example.com

We’ll target standard web links (http://, https://, and www.), and we’ll filter obviously broken hits.

The quick recipe (regex + urllib.parse cleanup)

This approach works for most cases and is easy to maintain.

1) Use a pragmatic regex

A strict, RFC-compliant regex is huge and slow. A pragmatic pattern works better in practice:

URL_PATTERN = r"""
(?:
    https?://
    | www\.
)
[^\s<>"')\]]+        # one or more non-space, non-closing-delimiter chars
"""

It catches http://, https://, and www. links.

It stops before common trailing punctuation () ] ' " >), which often appears in prose.

Tip: We’ll still post-clean each match to trim any leftover trailing punctuation.

2) Read the file, find matches, normalize, deduplicate

from pathlib import Path
import re
from urllib.parse import urlparse

URL_REGEX = re.compile(URL_PATTERN, re.IGNORECASE | re.VERBOSE)

TRAILING_PUNCT = '.,;:!?)"\''  # trim these if they cling to the end

def is_probable_url(s: str) -> bool:
    """
    Accept http(s) links and www.* links; reject fragments that look broken.
    """
    if s.startswith("www."):
        return True
    parsed = urlparse(s)
    return parsed.scheme in {"http", "https"} and bool(parsed.netloc)

def normalize_url(s: str) -> str:
    """
    - Trim common trailing punctuation
    - Lowercase the scheme + host only (leave path/query case as-is)
    - Ensure 'www.' links get a scheme for consistency
    """
    s = s.rstrip(TRAILING_PUNCT)
    if s.startswith("www."):
        s = "https://" + s

    parsed = urlparse(s)
    if parsed.scheme and parsed.netloc:
        # Lowercase scheme + host; keep path/query/fragment untouched
        s = parsed._replace(scheme=parsed.scheme.lower(),
                            netloc=parsed.netloc.lower()).geturl()
    return s

def extract_urls_from_text(text: str) -> list[str]:
    raw = URL_REGEX.findall(text)
    cleaned = []
    for u in raw:
        u = normalize_url(u)
        if is_probable_url(u):
            cleaned.append(u)

    # Deduplicate but preserve order
    seen = set()
    unique = []
    for u in cleaned:
        if u not in seen:
            seen.add(u)
            unique.append(u)
    return unique

def extract_urls_from_file(path: str | Path) -> list[str]:
    p = Path(path)
    text = p.read_text(encoding="utf-8", errors="ignore")
    return extract_urls_from_text(text)

if __name__ == "__main__":
    urls = extract_urls_from_file("sample.txt")
    for u in urls:
        print(u)

Why this works well:

  • The regex is fast and avoids pathological backtracking.
  • We fix common issues (e.g., www.example.com) → https://www.example.com).
  • We filter out obvious junk via urlparse.
  • We deduplicate while preserving the first occurrence order.

Handling big files (streaming)

For multi-gigabyte logs, avoid loading everything into memory. Process line by line:

def extract_urls_streaming(path: str | Path):
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        seen = set()
        for line in f:
            for u in URL_REGEX.findall(line):
                u = normalize_url(u)
                if is_probable_url(u) and u not in seen:
                    seen.add(u)
                    yield u

# Usage
for url in extract_urls_streaming("huge.log"):
    print(url)

This keeps memory usage low and starts producing results immediately.

Common pitfalls and how we guard against them

  1. Trailing punctuation – prose often ends links with ).,"!
    Fix: rstrip(TRAILING_PUNCT).
  2. Mixed case hosts – hosts are case-insensitive; paths are not.
    Fix: Lowercase scheme + host only.
  3. “www.” without a scheme – some tools expect absolute URLs.
    Fix: Prepend https://.
  4. Duplicates – the same link appears many times.
    Fix: Ordered de-duplication (seen set).
  5. False positives – e.g., http:/broken.
    Fix: urlparse validation (scheme in {http, https} and netloc present).

If your file is actually HTML

If the “text file” contains HTML, parse it with BeautifulSoup to avoid pulling links from scripts or attributes you don’t want:

from bs4 import BeautifulSoup
from pathlib import Path

def extract_urls_from_html_file(path: str | Path) -> list[str]:
    html = Path(path).read_text(encoding="utf-8", errors="ignore")
    soup = BeautifulSoup(html, "html.parser")
    hrefs = []
    for a in soup.find_all("a", href=True):
        u = normalize_url(a["href"])
        if is_probable_url(u):
            hrefs.append(u)
    # Deduplicate
    seen, unique = set(), []
    for u in hrefs:
        if u not in seen:
            seen.add(u)
            unique.append(u)
    return unique

Use this method for real web pages; for plain text logs/emails, stick with the regex approach.

Performance notes

  • Regex choice matters. Avoid overly strict patterns—real-world text is noisy. The above pattern is intentionally simple and fast.
  • Streaming beats buffering for large files.
  • If you need to process millions of lines, consider:
    • Compiling the regex once (we did).
    • Avoiding expensive per-match operations.
    • Writing outputs in batches.

Security & hygiene tips

  • Don’t fetch URLs blindly just because you extracted them.
  • If you need to ping them, rate-limit requests and validate domains against an allowlist.
  • Log the source line number for auditing if required (modify the streaming function to include enumerate(f, start=1)).

You’ve now seen a pragmatic path to find and extract all URLs from a given text file in Python using a compact regex, careful normalization, and light validation. For massive logs, stream; for HTML, parse anchors. With these patterns, you can confidently find and extract all URLs from a given text file in Python and immediately export or further analyze them. If you’re building a pipeline or CLI, the same functions drop right in—making it straightforward to find and extract all URLs from a given text file in Python across different projects and environments.

Python File ReadingPythonRegex

Post navigation

Previous post
Next post
©2025 Explain to Dev | WordPress Theme by SuperbThemes