How to Find and Extract All URLs from a Text File in Python (With Clean, Reliable Results) etd_admin, September 11, 2025September 11, 2025 If you need to find and extract all URLs from a given text file in Python, you can do it in a few lines—and do it well enough to handle real-world messiness (trailing punctuation, duplicates, bad fragments). Below are practical, copy-pasteable patterns that scale from a quick script to production-ready parsing. What counts as a “URL”? In everyday text, links show up as: https://example.com/path?q=1 http://sub.domain.co.uk www.example.org/docs Bare domains in emails or logs (rare): example.com We’ll target standard web links (http://, https://, and www.), and we’ll filter obviously broken hits. The quick recipe (regex + urllib.parse cleanup) This approach works for most cases and is easy to maintain. 1) Use a pragmatic regex A strict, RFC-compliant regex is huge and slow. A pragmatic pattern works better in practice: URL_PATTERN = r""" (?: https?:// | www\. ) [^\s<>"')\]]+ # one or more non-space, non-closing-delimiter chars """URL_PATTERN = r""" (?: https?:// | www\. ) [^\s<>"')\]]+ # one or more non-space, non-closing-delimiter chars """ It catches http://, https://, and www. links. It stops before common trailing punctuation () ] ' " >), which often appears in prose. Tip: We’ll still post-clean each match to trim any leftover trailing punctuation. 2) Read the file, find matches, normalize, deduplicate from pathlib import Path import re from urllib.parse import urlparse URL_REGEX = re.compile(URL_PATTERN, re.IGNORECASE | re.VERBOSE) TRAILING_PUNCT = '.,;:!?)"\'' # trim these if they cling to the end def is_probable_url(s: str) -> bool: """ Accept http(s) links and www.* links; reject fragments that look broken. """ if s.startswith("www."): return True parsed = urlparse(s) return parsed.scheme in {"http", "https"} and bool(parsed.netloc) def normalize_url(s: str) -> str: """ - Trim common trailing punctuation - Lowercase the scheme + host only (leave path/query case as-is) - Ensure 'www.' links get a scheme for consistency """ s = s.rstrip(TRAILING_PUNCT) if s.startswith("www."): s = "https://" + s parsed = urlparse(s) if parsed.scheme and parsed.netloc: # Lowercase scheme + host; keep path/query/fragment untouched s = parsed._replace(scheme=parsed.scheme.lower(), netloc=parsed.netloc.lower()).geturl() return s def extract_urls_from_text(text: str) -> list[str]: raw = URL_REGEX.findall(text) cleaned = [] for u in raw: u = normalize_url(u) if is_probable_url(u): cleaned.append(u) # Deduplicate but preserve order seen = set() unique = [] for u in cleaned: if u not in seen: seen.add(u) unique.append(u) return unique def extract_urls_from_file(path: str | Path) -> list[str]: p = Path(path) text = p.read_text(encoding="utf-8", errors="ignore") return extract_urls_from_text(text) if __name__ == "__main__": urls = extract_urls_from_file("sample.txt") for u in urls: print(u)from pathlib import Path import re from urllib.parse import urlparse URL_REGEX = re.compile(URL_PATTERN, re.IGNORECASE | re.VERBOSE) TRAILING_PUNCT = '.,;:!?)"\'' # trim these if they cling to the end def is_probable_url(s: str) -> bool: """ Accept http(s) links and www.* links; reject fragments that look broken. """ if s.startswith("www."): return True parsed = urlparse(s) return parsed.scheme in {"http", "https"} and bool(parsed.netloc) def normalize_url(s: str) -> str: """ - Trim common trailing punctuation - Lowercase the scheme + host only (leave path/query case as-is) - Ensure 'www.' links get a scheme for consistency """ s = s.rstrip(TRAILING_PUNCT) if s.startswith("www."): s = "https://" + s parsed = urlparse(s) if parsed.scheme and parsed.netloc: # Lowercase scheme + host; keep path/query/fragment untouched s = parsed._replace(scheme=parsed.scheme.lower(), netloc=parsed.netloc.lower()).geturl() return s def extract_urls_from_text(text: str) -> list[str]: raw = URL_REGEX.findall(text) cleaned = [] for u in raw: u = normalize_url(u) if is_probable_url(u): cleaned.append(u) # Deduplicate but preserve order seen = set() unique = [] for u in cleaned: if u not in seen: seen.add(u) unique.append(u) return unique def extract_urls_from_file(path: str | Path) -> list[str]: p = Path(path) text = p.read_text(encoding="utf-8", errors="ignore") return extract_urls_from_text(text) if __name__ == "__main__": urls = extract_urls_from_file("sample.txt") for u in urls: print(u) Why this works well: The regex is fast and avoids pathological backtracking. We fix common issues (e.g., www.example.com) → https://www.example.com). We filter out obvious junk via urlparse. We deduplicate while preserving the first occurrence order. Handling big files (streaming) For multi-gigabyte logs, avoid loading everything into memory. Process line by line: def extract_urls_streaming(path: str | Path): with open(path, "r", encoding="utf-8", errors="ignore") as f: seen = set() for line in f: for u in URL_REGEX.findall(line): u = normalize_url(u) if is_probable_url(u) and u not in seen: seen.add(u) yield u # Usage for url in extract_urls_streaming("huge.log"): print(url)def extract_urls_streaming(path: str | Path): with open(path, "r", encoding="utf-8", errors="ignore") as f: seen = set() for line in f: for u in URL_REGEX.findall(line): u = normalize_url(u) if is_probable_url(u) and u not in seen: seen.add(u) yield u # Usage for url in extract_urls_streaming("huge.log"): print(url) This keeps memory usage low and starts producing results immediately. Common pitfalls and how we guard against them Trailing punctuation – prose often ends links with ).,"!Fix: rstrip(TRAILING_PUNCT). Mixed case hosts – hosts are case-insensitive; paths are not.Fix: Lowercase scheme + host only. “www.” without a scheme – some tools expect absolute URLs.Fix: Prepend https://. Duplicates – the same link appears many times.Fix: Ordered de-duplication (seen set). False positives – e.g., http:/broken.Fix: urlparse validation (scheme in {http, https} and netloc present). If your file is actually HTML If the “text file” contains HTML, parse it with BeautifulSoup to avoid pulling links from scripts or attributes you don’t want: from bs4 import BeautifulSoup from pathlib import Path def extract_urls_from_html_file(path: str | Path) -> list[str]: html = Path(path).read_text(encoding="utf-8", errors="ignore") soup = BeautifulSoup(html, "html.parser") hrefs = [] for a in soup.find_all("a", href=True): u = normalize_url(a["href"]) if is_probable_url(u): hrefs.append(u) # Deduplicate seen, unique = set(), [] for u in hrefs: if u not in seen: seen.add(u) unique.append(u) return uniquefrom bs4 import BeautifulSoup from pathlib import Path def extract_urls_from_html_file(path: str | Path) -> list[str]: html = Path(path).read_text(encoding="utf-8", errors="ignore") soup = BeautifulSoup(html, "html.parser") hrefs = [] for a in soup.find_all("a", href=True): u = normalize_url(a["href"]) if is_probable_url(u): hrefs.append(u) # Deduplicate seen, unique = set(), [] for u in hrefs: if u not in seen: seen.add(u) unique.append(u) return unique Use this method for real web pages; for plain text logs/emails, stick with the regex approach. Performance notes Regex choice matters. Avoid overly strict patterns—real-world text is noisy. The above pattern is intentionally simple and fast. Streaming beats buffering for large files. If you need to process millions of lines, consider: Compiling the regex once (we did). Avoiding expensive per-match operations. Writing outputs in batches. Security & hygiene tips Don’t fetch URLs blindly just because you extracted them. If you need to ping them, rate-limit requests and validate domains against an allowlist. Log the source line number for auditing if required (modify the streaming function to include enumerate(f, start=1)). You’ve now seen a pragmatic path to find and extract all URLs from a given text file in Python using a compact regex, careful normalization, and light validation. For massive logs, stream; for HTML, parse anchors. With these patterns, you can confidently find and extract all URLs from a given text file in Python and immediately export or further analyze them. If you’re building a pipeline or CLI, the same functions drop right in—making it straightforward to find and extract all URLs from a given text file in Python across different projects and environments. Python File ReadingPythonRegex