Cleaning HTML in Django Fields: Recipes, Regex, and Commands

In managing large sets of HTML content within Django models (e.g. html_rewrite fields), we often need to clean, sanitize, or strip elements while preserving meaningful text. This guide documents practical patterns using BeautifulSoup, Django management commands, and regex-based cleanup.


1. Remove Tags That Contain Certain Keywords

Goal: Remove entire HTML tags that include any specified keyword (e.g. "cookiebeleid", "powered by", "facebook").

keywords = ['cookiebeleid', 'powered by', 'facebook', 'instagram', 'privacybeleid']
for tag in soup.find_all():
    text = tag.get_text().lower()
    if any(word in text for word in keywords):
        tag.decompose()

⚠️ Note: This can be destructive — tags might include important content. So…


2. Remove Only the Tag Containing a Keyword, Not Siblings

This version removes only the tag that contains the keyword, not everything that follows:

for tag in soup.find_all():
    text = tag.get_text(" ", strip=True).lower()
    if any(word in text for word in keywords):
        tag.decompose()

3. Remove Everything After the First Keyword Match

Mimicking CSS :contains + ~ selector (which isn't supported in BeautifulSoup):

for tag in soup.find_all():
    text = tag.get_text().lower()
    if any(word in text for word in keywords):
        for t in [tag] + list(tag.find_all_next()):
            t.extract()
        break  # stop after first match

🚫 But this might remove too much. Use carefully.


4. Regex-Based Tag Removal (Safe, Targeted)

If tag-based logic is too broad, use regex to surgically remove tags with a keyword inside:

import re

pattern = re.compile(
    r"<(?P\w+)([^>]*?)>(?P<content>[^<]*?(?:powered by|facebook).*?)</\1>",
    flags=re.IGNORECASE | re.DOTALL
)
html_cleaned = pattern.sub('', html)

💡 Then optionally re-parse with BeautifulSoup for tidy HTML.


5. Strip <body> and <html> Tags, Keep Only Inner Content

if soup.body:
    cleaned_html = soup.body.decode_contents().strip()
else:
    cleaned_html = ''.join(unicode(tag) for tag in soup.contents).strip()

6. Remove Only <a href=> Tags, Keep Link Text

Goal: Deactivate hyperlinks but keep the visible link text intact.

for tag in soup.find_all('a'):
    tag.replace_with(tag.get_text())  # Remove link, keep text

🎯 Turns <a href="...">Visit</a> into just Visit.


7. Detect and Remove Loose Text After HTML

Goal: Find and remove trailing disclaimer-style text after valid HTML structure.

match = re.search(r'</(html|body|div)>', html, flags=re.IGNORECASE)
if match:
    end_pos = match.end()
    trailing = html[end_pos:].strip()

You can then strip html[end_pos:] off the field.


Bonus: Django Management Command Pattern

class Command(BaseCommand):
    def handle(self, *args, **options):
        domains = Domain.objects.exclude(html_rewrite__isnull=True)
        for domain in domains:
            soup = BeautifulSoup(domain.html_rewrite or '', 'html.parser')
            # modify soup...
            domain.html_rewrite = unicode(soup)
            domain.save()

Each script can be adapted to your model, field, and cleanup goal.


Conclusion

This collection of techniques helps you clean and sanitize HTML fields in Django databases. From stripping tags, removing links, to using precise regex patterns — you now have a flexible toolbox for any content-cleaning challenge.

Need a live preview, logger, or backup solution before modifying? Just ask 😉

Comments