Customize description in mass templates, HTML documents or context json files

To be able to find pages without description. If you work in a file system with text files without a content management system or even with a CMS but you have to add a description everywhere in 1000 articles because writers thought it is not important... Then you can try summer generation of text and add a description in HTML in context or in a database of CMS.

So we are going to generate a summary with an extractive summarization technique.

We first count how many files there are without description. With me that is now just text files with not filled context of description.

grep -r '"description": "",' context/

I tried to do so manually that is not less than 1 week for sure. In 1 or 2 hours I have done piece 50 or so.... So I need to have it exactly automatically processed. But how?

Read context file if no description there then open html file via request of file system. With soup polish all tags to text then divide text into sentences with a split('.') and take randomly a sentence from there and take only... Meta descriptions can technically be any length, but Google generally truncates snippets to ~155-160 characters.

Because it is done like this.

Read output from file or stream in python

grep -r '"description": "",' context* > /tmp/files.txt

Open a file

context/0571-ramen-standaardmaten.json

Read json

Request url

example.be/page.html

Soap div class: content content-width

Remove HtML

from bs4 import BeautifulSoup, SoupStrainer, Comment

import re

import random

soup = BeautifulSoup(page_source, 'lxml')

content = soup.find(div, {"class":"content content-width"})

text = ' '.join(soup.findAll(text=True))

We can of course now use an extractive summarization technique from spacy to fill in the description with the most relevant sentence of this text.... But we can with equal success. Select a random sentence from the text and abbreviate it by 160 characters. So you have to choose. We do it with Spacy

import spacy

from spacy.lang.en.stop_words import STOP_WORDS

from string import punctuation

from collections import Counter

from heapq import nlargest

In this case generated description is not as clickable and not as readable as a sentence that is randomly selected from text so we stick to doing a simple random sentence selection because it is more human than generated description. Afterwards anyway all pages copywriter will check and rewrite description that is a quick fix of previous mistake of other people. We will see to what extent this method is useful.

Script example..

#!/usr/bin/python

from bs4 import BeautifulSoup, SoupStrainer, Comment

import sys, getopt, os

import re

import random

import glob

import json

from slugify import slugify

import shutil

from startpage import replace_tag

import sys, traceback

import json

import requests

def get_filename(line):

"""

Output from grep command >

context/0202-pvc-deuren-te-koop.json:"description": "",

"""

url = line.split(":")[0]

return url

def get_context(filename):

text = open(filename, "r").read()

context = json.loads(text)

if len(context["description"]) == 0:

return context

else:

return None

def get_content(url):

r = requests.get(url)

soup = BeautifulSoup(r.text) # lxml

content = soup.find("div", {"class":"content content-width"})

text = ' '.join(content.findAll(text=True))

return text

def select_random(text):

tmp = []

sentences = text.split(".")

for sent in sentences:

if len(sent) > 60:

tmp.append(sent)

return random.choice(tmp)[:155].replace("\n", "")

def update_description(filename, sentence, context):

context["description"] = sentence

text = json.dumps(context)

f = open(filename, "w")

print(text)

f.write(text)

f.close()

def main(argv):

fname = "/tmp/files.txt"

lines = open(fname, "r").readlines()

for line in lines:

try:

filename = get_filename(line)

context = get_context(filename)

if context != None:

text = get_content(context['url'])

sentence = select_random(text)

update_description(filename, sentence, context)

except:

traceback.print_exc(file=sys.stdout)

if __name__ == "__main__":

main(sys.argv[1:])

Search This Blog

Customize description in mass templates, HTML documents or context json files

Comments

Post a Comment