To be able to find pages without description. If you work in a file system with text files without a content management system or even with a CMS but you have to add a description everywhere in 1000 articles because writers thought it is not important... Then you can try summer generation of text and add a description in HTML in context or in a database of CMS.
So we are going to generate a summary with an extractive summarization technique.
We first count how many files there are without description. With me that is now just text files with not filled context of description.
grep -r '"description": "",' context/
I tried to do so manually that is not less than 1 week for sure. In 1 or 2 hours I have done piece 50 or so.... So I need to have it exactly automatically processed. But how?
Read context file if no description there then open html file via request of file system. With soup polish all tags to text then divide text into sentences with a split('.') and take randomly a sentence from there and take only... Meta descriptions can technically be any length, but Google generally truncates snippets to ~155-160 characters.
Because it is done like this.
Read output from file or stream in python
grep -r '"description": "",' context* > /tmp/files.txt
Open a file
context/0571-ramen-standaardmaten.json
Read json
Request url
example.be/page.html
Soap div class: content content-width
Remove HtML
from bs4 import BeautifulSoup, SoupStrainer, Comment
import re
import random
soup = BeautifulSoup(page_source, 'lxml')
content = soup.find(div, {"class":"content content-width"})
text = ' '.join(soup.findAll(text=True))
We can of course now use an extractive summarization technique from spacy to fill in the description with the most relevant sentence of this text.... But we can with equal success. Select a random sentence from the text and abbreviate it by 160 characters. So you have to choose. We do it with Spacy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest
In this case generated description is not as clickable and not as readable as a sentence that is randomly selected from text so we stick to doing a simple random sentence selection because it is more human than generated description. Afterwards anyway all pages copywriter will check and rewrite description that is a quick fix of previous mistake of other people. We will see to what extent this method is useful.
#!/usr/bin/python
from bs4 import BeautifulSoup, SoupStrainer, Comment
import sys, getopt, os
import re
import random
import glob
import json
from slugify import slugify
import shutil
from startpage import replace_tag
import sys, traceback
import json
import requests
def get_filename(line):
"""
Output from grep command >
context/0202-pvc-deuren-te-koop.json:"description": "",
"""
url = line.split(":")[0]
return url
def get_context(filename):
text = open(filename, "r").read()
context = json.loads(text)
if len(context["description"]) == 0:
return context
else:
return None
def get_content(url):
r = requests.get(url)
soup = BeautifulSoup(r.text) # lxml
content = soup.find("div", {"class":"content content-width"})
text = ' '.join(content.findAll(text=True))
return text
def select_random(text):
tmp = []
sentences = text.split(".")
for sent in sentences:
if len(sent) > 60:
tmp.append(sent)
return random.choice(tmp)[:155].replace("\n", "")
def update_description(filename, sentence, context):
context["description"] = sentence
text = json.dumps(context)
f = open(filename, "w")
print(text)
f.write(text)
f.close()
def main(argv):
fname = "/tmp/files.txt"
lines = open(fname, "r").readlines()
for line in lines:
try:
filename = get_filename(line)
context = get_context(filename)
if context != None:
text = get_content(context['url'])
sentence = select_random(text)
update_description(filename, sentence, context)
except:
traceback.print_exc(file=sys.stdout)
if __name__ == "__main__":
main(sys.argv[1:])
Comments
Post a Comment