Find and remove duplicate texts in your site.
Tried external tool but it doesn't do much, there are too many errors and I still have to pay money for each check. The control itself is of lower quality. More useful for very small sites smaller than 1000 pages..
I won't go into too much detail why I don't like it and what mistakes there are. So I have to do a little research and try to build in my own code and improve correct texts.
So first test
with open('some_file_1.txt', 'r') as file1:
with open('some_file_2.txt', 'r') as file2:
same = set(file1).intersection(file2)
same.discard('\n')
with open('some_output_file.txt', 'w') as file_out:
for line in same:
file_out.write(line)
Rewrite in django command
😅😆 Received something very far from expected..
Ah, ok I forgot to split. job.description.split('\n')
https://github.com/sergejdergatsjev/PageAdmin/blob/master/diffpermanentjobs.py
Now, create django command to see dubble content in descriptions.
class Command(BaseCommand):
def handle(self, *args, **options):
jobs = PermanentJob.objects.all()
self.used = {}
self.same = {}
for job in jobs:
self.find_same(job, jobs)
print("Count:" + str(len(self.same)))
self.create_txt_report()
def create_txt_report(self):
for k,v in self.same.items():
print("---------------------- " + str(k) + "
----------------")
print("pagina's: " + str(len(v)))
for same_text in v:
print(same_text[0])
print(same_text[1])
print("--------------------- ////
-------------------------")
def is_not_used(self, job_id):
if (job_id in self.used):
return False
else:
self.used[job_id] = ""
return True
def find_same(self, job, jobs):
for current_job in jobs:
if current_job != job:
same = set(job.description.split('\n'))
.intersection(current_job.description.split('\n'))
if(len(str(same)) > 200) and
(self.is_not_used(current_job.id)):
self.add_same(job, current_job, same)
def add_same(self, job, current_job, same):
job_key = self.same.get(job.id)
if job_key:
self.same[job.id].append((current_job.id, same))
else:
self.same[job.id] = [(current_job.id, same)]
This way already works well. I will use it further and the other diff is that filecamp is just extra information.
————————
import difflib
text1 = open("sample1.txt").readlines()
text2 = open("sample2.txt").readlines()
for line in difflib.unified_diff(text1, text2):
print line,
OUTPUT
---
+++
@@ -1 +1 @@
-Sample file 1
+Sample file 2
INPUT FILES
sample1.txt
sample2.txt
——————————
from difflib import Differ
with open('cfg1.txt') as f1, open('cfg2.txt') as f2:
differ = Differ()
for line in differ.compare(f1.readlines(), f2.readlines()):
if line.startswith(" "):
print(line[2:], end="")
The example that can be found on the internet with filecmp is Unusable in this case, because I need to see what is different and I need to be able to see how many percent is the same and how much differs.
So this example is informational only
import filecmp
f1 = "C:/Users/user/Documents/intro.txt"
f2 = "C:/Users/user/Desktop/intro1.txt"
# shallow comparison
result = filecmp.cmp(f1, f2)
print(result)
# deep comparison
result = filecmp.cmp(f1, f2, shallow=False)
print(result)
Links on about the same topic
https://stackoverflow.com/questions/55061542/how-to-check-for-differences-between-two-spacy-doc-objects
Spacey similarities
https://stackoverflow.com/questions/11008519/detecting-and-printing-the-difference-between-two-text-files-using-python-3-2
https://docs.python.org/3/library/difflib.html
Comments
Post a Comment