Intern nota about crawlers
Summary of XML method
Copy crontab command like jobboost for example and update url and methods
Copy commands for jobboost and name them after source Jooble / Talentus for example
- Check structure of cleaner and parser
- Normally checkduplicatesbysource and cleanjob are generic commands which you can reuse with parameters.
- Parser command you have to check because there are unique XML parser functions which are dependent on XML and sometimes change at source, so it does not make much sense to make them generic.
Summary Crawler method
- Use rich data for indexing via JSON
- If no markup for it in html or json then you can parse HTML or string directly
In details XML parser
When we welcome a new customer, we create a separate process for him that will take care of several things. Now we are talking about large clients with thousands of vacancies. Automation happens mostly through XML mechanizm and sometimes through crawler. Crawler is less desirable because crawler has to parse HTML itself or use jobs structured data.First I will explain the XML working method. We do not use external frameworks. We only use our own input mechanisms because they are optimized for tasks we need.
In most cases XML looks completely different on different websites and needs a mapping of categories. We do the mapping with Natural Language Processing. The rest we just parse with a parser like etree. So with a django command we can just make a connector and xml parser this way.
self.session = session = jobutils.create_session()
for item in self.start_url:
try:
response = session.get(item.get("url"))
except:
traceback.print_exc(file=sys.stdout)
sleep(2)
if item.get("premium"):
self.redirect = True
self.payed = True
else:
self.redirect = False
self.payed = False
root = etree.fromstring(response.content)
jobs = root.xpath("//job")
for job in jobs:
self.parse_job(job, source)
print("Done")
Then you can browse through all objects and XML and parse them like for example;
jobs = Job.objects.filter(source_unique=job_id)
if not jobs.exists():
o_job = Job()
o_job.source_unique = job_id
o_job.url = job.findtext('url')
o_job.email = utils.email_validation(self.get_email(description))
o_job.sol_url = job.findtext('url')
# o_job.user = user
o_job.source = source
o_job.weight = source.weight
o_job.title = title[:240]
o_job.slug = slugify(o_job.title)
This way you avoid overheating of external frameworks and can work in princepe with all customers without additional customer requirements.
As a client does not use XML yet, which used to be about 10 years ago, hardly anyone had XML interfaces and we just had to set up crawlers for client sites. Often these sites have not only HTML structure but Javascript navigation and that was still a challenge to create bots for such clients.
Then you need some mechanisms. 1 that will search vacancies on the site and quickly move them in the navigation. 2 Parse jobs that has found first process. 3 Once a day run through a source and check if the vacancy is still online at the website of origin.
For this method we also use our own framework which is based on stretched classes methods of these objects can be rewritten for any solution.
There is no better method than reading examples
vim crawler/management/commands/abstractjob.py
This command is based on BaseCommand from django and does only one thing. Retrieve a page where vacancy is found and parse it into 1 object and store it in the database via django ORM.
One of the problems after pressing is the duplicates.
Many interims and selection offices do double vacancies and only difference they add only a separate state. I can't say it could help for better results but can definitely cause problems so we need to remove duplicates and add extra cities only in case of such vacancies.
These people think that they are helping this job to receive more traffic but in fact they are adding extra work and slowing down the indexing process.
We should also have a separate command so that we do not see duplicates in our index because that can become very frustrating when you see such results.
The problem is that a city is a text field. Therefore it would be easier to customise titles just have a separate command that would add a state with a strip. Thus results look more unique and employer will indeed receive more responses.
You can see an example of such a command
vim crawler/management/commands/checkduplicatesbysource.py
This works mostly per source because we usually don't get a crossed source check.... There are hundreds to find exactly the same text or images or texts that are 70% alike, for example. A common method to generate for example a slug of title and filter objects with the same slug. That is one of the simplest methods not like natural language processing and is the fastest and works reasonably effectively if you need to filter thousands or even millions of records.
Example python function:
def check_by_slug(self, job, source):
if self.verbose == True:
print("ID %d" % (job.id))
print(job.slug)
print(datetime.datetime.now())
jobs = Job.objects.filter(slug=job.slug, for_index=True, company_name=job.company_name, source=source).exclude(id=job.id)
count = jobs.count()
if self.verbose == True:
print "count %d " % count
if count > 0:
try:
# jobs.delete()
jobs.update(has_checked_duplicate=True)
jobs.update(for_index=False)
except:
traceback.print_exc(file=sys.stdout)
transaction.rollback()
if self.verbose == True:
print datetime.datetime.now()
In details Crawler configuration
Just try browsing the site manually first to study how navigation is developed and how to set up a crawler to browse through categories and to job descriptions. Don't forget to test a separate incognito window because some sites let you browse and display jobs in 1 session and not in other sessions. Those urls live 1 session and are made unique for each visitor. That makes browsing difficult and sometimes impossible. Because you actually know where you've stopped and what files you still need to get.Make a copy of a similar crawler so you can automate this site.
cp crawler/management/commands/jobbsitecat.py crawler/management/commands/newjobsitecat.py
Put a screenshot patch in so you can monitor via browser what is happening with crawler and see what the boot actually sees. Because some sites apply cloaking principles
SCREENSHOT_FILE = "/var/www/vindazo_de/static/newjobsite.png"
Then immediately start debugging without sentiments and stop where you need to browse so you can manually run everything and write down correct actions.
import pdb;pdb.set_trace()
Then you can check screenshots and see what crawler does via https protocol. Just make sure your screen is saved in a distributed folder like /static/ for example.
self.driver.find_element_by_xpath('//span[@class="typauswahl start"]/a').click()
For example:
element = self.driver.find_element_by_xpath('//a[@class="position-link"]')
You can find elements and click to see what appears.
self.driver.save_screenshot(SCREENSHOT_FILE)
Then in the real browser go to about the same point and see what you have to do to perform the next action. That is done.
element = self.driver.find_element_by_xpath('//a[@class="jobview-paging-control jobview-paging-next"]')
Also see that you always remove such things as consent from the road.
* ElementClickInterceptedException: Message: Element <a class="jobview-paging-control jobview-paging-next" href="#next"> is not clickable at point (577,951) because another element <div class="consent-banner"> obscures it
element = self.driver.find_element_by_xpath('//button[@id="accept-sta-consent"]')
element.click()
Thus, with already tested parts of the text, you can
(Pdb) element = self.driver.find_element_by_xpath('//a[@class="jobview-paging-control jobview-paging-next"]')
(Pdb) element.click()
(Pdb) self.driver.save_screenshot(SCREENSHOT_FILE)
Self.driver.current_url
Structured data
When we start parsing content, we check whether there is already structured data on the page.
https://search.google.com/test/rich-results
If we see the Job posting structure directly and can parse it from Json or xml then we don't need any extra HTML parser and this json is standard for industry.
So you can parse json that are between Json script tags.
<script type="application/ld+json">
script = soup.find('script', {"type":"application/ld+json"})
import json
job_data = json.loads(script.text)
job_data.keys()
u'description', u'title', u'employmentType', u'datePosted', u'validThrough', u'directApply', u'jobLocation', u'@context', u'baseSalary', u'hiringOrganization', u'@type']
Job_data["title"]
job_data["jobLocation"]
[{u'geo': {u'latitude': 52.5099338311689, u'@type': u'GeoCoordinates', u'longitude': 13.3867898863636}, u'@type': u'Place', u'address': {u'addressCountry': u'DE', u'addressLocality': u'Berlin', u'addressRegion': u'berlin', u'streetAddress': u'', u'postalCode': u'10115', u'@type': u'PostalAddress'}}]
job_data["jobLocation"][0].keys()
[u'geo', u'@type', u'address']
So, City for example you can get with..
job_data['jobLocation'][0]['address']['addressLocality']
job_data['jobLocation'][0]['address']['postalCode']
job.title = job_data["title"]
job.company_name = job_data['hiringOrganization']['name']
job.email = self.get_email(page_source)
job.phone = self.get_phone(page_source)
job.city = job_data['jobLocation'][0]['address']['addressLocality']
job.zip_code = job_data['jobLocation'][0]['address']['postalCode']
job.address = job_data['jobLocation'][0]['address']['streetAddress']
job.country = "Deutschland"
job.slug = slugify(job.title)
text = utils.parse_all_text(job_data['description'])
job.description = text
job.raw_text = utils.clean_job_html(job_data['description'])
Better quality actually via json and more importantly that is a generic command we can make.
Direct HTML parsing or Raw String as No Structured Rich data
In principle, we should already parse it immediately because if we see direct content and still have to move from one job to the other.
while next_page:
try:
self.driver.find_element_by_xpath('//img[contains(@src, "paginierung_rechts_aktiv")]').click()
sleep(3)
soup = BeautifulSoup(self.driver.page_source)
urls = self.get_job_urls(soup)
self.create_pages(urls)
except:
next_page = False
traceback.print_exc(file=sys.stdout)
def parse_job(self, text, source, url, session):
"""
Parse job
"""
job = Job()
job.status = 0
job.source_unique = url
job.url = url
job.source = source
job.weight = source.weight
job.online_since = datetime.now()
job.online_since_refreshed = datetime.now()
job.online_to = datetime.now()+timedelta(days=60)
if self.test:
import pdb; pdb.set_trace()
if text.find(self.offline) != -1:
return False
soup = BeautifulSoup(text, 'lxml')
job.title = self.get_title(soup).strip()
if not job.title:
return False
else:
job.title = job.title[0:120]
job.slug = slugify(job.title)
try:
job = self.get_contact_information(soup, job, session)
except:
traceback.print_exc(file=sys.stdout)
if not self.has_correct_location(soup):
return False
soup = self.get_content_zone(soup)
if soup == None:
def get_phone(self, soup):
tmp = ''
company_phone = soup.find(text=re.compile('Telefonnummer'))
if company_phone:
tmp = company_phone.split(':')[-1].strip()
return tmp
def get_city(self, soup):
tmp = ''
company_city = soup.find("span", id=re.compile(".*Ort*."))
if company_city:
tmp = company_city.text
return tmp
def get_zip_code(self, soup):
tmp = ''
company_zip = soup.find("span", id=re.compile(".*Plz*."))
if company_zip:
tmp = company_zip.text
return tmp
def get_address(self, soup):
tmp = ''
company_address = soup.find("span", id=re.compile(".*Strasse*."))
if company_address:
tmp = company_address.text
return tmp
def get_content_zone(self, soup):
job.company_name = self.get_company_name(soup)
job.email = self.get_email(soup)
job.phone = self.get_phone(soup)
job.city = self.get_city(soup)
job.zip_code = self.get_zip_code(soup)
job.address = self.get_address(soup)
job.country = "Deutschland"
return job
def get_company_name(self, soup):
tmp = ''
company_name = soup.find('a', id=re.compile(".*arbeitgeber"))
if company_name:
tmp = company_name.find('span').text
return tmp
def get_title(self, soup):
# tmp = soup.find('div', {"id":"containerInhaltKopf"}).find('h3')
tmp = soup.select("div#containerInhaltKopf > h3")
try:
tmp = tmp[0].text.replace("\n", "").replace("\t", "").replace("Stellenangebot -", "").strip()
except:
tmp = ""
if not tmp:
tmp = soup.title.text.replace("JOBBĂ–RSE - Stellenangebot -".decode('utf-8'), "" ).strip()
return tmp
Raw String parse with Regular Expressions
For such matters as telephone numbers or email, please use regular expressions.
For example Telephone:
https://regex101.com/r/McD0KW/2/
You can, for example, divide your function into two or more states as here.
def get_phone(self, page_source):
pattern = re.compile(r"[0\+\(][\d\-\.\+ \)\(\/]{10,22}")
phones = re.findall(pattern, page_source)
if len(phones) > 0:
return self.filter_phone(phones)
else:
return None
def filter_phone(self, phones):
# filter
pattern = re.compile("[\-\.\+ \)\(\/]")
for phone in phones:
symbols = re.findall(pattern, phone)
if len(symbols) > 3:
return phone
return None
A good regular expression is like a work of art. You can write the same thing with much less performance or something that doesn't work at all or works in some cases that can take several hours to solve.
E-mail to cover most cases.
https://regex101.com/r/xKUnaN/1
Then the email parser could look like this.
pattern = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
return re.findall(pattern, content)
def get_email(self, page_source):
email = None
# in text
emails = parse_emails(page_source)
# mailto
if len(emails) > 0:
try:
email = normalization_email(emails[0])
validate_email(emails[0])
except:
email = None
return email
So if you have to parse HTML then occasionally it is interesting to make abstraction of tags and just rough string parsing and find necessary information with patterns ...
Comments
Post a Comment