To perform pattern matching for extracting car makes from a sentence, you can use spaCy's `Matcher` class. The `Matcher` allows you to define patterns based on token attributes and match them against a given document. Here's an example of how you can use `Matcher` for this task:
```python
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
def extract_car_make(sentence, car_make_list):
doc = nlp(sentence)
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": make.lower()} for make in car_make_list]
matcher.add("CAR_MAKE",
[pattern]
)
matches = matcher(doc)
car_make = None
if matches:
match_id, start, end = matches[0]
car_make = doc[start:end].text
return car_make
# Example usage
sentence = "I saw a red Ford Mustang on the street."
car_make_list = ["Ford", "Chevrolet", "Toyota", "BMW"]
make = extract_car_make(sentence, car_make_list)
print(make) # Output: Ford
```
In this example, the `extract_car_make` function takes a sentence and a list of car makes as input. The sentence is processed using spaCy's `nlp` function to obtain a parsed `doc` object. We initialize a `Matcher` instance by providing the vocabulary of the loaded language model.
We create a pattern list based on the car make list, converting each make to lowercase using a list comprehension. Each pattern in the list specifies that the token should have a lowercase form matching the car make. For example, `{"LOWER": "ford"}` represents the pattern for "Ford".
We add the pattern to the matcher using `matcher.add("CAR_MAKE", None, pattern)`, where "CAR_MAKE" is a unique ID for the pattern.
We then use `matcher(doc)` to find matches in the document. If there is a match, we extract the car make from the first match by getting the start and end indices and using `doc[start:end].text`.
In the example usage, the sentence "I saw a red Ford Mustang on the street" is passed to the `extract_car_make` function, along with a car make list containing "Ford", "Chevrolet", "Toyota", and "BMW". The function identifies "Ford" as the car make, and it is printed as the output.
You can modify the `car_make_list` variable to include the specific car makes you want to extract using pattern matching. The function will then find the first matching car make from the list in the sentence and return it.
Every Brand have to match i a pattern. Working method could be seen like this.
def extract_car_make(self, text):
doc = self.nlp(text)
car_make = None
import pdb; pdb.set_trace()
matcher = Matcher(self.nlp.vocab)
pattern = [[{"LOWER": make.lower()}] for make in self.car_make_list]
matcher.add("CAR_MAKE", pattern)
matches = matcher(doc)
print([token.text for token in doc])
if matches:
match_id, start, end = matches[0]
car_make = doc[start:end].text
make = VehicleMerk.objects.get(name=car_make)
return make
With this method you can parse subcategory form one text.. For example.
def extract_car_make_and_model(self, text):
doc = self.nlp(text)
car_make = None
car_model = None
#import pdb; pdb.set_trace()
matcher = Matcher(self.nlp.vocab)
pattern = [[{"LOWER": make.lower()}] for make in self.car_make_list]
matcher.add("CAR_MAKE", pattern)
matches = matcher(doc)
print([token.text for token in doc])
if matches:
match_id, start, end = matches[0]
car_make = doc[start:end].text
make = VehicleMerk.objects.get(name=car_make)
model_list = VehicleModel.objects.filter(vehicle_merk=make).values_list('name', flat=True)
matcher_model = Matcher(self.nlp.vocab)
pattern = [[{"LOWER": make.lower()}] for make in model_list]
matcher_model.add("CAR_MODEL", pattern)
matches = matcher_model(doc)
if matches:
match_id, start, end = matches[0]
car_model = doc[start:end].text
model = VehicleModel.objects.get(name=car_model, vehicle_merk=make)
return make, model
Or you can define it together in 1 pattern if you need a string together. In our case we need to add a car from the list in our site and make price calculations. It is therefore necessary that this string comes separately.
The current approach faces an issue when dealing with entities such as Alfa Romeo, Panther Westwinds, and Martin Motors. The problem lies in the pattern matcher, which searches within each token instead of considering the entire text as a whole. To address this, the pattern matcher should be modified to search across the entire text rather than individual tokens.
Customizing spaCy’s Tokenizer class
Furthermore, we have a problem with standard Tokenizer.We have to use own because a model can contain a strip and then this token is divided into two or three parts in this way matching no longer works for such models Make like Mercedes-Benz is not working..
Our custom configuration for ~
self.nlp.tokenizer = Tokenizer(self.nlp.vocab, infix_finditer=re.compile(r'''[~]''').finditer)
Regular expressions
The current approach faces an issue when dealing with entities such as Alfa Romeo, Panther Westwinds, and Martin Motors. The problem lies in the pattern matcher, which searches within each token instead of considering the entire text as a whole. To address this, the pattern matcher should be modified to search across the entire text rather than individual tokens.When utilizing the REGEX operator, it's important to note that it functions on individual tokens rather than the entire text. Every expression you provide will be matched against a token. In case you require matching on the entire text instead, please refer to the specific information on regex matching for the entire text.
Cars Make and model example
for pattern in self.make_patterns:
car_make = self.parse_make(pattern, doc)
if car_make:
break
def parse_make(self, expression, doc):
for match in re.finditer(expression, doc.text):
start, end = match.span()
span = doc.char_span(start, end)
# This is a Span object or None if match doesn't map to valid token sequence
if span is not None:
return span.text
Comments
Post a Comment