GSoC 2017 : Week 3

Status update for the 3rd week

Krishanu Konar

3 minute read

Time is passing by ever so quickly and things are starting to get real intense. Although it has only been three weeks, it feels like I’m a veteran developer now (professional developers everywhere cringed :P). Anyways, here’s the progress report from my third week.

These next few weeks, my focus would majorly be on expanding the scope of extractor, adding few common domains and working on making it more scalable to handle previously unseen lists with existing rules. This week, I’ve started working on adding new domains. This time around, I took my mentor’s suggestion and tried to implement a single mapper that can map multiple list items instead of having a mapping function for every single type of element. Previously, all the properties were present in the mapping functions itself, like in the example below:

# mapping bibliography for Writer, snippet from mapper.py
g.add((rdflib.URIRef(uri), dbo.author, res))
isbn = isbn_mapper(elem)
if isbn:
    g.add((rdflib.URIRef(uri), dbo.isbn, rdflib.Literal(isbn, datatype=rdflib.XSD.string)))
if year:
    add_years_to_graph(g, uri, year)
if lit_genre:
    g.add((rdflib.URIRef(uri), dbo.literaryGenre, dbo + rdflib.URIRef(lit_genre)))

This led to me changing the way Federica and I have been using the mapping rules. The ontology classes/properties now stored in the mapping_rules.py instead of the mapping functions.

# (new)mapping contribution type for Person, snippet from mapper.py
contrib_type == None:
feature = bracket_feature_mapper(elem)
for t in CONTRIBUTION_TYPE[lang]:
    if re.search(t, feature, re.IGNORECASE):
        contrib_type = CONTRIBUTION_TYPE[lang][t]

if contrib_type:
    g.add((rdflib.URIRef(uri), dbo[contrib_type], res))  #notice the property

In new domains, I also analysed the EducationalInstitution domain, and completed writing the rules/mappers for that. The list extractor can now extract triples from EducationalInstitution, as well as its subdomains like College, School and University. After that I looked at different Domains within Person in order to generalize extractor to work on this superclass. Domains like Painter, Architect Architect, Astronaut, Ambassador Athelete, BusinessPerson, Chef, Celebrity, Coach etc. will now also work with extractor, increasing its coverage, but I still have to work on the quality of extraction as Person is one of the biggest Domain on Wiki and has extreme variability. For that, I changed various functions in order to support generalized domains (eg. year_mapper, role etc.). Now extracts all the years in which the person has won same award/honors.

In the end, I had a meeting with Luca to discuss ways to merge the mapping_rules for both list & table extractor projects, another meeting is scheduled next week after discussing the idea with mentors. For the next week, I’ll keep on adding domains to the extractor, while adding the new rules/functions in a generalized way. I also hope to come to a resolution about the final structure of my extractor after a discussion with Luca.

You can follow my project on github here.

comments powered by Disqus