GSoC 2017 : Week 2

Status update for the 2nd week

Krishanu Konar

4 minute read

The second week has flown by, and now it’s time for the second week’s progress report.

The primary tasks for the second week was to add support for Spanish and German for the existing Writer and Actor domain, along with integrating the MusicalArtist domain that was part of my warmup task.

So, I started off with adding the MusicalArtist domain to my current codebase. It was fairly straightforward (for the most part) and it worked like a charm.

ref = reference_mapper(elem)  # look for resource references
if ref:  # current element contains a reference
 uri = wikidataAPI_call(ref, lang)  #try to reconcile resource with Wikidata API
    if uri:
  dbpedia_uri = find_DBpedia_uri(uri, lang)  # try to find equivalent DBpedia resource
        if dbpedia_uri:  # if you can find a DBpedia res, use it as the statement subject
   uri = dbpedia_uri
        else:  # Take the reference name anyway if you can't reconcile it
             ref = list_elem_clean(ref)
             elem = elem.replace(ref, "")  #subtract reference part from list element, to facilitate further parsing
             uri_name = ref.replace(' ', '_')
             uri_name = urllib2.quote(uri_name)  ###
             uri = dbr + uri_name.decode('utf-8', errors='ignore')
        g.add((rdflib.URIRef(uri), rdf.type, dbo.Album))
        g.add((rdflib.URIRef(uri), dbo.musicalArtist, res))

However, diving deeper into many musical artists, I noticed that the extractor wasn’t working very efficiently and constantly missed many elements. It was then when I realised, an actor could well have recorded a few songs or a musician might’ve acted in a movie, and the extractor was specifically looking for a particular section for a resource. Its funny and astounding at the same time that one can completely miss such an intuitive thing. Anyway, a big overhaul was needed.

So, after analyzing many articles from different domains, I realised that there are several domains that have intersecting sections, and I had to change my approach. So from now, I’ll be focusing on writing the mapping functions that can extract list elements from a given section. Later, domains can be added in the mapping_rules.py inluding the various sections that might exist in the domain articles.

For this, I had to completely restructure my current mapping_rules file. The rules now contain 2 multi-level dictionaries, first of which maps the domain of the resource to the sections it could be related to, and the second one mapping the sections to the appropriate mappings.

MAPPING = { 
            'Person': ['FILMOGRAPHY', 'DISCOGRAPHY', 'BIBLIOGRAPHY', 'HONORS'],
            'Writer': ['BIBLIOGRAPHY', 'HONORS'], 
            'MusicalArtist': ['DISCOGRAPHY','FILMOGRAPHY', 'CONCERT_TOURS', 'HONORS'],
            'Band':['DISCOGRAPHY', 'CONCERT_TOURS', 'BAND_MEMBERS', 'HONORS'],
}

BIBLIOGRAPHY = {
    'en': ['bibliography', 'works', 'novels', 'books', 'publications'],
    'it': ['opere', 'romanzi', 'saggi', 'pubblicazioni', 'edizioni'],
    'de': ['bibliographie', 'werke','arbeiten', 'bücher', 'publikationen'],
    'es': ['Obras', 'Bibliografía']
}

I also had to change the select mapping function to handle multiple sections.

omains = MAPPING[res_class]  # e.g. ['BIBLIOGRAPHY', 'FILMOGRAPHY']
domain_keys = []
resource_class = res_class

for domain in domains:
    if domain in mapped_domains:
        continue
    if lang in eval(domain):
 domain_keys = eval(domain)[lang]  # e.g. ['bibliography', 'works', ..]
    else:
 print("The language provided is not available yet for this mapping")

    mapped_domains.append(domain)  #this domain won't be used again for mapping
    
    for res_key in resDict.keys():  # iterate on resource dictionary keys
 mapped = False

 for dk in domain_keys:  # search for resource keys related to the selected domain
     # if the section hasn't been mapped yet and the title match, apply domain related mapping
     dk = dk.decode('utf-8') #make sure utf-8 mismatches don't skip sections 
     if not mapped and re.search(dk, res_key, re.IGNORECASE):
     mapper = "map_" + domain.lower() + "(resDict[res_key], res_key, db_res, lang, g, 0)"
     res_elems += eval(mapper)  # calls the proper mapping for that domain and counts extracted elements
     mapped = True  # prevents the same section to be mapped again

So, this major change in the selection of mapper functions greatly improved the working of the extractor. It is now possible to add multiple mappers to a domain, effectively increasing the number of extracted elements, hence increasing accuracy.

Then, I continued with adding support for German and Spanish language in all the 3 initial domains (Actor, Writer, MusicalArtist). And that concluded the work for my second week.

This coming week, I’ll be looking forward to adding new domains to the extractor. Another task next week would be discussing an approach with Luca, my friend who is also working for DBpedia on another similar project, for potentially coming up with a possible template /mapping rules, to make a more effective and scalable extractor.

You can follow my project on github here.

comments powered by Disqus