GSoC 2017 : Week 2
Status update for the 2nd week
The second week has flown by, and now it’s time for the second week’s progress report.
The primary tasks for the second week was to add support for Spanish and German for the existing Writer and Actor domain, along with integrating the MusicalArtist domain that was part of my warmup task.
So, I started off with adding the MusicalArtist domain to my current codebase. It was fairly straightforward (for the most part) and it worked like a charm.
ref = reference_mapper(elem) # look for resource references
if ref: # current element contains a reference
uri = wikidataAPI_call(ref, lang) #try to reconcile resource with Wikidata API
if uri:
dbpedia_uri = find_DBpedia_uri(uri, lang) # try to find equivalent DBpedia resource
if dbpedia_uri: # if you can find a DBpedia res, use it as the statement subject
uri = dbpedia_uri
else: # Take the reference name anyway if you can't reconcile it
ref = list_elem_clean(ref)
elem = elem.replace(ref, "") #subtract reference part from list element, to facilitate further parsing
uri_name = ref.replace(' ', '_')
uri_name = urllib2.quote(uri_name) ###
uri = dbr + uri_name.decode('utf-8', errors='ignore')
g.add((rdflib.URIRef(uri), rdf.type, dbo.Album))
g.add((rdflib.URIRef(uri), dbo.musicalArtist, res))However, diving deeper into many musical artists, I noticed that the extractor wasn’t working very efficiently and constantly missed many elements. It was then when I realised, an actor could well have recorded a few songs or a musician might’ve acted in a movie, and the extractor was specifically looking for a particular section for a resource. Its funny and astounding at the same time that one can completely miss such an intuitive thing. Anyway, a big overhaul was needed.
So, after analyzing many articles from different domains, I realised that there are several domains that have intersecting sections, and I had to change my approach. So from now, I’ll be focusing on writing the mapping functions that can extract list elements from a given section. Later, domains can be added in the mapping_rules.py inluding the various sections that might exist in the domain articles.
For this, I had to completely restructure my current mapping_rules file. The rules now contain 2 multi-level dictionaries, first of which maps the domain of the resource to the sections it could be related to, and the second one mapping the sections to the appropriate mappings.
MAPPING = {
'Person': ['FILMOGRAPHY', 'DISCOGRAPHY', 'BIBLIOGRAPHY', 'HONORS'],
'Writer': ['BIBLIOGRAPHY', 'HONORS'],
'MusicalArtist': ['DISCOGRAPHY','FILMOGRAPHY', 'CONCERT_TOURS', 'HONORS'],
'Band':['DISCOGRAPHY', 'CONCERT_TOURS', 'BAND_MEMBERS', 'HONORS'],
}
BIBLIOGRAPHY = {
'en': ['bibliography', 'works', 'novels', 'books', 'publications'],
'it': ['opere', 'romanzi', 'saggi', 'pubblicazioni', 'edizioni'],
'de': ['bibliographie', 'werke','arbeiten', 'bücher', 'publikationen'],
'es': ['Obras', 'Bibliografía']
}I also had to change the select mapping function to handle multiple sections.
omains = MAPPING[res_class] # e.g. ['BIBLIOGRAPHY', 'FILMOGRAPHY']
domain_keys = []
resource_class = res_class
for domain in domains:
if domain in mapped_domains:
continue
if lang in eval(domain):
domain_keys = eval(domain)[lang] # e.g. ['bibliography', 'works', ..]
else:
print("The language provided is not available yet for this mapping")
mapped_domains.append(domain) #this domain won't be used again for mapping
for res_key in resDict.keys(): # iterate on resource dictionary keys
mapped = False
for dk in domain_keys: # search for resource keys related to the selected domain
# if the section hasn't been mapped yet and the title match, apply domain related mapping
dk = dk.decode('utf-8') #make sure utf-8 mismatches don't skip sections
if not mapped and re.search(dk, res_key, re.IGNORECASE):
mapper = "map_" + domain.lower() + "(resDict[res_key], res_key, db_res, lang, g, 0)"
res_elems += eval(mapper) # calls the proper mapping for that domain and counts extracted elements
mapped = True # prevents the same section to be mapped againSo, this major change in the selection of mapper functions greatly improved the working of the extractor. It is now possible to add multiple mappers to a domain, effectively increasing the number of extracted elements, hence increasing accuracy.
Then, I continued with adding support for German and Spanish language in all the 3 initial domains (Actor, Writer, MusicalArtist). And that concluded the work for my second week.
This coming week, I’ll be looking forward to adding new domains to the extractor. Another task next week would be discussing an approach with Luca, my friend who is also working for DBpedia on another similar project, for potentially coming up with a possible template /mapping rules, to make a more effective and scalable extractor.
You can follow my project on github here.

Share this post
Twitter
Facebook
Reddit
LinkedIn
Email