GSoC 2017 : Week 1
Status update for the first week
With the first week now past us, it’s time for the first week’s progress report.
First week was mainly about checking the existing code for potential improvements. So, this week, I went over the existing code and made slight tweaks to them. I added the __init__ module and docstring. I also worked on improving the method that was used to create the resource dictionary, and which extracted triples and stored them, to get rid of the junk values that were observed during extraction. One of the added methods was remove_symbols().
def remove_symbols(listDict_key):
''' removes other sybols are garbage characters that pollute the values to be inserted
:param listDict_key: dictionary entries(values) obtained from parsing
:return: a dictionary without empty values
'''
for i in range(len(listDict_key)):
value = listDict_key[i]
if type(value)==list:
value=remove_symbols(value)
else:
listDict_key[i] = value.replace(' ','')
return listDict_keyAnother addition was a method that stores the statistical results of all the extractions that would take place in a csv file. This method would be used in future for evaluation of the performance of the extractor and logging the statistics of the extractions that would be performed in meantime.
def evaluate(lang, source, tot_extracted_elems, tot_elems):
''' Evaluates the extaction process and stores it in a csv file.
:param source: resource type(dbpedia ontology type)
:param tot_extracted_elems: number of list elements extracted in the resources.
:param tot_elems: total number of list elements present in the resources.
'''
print "\nEvaluation:\n===========\n"
print "Resource Type:", lang + ":" + source
print "Total list elements found:", tot_elems
print "Total elements extracted:", tot_extracted_elems
accuracy = (1.0*tot_extracted_elems)/tot_elems
print "Accuracy:", accuracy
with open('evaluation.csv', 'a') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
filewriter.writerow([lang, source, tot_extracted_elems, tot_elems, accuracy])Lastly, I merged the MusicalArtist domain to the existing code, which was already part of my GSoC warmup task. This however, requires finer extraction functions, which would be added later on. As discussed with mentors, I’m currently looking at ways to make the list-extractor more scalable. I’ll also look for potential problems in the existing code and improve it wherever required.
This week, I’ll be looking forward to adding more languages to the existing domains, and then, as discussed with my mentors, I would look into the scalability potential in the list-extractor.
You can follow my project on github here.

Share this post
Twitter
Facebook
Reddit
LinkedIn
Email