GSoC 2017 : Week 1

Status update for the first week

June 5, 2017 Krishanu Konar

2 minute read

With the first week now past us, it’s time for the first week’s progress report.

First week was mainly about checking the existing code for potential improvements. So, this week, I went over the existing code and made slight tweaks to them. I added the __init__ module and docstring. I also worked on improving the method that was used to create the resource dictionary, and which extracted triples and stored them, to get rid of the junk values that were observed during extraction. One of the added methods was remove_symbols().

def remove_symbols(listDict_key):
    ''' removes other sybols are garbage characters that pollute the values to be inserted 

    :param listDict_key: dictionary entries(values) obtained from parsing
    :return: a dictionary without empty values
    '''
    for i in range(len(listDict_key)):
        value = listDict_key[i]
        if type(value)==list:
            value=remove_symbols(value)
        else:
            listDict_key[i] = value.replace('&nbsp;','')

    return listDict_key

Another addition was a method that stores the statistical results of all the extractions that would take place in a csv file. This method would be used in future for evaluation of the performance of the extractor and logging the statistics of the extractions that would be performed in meantime.

def evaluate(lang, source, tot_extracted_elems, tot_elems):

''' Evaluates the extaction process and stores it in a csv file.

    :param source: resource type(dbpedia ontology type)
    :param tot_extracted_elems: number of list elements extracted in the resources.
    :param tot_elems: total number of list elements present in the resources.
    '''
    print "\nEvaluation:\n===========\n"
    print "Resource Type:", lang + ":" + source
    print "Total list elements found:", tot_elems
    print "Total elements extracted:", tot_extracted_elems
    accuracy = (1.0*tot_extracted_elems)/tot_elems
    print "Accuracy:", accuracy

    with open('evaluation.csv', 'a') as csvfile:
        filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
        filewriter.writerow([lang, source, tot_extracted_elems, tot_elems, accuracy])

Lastly, I merged the MusicalArtist domain to the existing code, which was already part of my GSoC warmup task. This however, requires finer extraction functions, which would be added later on. As discussed with mentors, I’m currently looking at ways to make the list-extractor more scalable. I’ll also look for potential problems in the existing code and improve it wherever required.

This week, I’ll be looking forward to adding more languages to the existing domains, and then, as discussed with my mentors, I would look into the scalability potential in the list-extractor.

You can follow my project on github here.

gsoc

Home

My Site

About

GSoC

Categories

Recent Posts

Short Notes: Virtual File Systems (VFS)

Short Notes: Virtual File Systems (VFS)

Short Notes: cGroups and Namespaces

Short Notes: Unix System Calls

Short Notes: Inter Process Communication

GSoC 2017 : Week 1

Recent Posts

Short Notes: Virtual File Systems (VFS)

Short Notes: Virtual File Systems (VFS)

Short Notes: cGroups and Namespaces

Short Notes: Unix System Calls

Short Notes: Inter Process Communication

Categories

About