GSoC 2017 : Week 4
Status update for the 4th week
Last Sunday marked the end of the 4th week of my 3-month long Summer of code project. Another significant corollary from that is, this was the final week before the first evaluations that take place this week. I’ve done what I could’ve and now my fate lies in the hands of my mentors…
Anyway, continuing from last week, this week too, I continued with adding new domains for the list-extractor. The main idea is trying to figure out domains that could potentially contain list elements. These mapper functions can later be used by other domains too. So, this week, I started working on the PeriodicalLiterature domain, since it contained many lists which were common to many institutions. So I started working on mapping rules and mapper functions for PeriodicalLiterature.
While exploring these domains, I realized that most of the list elements had a date entry, which is an important information present in the lists. The existing year extractor only extracted years in the regex form:
#old regex
year_regex = ur'\s(\d{4})\s'which missed out nearly all the information, as it didn’t support months, or a period of time. A major effort this week was spent on re-writing the year_mapper() to add months (if present) with the dates. Also, if present, the new mapper tries to extract the period of years of the particular element (start date - end date).
#regex to figure out the presence of months in elements
month_list = { r'(january\s?)\d{4}':'1^', r'\W(jan\s?)\d{4}':'1^', r'(february\s?)\d{4}':'2^', r'\W(feb\s?)\d{4}':'2^',
r'(march\s?)\d{4}':'3^', r'\W(mar\s?)\d{4}':'3^',r'(april\s?)\d{4}':'4^',r'\W(apr\s?)\d{4}':'4^',
r'(may\s?)\d{4}':'5^', r'\W(may\s?)\d{4}':'5^',r'(june\s?)\d{4}':'6^',r'\W(jun\s?)\d{4}':'6^',
r'(july\s?)\d{4}':'7^',r'\W(jul\s?)\d{4}':'7^', r'(august\s?)\d{4}':'8^', r'\W(aug\s?)\d{4}':'8^',
r'(september\s?)\d{4}':'9^', r'\W(sep\s?)\d{4}':'9^',r'\W(sept\s?)\d{4}':'9^', r'(october\s?)\d{4}':'10^',
r'\W(oct\s?)\d{4}':'10^',r'(november\s?)\d{4}':'11^', r'\W(nov\s?)\d{4}':'11^' ,
r'(december\s?)\d{4}':'12^', r'\W(dec\s?)\d{4}':'12^'}
#flags to check presence of months/period
month_present = False
period_dates = False
for mon in month_list:
if re.search(mon, list_elem, re.IGNORECASE):
rep = re.search(mon, list_elem, re.IGNORECASE).group(1)
list_elem = re.sub(rep, month_list[mon], list_elem, flags=re.I)
month_present = True
#new year regex (complex, isn't it :P)
year_regex = ur'(?:\(?\d{1,2}\^)?\s?\d{4}\s?(?:–|-)\s?(?:\d{1,2}\^)?\s?\d{4}(?:\))?' #regex for checking if its a single year or period
if re.search(period_regex, list_elem, flags=re.IGNORECASE):
period_dates = TrueAfter re-writing year_mapper, I finished mappers and rules for PeriodicalLiterature and tested it on some resources belonging to Magazine, Newspaper and AcademicJournal sub-domains. I also updated the awards/honors mapper, which can now differentiate honorary degrees and awards. I finished this week’s work by optimizing the code a bit, removing redundant code and replaced the existing year_mapper() with the the new mapper in each module; adding the newly written quote_mapper() resource extractor in the URI extracting process. After that, I merged all progress into master, as this is the final stable running version before the evaluations begin.
Should I pass the first evaluation (I really feel I would :P), my next task would be, as discussed with Luca, working on a module that’ll create a new settings file and allow the user to select the mapping functions for the domain for the extraction process. This will increase support for unmatched domains.
Let’s hope for the best!! :)
You can follow my project on github here.

Share this post
Twitter
Facebook
Reddit
LinkedIn
Email