About List-Extractor

About my Project for GSoC 2017: Wikipedia List-Extractor

May 31, 2017 Krishanu Konar

2 minute read

Okay, so today I’ll be writing a brief summary of what my project is all about. As the name itself suggests…

It extracts data from Wikipedia lists.

A wikipedia list

Now hold on….

Isn’t that a simple task? That’s something a noob can do by writing a simple script that scrapes data off the Wikipedia pages. What’s so special about your project, huh?

It’s slightly more subtle than that. It’s not all about just scraping the data and dumping it. The whole point of this project is to extract data and make it meaningful and connected, the very essence of semantic Web. Also, making it user friendly so that a person with limited computer knowledge can add more domains to the extractor.

From the existing data present in the wiki lists, we form triples, which follow the W3C-RDF standards. Instead of using static constants or strings, we actually store the URI of the resources, which helps us in connecting all the triples, which as a result forms a large knowledge graph, which can be used to answer complex queries. The following snippet shows the sample extracted triplets for a musical album and its related artist. Pretty sweet eh?

@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix dbr: <http://dbpedia.org/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .


dbr:In_The_Light_of_Fires_Burning dbo:musicalArtist <http://dbpedia.org/resource/John_Howard_(singer-songwriter)> ;
    dbo:releaseYear "2016-01-01"^^xsd:gYear .

dbr:In_The_Mood dbo:musicalArtist dbr:Nicole_Moudaber ;
    dbo:releaseYear "2013-01-01"^^xsd:gYear .

We use the existing dbpedia ontologies to gather the related resources. In this project, we use JSONpedia Live, another project which was started in GSoC 2014 and is being currently maintained by Michele Mostrada. This live service provides us with a valid JSON response to a given resource, which can be parsed to extract relevant information. Of course, being a Web based service, it might be down if it receives high volume requests, so we need to use the JSONpedia library in our project. A small catch though, it’s written in Java. Integrating the library will be an important task in my project in later stages.

So, to summarize, the main objective of my project will be to add more data to the existing knowledge base, extend the existing list-extractor tool and add different resources, and as a result generating new datasets which can be added in the DBpedia datasets, along with integrating the JSONpedia library with the project to make the extractor independent of using the live service!

Let the code begin!!

gsoc

Home

My Site

About

GSoC

Categories

Recent Posts

Short Notes: Virtual File Systems (VFS)

Short Notes: Virtual File Systems (VFS)

Short Notes: cGroups and Namespaces

Short Notes: Unix System Calls

Short Notes: Inter Process Communication

About List-Extractor

Recent Posts

Short Notes: Virtual File Systems (VFS)

Short Notes: Virtual File Systems (VFS)

Short Notes: cGroups and Namespaces

Short Notes: Unix System Calls

Short Notes: Inter Process Communication

Categories

About