Linked Data

From myExperiment
Jump to: navigation, search

Linked Data

An increasing number of information providers are publishing their data and information as Linked Data. Linked Data is seen by many as a key step in the realisation of the Semantic Web or Data Web. Linked Data uses the Web to facilitate connections between related data that was not previousl linked.

Publishing Linked Data involves following a number of principles or guidelines, originally set out in Linked Data Design Issues. In summary, publishing Linked Data involves the following:

  1. Use URIs as names for things
  2. Use HTTP URIs so that applications can dereference or look up those names.
  3. When a client requests a URI, provide useful information
  4. Include links to other URIs, so that clients can discover more things.

In practice (for more details, see How to Publish Linked Data on the Web or Jeni Tennison's blog), this involves a publisher designing an appropriate URI scheme to refer to the "things" they are publishing information about, along with content negotiation and redirection configurations on the server to ensure that appropriate content is served to applications.

For example, a user browser may be provided human-readable HTMl describing a resource, while a crawler application may be provided an RDF description of the resource. Publishing Linked Data also often involves identifying "non-information resources", i.e, those things whose essential characteristics cannot be conveyed in a message. For such resources, the response to a request for information will usually be a redirection or "See Other", pointing the client at an information resource (a document) that provides a description of the non-information resource.

Publishing as Linked Data brings a number of benefits.

  1. You can potentially ask questions nobody asked before.
  2. You can join, link or refer to things from outside your repository or information store that are not appropriate to store/keep up to date internally
  3. You can expose all the information you can, in an interoperable way, to support query of similar parts of hetrogeneous systems.

myExperiment contains data about a number of things including users, workflows, files, packs and other contributions. By exposing this information as Linked Data, we will be able to integrate this information with other sources -- the use of common infrastructure and representation formats can also make it easier for others to discover and use the information that is published by myExperiment.

Publishing myExperiment data

There are two myExperiment services which make public data available.

RESTful API

www.myexperiment.org has a RESTful API, which allows programmatic access to the repository. We are currently trialling a version which will expose and publish this data using linked data principles. This requires us to define a URI scheme for the resources that myExperiment describes (users, workflows, files, packs, contributions etc) along with the delivery of information about those resources. A key question for us to investigate is how we might link the data in myExperiment with the rest of the Linked Data cloud. Possible links here include links to the services that are used within a workflow (within the BioCatalogue project, we are also working towards Linked Data publication of information about the services described in BioCatalogue), links to alternative user identities, or links to the concepts that are salient to particular workflows -- for example, the Bio2RDF effort are providing Linked Data versions of data sources such as KEGG or the NCI Thesaurus which could serve as potential targets for links from workflows.

SPARQL Endpoint

RDF SPARQL and Linked Data are closely linked -- indeed the latest version of Linked Data Design Issues explicitly refers to the use of RDF and SPARQL for delivery of information. myExperiment provides a SPARQL endpoint at http://rdf.myexperiment.org/sparql that provides access to the myExperiment data. Applications can then query this myExperiment data directly. Although the provision of the SPARQL endpoint is not Linked Data publishing per se, the intention is that the endpoint provides access to the same data that the Linked Data is a view on, and that the URIs referred to in the results served from the endpoint are "Linked Data friendly".

http://wiki.myexperiment.org/images/LinkedDataFigure.png

The figure above illustrates the ways in which the rdf.myexperiment.org and www.myexperiment.org sites/servers interact.

  1. A SPARQL query that asks for the workflows owned by a particular user. The query is posed to the myExperiment SPARQL endpoint at rdf.myexperiment.org. The endpoint makes use of the myExperiment data.
  2. The results of the query are returned. This includes URIs for the workflows found.
  3. A client application can then invoke an HTTP GET on the URIs for a particular workflow. HTTP header information is used to indicate the representation that the response should be returned in.
  4. If RDF is requested, the client gets back a machine-readable description of the workflow. This will include metadata about the resource (e.g. owner creation date etc). A key aspect of Linked Data is that this information should include links to "other things". Here, the information about the workflow includes links to a service S used in the workflow and a data type D, that is consumed by the workflow.
  5. S and D are URIs which refer to resources in BioCatalogue and the Bio2RDF data respectively. Thus our client can make requests to those services, seeking more data about the service and/or datatype.
  6. If HTML was requested by the client, an HTTP response containing an HTML (human-readable) representation will be returned.

The http://rdf.myexperiment.org site is also used to host the ontologies or vocabularies that have been defined for use in the myExperiment data. These ontologies use terms from a number of existing vocabularies such as Dublin Core, FOAF, SIOC, Creative Commons and OAI-ORE.

Future Plans

Our future plans include the improvement of the Linked Data publication via the RESTful API, providing a service that will then sit within the "Linked Data Cloud". In tandem with that, we will be conducting investigations into how myExperiment data might be linked (either through automation or user interaction) with other data sources such as BioCatalogue.

Eprints are also working towards the provision of data using linked data principles. SPARQL endpoint will be provided in due course. As part of our myExperiment enhancement project, we will be investigating possible interactions between Eprints and myExperiment Linked Data provision.