Validation in JSON-LD

Introduction

Schema validation is a useful and important concept to the distribution of metadata in formats such as XML and JSON, in which the standard-provider creates a schema (specified in an XML-schema, XSD, for XML documents, or json-schema for JSON documents). Schemas allow us to go beyond the basic notation of making sure a file is simply valid XML or valid JSON, a requriement just to be read in by any parser. By detailing how the metadata must be structured, what elements must, can, and may not be included, and what data types may be used for those elements, schema help developers consuming the data to anticipate these details and thus build applications which know how to process them. For the data creator, validation is a convenient way to catch data input errors and ensure a consistent data structure.

Because schema validation must ensure predictable behavior without knowledge of what any specific application is going to do with the data, it tends to be very strict. A simple application may not care if certain fields are missing or if integers are mistaken for characters, while to another application these differences could lead it to throw fatal errors.

The approach of JSON-LD is less perscriptive. JSON-LD uses the notion of “framing” to let each application specify how it expects it data to be structured. JSON frames allow each developer consuming the data to handle many of the same issues that schema validation have previously assured. Readers should consult the official json-ld framing documentation for details on this approach.

library(jsonld)
library(jsonlite)
library(magrittr)
library(codemetar)

A motivating example:

Consider the following codemeta document:

codemeta <- 
'
{
  "@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
  "@type": "SoftwareSourceCode",
  "name": "codemetar: Generate CodeMeta Metadata for R Packages",
  "datePublished":" 2017-05-20",
  "version": 1.0,
  "author": [
    {
      "@type": "Person",
      "givenName": "Carl",
      "familyName": "Boettiger",
      "email": "cboettig@gmail.com",
      "@id": "http://orcid.org/0000-0002-1642-628X"
    }],
  "maintainer":  {"@id": "http://orcid.org/0000-0002-1642-628X"}
}
'

Perhaps our application wants to construct an R person object from the author fields of the metadata (e.g. to build up a bibtex citation object). We might try something like:

meta <- fromJSON(codemeta, simplifyVector = FALSE) # tell fromJSON not to screw with the formatting by simplifying
lapply(meta$author, 
                 function(author) 
                   person(given = author$given, 
                          family = author$family, 
                          email = author$email,
                          role = "aut"))

[[1]]
[1] "Carl Boettiger <cboettig@gmail.com> [aut]"

Yay, that works as expected, since our metadata had all the fields we needed. However, there’s other data that is missing in our example that could potentially cause problems for our application. For instance, our first author lists no affiliation, so the following code throws an error:

meta$author[[1]]$affiliation

NULL

If we’re processing a lot of codemeta.json and only one input file is missing the affilation, it could disrupt our whole process. If codemeta.json were perscribed be a JSON schema, we could insist in the schema that affilation could not be missing. But that feels a bit heavy-handed – many use cases may have no need for affilation. (Of course one we could just leave this problem for each developer to address explicitly with their own error handling logic, but no developer would like that).

Framing: missing data

To solve this issue, we will construct a frame defining what it is we want. Our frame can set a default value (using the keyword @default) for author affiliation to avoid our application throwing such an error. Here we use the keyword @null for JSON null type, though we could also give other defaults):

frame <- '{
  "@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
  "@embed": "@always",
  "author": {
    "familyName": {},
    "affiliation": {"@default": "@null"}
  }
}'

meta <- 
  jsonld_frame(codemeta, frame) %>% 
  fromJSON(codemeta, simplifyVector = FALSE) %>%   ##  simplify messes with JSON formatting
  getElement("@graph") %>% getElement(1)            ## a piped version of [["@graph"]][[1]]

meta$author[[1]]$familyName

[1] "Boettiger"

meta$author[[1]]$affiliation

NULL

Framing: subsetting data

By default, frames return all the input data, while our application may only be interested in some subset. Often it is sufficient just to ignore these additional terms: in the example above it’s just as easy for our application to work with author elements whether or not we have dropped other elements such as meta$version. To restrict a frame to returning only the nodes we explicitly mention, we can use the keyword @explicit:

frame <- '{
  "@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
  "@explicit": "true",
  "@type": "Person",
  "givenName": {},
  "familyName": {}
}'

meta <- 
  jsonld_frame(codemeta, frame)  %>%
  fromJSON(codemeta, simplifyVector = FALSE) %>% 
  getElement("@graph") 

meta[[1]]

$id
[1] "http://orcid.org/0000-0002-1642-628X"

$type
[1] "Person"

$familyName
[1] "Boettiger"

$givenName
[1] "Carl"

Note that this has only returned the requested fields in the graph (along with the @id and @type, which are always included if provided, since they may be required to interpret the data properly). This frame extracts the givenName and familyName of any Person node it finds, regardless of where it occurs, while ommitting the rest of the data. Note that since the frame requests these elements at the top level, they are returned as such, with each match a separate entry in the @graph. Our example has only one person in meta[[1]], had we more matches they would appear in meta[[2]], etc. Note these returns are un-ordered.

Framing: expanding node references

The same underlying data can often be expressed in different ways, particularly when dealing with nested data. Framing can be of great help here to reshape the data into the structure required by the application. For instance, it would be natural to access the email of the maintainer in the same manner we did the author, but this fails for our example as maintainer is defined only by reference to an ID:

meta <- fromJSON(codemeta, simplifyVector = FALSE) 
paste("For complaints, email", meta$maintainer$email)

[1] "For complaints, email "

We can confirm that maintainer is just an ID:

meta$maintainer

$`@id`
[1] "http://orcid.org/0000-0002-1642-628X"

We can use a frame with the special directive "@embed": "@always" to say that we want the full maintainer information embedded an not just referred to by id alone. Then we can subset maintainer just like we do author.

frame <- '{
  "@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
  "@embed": "@always"
}'

meta <- 
  jsonld_frame(codemeta, frame) %>%
  fromJSON(codemeta, simplifyVector = FALSE) %>% 
  getElement("@graph") %>% getElement(1)

Now we can do

paste("For complaints, email", meta$maintainer$email)

[1] "For complaints, email cboettig@gmail.com"

and see that email has been successfully returned from the matching ID under author data.

Handling unexpected types

JSON-LD routines will simply refuse to compact data if the type differs from what the context expects. Here is a sample data file that declares that the buildInstructions are included as text, which differs from the context file which explicitly states that buildInstructions should be a URL:

codemeta <- 
'
{
  "@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
  "@type": "SoftwareSourceCode",
  "name": "codemetar: Generate CodeMeta Metadata for R Packages",
  "buildInstructions": { 
      "@value": "Just install this package using devtools::install_github", 
      "@type": "Text"
  }
}
'

When we perform a framing or compaction operation, buildInstructions gets de-referenced to codemeta:buildInstructions, because it does not match the context. This means that if our application asks for meta$buildInstructions:

meta <-
  jsonld_frame(codemeta, '{"@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld"}') %>% 
  fromJSON(codemeta, simplifyVector = FALSE) %>%  
  getElement("@graph") %>% getElement(1)    

## above is same as compacting:
#jsonld_compact(codeemta, "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld") %>% 
#  fromJSON(codemeta, simplifyVector = FALSE)

meta$buildInstructions

NULL

We just get NULL, rather than some unexpected type of object (e.g. a string that is not a URL.) Note that the data is not lost, but simply not dereferenced:

names(meta)

[1] "id"                         "type"                      
[3] "name"                       "codemeta:buildInstructions"

meta["codemeta:buildInstructions"]

$`codemeta:buildInstructions`
$`codemeta:buildInstructions`$type
[1] "Text"

$`codemeta:buildInstructions`$`@value`
[1] "Just install this package using devtools::install_github"

Note that this behavior only happens because the data declared the "@type": "Text" explicitly. JSON-LD algorithms only believe what they are told about type and only look for consistency in declared types. If you give text but declare it as a "@type": "URL", or don’t declare the type at all, JSON-LD algorithms won’t know anything is amiss and the property will be compacted as usual.

Carl Boettiger