Semaku supports its customers not only in delivering a Business Intelligence solution and Data Integration Service but also in analysing it’s data and fine tuning the data requirements.

Our data scientists support in defining the aggregated data intelligence and/ or integration question, translate this into a data model and deliver the solution in a user friendly format or frontend.

Currently we offer data integration and Business Intelligence services for our customers NXP Semiconductors and Nexperia Semiconductors.

Linked Data application Quality Assurance NXP Semiconductors

Semaku developed in cooperation with Dydra a product search application for NXP Semiconductors. With this lightweight solution the customer is able to apply the Linked Data technology for Business Intelligence and Search purposes. Combining different sources via the “datahub” in combination with a user friendly front end application offers insights in detailed information. Semaku thereby offered Data Scientist, Data Modelling and Application development support. Multiple databases are queried using Dydra to get as much information as possible about the product.

Semicondutors

The eptos™ Spec Sheet Generator (SSG) we co-developed together with Paradine GmbH. The SSG enables you to minimize the time and effort required to render and publish technical documents from your product master data. The eptos™ SSG is offered as an add-on module for the Paradine eptos™ framework.

Currently Semaku is working on a Linked Data Intelligence application whereby different data sources are coupled via a powerful cost effective solution. In addition the process to on board new data sources is standardized in terms of modelling and interfacing. A first faceted search viewer provides the required Business Intelligence insights. This solution is currently used for quality analysis at one of our customers.

Open Linked Data Platform Kadaster

The Kadaster is developing the data platform of the future whereby open datasets like the Basis Adres Gegevens (BAG), BasisRegistratie Topografie (BRT) and BasisRegistratie Kadastrale kaart (BRK) are made available as Linked open data. The Linked Data is being used for exploring data across datasets and when known what data exactly can be used for e.g. an application or Busines Intelligence purposes you may use the dataset API’s which are also being served via the dataplatform. Semaku offers hereby consultancy Services in the area of Product Owner and Architecture/ Data Scientist expertise.

topografics

Genealogy Linked data application library Dommeldal

Semaku supports the Dommeldal library by improving the access to genealogy data. An application will be implemented to visualise the data. Personal events such as birth and marriage will be linked to the people in these events. For any person a timeline can be shown which shows all their events and links them to other persons.

familytree

Semaku develops custom applications for customers, several examples can be found in the use cases.

Development, Service and Maintenance

Semaku offers development and maintenance support in the area of Java, XML, RDF, Vue.js applications.

For our customer Nexperia Semiconductors, Semaku provides DevOps engineering services in the Content and Product Information Management area related to the XML database, RDF database and the Java application stack. These applications are deployed to the AWS Cloud and are used for content syndication purposes.

For our partner Paradine GmbH we provide Java development support utilizing technologies such as Apache Solr.

Semicondutors

We offer time hire services for architectural, development, project & program management, product owner and business analyst roles.

Currently we fulfil a range of roles for customers as Philips HealthTech - with an Cloud Infra consulting role as also for NXP Semiconductors with project management, architecture and development support.

For the Kadaster we support the Data Platform of the future development via a Product Owner role.

In a recent post, Dean Allemang explained why he’s not excited about RDF-star. In this post I want to expand on Dean’s post with more concrete examples to illustrate why I’m concerned about RDF-star.

Global triples and occurences of triples

In the flights example, Dean mentions that using RDF-star would encourage poor modeling practice. So how exactly would that manifest if we choose to model multiple flights with multiple triples:

:NYC :connection :SFO .
:NYC :connection :SFO .
:NYC :connection :SFO .

First we note that “an RDF-star triple is an abstract entity whose identity is entirely defined by its subject, predicate, and object” and “this unique triple (s, p, o) can be quoted as the subject or object of multiple other triples, but must be assumed to represent the same thing everywhere it occurs”. So in the above example, we have the same RDF-star triple asserted three times.

This is already familiar for experienced RDF users as, when processing a graph with duplicate triples, those duplicates may be de-duplicated.

When quoting this triple in RDF-star, a novice user might assert additional information about the multiple flights as follows:

:NYC :connection :SFO .
<< :NYC :connection :SFO >>
  :departureGate "10" ;
  :departureTime "2023-04-22T00:00:00Z"^^xsd:dateTime ;
  :arrivalTime "2023-04-22T06:00:00Z"^^xsd:dateTime .

:NYC :connection :SFO .
<< :NYC :connection :SFO >>
  :departureGate "20" ;
  :departureTime "2023-04-23T00:00:00Z"^^xsd:dateTime ;
  :arrivalTime "2023-04-23T06:00:00Z"^^xsd:dateTime .

:NYC :connection :SFO .
<< :NYC :connection :SFO >>
  :departureGate "30" ;
  :departureTime "2023-04-24T00:00:00Z"^^xsd:dateTime ;
  :arrivalTime "2023-04-24T06:00:00Z"^^xsd:dateTime .

Or the equivalent using annotation syntax:

:NYC :connection :SFO {|
    :departureGate "10" ;
    :departureTime "2023-04-22T00:00:00Z"^^xsd:dateTime ;
    :arrivalTime "2023-04-22T06:00:00Z"^^xsd:dateTime .
  |} .

:NYC :connection :SFO {|
    :departureGate "20" ;
    :departureTime "2023-04-23T00:00:00Z"^^xsd:dateTime ;
    :arrivalTime "2023-04-23T06:00:00Z"^^xsd:dateTime .
  |} .

:NYC :connection :SFO {|
    :departureGate "30" ;
    :departureTime "2023-04-24T00:00:00Z"^^xsd:dateTime ;
    :arrivalTime "2023-04-24T06:00:00Z"^^xsd:dateTime .
  |} .

However, running this through the Jena Riot command-line tool with formatted output, we immediately see a problem:

:NYC :connection :SFO .

<< :NYC :connection :SFO >>
  :arrivalTime
    "2023-04-24T06:00:00Z"^^xsd:dateTime ,
    "2023-04-23T06:00:00Z"^^xsd:dateTime ,
    "2023-04-22T06:00:00Z"^^xsd:dateTime ;
  :departureGate
    "30" ,
    "20" ,
    "10" ;
  :departureTime
    "2023-04-24T00:00:00Z"^^xsd:dateTime ,
    "2023-04-23T00:00:00Z"^^xsd:dateTime ,
    "2023-04-22T00:00:00Z"^^xsd:dateTime .

Because the quoted triple represents the same thing (resource) everywhere it occurs, we have clashing/conflicting use of the resource identifier (being the quoted triple). Thus, we end up with all the statements being about the same thing, not three different flights. So it is impossible to know which departure gate relates to which departure time, not ideal if we don’t want to miss our flight!

Querying the graph using SPARQL-star query further illustrates the effect:

select *
where {
  :NYC :connection :SFO {|
      :departureGate ?gate ;
      :departureTime ?departs ;
      :arrivalTime ?arrives
    |} .
}

Gives the results:

--------------------------------------------------------------------------------------
| gate | departs                              | arrives                              |
======================================================================================
| "10" | "2023-04-24T00:00:00Z"^^xsd:dateTime | "2023-04-23T06:00:00Z"^^xsd:dateTime |
| "10" | "2023-04-24T00:00:00Z"^^xsd:dateTime | "2023-04-24T06:00:00Z"^^xsd:dateTime |
| "10" | "2023-04-24T00:00:00Z"^^xsd:dateTime | "2023-04-22T06:00:00Z"^^xsd:dateTime |
| "10" | "2023-04-22T00:00:00Z"^^xsd:dateTime | "2023-04-23T06:00:00Z"^^xsd:dateTime |
| "10" | "2023-04-22T00:00:00Z"^^xsd:dateTime | "2023-04-24T06:00:00Z"^^xsd:dateTime |
| "10" | "2023-04-22T00:00:00Z"^^xsd:dateTime | "2023-04-22T06:00:00Z"^^xsd:dateTime |
| "10" | "2023-04-23T00:00:00Z"^^xsd:dateTime | "2023-04-23T06:00:00Z"^^xsd:dateTime |
| "10" | "2023-04-23T00:00:00Z"^^xsd:dateTime | "2023-04-24T06:00:00Z"^^xsd:dateTime |
| "10" | "2023-04-23T00:00:00Z"^^xsd:dateTime | "2023-04-22T06:00:00Z"^^xsd:dateTime |
| "20" | "2023-04-24T00:00:00Z"^^xsd:dateTime | "2023-04-23T06:00:00Z"^^xsd:dateTime |
| "20" | "2023-04-24T00:00:00Z"^^xsd:dateTime | "2023-04-24T06:00:00Z"^^xsd:dateTime |
| "20" | "2023-04-24T00:00:00Z"^^xsd:dateTime | "2023-04-22T06:00:00Z"^^xsd:dateTime |
| "20" | "2023-04-22T00:00:00Z"^^xsd:dateTime | "2023-04-23T06:00:00Z"^^xsd:dateTime |
| "20" | "2023-04-22T00:00:00Z"^^xsd:dateTime | "2023-04-24T06:00:00Z"^^xsd:dateTime |
| "20" | "2023-04-22T00:00:00Z"^^xsd:dateTime | "2023-04-22T06:00:00Z"^^xsd:dateTime |
| "20" | "2023-04-23T00:00:00Z"^^xsd:dateTime | "2023-04-23T06:00:00Z"^^xsd:dateTime |
| "20" | "2023-04-23T00:00:00Z"^^xsd:dateTime | "2023-04-24T06:00:00Z"^^xsd:dateTime |
| "20" | "2023-04-23T00:00:00Z"^^xsd:dateTime | "2023-04-22T06:00:00Z"^^xsd:dateTime |
| "30" | "2023-04-24T00:00:00Z"^^xsd:dateTime | "2023-04-23T06:00:00Z"^^xsd:dateTime |
| "30" | "2023-04-24T00:00:00Z"^^xsd:dateTime | "2023-04-24T06:00:00Z"^^xsd:dateTime |
| "30" | "2023-04-24T00:00:00Z"^^xsd:dateTime | "2023-04-22T06:00:00Z"^^xsd:dateTime |
| "30" | "2023-04-22T00:00:00Z"^^xsd:dateTime | "2023-04-23T06:00:00Z"^^xsd:dateTime |
| "30" | "2023-04-22T00:00:00Z"^^xsd:dateTime | "2023-04-24T06:00:00Z"^^xsd:dateTime |
| "30" | "2023-04-22T00:00:00Z"^^xsd:dateTime | "2023-04-22T06:00:00Z"^^xsd:dateTime |
| "30" | "2023-04-23T00:00:00Z"^^xsd:dateTime | "2023-04-23T06:00:00Z"^^xsd:dateTime |
| "30" | "2023-04-23T00:00:00Z"^^xsd:dateTime | "2023-04-24T06:00:00Z"^^xsd:dateTime |
| "30" | "2023-04-23T00:00:00Z"^^xsd:dateTime | "2023-04-22T06:00:00Z"^^xsd:dateTime |
--------------------------------------------------------------------------------------

We end up with the cross-product of the three values for each of :departureGate, :departureTime and :arrivalTime.

The RDF-star specification recognizes the distinction between global triples and specific asserted occurences of triples. The latter requires additional nodes in the graph to represent the distinct occurences of the triple. Following the approach presented in the specification:

:NYC :connection :SFO .

[] :occurenceOf << :NYC :connection :SFO >> ;
  :departureGate "10" ;
  :departureTime "2023-04-22T00:00:00Z"^^xsd:dateTime ;
  :arrivalTime "2023-04-22T06:00:00Z"^^xsd:dateTime .

[] :occurenceOf << :NYC :connection :SFO >> ;
   :departureGate "20" ;
   :departureTime "2023-04-23T00:00:00Z"^^xsd:dateTime ;
   :arrivalTime "2023-04-23T06:00:00Z"^^xsd:dateTime .

[] :occurenceOf << :NYC :connection :SFO >> ;
   :departureGate "30" ;
   :departureTime "2023-04-24T00:00:00Z"^^xsd:dateTime ;
   :arrivalTime "2023-04-24T06:00:00Z"^^xsd:dateTime .

Though we end up so close to a ‘proper’ domain model that it’s a better proposition to improve the model to start with.

Other options might be to retain the context of the qualifying statements using named graphs, or add more recursive levels of reification. Neither is particularly appealing.

Ultimately the advice on N-ary relations and reification in RDF in the (draft) Working Group Note “Defining N-ary Relations on the Semantic Web” remains relevant, with some tweaks:

It may be natural to think of RDF reification RDF-star when representing n-ary relations. We do not want to use the RDF reification vocabulary RDF-star to represent n-ary relations in general for the following reasons. The RDF reification vocabulary RDF-star is designed to talk about statements—individuals that are instances of rdf:Statement rdf-star:Triple. A statement is a subject, predicate, object triple and reification in RDF RDF-star is used to put assert additional information about this triple. This information may include the source of the information in the triple, for example. In n-ary relations, however, additional arguments in the relation do not usually characterize the statement but rather provide additional information about the relation instance itself. Thus, it is more natural to talk about instances of a diagnosis relation or a purchase rather than about a statement. In the use cases that we discussed in the note, the intent is to talk about instances of a relation, not about statements about such instances.

Qualification of triples

Even when using RDF-star to make statements about statements, we still need to carefully consider how we model those meta-statements. A good example is if we want to model assertions and retractions of statements (who said what, and when).

Consider the following RDF-star triple:

<< _:a :name "Alice" >> :statedBy :bob.

Should we want to model when Bob stated this, what would be a good approach? As shown above, we cannot make more statements about the triple << _:a :name "Alice" >> without potential clashes. Instead, we need to talk about an occurence of the triple.

Given that rdfs:Literal is a subclass of rdfs:Resource, it follows that any literal is a resource. Similarly a quoted tripe is itself a resource. So, if we want to relate the quoted triple << _:a :name "Alice" >> with Bob and a timestamp, then we have an n-ary relation involving three resources.

As the intent is to talk about the instance of the relation, rather than the statement of the relation, we might introduce some ‘Assertion’ class:

[] a :Assertion ;
  :occurenceOf << _:a :name "Alice" >> ;
  :agent :bob ;
  :atTime "2023-04-30T20:40:40"^^xsd:dateTime

Arguably this saves a couple of statements when compared to the “traditional” RDF reification vocabulary.

Interoperability with Property Graphs

Using RDF-star to provide interoperability with Property Graphs is mentioned in both the RDF-star Working Group Charter and RDF-star Use Cases and Requirements.

The Property Graphs and object-oriented databases that I’ve encountered use the edges to represent instances of relations, as opposed to some globally unique triple. As in, there may be multiple instances of edges/relations with the same relation type between the same nodes/relations, but with different attributes and values on the edges.

A natural representation of Property Graph data in RDF is therefore to introduce a qualified class per relation type. Arguably RDF-star is not necessary for this, but if we wish to introduce ‘simple’ binary relations in addition to qualified relations, then using quoted triples would allow to relate those.

To illustrate this, representing a PG relation as follows:

<< :Alice :worksFor :ACME >>
  :role :CEO ;
  :since 2010 ;
  :probability 0.8 ;
  :source <http://example.com/news> .

Would introduce the issues described above if we were to scrape data about Alice’s employment details from multiple sources that gave differing accounts of the employment history.

To avoid this, we can represent the relation as the instance of some class:

[] a :EmployeeRole ;
  :occurenceOf << :Alice :worksFor :ACME >> ;
  :roleName :CEO ;
  :since 2010 ;
  :probability 0.8 ;
  :source <http://example.com/news> .

Usage with PROV-O

Using quoted statements to relate instances of a relation to the unqualified triple could present an useful way to lookup qualifying information. This pattern could also be useful when incrementally evolving a model where we only discover the need to reify certain relations as requirements emerge. Rather than prospectively modeling all those classes up-front, we can start with a minimum-viable model and extend later as needed.

We can illustrate this using examples based on PROV-O. PROV-O Qualified classes and properties provide elaborated information about binary relations asserted using Starting Point and Expanded properties. Where we might start by talking about some activity using the starting-point properties:

:sortActivity a prov:Activity ;
  prov:startedAtTime "2011-07-16T01:52:02Z"^^xsd:dateTime ;
  prov:used :datasetA ;
  prov:generated :datasetB .

Later we might figure out that we needed to express the role of :datasetA whereby we can retain the relation to the original triple:

:sortActivity a prov:Activity ;
  prov:startedAtTime "2011-07-16T01:52:02Z"^^xsd:dateTime ;
  prov:qualifiedUsage [
    a prov:Usage ;
    prov:entity :datasetA ;         ## The entity used by the prov:Usage
    prov:hadRole :inputToBeSorted ; ## the role of the entity in this prov:Usage
    prov:specializationOf << :sortActivity prov:used :datasetA >> ;
  ];
  prov:used :datasetA ; ## retain the original asserted statement for backwards compatibility
  prov:generated :datasetB;
.

Conclusion

RDF-star and SPARQL-star add an extra level of complexity to standards that are already viewed as complex by many developers. Knowing when and how to apply these extensions to best effect, and avoid shooting oneself in the foot later, is not obvious to novice and causal users. Likely the pracitioners in the community need to build experience applying these extensions in the wild to figure out what works and what doesn’t.

Based on the examples presented above, I am not convinced that I will use RDF-star in my work. Whilst there probably are cases where it is useful to make statements about statements, in over 10 years using RDF and SPARQL, I am yet to encounter them.

I am also unconvinced there are many real-world use cases where it is useful to make statements about some RDF-star notion of a globally unique triple. Apart from trivial “toy” examples, I expect any real-world use cases will involve n-ary relations between a quoted statement and multiple other resources (IRIs and literals). Certainly that should at least raise questions about the utility of quoted triples as subjects in statements. Perhaps constraining their usage to only the object postion, in the same way as literals, would help avoid a rash of questionable modeling practices.

RDF-star seemingly seeks to solve a corner case of a corner case. Given that, I find it curious that it has attracted so much attention in the community the past years. Perhaps we should focus on efforts help onboard new users and drive long-term adoption:

  • How do we promote good RDF modeling practices?
  • How do we know when we talk about instances of relations versus statements?
  • How do we know when we talk about a triple versus an occurrence of a triple?

Presenting RDF-star as the solution for interoperability with property graphs also appears to be a red herring. Without involving some node in the graph that gives an identity to an instance of a relation, this is destined to end up with confused and dissatisfied users. I hope the bright minds that are involed in the RDF-star Working Group can find a way to square the circle.

This post shows how to connect to Amazon Neptune using curl with Signature Version 4 authentication. For more information about AWS Identity and Access Management (IAM) in Amazon Neptune, see the Neptune IAM overview.

The Neptune engine version 1.2.0.0 added support for more granular access control in Neptune IAM policies than has been available previously. We use this to grant access to applications using IAM roles based on the principle of least privilege. The applications are typically deployed as Amazon Lambda functions or running as containerized workloads in Amazon EKS.

To be able to develop, test and debug SPARQL queries and HTTP requests, it is useful to be able to reproduce the HTTP requests from the command line using curl. See also these instructions to assume an IAM role using the AWS CLI.

Prerequisites

  • curl 7.86.0 or higher.
  • IAM credentials to sign the requests.

To connect to Neptune using curl with Signature Version 4 signing

Check you have the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN environment variables set.

echo -e "AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}\nAWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}\nAWS_SESSION_TOKEN: ${AWS_SESSION_TOKEN}"

Enter the following command to get the status of your Neptune endpoint. Replace your-neptune-endpoint with the hostname or IP address of your Neptune DB instance. The default port is 8182.

curl https://your-neptune-endpoint:8182/status \
  --aws-sigv4 "aws:amz:eu-west-1:neptune-db" \
  --user "${AWS_ACCESS_KEY_ID}:${AWS_SECRET_ACCESS_KEY}" \
  --header "x-amz-security-token: ${AWS_SESSION_TOKEN}" \
  --no-progress-meter

Enter the following command to run a simple SPARQL query against your Neptune SPARQL endpoint.

curl https://your-neptune-endpoint:8182/sparql \
  --aws-sigv4 "aws:amz:eu-west-1:neptune-db" \
  --user "${AWS_ACCESS_KEY_ID}:${AWS_SECRET_ACCESS_KEY}" \
  --header "x-amz-security-token: ${AWS_SESSION_TOKEN}" \
  --header "Accept: application/sparql-results+json" \
  --header "Content-Type: application/sparql-query" \
  --data-binary "select * where { ?s ?p ?o } limit 10" \
  --no-progress-meter

Although this post uses Dydra graph database cloud service to illustrate the concepts, the approach is equally applicable to any RDF graph store that supports the SPARQL 1.1 Protocol and SPARQL 1.1 Graph Store HTTP Protocol

Following a short conversation on Twitter, the awesome Daniel Stenberg kindly implemented a new --url-query option in curl. This option is available in the curl 7.87.0 release. Daniel has blogged more about this new option, but here I want to demonstrate how it can make life easier for SPARQL 1.1 Protocol and SPARQL 1.1 Graph Store HTTP Protocol requests.

I’ve written previously about managing data in Dydra with SPARQL 1.1 Graph Store HTTP Protocol, so I’ll build on those examples in this post.

Authenticating requests to Dydra

Rather than passing your API key using the ?auth_token query string parameter, it is preferable to use either the -u/--user or --oauth2-bearer options. This can help avoid your credentials being inadvertently logged if your application logs request URLs.

Examples using curl to append data to the URL query with SPARQL 1.1 Graph Store HTTP Protocol

Get the named graph http://example.com/mygraph in the nlv01111/gsp repository in JSON-LD format

Previously it would be necessary to either URL-encode the graph IRI when constructing the request URL, or use the -G/--get and --data-urlencode options to make curl do a GET request with URL-encoded query string parameters:

curl "https://dydra.com/nlv01111/gsp/service" \
  --header "Accept: application/ld+json" \
  --data-urlencode "graph=http://example.com/mygraph" \
  -G

With the new --url-query option, we can omit the -G option:

curl "https://dydra.com/nlv01111/gsp/service" \
  --header "Accept: application/ld+json" \
  --url-query "graph=http://example.com/mygraph"

Put a local RDF/XML file data.rdf to the named graph http://example.com/mygraph in the nlv01111/gsp repository:

Previously we had no way to use curl to URL encode the query string parameters for anything other than GET requests. Life is now easier with --url-query:

curl -X PUT "https://dydra.com/nlv01111/gsp/service" \
  --data-binary @data.rdf \
  --header "Content-Type: application/rdf+xml" \
  --url-query "graph=http://example.com/mygraph" \
  --oauth2-bearer $MY_API_KEY

Examples using curl to append data to the URL query with SPARQL 1.1 Protocol

SPARQL query via GET

We can use --url-query to URL encode the query using the GET method:

curl "https://dydra.com/nlv01111/gsp/sparql" \
  --header "Accept: application/sparql-results+json" \
  --url-query "query=select * where { ?s ?p ?o }"

Additionally we can specify the RDF Dataset for a query via the default-graph-uri and named-graph-uri parameters:

curl "https://dydra.com/nlv01111/gsp/sparql" \
  --header "Accept: application/sparql-results+json" \
  --url-query "query=select * where { ?s ?p ?o }" \
  --url-query "default-graph-uri=http://example.com/mygraph"

We can also specify multiple graph IRIs:

curl "https://dydra.com/nlv01111/gsp/sparql" \
  --header "Accept: application/sparql-results+json" \
  --url-query "query=select * where { ?s ?p ?o }" \
  --url-query "default-graph-uri=http://example.com/mygraph" \
  --url-query "default-graph-uri=http://example.com/mygraph2"

SPARQL query via POST directly

Where the query text is longer, we may prefer to use the POST method and include the query directly and unencoded as the HTTP request message body:

curl -X POST "https://dydra.com/nlv01111/gsp/sparql" \
  --header "Accept: application/sparql-results+json" \
  --header "Content-Type: application/sparql-query" \
  --data-binary @query.rq

Additionally we can use --url-query to specify the RDF Dataset for a query via the default-graph-uri and named-graph-uri parameters:

curl -X POST "https://dydra.com/nlv01111/gsp/sparql" \
  --header "Accept: application/sparql-results+json" \
  --header "Content-Type: application/sparql-query" \
  --data-binary @query.rq \
  --url-query "default-graph-uri=http://example.com/mygraph"

SPARQL update via POST directly

For updates, we must use the POST method. Here we include the update directly and unencoded as the HTTP request message body:

curl -X POST "https://dydra.com/nlv01111/gsp/sparql" \
  --header "Content-Type: application/sparql-update" \
  --data-binary @update.ru \
  --oauth2-bearer $MY_API_KEY

Additionally we can use --url-query to specify the RDF Dataset for a update via the using-graph-uri or using-named-graph-uri parameters:

curl -X POST "https://dydra.com/nlv01111/gsp/sparql" \
  --header "Content-Type: application/sparql-update" \
  --data-binary @update.ru \
  --url-query "using-graph-uri=http://example.com/mygraph" \
  --oauth2-bearer $MY_API_KEY

Medior IT Business Analyst and Project Manager

Organization

Semaku B.V. is an IT company offering Master Data Management / Data Integration expertise and solutions, more generic consultancy roles and production services.

  • Delivery of end-to-end projects e.g. product traceability solutions
  • Offering and developing software products and services
  • Expertise roles; data science, DevOps, frontend and backend development
  • Consultancy roles; Project and program management, Product Owner, Scrum Master, information analyst and business analyst
  • Production services e.g. definition of decision trees for the Dutch environmental law (Omgevingswet)

We have a broad range of customers from multinational high-tech manufacturing companies, governmental institutions and smaller local customers. Our projects involve consultancy support within customer projects or Semaku projects.

Your responsibilities

Semaku is seeking for a medior IT Business Analyst and Project Manager in above mentioned work area. In this medior role you will be in direct contact with multiple parties including the customer. You will work closely with our Senior consultants who will support and guide you in your activities. You will be responsible for a broad range of activities from gathering requirements creating specification documentation, developing project plans, follow up on day to day project activities, etc. You may be required to work on site with the customer.

General duties and responsibilities include:

  • Ability to create structure within projects and project teams following Agile principles
  • Ability to build and maintain good customer relations plus being able to create structure in customer communication
  • Ability to translate business requirements into structured form
  • Adhering to project plans, ensuring that project deliverables are completed on time
  • Set up IT projects together with business representatives
  • Monitor project progress and create progress reports to steering committees
  • Performs review, analysis and evaluation of business and user requirements
  • Writes documentation of detailed business requirements and business process flows
  • Reviews all design solutions for accuracy and adherence to the business requirements
  • Gives design input through knowledge of software and industry
  • Answers functional questions about the system
  • Documents training materials and provides training if needed
  • Researches and documents bugs, communicates issues to developers

In your role you will be in close contact with our customers and depending on the project may act as a liaison between the customer and our developers. As we are a young company you will also have a lot of freedom and possibilities to shape your tasks, define your development path and help the company grow.

Desired Skills

  • 5-10 years work experience, master degree in (Technical) Business Administration, Economics and/or Business Economics preferably with specialization Business Information Systems
  • Certifications: ITIL foundation, PSM I Scrum Master, PSPO I Product Owner, Prince II foundation, SAFe foundation
  • Strong analytical and conceptual thinking required
  • Self-starter
  • Able to gather requirements from customers, understand business needs and articulate solutions
  • Able to conduct consulting engagements
  • Negotiation and presentation skills
  • Ability to manage stakeholders, multivendor/practice/party coordination
  • Ability to effectively prioritize and execute tasks in a fast-paced environment
  • Ability to work independently
  • Willing to learn new skills
  • Fast learner, curious and a team player

We offer

  • Highly innovative work and inspiring environment
  • Competitive salary
  • Opportunity for rapid growth and skills development
  • Possibility for learning via additional courses supporting you in your projects and activities
  • Be part of and shape an ambitious young company

Apply

For any questions you may contact Tim Nelissen (Co-owner and Principal Consultant)

Mobile: +31 637281400 | Email: tim.nelissen@semaku.com.

You can send your application letter and CV to careers@semaku.com.

Medior IT Data Scientist

Organization

Semaku B.V. is an IT company offering Master Data Management / Data Integration expertise and solutions, more generic consultancy roles and production services.

  • Delivery of end-to-end projects e.g. product traceability solutions
  • Offering and developing software products and services
  • Expertise roles; data science, DevOps, frontend and backend development
  • Consultancy roles; Project and program management, Product Owner, Scrum Master, information analyst and business analyst
  • Production services e.g. definition of decision trees for the Dutch environmental law (Omgevingswet)

We have a broad range of customers from multinational high-tech manufacturing companies, governmental institutions and smaller local customers. Our projects involve cloud-native architectures and integration of cutting edge AI/ML technology along with innovative multi-agent systems for Industry 4.0 solutions.

Your responsibilities

In this role you will be in direct contact with multiple parties including the customer. You will work closely with our senior consultants who will support and guide you in your activities. You will be responsible for a broad range of activities from gathering requirements, creating data models, integrating datasets, analyse data patterns and creating data enhancement opportunities. You may be required to work on site with the customer.

General duties and responsibilities include:

  • Ability to translate business requirements into a structured form
  • Performs gap-analysis with the assistance of other team members
  • Defining data models e.g. SHACL, OWL, RDFS, UML
  • Performs review, analysis and evaluation of business and user requirements
  • Writes documentation of detailed business requirements and business process flows
  • Reviews all design solutions for accuracy and adherence to the business requirements
  • Gives design input through knowledge of software and industry
  • Answers functional questions about the system
  • Documents training materials and provides training if needed
  • Researches and documents bugs, communicates issues to developers

In your role you will be in close contact with our customers and depending on the project may act as a liaison between the customer and our developers. Within our company you will have a lot of freedom and possibilities to shape your tasks, define your development path and help the company grow.

Desired Skills

  • 5 years work experience, master degree in Mathemathics, Computer Science
  • Strong communicator and relationship builder
  • Strong analytical and conceptual thinking required
  • Experience with Semantic Web technologies and concepts such as RDF, SPARQL, SHACL
  • Experience with ‘big data’ tools like Hadoop, Spark and so on is a plus
  • Able to gather requirements from customers, understand business needs and articulate solutions
  • Able to conduct consulting engagements
  • Negotiation and presentation skills
  • Ability to effectively prioritize and execute tasks in a fast-paced environment
  • Ability to work independently, Self-starter
  • Located in the Netherlands preferably in the area of Eindhoven
  • Good written and oral communication skills, fluent in English and Dutch
  • Willing to learn new skills, fast learner, curious and a team player

We offer

  • Highly innovative work and inspiring environment
  • Competitive salary
  • Opportunity for rapid growth and skills development
  • Possibility for learning via additional courses supporting you in your projects and activities
  • Be part of and shape an ambitious young company

Apply

For any questions you may contact Tim Nelissen (Co-owner and Principal Consultant)

Mobile: +31 637281400 | Email: tim.nelissen@semaku.com

You can send your application letter and CV to careers@semaku.com.

Senior / Medior Devops Engineer

Organization

Semaku B.V. is an IT company offering Master Data Management / Data Integration expertise and solutions, more generic consultancy roles and production services.

  • Delivery of end-to-end projects e.g. product traceability solutions
  • Offering and developing software products and services
  • Expertise roles; data science, DevOps, frontend and backend development
  • Consultancy roles; Project and program management, Product Owner, Scrum Master, information analyst and business analyst
  • Production services e.g. definition of decision trees for the Dutch environmental law (Omgevingswet)

We have a broad range of customers from multinational high-tech manufacturing companies, governmental institutions and smaller local customers. Our projects involve cloud-native architectures and integration of cutting edge AI/ML technology along with innovative multi-agent systems for Industry 4.0 solutions.

Your responsibilities

As a senior / medior DevOps engineer you will be in direct contact with multiple parties including the customer. You will be actively involved in software development projects and work closely with the system and process architect, other front and back-end developers, product owner and scrum master. As we are working in a small team you will have direct input into implementation decisions. Next to that you will support solving customer incident and change requests. You may be required to work on site with the customer.

Desired Skills

  • 5 to 10 years work experience, Bachelor or master degree in Computer Science
  • Experience with containerization, e.g. Kubernetes, Docker
  • Experience with Terraform
  • Experience with CI/CD tools, e.g. Jenkins
  • Experience with serverless architecture and services, e.g. AWS Lambda and Step Functions
  • AWS certification as DevOps Engineer or SysOps Administrator
  • Good written and oral communication skills, fluent in English and Dutch
  • Able to define cloud infrastructure
  • Able to estimate, plan and execute in line with the plan
  • Proven analytical and problem-solving abilities
  • Ability to effectively prioritize and execute tasks in a fast-paced environment
  • Ability to work independently
  • Willing to learn new skills
  • Fast learner, curious and a team player

We offer

  • Highly innovative work and inspiring environment
  • Competitive salary
  • Opportunity for rapid growth and skills development
  • Possibility for learning via additional courses supporting you in your projects and activities
  • Be part of and shape an ambitious young company

Apply

For any questions please contact Tim Nelissen (Co-owner and Principal Consultant)

Mobile: +31 637281400 | Email: tim.nelissen@semaku.com

You can send your application letter and CV to careers@semaku.com.

Senior / Medior Fullstack Developer

Organization

Semaku B.V. is an IT company offering Master Data Management / Data Integration expertise and solutions, more generic consultancy roles and production services.

  • Delivery of end-to-end projects e.g. product traceability solutions
  • Offering and developing software products and services
  • Expertise roles; data science, DevOps, frontend and backend development
  • Consultancy roles; Project and program management, Product Owner, Scrum Master, information analyst and business analyst
  • Production services e.g. definition of decision trees for the Dutch environmental law (Omgevingswet)

We have a broad range of customers from multinational high-tech manufacturing companies, governmental institutions and smaller local customers. Our projects involve cloud-native architectures and integration of cutting edge AI/ML technology along with innovative multi-agent systems for Industry 4.0 solutions.

Your responsibilities

As a senior / medior fullstack developer you will be in direct contact with multiple parties including the customer. You will be actively involved in software development projects and work closely with the system and process architect, other front and back-end developers, product owner and scrum master. As we are working in a small team you will have direct input into implementation decisions. Next to that you will support solving customer incident and change requests. You may be required to work on site with the customer.

Desired Skills

  • 5 to 10 years work experience, Bachelor or Master degree in Computer Science
  • Experience with Java frameworks, Maven and Spring framework
  • Experience building Vue.js applications
  • Knowledge of HTML5 and CSS3
  • Experience with Kubernetes, Docker and Cloud architectures
  • AWS certification is an advantage
  • Experience with Elasticsearch is an advantage
  • Good written and oral communication skills, fluent in English and Dutch
  • Able to define techincal application architecture
  • Able to estimate, plan and execute in line with the plan
  • Proven analytical and problem-solving abilities
  • Ability to effectively prioritize and execute tasks in a fast-paced environment
  • Ability to work independently
  • Willing to learn new skills
  • Fast learner, curious and a team player

We offer

  • Highly innovative work and inspiring environment
  • Competitive salary
  • Opportunity for rapid growth and skills development
  • Possibility for learning via additional courses supporting you in your projects and activities
  • Be part of and shape an ambitious young company

Apply

For any questions you may contact Tim Nelissen (Co-owner and Principal Consultant)

Mobile: +31 637281400 | Email: tim.nelissen@semaku.com

You can send your application letter and CV to careers@semaku.com.

Aggregates functions like SUM, MIN, MAX, AVG and GROUP_CONCAT provide an easy way to aggregate over a complete dataset, or groups within a dataset in combination with a GROUP BY clause. This provides an easy way to summarise information, but often we want to know more about specific solutions within a group.

For demo purposes, consider a dataset consisting of information about books, an excerpt of which is shown below:

base <http://example.com/id/>
prefix schema: <http://schema.org/>

<book/1> a schema:Book ;
  schema:name "The Hobbit" ;
  schema:genre "Fantasy" ;
  schema:pages 367 ;
  schema:author <person/1> .

<book/2> a schema:Book ;
  schema:name "The Lord Of The Rings: The Fellowship Of The Ring" ;
  schema:genre "Fantasy" ;
  schema:pages 404 ;
  schema:author <person/1> .

<book/2> a schema:Book ;
  schema:name "The Lord Of The Rings: The Two Towers" ;
  schema:genre "Fantasy" ;
  schema:pages 450 ;
  schema:author <person/1> .

<book/2> a schema:Book ;
  schema:name "The Lord Of The Rings: The Return Of The King" ;
  schema:genre "Fantasy" ;
  schema:pages 496 ;
  schema:author <person/1> .

<person/1> a schema:Person ;
  schema:name "J.R.R. Tolkien" .

Problem: finding solutions within an aggregate grouping

Use case: finding the longest book in each genre.

Solutions:

Concatenate the date + filename and use MAX aggregate, then split the during projection to give filename

prefix schema: <http://schema.org/>
select ?genre (strafter(max(concat(?Book_pages, "//", ?Book_name)), "//") as ?Book_pages_name)
where  {
  [] a schema:Book ;
    schema:name ?name ;
    schema:pages ?pages ;
    schema:genre ?genre .
}
group by ?genre

This quick approach can satisfy many simpler use cases, but starts to fall apart when you want to return more information about the solution within a group (especially if that information consists of RDF terms that are not strings or plain literals). Also this approach will return once solution from each group, in this case the book with most pages and the alphabetically last name. In some use cases, we may want to know all books with the maximum number of pages within each group.

Use sub-select and join the results.

prefix schema: <http://schema.org/>
select *
where {
  {
    # find max pages per genre
    select ?genre (max(?pages) as ?max_pages) {
      [] a schema:Book ;
        schema:genre ?genre ;
        schema:pages ?pages .
    }
    group by ?genre
  }
  ?Book a schema:Book ;
    schema:name ?name ;
    schema:genre ?genre ;
    schema:pages ?max_pages .
}

Problem: calculating the statistics of aggregates

Use case: How many authors have X number of books?

Solutions:

  • Use sub-select with aggregation to calculate the number of children per family and project this
  • In outer query aggregate by the projected (aggregated) value to calculate count of families with that number of children

As demonstrated in the query:

prefix schema: <http://schema.org/>
select ?count_books (count(*) as ?count_authors)
where {
  select ?author (count(*) as ?count_books) {
    [] a schema:Book ;
      schema:author ?author .
  }
  group by ?author
}
group by ?count_books

Summary

Semaku offers services in the area of structured and integrated information management. Encompassing content management, product information management and product lifecycle management, we help our client to streamline and implement the big picture with a unified approach.

The outcome: our clients optimize their costs and achieve a faster time to publish their content, leading to a more consistent visibility of their product offering.

It is our core mission to empower you with smoother information creation, management, storage and publication processes. Whether you struggle with data inconsistency across your silos or content publication throughput time, or a mix of both, we will help you design the best plan to let your information landscape be what it is meant to be: a robust hub for internal and external collaborators to consume and contribute efficiently.

Guiding you on the path of implementing and integrating the best-of-breed data and content management solutions using international standards and troubleproof technology is what we are good at.

You can find examples of successful collaborations with partners and clients in the Use cases.

Are you interested in our services, have questions or want to work with us? Get in touch!

This is the fourth in a series of posts about using SHACL to validate material composition data for semiconductor products (microchips). This results from a recent project we undertook for Nexperia. In our first four posts, we looked at how to validate our material composition data:

  • In the first post we looked at the basic data model for material composition and how basic SHACL vocabulary can be used to describe the constraints.
  • In the second post we looked at how SPARQL-based constraints can be used to implement more complex rules based on a SPARQL SELECT query.
  • In the third post, how aggregates can be used as part of validation rules, and
  • In the fourth post we looked at using OWL to model class-based inference rules.

In this post, we will explore how we could use SPARQL as an alternative to OWL to capture the inference rules from the previous post.

Starting from the same generic cases:

  1. Any X is a Z
  2. Any X containing Y is a Z
  3. Any X containing at least N% of Y is a Z

Let’s look at how these can be implemented as SPARQL CONSTRUCT queries.

Any X is a Z

Using the standard RDFS/OWL vocabulary, the rdfs:subClassOf is the obvious way to implement this inference rule as it infers that any instance of the subclass is also an instance of the parent class:

plm:Adhesive rdfs:subClassOf iec:M-014 .

This is straightforward to map into the equivalent SPARQL CONSTRUCT query:

PREFIX plm: <http://example.com/def/plm/>
PREFIX iec: <http://example.com/def/iec62474/>
CONSTRUCT {
  ?s a iec:M-014
}
WHERE {
  ?s a plm:Adhesive .
}

In layman’s terms, for every ?s that is a plm:Adhesive, construct the statement that ?s is a iec:M-014.

So from the statement:

<132253533401> a plm:Adhesive .

We can construct the statement:

<132253533401> a iec:M-014 .

However, this requires a separate query for each mapping rule to be instantiated for each material class. These could be manually maintained or generated by some templating language.

Another option is to make the rdfs:subClassOf statements part of the RDF dataset we are querying over. This allows to define a single SPARQL query that will work for all rules of this “any X is a Z” case.

Here we will assume the instance data is in the default graph and the inference rules in a graph named http://example.com/graph/iec62474.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX plm: <http://example.com/def/plm/>
PREFIX iec: <http://example.com/def/iec62474/>
CONSTRUCT {
  ?s a ?parent .
}
WHERE {
  ?s a ?child .
  graph <http://example.com/graph/iec62474> {
    ?child rdfs:subClassOf ?parent .
  }
}

So from the RDF dataset:

<132253533401> a plm:Adhesive .
graph <http://example.com/graph/iec62474> {
  plm:Adhesive rdfs:subClassOf iec:M-014 .
}

We can construct the statement:

<132253533401> a iec:M-014 .

As additional statements are added into the dataset, additional statements can be constructed using our rule. Essentially here we have implemented the rdfs:subClassOf semantics in a concrete SPARQL query.

Any X containing Y is a Z

For this case we had to dive into class-based restriction rules to express the logic using OWL. This is rather complex for non-ontologists, so here we propose a simpler instance-based approach using SPARQL. We will use a similar approach as above where the rules will be captured in a named graph on our RDF dataset.

Consider the concrete rule “Any Component containing some C.I. Pigment Violet 23 is a M-015: Other Organic Materials”. First let’s write a rule in SPARQL for this:

PREFIX plm: <http://example.com/def/plm/>
PREFIX iec: <http://example.com/def/iec62474/>
CONSTRUCT {
  ?s a iec:M-015 .
}
WHERE {
  ?s a plm:Component ;
    plm:containsMaterialClass <132285000361> .
}

So from the statements:

<132253533401> a plm:Component ;
  plm:containsMaterialClass <132285000361> .

We can construct the statement:

<132253533401> a iec:M-015 .

This is a much more direct and understandable way to capture the rule. It also does not generate additional unneeded rdf:type statements for the intermediate classes necessary for OWL inferencing.

The next step is to generalise/abstract this SPARQL query so it works for all rules of this type. Let’s begin by replacing the parts that will change rules of this type with a variable:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX plm: <http://example.com/def/plm/>
PREFIX iec: <http://example.com/def/iec62474/>
CONSTRUCT {
  ?s a ?parent .
}
WHERE {
  ?s a ?child ;
    plm:containsMaterialClass ?materialClass .
  VALUES (?child ?materialClass ?parent) {
    (plm:Component <132285000361> iec:M-015)
    # more rules can be added here
  }
}

Here we have used the VALUES clause to pass in a set of bindings for the variables. It would be relatively easy to extend the set of solutions for other material classes. But let’s consider how we can add statements into our dataset that allow to capture these rules in RDF statements.

As there is no standard RDFS term that expresses the semantics we need, we will make a small custom vocabulary. Here was can describe this in RDF as follows:

[] a plm:Component ;
  plm:containsMaterialClass <132285000361> ;
  rdfs:subClassOf iec:M-015 .

Here we introduce a blank node (denoted by []) that kind of captures the pattern we want to match and the sub class. In RDF terms, this does not make too much sense, but if this is added to the named graph, we can write a query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX plm: <http://example.com/def/plm/>
PREFIX iec: <http://example.com/def/iec62474/>
CONSTRUCT {
  ?s a ?parent .
}
WHERE {
  ?s a ?child ;
    plm:containsMaterialClass ?materialClass .
  graph <http://example.com/graph/iec62474> {
    [] a ?child ;
      plm:containsMaterialClass ?materialClass ;
      rdfs:subClassOf ?parent .
  }
}

This allows us to pull out those rules into a separate RDF graph as a kind of configuration rather than hard-coded into the SPARQL query.

Any X containing at least N% of Y is a Z

Finally, let’s consider the “Any X containing at least N% of Y is a Z” case. Specifically let’s look at the rule “Any Component containing at least 40% Iron is a M-002: Other Ferrous alloys, non-stainless steels”.

To express this in OWL required a two-layer approach using class-based restrictions. To express the equivalent in SPARQL is relatively simple:

PREFIX plm: <http://example.com/def/plm/>
PREFIX iec: <http://example.com/def/iec62474/>
CONSTRUCT {
  ?s a iec:M-002 .
}
WHERE {
  ?s a plm:Component ;
    plm:qualifiedRelation [
      a plm:ContainsMaterialClassRelation ;
      plm:massPercentage ?massPercentage ;
      plm:target <132285000116>
    ] .
  filter (?massPercentage >= 40)
}

Again let’s use variables and bind these using a VALUES clause:

PREFIX plm: <http://example.com/def/plm/>
PREFIX iec: <http://example.com/def/iec62474/>
CONSTRUCT {
  ?s a ?parent .
}
WHERE {
  ?s a ?child ;
    plm:qualifiedRelation [
      a plm:ContainsMaterialClassRelation ;
      plm:massPercentage ?massPercentage ;
      plm:target ?materialClass
    ] .
  values (?child ?materialClass ?minPercentage ?parent) {
    (plm:Component <132285000116> 40 iec:M-002)
    # more rules can be added here
  }
  filter (?massPercentage >= ?minPercentage)
}

Here again the set of bindings can be extended for other rules of this type, but we can also pull this out into a separate configuration graph:

[] a plm:Component ;
  plm:qualifiedRelation [
    a plm:ContainsMaterialClassRelation ;
    plm:massPercentage 40 ;
    plm:target <132285000361>
  ] ;
  rdfs:subClassOf iec:M-002 .

Again, there is limited semantic value to these statements, it is just a way to capture a solution as RDF statements. The SPARQL query can then be formulated:

PREFIX plm: <http://example.com/def/plm/>
PREFIX iec: <http://example.com/def/iec62474/>
CONSTRUCT {
  ?s a ?parent .
}
WHERE {
  ?s a ?child ;
    plm:qualifiedRelation [
      a plm:ContainsMaterialClassRelation ;
      plm:massPercentage ?massPercentage ;
      plm:target ?materialClass
    ] .
  { SELECT DISTINCT ?child ?materialClass ?minPercentage ?parent {
    graph <http://example.com/graph/iec62474> {
      [] a ?child ;
        plm:qualifiedRelation [
          a plm:ContainsMaterialClassRelation ;
          plm:massPercentage ?minPercentage ;
          plm:target ?materialClass
        ] ;
        rdfs:subClassOf ?parent .
    }
  }}
  filter (?massPercentage >= ?minPercentage)
}

Here we introduce the sub-select just to force home that the only purpose of these new statements is to produce a solution set that is joined to the outer results. This has exactly the same effect as using the VALUES formulation from the previous query.

The formulation of these rules carries little (machine readable) semantic value, it only serves to match a pattern in our query.

Hopefully this demonstrates how SPARQL can be used to implement such rules. The reader should also be able to contrast this approach with the OWL approach from the previous post. Whilst SPARQL is arguably simpler and easier to understand, the semantics of the rules are not really made explicit as they are with OWL.

In the next post in this series, we will look at how SHACL can be used to implement these rules.

This is the fourth in a series of posts about using SHACL to validate material composition data for semiconductor products (microchips). This results from a recent project we undertook for Nexperia. In our first three posts, we looked at how to validate our material composition data:

  • In the first post we looked at the basic data model for material composition and how basic SHACL vocabulary can be used to describe the constraints.
  • In the second post we looked at how SPARQL-based constraints can be used to implement more complex rules based on a SPARQL SELECT query and,
  • In the third post, how aggregates can be used as part of validation rules.

In this post we will venture beyond validating data and consider how we can enrich the data by inferring new, additional statements. Before jumping into the implementation of the inferencing, let’s first look at what we want to be able to infer.

One of the requirements from Nexperia is that we should be able to generate IPC-1752A compliant XML files. IPC-1752A is the materials declaration standard for companies in the supply chain to share information on materials in products. In February 2014, a second amendment to IPC-1752A was published that, amongst other additions, added a new field for use with new IEC 62474 Declarable Substances list to align with IEC 62474; Based on our, now validated, material composition data, we want to automatically infer the IEC 62474 classification of each material.

IEC 62474 defines set of material classes in a tree-like structure:

  • Inorganic materials

    • Metals and Metal Alloys

      • Ferrous alloys
        • M-001: Stainless steel
        • M-002: Other Ferrous alloys, non-stainless steels
      • Non-ferrous metals and alloys
        • M-003: Aluminum and its alloys
        • M-004: Copper and its alloys
        • M-005: Magnesium and its alloys
        • M-006: Nickel and its alloys
        • M-007: Zinc and its alloys
        • M-008: Precious metals
        • M-009: Other non-ferrous metals and alloys
    • Non-metals

      • M-010: Ceramics / Glass
      • M-011: Other inorganic materials
  • Organic materials

    • Plastics and rubber
      • M-012: PolyVinylChloride (PVC)
      • M-013: Other Thermoplastics
      • M-014: Other Plastics and Rubber
    • Other organics
      • M-015: Other Organic Materials

We can define rules to classify the materials based on:

  • The type of the material
  • The composition of the material

These rules are defined by the subject matter experts within Nexperia. They are relatively static, but we can expect that to change over time as new material types and substances are introduced. Examples of these rules are:

  • Any Adhesive is a M-014: Other Plastics and Rubber
  • Any Clip is a M-004: Copper and its alloys
  • Any Component containing at least 50% Lead Oxide is a M-010: Ceramics / Glass
  • Any Component containing at least 40% Iron is a M-002: Other Ferrous alloys, non-stainless steels
  • Any Component containing some C.I. Pigment Violet 23 is a M-015: Other Organic Materials
  • Any Lead Frame containing at least 50% Copper is a M-004: Copper and its alloys
  • Any Lead Frame containing at least 50% Iron is a M-002: Other Ferrous alloys, non-stainless steels

Note the cases where the ‘required’ percentage of Iron differs per material type. We can abstract this into the following generic cases:

  1. Any X is a Z
  2. Any X containing Y is a Z
  3. Any X containing at least N% of Y is a Z

When considering how to capture these rules, the ‘traditional’ answer in Semantic Web circles is to use OWL (the Web Ontology Language), so let’s try it. OWL generally works using Description Logic based inference rules which essentially works by defining sets (classes) of things and how those sets relate to each other.

Any X is a Z

The “Any X is a Z” case can be easily handled using rdfs:subClassOf to relate the material type (a class) to the IEC material class. For example:

plm:Adhesive rdfs:subClassOf iec:M-014 .

So from the statement:

<132253533401> a plm:Adhesive .

We can infer:

<132253533401> a iec:M-014 .

Any X containing Y is a Z

The “Any X containing Y is a Z” case can be handled using OWL class restrictions.

Consider the concrete rule “Any Component containing some C.I. Pigment Violet 23 is a M-015: Other Organic Materials”.

We start by defining a new class ex:ContainsCIPigmentViolet23 based on a restriction on the plm:containsMaterialClass predicate:

ex:ContainsCIPigmentViolet23 owl:equivalentClass [
    rdf:type owl:Restriction ;
    owl:onProperty plm:containsMaterialClass ;
    owl:hasValue <132285000361>
  ] .

This will infer a resource that contains material class <132285000361> is a ex:ContainsCIPigmentViolet23.

Next we can define a new class ex:ComponentContainingCIPigmentViolet23 as being equivalent to the intersection of the plm:Component and ex:ContainsCIPigmentViolet23 classes and being a subclass of iec:M-015:

ex:ComponentContainingCIPigmentViolet23 rdfs:subClassOf iec:M-015 ;
  owl:equivalentClass [
    rdf:type owl:Class ;
    owl:intersectionOf ( plm:Component ex:ContainsCIPigmentViolet23 )
  ] .

In combination with the previous restriction, this will infer a resource that contains material class <132285000361> and is of type plm:Component is a ex:ComponentContainingCIPigmentViolet23 and a iec:M-015.

So from the statements:

<132253533401> a plm:Component ;
  plm:containsMaterialClass <132285000361> .

We can infer:

<132253533401> a ex:ContainsCIPigmentViolet23, ex:ComponentContainingCIPigmentViolet23, iec:M-015 .

Any X containing at least N% of Y is a Z

Finally, let’s consider the “Any X containing at least N% of Y is a Z” case. This is the most complex case and requires a mental backflip or two involving the qualified relationships that carry the mass percentage.

Specifically let’s look at the rule “Any Component containing at least 40% Iron is a M-002: Other Ferrous alloys, non-stainless steels”.

First we define the inferences for the qualified relations. We define a new class ex:MassPercentageMin40 based on a restriction on the plm:massPercentage predicate having an integer value of 40 or more:

ex:MassPercentageMin40 owl:equivalentClass [
    rdf:type             owl:Restriction ;
    owl:onProperty       plm:massPercentage ;
    owl:allValuesFrom [
      rdf:type             rdfs:Datatype ;
      owl:onDatatype       xsd:integer ;
      owl:withRestrictions ( [ xsd:minInclusive 40 ] )
    ]
  ] .

This will infer that a resource that has a mass percentage value of 40 or more is a ex:MassPercentageMin40.

We also define a new class ex:HasIron based on a restriction on the plm:target predicate:

ex:HasIron owl:equivalentClass [
    rdf:type owl:Restriction ;
    owl:onProperty plm:target ;
    owl:hasValue <132285000116>
  ] .

This will infer a resource that with target <132285000116> is a ex:HasIron.

Then we define another class ex:MassPercentageMin40Iron as being the equivalent of the intersection of the classes ex:MassPercentageMin40 and ex:HasIron that we just defined:

ex:MassPercentageMin40Iron owl:equivalentClass [ 
    rdf:type owl:Class ;
    owl:intersectionOf ( ex:MassPercentageMin40 ex:HasIron )
  ] .

This will infer that a resource representing the relationship of a mass percentage value of 40 or more of <132285000116> is a ex:MassPercentageMin40Iron.

Next we define a class ex:MaterialWithMassPercentageMin40Iron based on a restriction on the plm:qualifiedRelation predicate having a value of type ex:MassPercentageMin40Iron or more:

ex:MaterialWithMassPercentageMin40Iron owl:equivalentClass [
    rdf:type owl:Restriction ;
    owl:onProperty plm:qualifiedRelation ;
    owl:someValuesFrom ex:MassPercentageMin40Iron
  ] .

Finally we can define a new class ex:ComponentWithMassPercentageMin40Iron as being equivalent to the intersection of the plm:Component and ex:MaterialWithMassPercentageMin40Iron classes and being a subclass of iec:M-002:

ex:ComponentWithMassPercentageMin40Iron rdfs:subClassOf iec:M-002 ;
  owl:equivalentClass [
    rdf:type owl:Class ;
    owl:intersectionOf ( plm:Component ex:MaterialWithMassPercentageMin40Iron )
  ] .

Put together we can then infer that a Component that contains 40% or more of <132285000116> is a ex:ComponentWithMassPercentageMin40Iron and a iec:M-002.

So from the statements:

<331214891234> a plm:Component ;
  plm:name "3312 148 91234" ;
  plm:containsMaterialClass <132285000116> ;
  plm:qualifiedRelation [
    a plm:ContainsMaterialClassRelation ;
    plm:target <132285000116> ;
    plm:materialGroup "Pure metal" ;
    plm:massPercentage 100
  ] .

We can infer:

<331214891234> a plm:Component, ex:MaterialWithMassPercentageMin40Iron, ex:ComponentWithMassPercentageMin40Iron, iec:M-002 ;
  plm:name "3312 148 91234" ;
  plm:containsMaterialClass <132285000116> ;
  plm:qualifiedRelation [
    a plm:ContainsMaterialClassRelation, ex:MassPercentageMin40, ex:HasIron, ex:MassPercentageMin40Iron ;
    plm:target <132285000116> ;
    plm:materialGroup "Pure metal" ;
    plm:massPercentage 100
  ] .

This demonstrates that it is possible to capture these rules using the class-based inferencing approach of OWL. The sample OWL ontology is available here.

IEC 62474 ontology - Visualization generated using WebVowl

In the next post in the series, we’ll look at how we can use SPARQL to define the same rules.

This is the third in a series of posts about using SHACL to validate material composition data for semiconductor products (microchips). This results from a recent project we undertook for Nexperia. In the first post we looked at the basic data model for material composition and how basic SHACL vocabulary can be used to describe the constraints. In the second post we looked at how SPARQL-based constraints can be used to implement more complex rules based on a SPARQL SELECT query. In this post we will continue to look at SPARQL-based constraints and how aggregates can be used as part of validation rules.

For each material we have the composition of the material in terms of the mass percentage of the substances it contains. For example the adhesive 1322 535 33401:

Graph representation of adhesive material composition

The same graph expressed in RDF:

<132253533401> a plm:Adhesive ;
  plm:containsMaterialClass <132285000108>, <132285000179>, <132285000343>, <132285000435> ;
  plm:name "1322 535 33401" ;
  plm:qualifiedRelation [
    a plm:ContainsMaterialClassRelation ;
    plm:massPercentage 10.0 ;
    plm:materialGroup "Polymer" ;
    plm:target <132285000435>
  ] ;
  plm:qualifiedRelation [
    a plm:ContainsMaterialClassRelation ;
    plm:massPercentage 5.0 ;
    plm:materialGroup "Polymer" ;
    plm:target <132285000343>
  ] ;
  plm:qualifiedRelation [
    a plm:ContainsMaterialClassRelation ;
    plm:massPercentage 84.0 ;
    plm:materialGroup "Filler" ;
    plm:target <132285000108>
  ] ;
  plm:qualifiedRelation [
    a plm:ContainsMaterialClassRelation ;
    plm:massPercentage 1.0 ;
    plm:materialGroup "Additive" ;
    plm:target <132285000179>
  ] .

In this case we can see the mass percentages sum to 100% as one would expect:

10.0 + 5.0 + 84.0 + 1.0 = 100.0

However, for other materials we observe this is not the case and we would like to define a constraint to check for this. This could be due to a typo when entering data, or because some materials are entered as having some small ‘trace’ percentage of a substance, whereby the total is not exactly 100%.

As with the previous post, we first begin by defining a SPARQL query that implements the logic for the check. In this case, we want to aggregate per material and calculate the sum of plm:massPercentage on the related plm:ContainsMaterialClassRelation via the plm:qualifiedRelation property.

A query that implements this logic is as follows:

PREFIX plm: <http://example.com/def/plm/>
SELECT ?material (sum(?massPercentage) as ?sumMassPercentage) {
  ?material plm:qualifiedRelation/plm:massPercentage ?massPercentage .
}
GROUP BY ?material

Running this query over the example data yields these results:

---------------------------------------------------------
| material                          | sumMassPercentage |
=========================================================
| <http://example.com/032226800047> | 100.00            |
| <http://example.com/132295500317> | 100.0             |
| <http://example.com/132299586663> | 100.00            |
| <http://example.com/132253533401> | 100.0             |
| <http://example.com/331214892031> | 100.1             |
| <http://example.com/340000130609> |                   |
| <http://example.com/132299586251> | 100.00            |
| <http://example.com/344000000687> | 100.00            |
| <http://example.com/331206306701> | 100.00            |
| <http://example.com/340007000868> | 100.0             |
---------------------------------------------------------

We can see that:

  • <http://example.com/331214892031> has total mass percentage of 100.1% which is suspect
  • <http://example.com/340000130609> has missing sum (due to incorrect datatype on the value which is already validated in our shape file).

There are various ways we can write this SPARQL constraint in SHACL, but using a property shape with a path seems the best fit:

:shape1 a sh:NodeShape ;
  sh:targetSubjectsOf plm:containsMaterialClass ;
  sh:property [
    sh:path ( plm:qualifiedRelation plm:massPercentage ) ;
    sh:severity sh:Warning ;
    sh:sparql [
      a sh:SPARQLConstraint ;
      sh:message "Mass percentage of contained material classes should sum to 100%" ;
      sh:prefixes plm: ;
      sh:select """
        SELECT $this (sum(?massPercentage) as ?value) {
          $this $PATH ?massPercentage .
        }
        GROUP BY $this
        HAVING (sum(?massPercentage) != 100)
        """
    ]
  ] .

Here we have a sh:NodeShape that targets all resources that are subject of plm:containsMaterialClass property, which is logically any material (i.e. the rdfs:domain of the property is plm:Material). This saves having to enumerate all the different classes of materials or to materialize the inferred class memberships in the data.

Next we have defined the sh:property on this with the SHACL Property path ( plm:qualifiedRelation plm:massPercentage ) which is equivalent to the SPARQL Property path plm:qualifiedRelation/plm:massPercentage from our query. In the sh:sparql part, we define the SPARQL query where the $PATH variable will be substituted with the SPARQL Property path at runtime.

As we are using SPARQL aggregates to calculate the total mass percentage of the susbstances in a material, we use the HAVING keyword to operate over the grouped solution set (in the same way that FILTER operates over un-grouped ones) to only return results that are not equal to 100. Recall, from last post in the series, the SPARQL query must be written such that it gives results for things that do not match the constraint

Note we also define the severity of this constraint as sh:Warning. This is because we do not want the data processing pipeline to fail, but the warning should be logged and reported to the responsible person.

The extended shape file is available here.

If we use this shape file to validate our data, we see the additional validation result:

[ a sh:ValidationResult ;
  sh:focusNode <http://example.com/331214892031> ;
  sh:resultMessage "Mass percentage of contained material classes should sum to 100%" ;
  sh:resultPath ( plm:qualifiedRelation plm:massPercentage ) ;
  sh:resultSeverity sh:Warning ;
  sh:sourceConstraint []  ;
  sh:sourceConstraintComponent sh:SPARQLConstraintComponent ;
  sh:sourceShape []  ;
  sh:value 100.1
]

This tallies with the results from our standalone query.

This demonstrates that non-trivial rules involving calculated aggregate values can be implemented using SPARQL-based constraints in SHACL using the HAVING keyword to filter the grouped solution sets.

As we are using W3C standards, we can be sure we avoid vendor-specific solutions and thus lock-in.

In the next post in the series, we will look at using OWL and SHACL to implement inferencing rules for classification of materials.

This is the second in a series of posts about using SHACL to validate material composition data for semiconductor products (microchips). This results from a recent project we undertook for Nexperia. In the first post we looked at the basic data model for material composition and how basic SHACL vocabulary can be used to describe the constraints. In this post we will look at how SPARQL-based constraints can be used to implement more complex rules based on a SPARQL SELECT query.

As a working example, we will look at how we can write a rule to validate the CAS (Chemical Abstracts Service) Registry Number® (CAS RN®) of a substance. The registry contains information on more than 130 million organic and inorganic substances.

Each CAS RN identifier:

  • Is a unique numeric identifier
  • Designates only one substance
  • Has no chemical significance
  • Is a link to a wealth of information about a specific chemical substance

An example is 9003-35-4 which is the identifier for the ‘Phenol, polymer with formaldehyde’ substance.

Phenol, polymer with formaldehyde

A CAS RN includes up to 10 digits which are separated into 3 groups by hyphens. The first part of the number, starting from the left, has 2 to 7 digits; the second part has 2 digits. The final part consists of a single check digit.

In the first post, we saw already how the syntax of the CAS RN can be checked using the regex "^[0-9]{2,7}-[0-9]{2}-[0-9]$" to match the pattern.

However, the CAS RN also provides a way to do check digit verification to detect mistyped numbers, which would be useful to incorporate into our validation rules.

The CAS RN may be written in a general form as:

  Nᵢ......N₄N₃ - N₂N₁ - R

In which R represents the check digit and N represents a fundamental sequential number. The check digit is derived from the following formula:

(iNᵢ + ... + 4N₄ + 3N₃ + 2N₂ + 1N₁) mod 10 = R

For example, for ‘Phenol, polymer with formaldehyde’ RN 9003-35-4, the validity is checked as follows:

  CAS RN: 9003-35-4
sequence: 6543 21

N₆ = 9; N₅ = 0; N₄ = 0; N₃ = 3; N₂ = 3; N₁ = 5

  ((6 x 9) + (5 x 0) + (4 x 0) + (3 x 3) + (2 x 3) + (1 x 5)) mod 10
= (54 + 0 + 0 + 9 + 6 + 5) mod 10
= 74 mod 10
= 4

Valid!

Obviously there is no way to do this with the SHACL Core language. With a little thought, we can implement this validity check in SPARQL as follows:

select ?casNum ?checksum ?test
where {
  # remove the hyphens
  bind(replace(?casNum, "-", "") as ?casNum_)

  # get the length of the RN
  bind(strlen(?casNum_) as ?len)

  # get the checksum value R
  bind(xsd:integer(substr(?casNum_, ?len-0, 1)) as ?0)   # R
  bind(xsd:integer(substr(?casNum_, ?len-1, 1))*1 as ?1) # 1N₁
  bind(xsd:integer(substr(?casNum_, ?len-2, 1))*2 as ?2) # 2N₂
  bind(xsd:integer(substr(?casNum_, ?len-3, 1))*3 as ?3) # 3N₃
  bind(xsd:integer(substr(?casNum_, ?len-4, 1))*4 as ?4) # 4N₄
  bind(xsd:integer(substr(?casNum_, ?len-5, 1))*5 as ?5) # 5N₅
  bind(xsd:integer(substr(?casNum_, ?len-6, 1))*6 as ?6) # 6N₆
  bind(xsd:integer(substr(?casNum_, ?len-7, 1))*7 as ?7) # 7N₇
  bind(xsd:integer(substr(?casNum_, ?len-8, 1))*8 as ?8) # 8N₈
  bind(xsd:integer(substr(?casNum_, ?len-9, 1))*9 as ?9) # 9N₉
  bind(
    coalesce(
      # if RN length = 10, then sum positions 1N₁ thru 9N₉, else
      if(?len=10, ?1+?2+?3+?4+?5+?6+?7+?8+?9, 1/0),
      # if RN length = 9, then sum positions 1N₁ thru 8N₈, else
      if(?len=9, ?1+?2+?3+?4+?5+?6+?7+?8, 1/0),
      # if RN length = 8, then sum positions 1N₁ thru 7N₇, else
      if(?len=8, ?1+?2+?3+?4+?5+?6+?7, 1/0),
      # if RN length = 7, then sum positions 1N₁ thru 6N₆, else
      if(?len=7, ?1+?2+?3+?4+?5+?6, 1/0),
      # if RN length = 6, then sum positions 1N₁ thru 5N₅, else
      if(?len=6, ?1+?2+?3+?4+?5, 1/0),
      # if RN length = 5, then sum positions 1N₁ thru 4N₄
      if(?len=5, ?1+?2+?3+?4, 1/0)
    ) as ?sum
  )

  # divide the sum by 10
  bind(?sum/10 as ?sum_10)

  # calculate the remainder and multiply by 10 to give the checksum
  bind(10*(?sum_10 - floor(?sum_10))  as ?checksum)

  # test if checksum = R
  bind(?checksum = ?0 as ?test)
}

We can then use VALUES clause to pass some (counter)examples as bindings for ?casNum into the query:

values ?casNum {
  "9003-35-4"
  "1333-86-4"
  "138265-88-0"
  "60676-86-0"
  "60676-86-1"
  "1344-28-1"
  "603-35-0"
  "60-35-0"
}

Which yields the results:

+-------------+----------+-------+
|   casNum    | checksum | test  |
+-------------+----------+-------+
| 9003-35-4   |        4 | true  |
| 1333-86-4   |        4 | true  |
| 138265-88-0 |        0 | true  |
| 60676-86-0  |        0 | true  |
| 60676-86-1  |        0 | false |
| 1344-28-1   |        1 | true  |
| 603-35-0    |        0 | true  |
| 60-35-0     |        5 | false |
+-------------+----------+-------+

Now that we have validated the query logic, the constraint can be incorporated into the property shape for our plm:casNumber property by using sh:sparql:

:casNumberShape a sh:PropertyShape ;
  sh:path plm:casNumber ;
  sh:maxCount 1 ;
  sh:datatype xsd:string ;
  sh:pattern "^[0-9]{2,7}-[0-9]{2}-[0-9]$" ; # match pattern "nnnnnNN-NN-N"
  sh:sparql [
    a sh:SPARQLConstraint ;
    sh:message "Checksum of CAS Registry Number must be valid." ;
    sh:prefixes [
      sh:declare [
        sh:prefix "plm" ;
        sh:namespace "http://example.com/def/plm/"^^xsd:anyURI
      ]
    ] , [
      sh:declare [
        sh:prefix "xsd" ;
        sh:namespace "http://www.w3.org/2001/XMLSchema#"^^xsd:anyURI
      ]
    ] ;
    sh:select """
      select $this (?casNum as ?value)
      where {
        $this $PATH ?casNum                                         # match the plm:casNumber predicate
        bind(replace(?casNum, "-", "") as ?casNum_)                 # remove the hyphens
        bind(strlen(?casNum_) as ?len)                              # get the length of the RN
        bind(xsd:integer(substr(?casNum_,?len-0,1)) as ?0)          # get the checksum value R
        bind(xsd:integer(substr(?casNum_,?len-1,1))*1 as ?1)        # 1N₁
        bind(xsd:integer(substr(?casNum_,?len-2,1))*2 as ?2)        # 2N₂
        bind(xsd:integer(substr(?casNum_,?len-3,1))*3 as ?3)        # 3N₃
        bind(xsd:integer(substr(?casNum_,?len-4,1))*4 as ?4)        # 4N₄
        bind(xsd:integer(substr(?casNum_,?len-5,1))*5 as ?5)        # 5N₅
        bind(xsd:integer(substr(?casNum_,?len-6,1))*6 as ?6)        # 6N₆
        bind(xsd:integer(substr(?casNum_,?len-7,1))*7 as ?7)        # 7N₇
        bind(xsd:integer(substr(?casNum_,?len-8,1))*8 as ?8)        # 8N₈
        bind(xsd:integer(substr(?casNum_,?len-9,1))*9 as ?9)        # 9N₉
        bind(
          coalesce(
            if(?len=10,?1+?2+?3+?4+?5+?6+?7+?8+?9,1/0),             # if RN length = 10, then sum positions 1N₁ thru 9N₉, else
            if(?len=9,?1+?2+?3+?4+?5+?6+?7+?8,1/0),                 # if RN length = 9, then sum positions 1N₁ thru 8N₈, else
            if(?len=8,?1+?2+?3+?4+?5+?6+?7,1/0),                    # if RN length = 8, then sum positions 1N₁ thru 7N₇, else
            if(?len=7,?1+?2+?3+?4+?5+?6,1/0),                       # if RN length = 7, then sum positions 1N₁ thru 6N₆, else
            if(?len=6,?1+?2+?3+?4+?5,1/0),                          # if RN length = 6, then sum positions 1N₁ thru 5N₅, else
            if(?len=5,?1+?2+?3+?4,1/0)                              # if RN length = 5, then sum positions 1N₁ thru 4N₄
          ) as ?sum
        )
        bind(?sum/10 as ?sum_10)                                    # divide the sum by 10
        bind(10*(?sum_10 - floor(?sum_10))  as ?checksum)           # calculate the remainder and multiply by 10 to give the checksum
        filter(?checksum != ?0)                                     # test if checksum != R
      }
      """
  ] .

A few things to note:

  • Any prefixes that will be used in the SPARQL query must be defined using the sh:prefixes property, in this case plm: and xsd:
  • The $PATH variable in the SPARQL query is substituted at runtime by the sh:path used by the shape, in this case plm:casNumber
  • The SPARQL query must be written such that it gives results for things that do not match the constraint, in this case the FILTER clause matches when the calculated checksum is not equal to the value of R in the CAS RN

The extended shape file is available here.

Now if we use this extended property shape to validate our data, we now see these additional validation results (some details omitted for brevity):

[ a       <http://www.w3.org/ns/shacl#ValidationResult> ;
  <http://www.w3.org/ns/shacl#focusNode>
          <http://example.com/132285000223> ;
  <http://www.w3.org/ns/shacl#resultMessage>
          "Checksum of CAS Registry Number must be valid." ;
  <http://www.w3.org/ns/shacl#resultPath>
          plm:casNumber ;
  <http://www.w3.org/ns/shacl#resultSeverity>
          <http://www.w3.org/ns/shacl#Violation> ;
  <http://www.w3.org/ns/shacl#sourceConstraint>
          _:b1 ;
  <http://www.w3.org/ns/shacl#sourceConstraintComponent>
          <http://www.w3.org/ns/shacl#SPARQLConstraintComponent> ;
  <http://www.w3.org/ns/shacl#sourceShape>
          <http://example.com/ns#casNumberShape> ;
  <http://www.w3.org/ns/shacl#value>
          "1333-8-4"
]

and

[ a       <http://www.w3.org/ns/shacl#ValidationResult> ;
  <http://www.w3.org/ns/shacl#focusNode>
          <http://example.com/132285000108> ;
  <http://www.w3.org/ns/shacl#resultMessage>
          "Checksum of CAS Registry Number must be valid." ;
  <http://www.w3.org/ns/shacl#resultPath>
          plm:casNumber ;
  <http://www.w3.org/ns/shacl#resultSeverity>
          <http://www.w3.org/ns/shacl#Violation> ;
  <http://www.w3.org/ns/shacl#sourceConstraint>
          _:b1 ;
  <http://www.w3.org/ns/shacl#sourceConstraintComponent>
          <http://www.w3.org/ns/shacl#SPARQLConstraintComponent> ;
  <http://www.w3.org/ns/shacl#sourceShape>
          <http://example.com/ns#casNumberShape> ;
  <http://www.w3.org/ns/shacl#value>
          "7441-22-4"
]

The first violation is also picked up by the existing regex pattern match. The second violation matches the regex pattern, but is still invalid as it still fails the newly added check digit verification constraint.

This demonstrates how SPARQL-based constraints can be used to capture more complex rules that are not possible to describe with SHACL Core language. Having the full range of SPARQL expressiveness available gives an almost endless range of possibilities. These constraints can be checked using any SHACL processor that implements SHACL-SPARQL.

Note that this check will still not guarantee that the CAS RN actually exists in the CAS registry. In order to do that we would need to somehow reconcile the CAS RN against the CAS registry, or some other authority like Wikidata (e.g. Carbon Black is Q764245).

This is beyond the scope of SHACL and our project, but would open the door to integrate data published by those authorities into a consuming application.

In the next post in the series, we will continue explore the use of SPARQL constraints for other validation rules involving aggregation.

This is the first in a series of posts about using SHACL to validate material composition data for semiconductor products (microchips). This results from a recent project we undertook for Nexperia. In this first post we will look at the basic data model for material composition and how basic SHACL vocabulary can be used to describe the constraints.

First a short intro to Nexperia:

Nexperia is a dedicated global leader in Discretes, Logic and MOSFETs devices. Nexperia is a new company with a long history, broad experience and a global customer base. Originally part of Philips, Nexperia became a business unit of NXP before becoming an independent company in the beginning of 2017.

The quality and reliability of the products Nexperia produce is paramount as explained in this blog post. One aspect of this is the product composition, being a declaration of the subtances that a product contains. This information is published via the Nexperia Quality portal.

To have a better understanding of the data that is shown, it is useful to have some understanding of how micropchips are composed. The following image shows how a chip is typically composed of multiple sub-parts (a bit like a victoria sponge cake):

Chip layers

These sub-parts form the Bill Of Materials, or BOM, of the device. Each of these materials may have its own composition, for example, the mold consists mainly of plastic and the clip of copper.

The source data is modelled as an RDF graph where the material has a qualified relation to the types of substances (termed ‘Material Classes’) of which it is composed. This can be represented pictorially like this:

Graph representation of material composition

The same graph written in RDF (Turtle):

@prefix : <http://example.com/ns#> .
@prefix plm: <http://example.com/def/plm/> .

:331214892031 a plm:MouldCompound ;
  plm:name "3312 148 92031" ;
  plm:containsMaterialClass :132285000223 ;
  plm:qualifiedRelation [
    a plm:ContainsMaterialClassRelation ;
    plm:target :132285000223 ;
    plm:materialGroup "Pigment" ;
    plm:massPercentage 0.3
  ] .

:132285000223 a plm:MaterialClass ;
  plm:name "1322 850 00223" ;
  plm:description "Carbon black" ;
  plm:casNumber "1333-86-4" .

In plain English: the mould compound “3312 148 92031” contains 0.3% of material class “1322 850 00223” Carbon black (CAS number 1333-86-4) which acts as a pigment.

Note that logically there is 99.7% of other ‘stuff’ (like silica, polymer and resin) in this material, but that is not shown here for sake of brevity.

To be able to validate this data, we want to describe the RDF properties with which a resource should be described and the expected values (datatypes, cardinality, etc.) Here we want to describe three node shapes to match our 3 resources from the example data above.

The first shape should match the resource <http://example.com/id/plm/mouldcompound/331214892031>.

We can start by defining that shape as follows:

@prefix : <http://example.com/ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix plm: <http://example.com/def/plm/> .

:shape1 a sh:NodeShape ;
  sh:targetNode :331214892031 .

This defines a shape that will only target the resource :331214892031, which is way too specific. To make this more generic, we can relate the shape to our plm:MouldCompound class instead, so that it will be used to validate any Mould Compound:

@prefix : <http://example.com/ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix plm: <http://example.com/def/plm/> .

:shape1 a sh:NodeShape ;
  sh:targetClass plm:MouldCompound .

That’s better, but we also want this shape to validate Adhesive, Clip, Lead Frame, and so on. Rather than relate it to one or more classes, we can better relate it to any resource that is the subject of a plm:containsMaterialClass property. That can be done as follows:

@prefix : <http://example.com/ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix plm: <http://example.com/def/plm/> .

:shape1 a sh:NodeShape ;
  sh:targetSubjectsOf plm:containsMaterialClass .

This will then apply the shape to anything that contains a material class, perfect!

Next we want to extend the shape to validate the properties:

  • rdf:type
  • plm:name
  • plm:containsMaterialClass
  • plm:qualifiedRelation

Let’s go through each in turn.

For rdf:type, we want to describe the constraint that there is one rdf:type statement whose value is an IRI. That can be written in SHACL like this:

@prefix : <http://example.com/ns#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix plm: <http://example.com/def/plm/> .

:typeShape a sh:PropertyShape ;
  sh:path rdf:type ;
  sh:minCount 1 ;
  sh:maxCount 1 ;
  sh:nodeKind sh:IRI .

For plm:name, we want to describe the constraint that there is one plm:name statement whose value is match a regex matching the pattern “NNNN NNN NNNN”.

:nameShape a sh:PropertyShape ;
  sh:path plm:name ;
  sh:minCount 1 ;
  sh:maxCount 1 ;
  sh:datatype xsd:string ;
  sh:pattern "^[0-9]{4} [0-9]{3} [0-9]{5}$" .

For plm:containsMaterialClass, we want to describe the constraint that there is one or more plm:containsMaterialClass statement, where all values are of type plm:MaterialClass.

:containsMaterialClassShape a sh:PropertyShape ;
  sh:path plm:containsMaterialClass ;
  sh:minCount 1 ;
  sh:class plm:MaterialClass .

For plm:qualifiedRelation, we want to describe the constraint that there is one or more plm:qualifiedRelation statement, where all values are of type plm:ContainsMaterialClassRelation. Where the previous property shapes were generic and re-usable, we know this constraint is specific for use of plm:qualifiedRelation on instances that match our node shape. Therefore we define this as a blank node and reference to it from the node shape:

:shape1 a sh:NodeShape ;
  sh:targetSubjectsOf plm:containsMaterialClass ;
  sh:property [
    sh:path plm:qualifiedRelation ;
    sh:minCount 1 ;
    sh:class plm:ContainsMaterialClassRelation
  ] .

To complete this we can also refer to the other property shapes we have defined. So bringing it all together:

@prefix : <http://example.com/ns#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix plm: <http://example.com/def/plm/> .

:shape1 a sh:NodeShape ;
  sh:targetSubjectsOf plm:containsMaterialClass ;
  sh:property [
    sh:path plm:qualifiedRelation ;
    sh:minCount 1 ;
    sh:class plm:ContainsMaterialClassRelation
  ] , :typeShape , :nameShape , :containsMaterialClassShape .

:typeShape a sh:PropertyShape ;
  sh:path rdf:type ;
  sh:minCount 1 ;
  sh:maxCount 1 ;
  sh:nodeKind sh:IRI .

:nameShape a sh:PropertyShape ;
  sh:path plm:name ;
  sh:minCount 1 ;
  sh:maxCount 1 ;
  sh:datatype xsd:string ;
  sh:pattern "^[0-9]{4} [0-9]{3} [0-9]{5}$" .

:containsMaterialClassShape a sh:PropertyShape ;
  sh:path plm:containsMaterialClass ;
  sh:minCount 1 ;
  sh:class plm:MaterialClass .

Next to this we also want to define shapes for our plm:ContainsMaterialClassRelation and plm:MaterialClass classes. These can be defined as follows:

:ContainsMaterialClassRelationShape a sh:NodeShape ;
  sh:targetClass plm:ContainsMaterialClassRelation ;
  sh:property [
    sh:path plm:target ;
    sh:minCount 1 ;
    sh:maxCount 1 ;
    sh:class plm:MaterialClass
  ] , [
    sh:path plm:materialGroup ;
    sh:minCount 1 ;
    sh:maxCount 1 ;
    sh:datatype xsd:string
  ] , [
    sh:path plm:massPercentage ;
    sh:minCount 1 ;
    sh:maxCount 1 ;
    sh:datatype xsd:decimal
  ] .

:MaterialClassShape a sh:NodeShape ;
  sh:targetClass plm:MaterialClass ;
  sh:property :typeShape , :nameShape , :descriptionShape , :casNumberShape .

:descriptionShape a sh:PropertyShape ;
  sh:path plm:description ;
  sh:minCount 1 ;
  sh:maxCount 1 ;
  sh:datatype xsd:string .

:casNumberShape a sh:PropertyShape ;
  sh:path plm:casNumber ;
  sh:maxCount 1 ;
  sh:datatype xsd:string ;
  sh:pattern "^[0-9]{2,7}-[0-9]{2}-[0-9]$" . # match pattern "nnnnnNN-NN-N"

The shape file and some sample data is available here and here.

Now we can use the shapes we have defined to validate the sample data. You can use the online SHACL Plaground tool for this, but I prefer to use the Java SHACL API from command line.

To validate, you can use this command:

shaclvalidate.sh -datafile data.ttl -shapesfile shapes1.ttl

The result is a validation report, also in RDF, that describes the constraint checks that failed. For out example data, the validation report looks like this:

@prefix plm:   <http://example.com/def/plm/> .
@prefix owl:   <http://www.w3.org/2002/07/owl#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh:    <http://www.w3.org/ns/shacl#> .

[ a sh:ValidationReport ;
  sh:conforms false ;
  sh:result [
    a sh:ValidationResult ;
    sh:focusNode [] ;
    sh:resultMessage "Less than 1 values" ;
    sh:resultPath plm:massPercentage ;
    sh:resultSeverity sh:Violation ;
    sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
    sh:sourceShape _:b0 ] ;
  sh:result [
    a sh:ValidationResult ;
    sh:focusNode [] ;
    sh:resultMessage "Value does not have datatype xsd:decimal" ;
    sh:resultPath plm:massPercentage ;
    sh:resultSeverity sh:Violation ;
    sh:sourceConstraintComponent sh:DatatypeConstraintComponent ;
    sh:sourceShape _:b0 ;
    sh:value "100.0" ] ;
  sh:result [
    a sh:ValidationResult ;
    sh:focusNode <http://example.com/132285000223> ;
    sh:resultMessage "Value does not match pattern \"^[0-9]{2,7}-[0-9]{2}-[0-9]$\"" ;
    sh:resultPath plm:casNumber ;
    sh:resultSeverity sh:Violation ;
    sh:sourceConstraintComponent sh:PatternConstraintComponent ;
    sh:sourceShape <http://example.com/ns#casNumberShape> ;
    sh:value "1333-8-4" ]
] .

This report says the data does not conform to the shape (sh:conforms = false) and there are 3 valdiation errors:

  • A plm:massPercentage is missing
  • Another plm:massPercentage has value that does not match expected datatype xsd:decimal
  • A plm:casNumber has value that does not match the regex pattern

This demonstrates the basics of SHACL and how it can be used to validate RDF data according to a set of constraints.

In the next post in the series, we’ll look at using more complex rules defines in SPARQL to define additional constraints on the data.

For example given the following (shamelessly borrowed) psuedocode:

If student’s grade is greater than or equal to 90 then

Display “A”

Else

If student’s grade is greater than or equal to 80 then

Display “B”

Else

If student’s grade is greater than or equal to 70 then

Display “C”

Else

If student’s grade is greater than or equal to 60 then

Display “D”

Else

Display “F”

Of course this can be implemented in SPARQL using nested IF statements:

BIND (
  IF(?grade >= 90, "A",
    IF(?grade >= 80, "B",
      IF(?grade >= 70, "C",
        IF(?grade >= 60, "D", "F")
      )
    )
  ) AS ?result
)

However, when the logic gets more complex, this can quickly become hard to read and debug.

So let’s look at COALESCE in more detail, the following taken from the SPARQL 1.1 Recommendation:

The COALESCE function form returns the RDF term value of the first expression that evaluates without error. In SPARQL, evaluating an unbound variable raises an error.

So for each conditional test, we just need to force our else to evaluate to an error when it does not match our test. This can be easily achieved using IF where we put our required output in the second argument and some expression that evaluates to an error as the third argument.

IF(?test, "Yay!", 1/0)

The “otherwise” case can be handled by adding a degenerative case that will always evaluate without error.

So the equivalent logic for our grades example can also be implemented in SPARQL using COALESCE:

BIND (
  COALESCE(
    IF(?grade >= 90, "A", 1/0),
    IF(?grade >= 80, "B", 1/0),
    IF(?grade >= 70, "C", 1/0),
    IF(?grade >= 60, "D", 1/0),
    "F"
  ) AS ?result
)

In the end it comes down to taste, but arguably the latter is a cleaner approach.

For a current project we need to load data from an RDF graph store to populate tables in an existing SQL database. The approach we chose is to use SPARQL SELECT queries to express the mapping from the graph to tabular model. The CSV results from such a query can be used to populate the tables in the target database.

One of the columns in the target table contains coded values, where each code indicates the ‘type’ of the thing the row describes. The codes are two or three letters (e.g. FE, FA, NMF) where each code has a predetermined meaning.

However the RDF data used more types than in the target database, so it was necessary to create an n:1 mapping.

An example of such mappings using Schema.org classes is:

schema:Person --> "PE"
schema:Book --> "BK"
schema:MusicAlbum --> "ALB"
schema:TVClip --> "TV"
schema:TVSeries --> "TV"
schema:TVEpisode --> "TV"

One option would be to express these mappings in RDF, load them to the graph store and query over them. The above mappings might be written in Turtle as:

@prefix schema: <http:></http:> .
@prefix dct: <http:></http:> .

schema:Person dct:identifier "PE" .
schema:Book dct:identifier "BK" .
schema:MusicAlbum dct:identifier "ALB" .
schema:TVClip dct:identifier "TV" .
schema:TVSeries dct:identifier "TV" .
schema:TVEpisode dct:identifier "TV" .

If this was loaded to the RDF graph store along with the dataset, we can do a query like this:

PREFIX schema: <http:></http:>
SELECT ?name ?type_code
WHERE {
  ?s a ?type ;
    schema:name ?name .
  ?type dct:identifier ?type_code .
}

In case it makes sense to somehow partition the data, the code list could be loaded to a named graph, or to a separate SPARQL endpoint and use federation.

However in this case, as the value list is only really applicable to the target database, we decided to use the VALUES clause in SPARQL to associate the code to the class. In some ways this is simpler as all the logic is encapsulated in a single SPARQL query, which should be easier to maintain.

PREFIX schema: <http:></http:>
SELECT ?name ?type_code
WHERE {
  VALUES (?type ?type_code) {
    (schema:Person "PE")
    (schema:Book "BK")
    (schema:MusicAlbum "ALB")
    (schema:TVClip "TV")
    (schema:TVSeries "TV")
    (schema:TVEpisode "TV")
  }
  ?s a ?type ;
    schema:name ?name .
}

Should it be necessary to have some ‘otherwise’ case, the VALUES claue can be wrapped in an OPTIONAL clause, then use COALESCE function to provide a default binding for any solution where ?type_code variable is not bound:

PREFIX schema: <http:></http:>
SELECT ?name ?type_code_default
WHERE {
  ?s a ?type ;
    schema:name ?name .
  OPTIONAL {
    VALUES (?type ?type_code) {
      (schema:Person "PE")
      (schema:Book "BK")
      (schema:MusicAlbum "ALB")
      (schema:TVClip "TV")
      (schema:TVSeries "TV")
      (schema:TVEpisode "TV")
    }
  }
  BIND (COALESCE(?type_code, "NK") as ?type_code_default)
}

Learn how our customer NXP Semiconductors (NXPI) is applying Linked Data technology in an enterprise setting

Although this post uses Dydra graph database cloud service to illustrate the concepts, the approach is equally applicable to any RDF graph store that supports the SPARQL 1.1 Graph Store HTTP Protocol

Loading data to Dydra is pretty simple from the UI or with SPARQL 1.1 Update, however it can sometimes be easier to implement or work with HTTP operations at named graph level (where a named graph can be thought of as an RDF ‘document’).

To enable this Dydra supports the HTTP GET, PUT, DELETE and POST methods as defined by the SPARQL 1.1 Graph Store HTTP Protocol recommendation.

The graph store service endpoint of a repository is the repository URI followed by /service i.e.:

http://dydra.com/{user}/{repo}/service

To access the default graph, simply add ?default parameter. To access a specific named graph use ?graph parameter with the percent-encoded IRI of the named graph as the value (Dydra uses the indirect graph identification approach). For example to work with the graph http://example.com/mygraph use parameter ?graph=http%3A%2F%2Fwww.example.com%2Fmygraph (you can easily percent encode the graph IRI using online tools such as the URL Decoder/Encoder).

If the Dydra repository has privacy settings applied, you will also need to authenticate using basic HTTP authentication in conjunction with your API key e.g.:

http://dydra.com/{user}/{repo}/service?default&auth_token={MY_API_KEY}

The HTTP methods behave as follows:

  • GET fetches a serialization of the specified graph. The default format is Turtle, but other formats can be requested via content negotiation using the Accept header
  • PUT stores the RDF payload in the specified graph. Any existing data in the graph is overwritten. The format of the supplied RDF is specified using the Content-Type header
  • DELETE removes the specified graph from the repository. If the default graph is specified, it is equivalent to DROP DEFAULT in SPARQL 1.1 Update
  • POST appends the RDF payload to the specified graph. Any existing data in the graph is retained (an RDF merge of the graphs is performed). Again the format of the supplied RDF is specified using the Content-Type header.

Examples using curl

Get the default graph in the nlv01111/gsp repository in (default) Turtle serialization format:

curl https://dydra.com/nlv01111/gsp/service?default

Get the named graph http://example.com/mygraph in the nlv01111/gsp repository in JSON-LD format and output to local file data.jsonld:

curl "https://dydra.com/nlv01111/gsp/service?graph=http%3A%2F%2Fexample.com%2Fmygraph" \
  --header "Accept: application/ld+json" \
  --output data.jsonld

Put a local Turtle file data.ttl to the default graph in the nlv01111/gsp repository:

curl -X PUT "https://dydra.com/nlv01111/gsp/service?default&auth_token=${MY_API_KEY}" \
  --data-binary @data.ttl \
  --header "Content-Type: text/turtle"

Put a local RDF/XML file data.rdf to the named graph http://example.com/mygraph in the nlv01111/gsp repository:

curl -X PUT "https://dydra.com/nlv01111/gsp/service?graph=http%3A%2F%2Fexample.com%2Fid%2Fmygraph&auth_token=${MY_API_KEY}" \
  --data-binary @data.rdf \
  --header "Content-Type: application/rdf+xml"

Electronic components Image by Kae - Own work, Public Domain, Link

What if a customer were able to go to one place to:

  • Find and compare different products from different manufacturers
  • Compare prices and availability from different sellers
  • Compare stock levels and delivery times
  • Make a purchase

Surely such an open marketplace puts the customer in a better position to compare products and offerings and to make more informed purchasing decisions. Power to the customer!

In Google’s opinion this place is Google Shopping.

But what does does Google get out of this? Simple. Sellers pay for ‘sponsored’ links to get their offerings positioned more visibly. Also, as Google know the customers preferences, they can charge manufacturers for personalized/targeted ads towards those customers. That’s what I call having your cake and eating it…

Google is already doing this in B2C space. If they can pull off the same trick in B2B, whose business model is really under threat?

Structured data on the web

Let’s consider how Google might be able to get all the data needed to power such a rich search experience. Many webmasters already recognize “structured markup” as a best practice, but what does that mean exactly? Basically it refers to the use of various syntaxes like RDFa, JSON-LD and HTML5 Microdata to embed machine-readable (meta)data into web pages using standard vocabularies like Schema.org. This embedded data can help a search engine, or any other program reading the page, to better understand what the page is about and provide more accurate search results.

For components this might be used to describe a product model and the available variants along with the key selection parameters for the product. This is the kind of data you would expect the manufacturer to publish on their website. Alongside this a distributor may include structured markup detailing the amount of stock they have of a particular part, it’s price, ordering quantities, delivery methods, and so on. Google (or anyone else reading the data) can then merge these data sources together either using URIs as global identifiers, or some other identifier like a GTIN, to provide a complete picture of the product.

This approach vastly simplifies gathering and indexing of this data amongst many other benefits.

Sounds great, so where’s the hitch?

As far as I can see, there isn’t one. OK, it takes a certain amount of work to understand how to apply these vocabularies and include the right markup in the page. The latter can usually be done in the templates used to generate the pages with minimal fuss (hands up who’s still authoring their HTML by hand).

Once the markup is in place anyone with the right tools is able to extract the data from the page and use it. This really puts the ownership and responsibility to maintain the data with the relevant parties and provides open access to the data for anyone with an internet connection.

So, for example, a distributor can access the data from a manufacturer to use on their own website and vice-versa. Plus search engines like Google, Yahoo! and Bing can access all this great data to provide a better search experience. Also any third party can access the data to build new applications based on business models no-one thought of yet.

The only partial threat I see is to data aggregators who currently gather the information together (often with manual processes), align it and sell it on. Opening up access to the data directly from the relevant sources will render much of the manual data gathering process obsolete. However, this is currently a labor-intensive process that is very prone to mistakes. If this can be automated to be faster and improve data quality whilst reducing costs, great. It also provides a nice incentive for these companies to come up with more innovative services that add real value to the data. After all, who really wants to spend their time re-typing numbers into databases?

Of course this is not all going to happen overnight. For example, it will probably take quite some time for the industry to converge on standard properties for the parametric data. Perhaps this is one area where there is scope for new business models to help accelerate this process?

There will also be a certain amount of fear from manufacturers that opening up access to their product data will allow their products to be compared to the competition more easily and could potentially lead to products just being compared on price. Well guys, that comparison is already happening, so make damn sure you are in control of the data that is being used for that comparison! Being able to articulate why your product is worth the extra few cents is part of the game.

An analogous example is searching for a hotel for your next city trip. Often you’re not just looking for the cheapest room, but for the right facilities and location that meet your requirements and budget. Why should shopping for chips be any different?

Another point to consider is if the customer is kept within the Google experience, how is that going to affect traffic to your own site. One concern is that your website just becomes a data feed for Google. Well what’s the real goal here, hit count or sales? It makes sense to ensure your product information is accurate no matter where the customer finds it in order to improve chances of making the sale.

Of course making sure you get your brand message to the customers is still very important. This is where you need to give customers a reason to visit your site. A couple of ideas:

  • Provide engaging content that can help engineers design new and exciting products and provide ideas for possible future developments
  • Use your subject matter knowledge provide ‘vertical’ search interfaces that go beyond those a generic search engine like Google can offer

So what are you waiting for? Get in touch to learn more about adding structured markup to your website.