This is the second in a series of posts about using SHACL to validate material composition data for semiconductor products (microchips). This results from a recent project we undertook for Nexperia. In the first post we looked at the basic data model for material composition and how basic SHACL vocabulary can be used to describe the constraints. In this post we will look at how SPARQL-based constraints can be used to implement more complex rules based on a SPARQL SELECT query.
As a working example, we will look at how we can write a rule to validate the CAS (Chemical Abstracts Service) Registry Number® (CAS RN®) of a substance. The registry contains information on more than 130 million organic and inorganic substances.
Each CAS RN identifier:
An example is 9003-35-4
which is the identifier for the ‘Phenol, polymer with formaldehyde’ substance.
A CAS RN includes up to 10 digits which are separated into 3 groups by hyphens. The first part of the number, starting from the left, has 2 to 7 digits; the second part has 2 digits. The final part consists of a single check digit.
In the first post, we saw already how the syntax of the CAS RN can be checked using the regex "^[0-9]{2,7}-[0-9]{2}-[0-9]$"
to match the pattern.
However, the CAS RN also provides a way to do check digit verification to detect mistyped numbers, which would be useful to incorporate into our validation rules.
The CAS RN may be written in a general form as:
Nᵢ......N₄N₃ - N₂N₁ - R
In which R
represents the check digit and N
represents a fundamental sequential number.
The check digit is derived from the following formula:
(iNᵢ + ... + 4N₄ + 3N₃ + 2N₂ + 1N₁) mod 10 = R
For example, for ‘Phenol, polymer with formaldehyde’ RN 9003-35-4
, the validity is checked as follows:
CAS RN: 9003-35-4
sequence: 6543 21
N₆ = 9; N₅ = 0; N₄ = 0; N₃ = 3; N₂ = 3; N₁ = 5
((6 x 9) + (5 x 0) + (4 x 0) + (3 x 3) + (2 x 3) + (1 x 5)) mod 10
= (54 + 0 + 0 + 9 + 6 + 5) mod 10
= 74 mod 10
= 4
Valid!
Obviously there is no way to do this with the SHACL Core language. With a little thought, we can implement this validity check in SPARQL as follows:
select ?casNum ?checksum ?test
where {
# remove the hyphens
bind(replace(?casNum, "-", "") as ?casNum_)
# get the length of the RN
bind(strlen(?casNum_) as ?len)
# get the checksum value R
bind(xsd:integer(substr(?casNum_, ?len-0, 1)) as ?0) # R
bind(xsd:integer(substr(?casNum_, ?len-1, 1))*1 as ?1) # 1N₁
bind(xsd:integer(substr(?casNum_, ?len-2, 1))*2 as ?2) # 2N₂
bind(xsd:integer(substr(?casNum_, ?len-3, 1))*3 as ?3) # 3N₃
bind(xsd:integer(substr(?casNum_, ?len-4, 1))*4 as ?4) # 4N₄
bind(xsd:integer(substr(?casNum_, ?len-5, 1))*5 as ?5) # 5N₅
bind(xsd:integer(substr(?casNum_, ?len-6, 1))*6 as ?6) # 6N₆
bind(xsd:integer(substr(?casNum_, ?len-7, 1))*7 as ?7) # 7N₇
bind(xsd:integer(substr(?casNum_, ?len-8, 1))*8 as ?8) # 8N₈
bind(xsd:integer(substr(?casNum_, ?len-9, 1))*9 as ?9) # 9N₉
bind(
coalesce(
# if RN length = 10, then sum positions 1N₁ thru 9N₉, else
if(?len=10, ?1+?2+?3+?4+?5+?6+?7+?8+?9, 1/0),
# if RN length = 9, then sum positions 1N₁ thru 8N₈, else
if(?len=9, ?1+?2+?3+?4+?5+?6+?7+?8, 1/0),
# if RN length = 8, then sum positions 1N₁ thru 7N₇, else
if(?len=8, ?1+?2+?3+?4+?5+?6+?7, 1/0),
# if RN length = 7, then sum positions 1N₁ thru 6N₆, else
if(?len=7, ?1+?2+?3+?4+?5+?6, 1/0),
# if RN length = 6, then sum positions 1N₁ thru 5N₅, else
if(?len=6, ?1+?2+?3+?4+?5, 1/0),
# if RN length = 5, then sum positions 1N₁ thru 4N₄
if(?len=5, ?1+?2+?3+?4, 1/0)
) as ?sum
)
# divide the sum by 10
bind(?sum/10 as ?sum_10)
# calculate the remainder and multiply by 10 to give the checksum
bind(10*(?sum_10 - floor(?sum_10)) as ?checksum)
# test if checksum = R
bind(?checksum = ?0 as ?test)
}
We can then use VALUES clause to pass some (counter)examples as bindings for ?casNum
into the query:
values ?casNum {
"9003-35-4"
"1333-86-4"
"138265-88-0"
"60676-86-0"
"60676-86-1"
"1344-28-1"
"603-35-0"
"60-35-0"
}
Which yields the results:
+-------------+----------+-------+
| casNum | checksum | test |
+-------------+----------+-------+
| 9003-35-4 | 4 | true |
| 1333-86-4 | 4 | true |
| 138265-88-0 | 0 | true |
| 60676-86-0 | 0 | true |
| 60676-86-1 | 0 | false |
| 1344-28-1 | 1 | true |
| 603-35-0 | 0 | true |
| 60-35-0 | 5 | false |
+-------------+----------+-------+
Now that we have validated the query logic, the constraint can be incorporated into the property shape for our plm:casNumber
property by using sh:sparql
:
:casNumberShape a sh:PropertyShape ;
sh:path plm:casNumber ;
sh:maxCount 1 ;
sh:datatype xsd:string ;
sh:pattern "^[0-9]{2,7}-[0-9]{2}-[0-9]$" ; # match pattern "nnnnnNN-NN-N"
sh:sparql [
a sh:SPARQLConstraint ;
sh:message "Checksum of CAS Registry Number must be valid." ;
sh:prefixes [
sh:declare [
sh:prefix "plm" ;
sh:namespace "http://example.com/def/plm/"^^xsd:anyURI
]
] , [
sh:declare [
sh:prefix "xsd" ;
sh:namespace "http://www.w3.org/2001/XMLSchema#"^^xsd:anyURI
]
] ;
sh:select """
select $this (?casNum as ?value)
where {
$this $PATH ?casNum # match the plm:casNumber predicate
bind(replace(?casNum, "-", "") as ?casNum_) # remove the hyphens
bind(strlen(?casNum_) as ?len) # get the length of the RN
bind(xsd:integer(substr(?casNum_,?len-0,1)) as ?0) # get the checksum value R
bind(xsd:integer(substr(?casNum_,?len-1,1))*1 as ?1) # 1N₁
bind(xsd:integer(substr(?casNum_,?len-2,1))*2 as ?2) # 2N₂
bind(xsd:integer(substr(?casNum_,?len-3,1))*3 as ?3) # 3N₃
bind(xsd:integer(substr(?casNum_,?len-4,1))*4 as ?4) # 4N₄
bind(xsd:integer(substr(?casNum_,?len-5,1))*5 as ?5) # 5N₅
bind(xsd:integer(substr(?casNum_,?len-6,1))*6 as ?6) # 6N₆
bind(xsd:integer(substr(?casNum_,?len-7,1))*7 as ?7) # 7N₇
bind(xsd:integer(substr(?casNum_,?len-8,1))*8 as ?8) # 8N₈
bind(xsd:integer(substr(?casNum_,?len-9,1))*9 as ?9) # 9N₉
bind(
coalesce(
if(?len=10,?1+?2+?3+?4+?5+?6+?7+?8+?9,1/0), # if RN length = 10, then sum positions 1N₁ thru 9N₉, else
if(?len=9,?1+?2+?3+?4+?5+?6+?7+?8,1/0), # if RN length = 9, then sum positions 1N₁ thru 8N₈, else
if(?len=8,?1+?2+?3+?4+?5+?6+?7,1/0), # if RN length = 8, then sum positions 1N₁ thru 7N₇, else
if(?len=7,?1+?2+?3+?4+?5+?6,1/0), # if RN length = 7, then sum positions 1N₁ thru 6N₆, else
if(?len=6,?1+?2+?3+?4+?5,1/0), # if RN length = 6, then sum positions 1N₁ thru 5N₅, else
if(?len=5,?1+?2+?3+?4,1/0) # if RN length = 5, then sum positions 1N₁ thru 4N₄
) as ?sum
)
bind(?sum/10 as ?sum_10) # divide the sum by 10
bind(10*(?sum_10 - floor(?sum_10)) as ?checksum) # calculate the remainder and multiply by 10 to give the checksum
filter(?checksum != ?0) # test if checksum != R
}
"""
] .
A few things to note:
sh:prefixes
property, in this case plm:
and xsd:
$PATH
variable in the SPARQL query is substituted at runtime by the sh:path
used by the shape, in this case plm:casNumber
FILTER
clause matches when the calculated checksum is not equal to the value of R
in the CAS RNThe extended shape file is available here.
Now if we use this extended property shape to validate our data, we now see these additional validation results (some details omitted for brevity):
[ a <http://www.w3.org/ns/shacl#ValidationResult> ;
<http://www.w3.org/ns/shacl#focusNode>
<http://example.com/132285000223> ;
<http://www.w3.org/ns/shacl#resultMessage>
"Checksum of CAS Registry Number must be valid." ;
<http://www.w3.org/ns/shacl#resultPath>
plm:casNumber ;
<http://www.w3.org/ns/shacl#resultSeverity>
<http://www.w3.org/ns/shacl#Violation> ;
<http://www.w3.org/ns/shacl#sourceConstraint>
_:b1 ;
<http://www.w3.org/ns/shacl#sourceConstraintComponent>
<http://www.w3.org/ns/shacl#SPARQLConstraintComponent> ;
<http://www.w3.org/ns/shacl#sourceShape>
<http://example.com/ns#casNumberShape> ;
<http://www.w3.org/ns/shacl#value>
"1333-8-4"
]
and
[ a <http://www.w3.org/ns/shacl#ValidationResult> ;
<http://www.w3.org/ns/shacl#focusNode>
<http://example.com/132285000108> ;
<http://www.w3.org/ns/shacl#resultMessage>
"Checksum of CAS Registry Number must be valid." ;
<http://www.w3.org/ns/shacl#resultPath>
plm:casNumber ;
<http://www.w3.org/ns/shacl#resultSeverity>
<http://www.w3.org/ns/shacl#Violation> ;
<http://www.w3.org/ns/shacl#sourceConstraint>
_:b1 ;
<http://www.w3.org/ns/shacl#sourceConstraintComponent>
<http://www.w3.org/ns/shacl#SPARQLConstraintComponent> ;
<http://www.w3.org/ns/shacl#sourceShape>
<http://example.com/ns#casNumberShape> ;
<http://www.w3.org/ns/shacl#value>
"7441-22-4"
]
The first violation is also picked up by the existing regex pattern match. The second violation matches the regex pattern, but is still invalid as it still fails the newly added check digit verification constraint.
This demonstrates how SPARQL-based constraints can be used to capture more complex rules that are not possible to describe with SHACL Core language. Having the full range of SPARQL expressiveness available gives an almost endless range of possibilities. These constraints can be checked using any SHACL processor that implements SHACL-SPARQL.
Note that this check will still not guarantee that the CAS RN actually exists in the CAS registry. In order to do that we would need to somehow reconcile the CAS RN against the CAS registry, or some other authority like Wikidata (e.g. Carbon Black is Q764245).
This is beyond the scope of SHACL and our project, but would open the door to integrate data published by those authorities into a consuming application.
In the next post in the series, we will continue explore the use of SPARQL constraints for other validation rules involving aggregation.