API

POST to https://env.zignsec.com/v2/ekyc/fuzzymatch    where env is API or test. (Note that compareItems is an earlier still operational synonym for fuzzyMatch API above)

API overview

  • FuzzyMatch is a general fuzzy match service for freely comparing strings or structures/objects (JSON). The response is a total score and details for each string comparison.
  • It is free of charge.
  • The main MatchScore field shows a high-level score: HIGH means a good match, MEDIUM means a possible and LOW means no match.
  • See Match Scoring algorithm for expression details and pre-defined scorer configurations.
  • See Scoringexpression for customization of the scoring algorithm.
  • Note that both names and values are case insensitive in the data. Ex “name”:“John” is the same as “Name”:“JOHN”.

A first look

Raw request:

POST https://test.zignsec.com/v2/ekyc/fuzzymatch HTTP/1.1
Content-Type: application/json; charset=UTF-8
Authorization: -your key here-
Host: test.zignsec.com
Content-Length: 120

{
"Item1":{"fulllname":"Edgar Allan Poe"},
"Item2":{"fulllname":"Egar Alin Poe"},
"responseformat":"structured"
}

The response is: 3 characters distance (lD=3), MEDIUM match and 80% similarity. (ie the changes are: 1- delete a D; 2- delete an L; and 3- switch an A into an I)

{
  "id":"e85144b0-9399-4803-9f8e-9fec3da0856a",
  "errors":[],
  "MatchScore":{
    "MatchLevel":"MEDIUM",
    "MatchPercentage":80,
    "SubScores":
[
 {
   "FieldName":"fulllname", 
   "CouldBeComputed":true, 
   "MatchPercentage":80, 
   "Weight":1.0, 
   "ComparisonType":"LevenshteinDistance", 
   "MeasureOfLevenshteinDistance":3,
   "MeasureOfEquality":false
  }
 ],
 "ScoringExpression":"85%/60% | fulllname;1;ld"
}

FuzzyMatch API

The main element in the response is the MatchScore which shows the degree of match in a stop-light fashion – either HIGH (a good match), MEDIUM or LOW, followed by detailed percentages for each sub score computed. For quicker reading of the results, the ResponseFormat is set to Flat per default, showing a format closely resembling the in data parameter ScoringExpression Note: The item1 and item2 parameters can switch places without changing the match score (except for that the only item1 will set which fields are expected). Note: Two call options either set Item2 or Items2s. Multiple comparisons can be performed on the same call by sending in an Item2s array.

Parameters

Item1Any Json, for example an Address structure to be compared with either a Item2, or an array of Item2s.
Item2A Json object, for example an address, which should be compared to Item1. Response is in MatchScore.
Item2sAn array of Json objects, for example addresses, where each item should be compared to Item1. Response is in MatchScores array.
ScoringExpressionOptional parameter. A detailed rule expression that controls the comparison.
ScorerOptional parameter. A pre-defined ruleset looked up by name, for example ‘Address’ or ‘Identity’.
ResponseFormatOptional parameter. Can be set to either Flat or Structured. Flat format is more compact and readable, and is the default format.

Scoring Rules

The scoring rules can be changed on each call with ScoringExpression and Scorer respectively:

  • See Example 1: Thoroughly explains the rules using the built-in ScoringExpression “address”
  • See Example 2: By ScoringExpression, a custom rule string controlling the comparison.
  • See Example 3: Automatic scoring. If neither ScoringExpression nor Scorer was set, an automatic comparison is performed with all fields in Item1 and default weights and comparison method (i.e., 1.0 and LevenshteinDistance).

Example 1: Score computation explained

POST https://test.zignsec.com/v2/ekyc/fuzzymatch HTTP/1.1
User-Agent: Fiddler
Content-Length: 27
Content-Type: application/json; charset=UTF-8
Authorization: af5cd36b-83d0-43fb-bfe8-22a8addea553

{
"Item1":{"FIRSTNAME":"Lars", "LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6800","City":"Varde", DATEofbirth:"19510203"},
"Item2":{"FIRSTNAME":"Lars Ole","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6710","City":"Varde", DATEofbirth:"19510203"},
scorer:"address"
}

Json Response:

{
  "id":"15f686cb-3fc3-42ca-b069-fcd3c634f3ba",
  "errors":[],
  "MatchScore":"HIGH 93% | FullName 77%;1;ld_4 | Address 100%;0,6;ld_0 | PostalCode 50%;0,6;ld_2 | City 100%;0,8;ld_0 | DateOfBirth 100%;5;eqx_True",
  "ScoringExpression":"80%/60% | FullName;1;ld | Address;0,6;ld | Address2;0,4;ldx | Location;0,4;ldx | PostalCode;0,6;ld | City;0,8;ld | DateOfBirth;5;eqx"
}

Here we have used the built-in “address” rule-set and will see below a thorough explanation of the computation steps in the FuzzyMatch algorithm to reach 93%, HIGH. The actual scores computed and the active scoring rules are always presented compactly in two response fields: * MatchScore (The actual computed scores): HIGH 93% |
FullName 77%;1;ld_4 |
Address 100%;0,6;ld_0 |
PostalCode 50%;0,6;ld_2 |
City 100%;0,8;ld_0 |
DateOfBirth 100%;5;eqx_True * ScoringExpression (The scoring rules used): 80%/60% |
FullName;1;ld |
Address;0,6;ld |
Address2;0,4;ldx |
Location;0,4;ldx |
PostalCode;0,6;ld |
City;0,8;ld |
DateOfBirth;5;eqx Where HIGH (93%) is the final matchscore, computed from the subscores in this manner: Subscores: 1. FullName 77% match (“Lars Svenning” vs “Lars Ole Svenning”) 4 char distance/longest string = 4/17 = 23% diff = 77% match [i.e. S1 subscore is 77% and W1 weight is 1.0 from rule] 2. Address 100% (exactly the same strings) 3. PostalCode 50% match (“6800” vs “6710”) 2 char distance / longeststring = 2/4 = 50% diff = 50% match. 4. City 100% match (exactly the same strings) 5. DateOfBirth 100% match (tested for equality) Total score: (subscores are weighted into a total score like this: (S1*W1 + S2*W2 +… +Sn*Wn)/(W1+w2+…+Wn) this ex (77%*1.0 + 100%*0.6 + 50%*0.6 + 100%*0.8 + 100%*5.0) / (1.0 + 0.6 + 0.6 + 0.8 + 5.0) = 93% because (0.77+0.6+0.3+0.8+5)/8 = 7.33/8 = 0.93 And finally, the MATCH level limits for MEDIUM and HIGH are set to 60%/80%, so 93% is greater than 80%, so HIGH.

Example 2: By setting ScoringExpression

See this for more details. This example: * Changes the scoring rule for City to use LevenshteinMandatory. * Changes the MatchLevel limits. 90%/55% are the total limits in this call. * Uses the array Item2s to compare multiple items with Item1.

POST https://test.zignsec.com/v2/ekyc/fuzzymatch HTTP/1.1
User-Agent: Fiddler
Content-Length: 27
Content-Type: application/json; charset=UTF-8
Authorization: 000000....

{
"Item1":{"FirstName":"Lars","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6800","City":"Varde"},
"Item2s":[
  {"FirstName":"Lars Ole","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6710","City":"Varde"},
  {"FirstName":"Lars","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6810","City":"Varde"}
]
,"ScoringExpression":"90%/55% | FullName;0.8 | Address;0.5 | PostalCode | City;;ldm"
}

Json response:

{
   "id":"6a46ecb3-df83-4347-9b86-e06b7298bd8d",
   "errors":[

   ],
   "MatchScores":[
      "MEDIUM 79% | FullName 77%;0,8;ld_4 | Address 100%;0,5;ld_0 | PostalCode 50%;1;ld_2 | City 100%;1;ldm_0",
      "HIGH 92% | FullName 100%;0,8;ld_0 | Address 100%;0,5;ld_0 | PostalCode 75%;1;ld_1 | City 100%;1;ldm_0"
   ],
   "ScoringExpression":"90%/55% | FullName;0.8;ld  |  Address;0.5;ld  |  PostalCode;1;ld  |  City;1;ldm"
}

Example 3: By using the implicit rules

If neither the ScoringExpression nor the Scorer parameter was set, an implicit ScoringExpression is created automatically, comparing all the fields on Item1 to the ones existing on Item2, with Levenshtein distance and all relative weights equal to 1.0.

POST https://test.zignsec.com/v2/ekyc/fuzzymatch HTTP/1.1
Content-Length: 27
Content-Type: application/json; charset=UTF-8
Authorization: 0000....

{
"Item1":{"FirstName":"Lars","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6800","City":"Varde"},
"Item2": {"FirstName":"Lars Ole","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6710","City":"Varde"}
}

Json Response:

{
  "id":"ab6c6b59-c8a3-4632-bd30-c11a0cdf3647",
  "errors":[],
  "MatchScore":"HIGH 80% | FirstName 50%;1;ld_4 | LastName 100%;1;ld_0 | Address 100%;1;ld_0 | PostalCode 50%;1;ld_2 | City 100%;1;ld_0",
  "ScoringExpression":"80%/60% | FirstName;1;ld | LastName;1;ld | Address;1;ld | PostalCode;1;ld | City;1;ld"
}

More explanation of the scoring rules

We employ a customizable comparison algorithm to pair-wise measure the similarity between two Json objects, for example two user addresses. It is based on the standard Levenhstein string distance. Computation steps: 1. selecting the fields to include in the comparison, and 2. selecting the respective field´s comparison method, for example “string distance”, and 3. selecting the relative weight for each field´s subscore going into the in a weighted average total score, and finally 4. selecting the percentage limits for how the total score is mapped into one of the three match score classes.

A real-world use case

Suppose a online web store wants to verify the name and address of new users at sign-up. The algorithm compares the address the user has entered himself to one other address fetched from a trusted external address source. The comparison response contains a three-stepped class score. The score report includes enough matching details for the implementer to inspect the sub scores in detail.

Scoring Overview

The comparison process begins with a number of separate field comparisons. They are then weighed together into a total MatchPercentage score which is finally mapped to one of the MatchLevel categories.

  • HIGH – a very good match – The data item (ex user) should be accepted as referring to the same item (ex individual).
  • MEDIUM – a business decision. Maybe the item (ex user) could be accepted but need a manual inspection.
  • LOW – a poor match – The data item (ex user) should not be accepted as the items are too different.

Note: If data is missing from any of the fields in a comparison, the MatchLevel is automatically lowered to be at best a MEDIUM match, to incur an extra warning business message.

How is the MatchLevel Computed?

The final MatchLevel is computed by mapping of MatchPercentage into three category Levels. Below are the default mapping limits, which can be overridden by ScoringExpression.

  • HIGH – when MatchPercentage is 85-100%.
  • MEDIUM – when MatchPercentage is 65-84%.
  • LOW – when MatchPercentage is 0-64%.

Predefined Scorer Configuration: “Address” (as of Jan 2019)

“80/60 |
FullName;1;ld |
Address;0.6;ld |
Address2;0.4;ldx |
Location;0.4;ldx |
PostalCode;0.6;ld |
City;0.8;ld |
DateOfBirth;5;eqx” The choice of fields, weights and comparison types are the current defaults, but can be overridden through ScoringExpression.

  • FullName: Weight=1; compared by LevenshteinDistance
  • Address: Weight=0.6; compared by LevenshteinDistance
  • Address2: Weight=0.4; compared by LevenshteinDistance, excluded from totalscore if missing
  • Location: Weight=0.4; compared by LevenshteinDistance, excluded from totalscore if missing
  • PostalCode: Weight=0.6; compared by LevenshteinDistance
  • City: Weight=0.8; compared by LevenshteinDistance
  • DateOfBirth: Weight=5; compared by Equality, excluded from totalscore if missing

Note: FullName is an alias for ‘FirstName LastName’

Predefined Scorer Configuration: “Identity”

Here the below choice of fields, weights and comparison type are set, but can be overridden through ScoringExpression.

  • FullName: Weight=1; compared by LevenshteinDistance
  • PersonalNumber: Weight=1; compared by EqualityCaseInsensitive

Note: FullName is an alias for ‘FirstName LastName’

ScoringExpression

The ScoringExpression string gives the means for customization of the scoring algorithm. Note: All string comparisons are per default case-insensitive. Note: Scorer is a named lookup of a predefined ScoringExpression. Per default, “address” and “identity” are available but more can be added per customer with self-service calls.

Scoring rule example 1

“FullName;0.5 |
Address;0.5 |
PostalCode |
City”
The above expression says that the fields compared should be FullName, Address, PostalCode and City (FullName is an alias for ‘FirstName LastName’). Here the first two fields are given a relative weight of 0.5, and the other two receives the default weight 1.0. This means that FullName and Address are half as influential on the totalscore as PostalCode and City. The MatchLevel limits are the default 85%/60%.

Scoring rule example 2

“85/65 |
FullName;0.5 |
Address;0.5 |
PostalCode;;ldm |
City”
The above expression shows how you to change the “final score limits”. Also how to change the comparison type to LevenshteinDistanceMandatory, see explanation of comparison types below.

The Field Comparison Types:

  • LevenshteinDistance (ld) is the default comparison method because of its versatility. It puts an integer measure on the level of difference between two strings. The distance is a count of the number of character replacements, insertions or deletions needed to change one of the strings into the other. If either of the compared fields is missing or empty, the sub score is set to 0% and the measurement is marked as missing data.
  • LevenshteinDistanceMandatory (ldm) uses the same measurement logic as ld, with the addition that a LOW score (i.e. a ld percentage < MEDIUM limit) sets the Total score to LOW.
  • LevenshteinDistanceOnlyWhenData (ldx) uses the same measurement logic as ld, with the addition that the subscore is not added to the Total score when data is missing.
  • EqualityCaseInsensitive (eq) is a normal equality comparison and the measurement is either same/True/100% or different/False/0%. If either of the compared fields is missing or empty, the sub score is set to 0% and the measurement is marked as missing data.
  • EqualityCaseInsensitiveMandory (eqm) uses the same measurement logic as eq, with the addition that a LOW score sets the Total score to LOW.
  • EqualityCaseInsensitiveOnlyWhenData (eqx) uses the same measurement logic as eq, , with the addition that the subscore is not added to the Total score when data is missing.

After a sub score measurement has been computed, the sub score´s MatchPercentage can be computed. For LevenshteinDistance this computations is performed as (100 – 100 * (LevenshteinDistance/LongestStringLength), which makes sure the MatchPercentage always stay within 0%-100%.

The Field Rule Segment:

Examples:

  • PostalCode;1.2;ldm: The field PostalCode will be compared with the LevenshteinDistanceMandatory method and a weight of 1.2.
  • PostalCode;;ldm: The field PostalCode will be compared with the LevenshteinMandatory method and a default weight of 1.
  • PostalCode: The field PostalCode will be compared with the default method LevenshteinDistance and a default weight of 1.

Term description:

FieldName (term 1)The name of a field to compare, for example FullName. Normally, but not necessarily, existing on both items compared.
Weight (term 2)An optional relative weight that is factored into the total match score percentage. For example if all other comparisons have weight 1 and this 2, this sub score will have twice the influence on the total score as the other sub scores.
ComparisonType (term 3)An optional specification of comparison type. Can be either of ld/ldm/eq/eqm. This field describes the type of comparison performed: LevenshteinDistance (ld, the default), LevenshteinDistanceMandatory (ldm), EqualityCaseInsensitive (eq) and EqualityCaseInsensitiveMandatory (eqm).

MatchScore

This is a Json response string, in compact format, for the final grading and the subresults with their measurements, including all the rules in place in a comparison between two items. Example of a MatchScore: “MEDIUM 80% |
FullName 77%;0,5;ld_4 |
Address 100%;0,5;ld_0 |
PostalCode 50%;1;ld_2 |
City 100%;1;ld_0”
First segment: MatchLevel Contains LOW/MEDIUM/HIGH deduced from the match percentage between 0 and 100, where 100 means a perfect match. Following sements: SubScores Contains a delimited list of details for each sub score. The measurement segments explained: The measurement is the third and last segment in each sub score. An example is ld_3, which means there was a measurement of a three character Levenshtein distance between the two compared fields.