ZignSec

FuzzyMatch

API

POST to https://env.zignsec.com/v2/ekyc/fuzzymatch
   where env is api or test. (Note that compareitems is an earlier still operational synonym for fuzzymatch api above)

API overview

  • FuzzyMatch is a general fuzzy match service for freely comparing strings or structures/objects (json). The response is a total score and details for each string comparison.
  • It is free of charge.
  • The main MatchScore field shows a high-level score: HIGH means a good match, MEDIUM means a possible and LOW means no match.
  • See Match Scoring algorithm for expression details and pre-defined scorer configurations.
  • See Scoringexpression for customization of the scoring algorithm.
  • Note that both names and values are case insensitive in the data. Ex “name”:”John” is the same as “Name”:”JOHN”.

A Minimal Example Call

Raw request:

            POST https://test.zignsec.com/v2/ekyc/fuzzymatch HTTP/1.1
Content-Type: application/json; charset=UTF-8
Authorization: -your key here-
Host: test.zignsec.com
Content-Length: 120

{
"Item1":{"fulllname":"Edgar Allan Poe"},
"Item2":{"fulllname":"Egar Azzan Poe"},
"responseformat":"structured"
}
        

Response is: 3 characters distance (ld=3), MEDIUM match and 80% similarity.

            {
  "id":"e85144b0-9399-4803-9f8e-9fec3da0856a",
  "errors":[],
  "MatchScore":{
    "MatchLevel":"MEDIUM",
    "MatchPercentage":80,
    "SubScores":
[
 {
   "FieldName":"fulllname", 
   "CouldBeComputed":true, 
   "MatchPercentage":80, 
   "Weight":1.0, 
   "ComparisonType":"LevenshteinDistance", 
   "MeasureOfLevenshteinDistance":3,
   "MeasureOfEquality":false
  }
 ],
 "ScoringExpression":"85%/60% | fulllname;1;ld"
}
        

FuzzyMatch API

The main element in the response is the MatchScore which shows the degree of match in a stop-light fashion – either HIGH (a good match), MEDIUM or LOW, followed by detailed percentages for each sub score computed. For quicker reading of the results, the ResponseFormat is set to Flat per default, showing a format closely resembling the indata parameter ScoringExpression

Note: The item1 and item2 parameters can switch places without changing the match score (except for that the only item1 will set which fields are expected).

Note: Two call options either set Item2 or Items2s. Multiple comparisons can be performed on the same call by sending in an Item2s array.

Parameters

Item1 Any Json, for example an Address structure to be compared with either a Item2, or an array of Item2s.
Item2 A Json object, for example an address, which should be compared to Item1. Response is in MatchScore.
Item2s An array of Json objects, for example addresses, where each item should be compared to Item1. Response is in MatchScores array.
ScoringExpression Optional parameter. A detailed rule expression that controls the comparison.
Scorer Optional parameter. A pre-defined ruleset looked up by name, for example ‘Address’ or ‘Identity’.
ResponseFormat Optional parameter. Can be set to either Flat or Structured. Flat format is more compact and readable, and is the default format.

Scoring Rules

The scoring rules can be changed on each call with ScoringExpression and Scorer respectively:

  • See Example 1: Automatic scoring. If neither ScoringExpression nor Scorer was set, an automatic comparison is performed with all fields in Item1 and default weights and comparison method (ie 1.0 and LevenshteinDistance).
  • See Example 2: By ScoringExpression, a custom rule string controlling the comparison.
  • See Example 3: By Configuration name- performs a lookup of a ScoringExpression from the Scorer name, try “address” or “identity”.

Fuzzymatch example 1: Using the Automatic Scoring

If neither the ScoringExpression nor the Scorer parameter was set, an implicit ScoringExpression is created automatically, comparing all the fields on Item1 to the ones existing on Item2, with Levenshtein distance and all relative weights equal to 1.0.

            POST https://test.zignsec.com/v2/ekyc/fuzzymatch HTTP/1.1
Content-Length: 27
Content-Type: application/json; charset=UTF-8
Authorization: -your key here-

{
"Item1":{"FirstName":"Lars","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6800","City":"Varde"},
"Item2": {"FirstName":"Lars Ole","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6710","City":"Varde"}
}
        

Json Response:

            {
   "id":"40ea6196-e438-4772-bb81-98e3ac3c0ced",
   "errors":[

   ],
   "MatchScore":
      "MEDIUM 80% | FirstName 50%;1;ld_4 | LastName 100%;1;ld_0 | Address 100%;1;ld_0 | PostalCode 50%;1;ld_2 | City 100%;1;ld_0",
   "ScoringExpression":"85/60 | FirstName;1;ld  |  LastName;1;ld  |  Address;1;ld  |  PostalCode;1;ld  |  City;1;ld"
}
        

Fuzzymatch example: Setting ScoringExpression

See this for more details.

This example:
* Changes the scoring rule for City to use LevenshteinMandatory.
* Changes the MatchLevel limits. 90%/55% are the total limits in this call.
* Uses the array Item2s to compare multiple items with Item1.

            POST https://test.zignsec.com/v2/ekyc/fuzzymatch HTTP/1.1
User-Agent: Fiddler
Content-Length: 27
Content-Type: application/json; charset=UTF-8
Authorization: 000000....

{
"Item1":{"FirstName":"Lars","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6800","City":"Varde"},
"Item2s":[
  {"FirstName":"Lars Ole","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6710","City":"Varde"},
  {"FirstName":"Lars","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6810","City":"Varde"}
]
,"ScoringExpression":"90%/55% | FullName;0.8 | Address;0.5 | PostalCode | City;;ldm"
}
        

Json response:

            {
   "id":"6a46ecb3-df83-4347-9b86-e06b7298bd8d",
   "errors":[

   ],
   "MatchScores":[
      "MEDIUM 79% | FullName 77%;0,8;ld_4 | Address 100%;0,5;ld_0 | PostalCode 50%;1;ld_2 | City 100%;1;ldm_0",
      "HIGH 92% | FullName 100%;0,8;ld_0 | Address 100%;0,5;ld_0 | PostalCode 75%;1;ld_1 | City 100%;1;ldm_0"
   ],
   "ScoringExpression":"90%/55% | FullName;0.8;ld  |  Address;0.5;ld  |  PostalCode;1;ld  |  City;1;ldm"
}
        

Fuzzymatch example: Using Predefined Configuration

We have two predefined scorer configurations named “address” and “identity”.
This example also demonstrates how casing and field names and values is insignificant.

            POST https://test.zignsec.com/v2/ekyc/fuzzymatch HTTP/1.1
User-Agent: Fiddler
Content-Length: 27
Content-Type: application/json; charset=UTF-8
Authorization: 000000...

{
"Item1":{"firstname":"LARS","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6800","City":"Varde"},
"Item2":{"FIRSTNAME":"Lars Ole","LastName":"Svenning","Address":"Boulevarden 12","PostalCode":"6710","City":"Varde"}
,"Scorer":"address"
}

        

Json Response:

            {
   "id":"e90dd002-6c57-4641-95d2-2974329a1fd8",
   "errors":[

   ],
   "MatchScore":"MEDIUM 80% | FullName 77%;0,5;ld_4 | Address 100%;0,5;ld_0 | PostalCode 50%;1;ld_2 | City 100%;1;ld_0",
   "ScoringExpression":"85/60 |  FullName;0.5;ld  |  Address;0.5;ld  |  PostalCode;1;ld  |  City;1;ld"
}
        

Match Scoring in Detail

We employ a customizable comparison algorithm to pair-wise measure the similarity between two Json objects, for example two user addresses. It is basend on the standard Levenhstein string distance.
Computation steps:
1. selecting the fields to include in the comparison, and
2. selecting the respective field´s comparision method, for example “string distance”, and
3. selecting the relative weight for each field´s subscore going into the in a weighted average total score, and finally
4. selecting the percentage limits for how the total score is mapped into one of the three match score classes.

A real-world use case

Suppose a online web store wants to verify the name and address of new users at sign-up. The algorithm compares the address the user has entered himself to one other address fetched from a trusted external address source. The comparison response contains a three-stepped class score. The score report includes enough matching details for the implementor to inspect the sub scores in detail.

Scoring Overview

The comparison process begins with a number of separate field comparisons. They are then weighed together into a total MatchPercentage score which is finally mapped to one of the MatchLevel categories.

  • HIGH – a very good match – The data item (ex user) should be accepted as referring to the same item (ex individual).
  • MEDIUM – a business decision. Maybe the item (ex user) could be accepted but need a manual inspection.
  • LOW – a poor match – The data item (ex user) should not be accepted as the items are too different.

Note: If data is missing from any of the fields in a comparison, the MatchLevel is automatically lowerered to be at best a MEDIUM match, to incur an extra warning business message.

How is the MatchLevel Computed?

The final MatchLevel is computed by mapping of MatchPercentage into three category Levels.
Below are the default mapping limits, which can be overriden by ScoringExpression.

  • HIGH – when MatchPercentage is 85-100%.
  • MEDIUM – when MatchPercentage is 65-84%.
  • LOW – when MatchPercentage is 0-64%.

Predefined Scorer Configuration: “Address”

The below choice of fields, weights and comparison type can be overriden through ScoringExpression.

  • FullName: Weight=0.5; compared by LevenshteinDistance
  • Address: Weight=0.5; compared by LevenshteinDistance
  • PostalCode: Weight=1; compared by LevenshteinDistance
  • City: Weight=1; compared by LevenshteinDistance

Note: FullName is an alias for ‘FirstName LastName’

Predefined Scorer Configuration: “Identity”

Here the below choice of fields, weights and comparison type are set, but can be overriden through ScoringExpression.

  • FullName: Weight=1; compared by LevenshteinDistance
  • PersonalNumber: Weight=1; compared by EqualityCaseInsensitive

Note: FullName is an alias for ‘FirstName LastName’

ScoringExpression

The ScoringExpression string gives the means for customization of the scoring algorithm.
Note: All string comparisons are per default case-insensitive.
Note: Scorer is a named lookup of a predefined ScoringExpression. Per default, “address” and “identity” are available but more can be added per customer with self-service calls.

Scoring rule example 1

“FullName;0.5 | Address;0.5 | PostalCode | City”

The above expression says that the fields compared should be FullName, Address, PostalCode and City (FullName is an alias for ‘FirstName LastName’). Here the first two fields are given a relative weight of 0.5, and the other two receives the default weight 1.0. This means that FullName and Address are half as influential on the totalscore as PostalCode and City. The MatchLevel limits are the default 85%/60%.

Scoring rule example 2

“85/65 | FullName;0.5 | Address;0.5 | PostalCode;;ldm | City”

The above expression shows how you to change the “final score limits”. Also how to change the comparison type to LevenshteinDistanceMandatory, see explanation of comparison types below.

The Field Comparison Types:

  • LevenshteinDistance (ld) is the default comparison method because of its versatility. It puts an integer measure on the level of difference between two strings. The distance is a count of the number of character replacements, insertions or deletions needed to change one of the strings into the other. If either of the compared fields is missing or empty, the sub score is set to 0% and the measurement is marked as missing data.
  • LevenshteinDistanceMandatory (ldm) uses the same measurement logic as ld, with the addition that a LOW score (ie a ld percentage < MEDIUM limit) sets the Total score to LOW.
  • EqualityCaseInsensitive (eq) is a normal equality comparison and the measurement is either same/True/100% or different/False/0%. If either of the compared fields is missing or empty, the sub score is set to 0% and the measurement is marked as missing data.
  • EqualityCaseInsensitiveMandory (eqm) uses the same measurement logic as eq, with the addition that a LOW score sets the Total score to LOW.

After a sub score measurement has been computed, the sub score´s MatchPercentage can be computed. For LevenshteinDistance this computations is performed as (100 – 100 * (LevenshteinDistance/LongestStringLength), which makes sure the MatchPercentage always stay within 0%-100%.

The Field Rule Segment:

Examples:

  • PostalCode;1.2;ldm: The field PostalCode will be compared with the LevenshteinDistanceMandatory method and a weight of 1.2.
  • PostalCode;;ldm: The field PostalCode will be compared with the LevenshteinMandatory method and a default weight of 1.
  • PostalCode: The field PostalCode will be compared with the default method LevenshteinDistance and a default weight of 1.

Term description:

FieldName (term 1) The name of a field to compare, for example FullName. Normally, but not necessarily, existing on both items compared.
Weight (term 2) An optional relative weight that is factored into the total match score percentage. For example if all other comparisons have weight 1 and this 2, this sub score will have twice the influence on the total score as the other sub scores.
ComparisonType (term 3) An optional specification of comparison type. Can be either of ld/ldm/eq/eqm. This field describes the type of comparison performed: LevenshteinDistance (ld, the default), LevenshteinDistanceMandatory (ldm), EqualityCaseInsensitive (eq) and EqualityCaseInsensitiveMandatory (eqm).

MatchScore

This is a Json response string, in compact format, for the final grading and the subresults with their measurements, including all the rules in place in a comparison between two items.

Example of a MatchScore:
“MEDIUM 80% | FullName 77%;0,5;ld_4 | Address 100%;0,5;ld_0 | PostalCode 50%;1;ld_2 | City 100%;1;ld_0”

First segment: MatchLevel
Contains LOW/MEDIUM/HIGH deduced from the match percentage between 0 and 100, where 100 means a perfect match.

Following sements: SubScores
Contains a delimited list of details for each sub score.

The measurement segments explained:
The measuremnt is the third and last segment in each sub score. An example is ld_3, which means there was a measurement of a three character Levenshtein distance between the two compared fields.