3 MAPS Matching
After the data are staged, MAPS builds a map of matched data sets based on an optimized trip matching methodology. MAPS assigns records from multiple data sources to a unique trip identifier by comparing data fields that are common to several data source streams including Vessel Trip Reports (VTR), Allocation Management System (AMS), and Dealer reports.
The matching process attempts to align records using information indicating the records belong to the same trip. For some records, a hard match is possible. For example, identical VTR serial numbers in both the VTR and Dealer report constitute a hard match. Where a hard match is not possible, date matching is necessary. Orphans (data element exists in one source but not another) are allowed in all matches such that the list of unique trips accumulates during each matching step when trips in that dataset do not match previously identified trips. For this reason, the order of matching datasets is important.
The unique trip identifier is CAMSID. For federally-permitted vessels, CAMSID is a concatenated field, built as permit number, official trip date, then the VTR document identifier (DOCID), if it exists, else “000000”. The resulting format is PERMIT_YYYYMMDDhhmmss_DOCID. DOCID will be replaced with the universal trip identifier (UTID) when available. For dealer records with no federal permit number (permit = ‘000000’), the CAMSID is built as PERMIT, HULLID, dealer partner id, dealer link, and dealer date with the format PERMIT_HULLID_PARTNER_LINK_YYMMDD000000.
3.1 Matching Criteria
There are three primary matching criteria for a trip:
3.1.1 Vessel Permit Number (PERMIT)
PERMIT constitutes a hard match for MAPS; that is, a one-for-one match. The same Federal vessel permit number is required on records from all data sources for hard-matching of trips. All trip matching is performed within a specific PERMIT across all years.
3.1.2 VTR Serial Number (VTRSERNO)
VTR serial number constitutes a hard match when it exists. When available, MAPS first attempts to match records by the use of VTR serial numbers. Each trip with associated an VTRSERNO (and DOCID) is constrained to one unique CAMSID.
VTRSERNOs are validated during the data staging phase before being processed by MAPS. Validation includes checking for a legitimate VTRSERNO structure, validating against known vessel trip reports VTRSERNOs and date ranges for records older than 14 days, and cross checking VTRSERNOs across record sources. VTRSERNOs that fail validation are not used for matching or trip id assignment.
3.1.3 Date Matching
If a record cannot be matched using the VTR’s serial numbers, then MAPS uses a date matching process. This is done using the Kuhn-Munkres algorithm (often called the Hungarian method), and membership functions or Rolling Joins, which are explained in the following sections.
3.1.3.1 Hungarian (Kuhn-Munkres) algorithm
The Hungarian (Kuhn-Munkres) algorithm is a combinatorial optimization algorithm that solves the best one-to-one combination of records in the affinity matrix by maximizing (or minimizing) the total sum of associated values (probabilities, costs, etc). The resulting output is a list of the matching records (for example., 1 → 2, 2 → 3, 3 → 1). In this application, unambiguous trip orphans are removed prior to applying the algorithm.
3.1.3.1.1 Affinity Matrix
MAPS builds the affinity matrix by evaluating the strength of the associated date match through a membership function. A membership function is a curve or shape that defines how each point in the input space is mapped to a membership value, or degree of membership, between 0 and 1. (In this case, how well the dates align between records: 0 = not at all, 1 = perfect.)
NOTE: Currently, membership functions are based on date-times. It is possible to use other provided information: for example, Species Landed and Area Type Reported On VTR to match to specific AMS declarations. The final score can also be a function of multiple membership functions.
The affinity matrix is used to portray a score (either a probability or cost) assigned to each potential pairing of data records that might belong to the same trip. The matrix is built after hard matches have been removed off the top. Each matrix contains all remaining data records belonging to a given permit during some period of time for the two data types being matched (for example, VTR and Dealer report), with scores calculated between all potential pairings.
The score is calculated according to the distance in time between relevant time records – larger distances result in lower probabilities or higher costs. Some data records will have a start and end date for the trip (for example, VTRs), while others will have a single date of transaction (for example, Dealer report).
Note that the calculation is conditional on the nature of the data types being proposed for matching. For example, start and end dates for a trip are used to calculate a midpoint date, and all three dates are then used to compare times for VTR and AMS records. Given that the scores are relative, the details of their calculation are amenable to modification and may require adjustment to optimize matches.
3.1.3.1.2 Reproducible and Repeatable Matching
To facilitate matching that is both reproducible and repeatable from run to run, MAPS both orders data sets by date (land date if both start and land are present) and removes unambiguous trip orphans prior to applying the algorithm.
To facilitate identification of trip orphans, MAPS defines the affinity scores, which have been assigned through membership functions, as relative probabilities. Therefore, any record that does not have a corresponding affinity score of >0 to at least one other match is an orphan. The trip orphans are easily identified by those rows or columns in the affinity matrix that sum to zero.
If the trip orphans are not removed prior to matching, than they will be randomily assigned to trip ids from run to run. How they are randomily assigned will depend on the coded implementation of the Hungarian (Kuhn-Munkres) algorithm and if the affinity matrix is totals are being maximized or minimized.
Note: Trip orphans are still possible as a results of applying the algorithm as when the number of data records in each set is unequal (for example, 5 AMS records being matched to 4 VTRs).
3.1.3.2 Backward Rolling Joins
Commonly used for analyzing data involving time, rolling joins are used you want to associate a record to another record with the most recent time prior to the first record and each record is not tracked with a common id attribute. For example, matching a fishing dealer record to the most recent vessel trip report.
3.2 Matching Order
MAPS matches multiple sets of data. The methodology to build the map is: * Order dependent — A matching operation on a particular set of data is completed before matching can be attempted on the next particular set of data. * Cumulative — The results of each previous match are kept to be used to attempt to match the next set of data; at the end of the process, MAPS has built the matching map.
This methodology is explained more thoroughly in the enumerated list directly after this discussion.
MAPS builds the matching map as follows:
- AMS Declaration information is matched to VTR:
This matching is always performed first, and forms the basis of declared and known trips. As AMS has no associated trip id, matches are performed via the Hungarian (Kuhn-Munkres) algorithm using the Sail and Land dates from both records. Based on the membership function’s score, many-to-one matches (trip stitching) are allowed in either direction based on compatible AMS declarations.
For example, if two AMS records had high affinity scores with a single VTR record, the algorithm would only match the AMS record with the highest score. Should the remaining AMS record not be matched to another VTR, the matching process would attempt to “stitch” or assign the additional AMS record if – based on the affinity score and other attributes – it was also deemed to belong to the same VTR record. Conversely, records matched by the algorithm may be separated if the affinity score does not meet the threshold (typically 0.50).
The default is to allow trip stitching in MAPS, but the option can be turned off. Stitching only occurs when matching AMS to VTR records. VTR dates are selected over AMS dates as the official trip record Sail and Land dates.
- Northeast Fisheries Observer Program (NEFOP) Reports: The Observer reports are matched by based upon VTR serial number. If the sources cannot be matched on VTR serial number, an attempt is made to match the records based upon VTR Land date and the Sail and Land date range of the Observer report with the Hungarian (Kuhn-Munkres) algorithm.
Most Observer Reports do not contain a a Hull identification number in place of a PERMIT value. The observer reports are first assigned a PERMIT value through referencing the Vessel Permit System (VPS) and cross referencing the reported VTR serial number (VTRSERNO) in the Vessel Trip Reports when staging the data.
- Multispecies Catch Report:
Records are first matched based upon VTR serial number. If the sources cannot be matched on VTR serial number, an attempt is made to match the records based upon the trip record Land dates through a rolling joins. Date joins are limited to trips with a multispecies AMS declaration or a non-declared declaration (e.g. declared out of fishery, DOF).
- Multispecies Trip Start Hail Reports:
Records are first matched by based upon VTR serial number. If the sources cannot be matched on VTR serial number, an attempt is made to match the records based upon the trip record Land dates through a rolling joins. Date joins are limited to trips with a multispecies AMS declaration or a non-declared declaration.
- Multispecies Pre-Trip Notification System Set Only Trips:
No VTR serial number exists for matching. Records are matched through rolling joins, based upon the trip record dates. Date joins are limited to trips with a multispecies AMS declaration or a non-declared declaration. No orphans are permitted.
- Herring Catch Reports:
Records are first matched by based upon VTR serial number. If the sources cannot be matched on VTR serial number, an attempt is made to match the records based upon the trip record Land dates through a rolling joins. Date joins are limited to trips with a Herring AMS declaration or a non-declared declaration.
- Scallop VMS Reports:
Records are first matched by based upon VTR serial number. If the sources cannot be matched on VTR serial number, an attempt is made to match the records based upon the trip record Land dates through a rolling joins. Date joins are limited to trips with a scallop AMS declaration or a non-declared declaration.
- Scallop Preland Reports:
Records are first matched by based upon VTR serial number. If the sources cannot be matched on VTR serial number, an attempt is made to match the records based upon the trip record Land dates through a rolling joins. Date joins are limited to trips with a scallop AMS declaration or a non-declared declaration.
- Scallop VMS Catch Report
Records are first matched by based upon VTR serial number. If the sources cannot be matched on VTR serial number, an attempt is made to match the records based upon the trip record Land dates through a rolling joins. Date joins are limited to trips with a scallop AMS declaration or a non-declared declaration.
- Commercial Fisheries Dealer Report:
Records are first matched by based upon VTR serial number. If the sources cannot be matched on VTR serial number, an attempt is made to match the records based upon the dealer-reported date of landing based on rolling joins with the current build of MAPS trip land dates. Dealer-reported dates of landing do not include time information, and are assumed to be 11 PM.
Dealer is matched second to last to allow all potential VTR serial numbers to be available from previous records.
- Commercial Fisheries Length Data CFLEN
Records are first matched by based upon VTR serial number. If the sources cannot be matched on VTR serial number, an attempt is made to match the records based upon the trip record Land dates through a rolling joins.
CFLEN is matched last to allow for any potential orphan dealer reports to match CFLEN records with VTR serial numbers.