mySociety and SpendNetwork have been working on a project for the UK Government Digital Service (GDS) Global Digital Marketplace Programme and the Prosperity Fund Global Anti-Corruption programme, led by the Foreign & Commonwealth Office (FCO), around beneficial ownership in public procurement. This is one of a series of posts about that work.
Once collected, a key issue in analysis of company ownership data is correctly identifying when the same individual is connected with multiple companies. While name matching is viable in small datasets, it increases the amount of work required to remove false positives in larger datasets.
For instance, while the UK’s Persons of Significant Control (PSC) register has a unique ID for each instance of a person having ownership, reconciling where an individual exists in multiple ownerships requires additional data processing, and possible inaccuracy. An approach developed for this dataset might not travel well to others, where address data may be less consistent (or lack an equivalent of, for example, a postcode). This problem extends beyond ownership data, and is a general issue in reconciling different datasets about people.
The exact challenges of name reconciliations vary by the naming conventions in a country. Just as there can be no universal standard on storing name information, shortcuts to reduce ‘noise’ in a name (removing common typos, or sound-alikes) differ by language. For instance, the process to generate a CURP (ID) number in Mexico (which, by default, incorporates an individual’s first name) has explicit exceptions for very common first names, requesting use of the individual’s second name instead. Approaches within a country can also be varied: Indonesia has a wide range of ethnic and language groups, and so several different sets of common naming conventions.
Given this problem, it is useful to be able to make use of other unique identifiers for an individual (a national ID or tax number). However, these are often seen as personal data that can not be released as part of open data. We have produced a short paper outlining the possible ways these private identifiers can be released.
Different approaches are practical in different contexts, but at a minimum it should always be viable (and should be encouraged) to collect private identification information, and release an ID fragment to aid reconciliation. This is a short code derived from an ID, but that is not in itself unique. This can be used to more accurately group similar names into unique people. Private information can be used to add information about uniqueness to the process, without revealing the private information publicly.
Research Mailing List
Sign up to our mailing list to hear about future research.