This is a prospective extension to Python dict .get() that solves a common problem in data applications. The bold proposal asks whether to include such an implementation in the core language or in a library, across languages used for data processing. See what you think..
Background & Why?
These days we have more data-oriented code being written (ML/AI,etc). Data is often “dirty” (missing values/spelling errors/grammar typos/etc). “Fuzzy” (less certain) matching can be useful in many of these cases (and traditionally in SQL we might use
%LIKE%). Dictionary implementations (i.e. hashmap, hashtable, associative-array, etc) are an efficient lookup mechanism. They respond as a
boolean lookup – the key is there, or the key is not; however it’s unconventional to think of their lookup with a
confidence measure. In data-oriented code, dictionaries are often used for matching data or conditionally
joining datasets. When we cross (natural) languages we get more typo variations (in the general sense, double languages means double the varieties of typos) and therefore greater likelihood of mismatches when performing lookups (or translating) across those languages.
Python code using the
difflib string similarity library. The code will perform a lookup in a dictionary (
dict), using a double
get_or_else has become a broadly adopted functional paradigm best practice in software engineering in order to replace
if-else blocks with a (coupled)
curried function. Coupling multiple
get_or_else function calls is normal, yet tends to give more edge cases / complicates testing. It remains unconventional to throw
confidence-based matching into this mix; which is precisely what we do here:
The dictionary lookup will either:
- match the
- match the
key with a similarity score >= threshold=0.5, or
- fail to match, and return
Obviously, the above looks like code golf, so let’s step through the call and show each operation:
(0) Standard Get or Else
The base operation is a standard
dict key check using the
get(..) ensures an exception is not raised if the
key is not found. If the
key is found it returns. (See the Runtime Analysis below for more on this execution, as it is evaluated after snippets 1-6).
(1) Similarity Scores:
Create a list of matches to each key. This is an O(n) operation, checking every
(2) Filter by Threshold Score:
Keep the keys with a similarity score that reaches or exceeds the
threshold value. This is an O(n) operation, checking every
(3) Sort to find the best match:
Sort the filtered results, so the best result is in index position
0. This is an O(n) to O(n log n) operation (for Timsort: i.e. Insertion or Merge).
(4) Get the top match or Handle if there are no matches. Ensure a value is returned:
float('nan') here because it should never unintentionally match a genuine key. Python doesn’t have a true
null type (i.e.
None == None is True, which is not the case for
float('nan') provides that
null behaviour. This is an O(1) operation.
(5) Extract the key value (i.e. from
(key,score) tuple) or handle no matches:
Same reasoning applies for
float('nan') to ensure the
null result it will not match an existing dictionary key. This is an O(1) operation.
(6) Second-Level Get_or_Else Lookup:
A simple get_or_else lookup. Note, that if a
top_match_key was not found, then its value will be
float('nan'), which will not match. Therefore, it will fail and return the specified
default_value. This is an O(1) operation.
Total time complexity is O(3n+3c) for average, worst and best case scenarios (excluding variations in dependent functions, e.g. Timsort). Comparatively,
dict‘s native lookup is O(1).
In the Appendix: Lazy Implementation section below, you can find a time complexity of O(3n+3c) for average and worst case scenarios and best as O(1), by separating the
confidence-based lookups into (
curried, yet) separate function calls.
Appendix: Lazy implementation
This implementation improves the best case execution time. In this case the similarity lookup is optional, and lazily called. If
key is not found within the dictionary,
get_or_threshold_match_lazy() will return a function (object) pointer, which can then be called. Note: the major difference in this function is on line 11.
Why? Well, the eager implementation (above) will first evaluate the O(3n+3c) lookup, then it will try the O(1) lookup. The lazy implementation, will first evaluate the O(1) lookup, then it will wait for evaluation of the O(3n+3c) lookup.
- Time complexity best case is O(1). Still average and worst is O(3n+3c).
- An option to separate function calls, and conditionally request the secondary function.
- Good for large lookup dictionaries.
- More complex code to write / read.
Appendix: Eager implementation (above) Pros & Cons
- Relatively simple code to write.
- Bad for dictionaries with large key sets.
- Guaranteed O(3n+3c) for dictionary lookups.