Hash tables are the engines behind Python’s high-performance dicts.
Sets are implemented with hash tables as well.

A few questions will be involved:

  1. Common dictionary methods

  2. Special handling for missing keys

  3. Variations of dict in the standard library

  4. The set and frozenset types

  5. How hash tables work

  6. Implications of hash tables (key type limitations, unpredictable ordering, etc.)

1. Hashable

All mapping types in the standard library use the basic dict in their implementation, so they share the limitation that the keys must be hashable (the values need not be hashable, only the keys).

  1. An object is hashable if it has a **hash value** which **never changes** during its lifetime (it

needs a hash() method), and can be compared to other objects (it needs an eq() method). Hashable objects which compare equal must have the same hash value. Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally. All of Python’s immutable built-in objects are hashable; mutable containers (such as lists or dictionaries) are not. Objects which are instances of user-defined classes are hashable by default. They all compare unequal (except with themselves), and their hash value is derived from their id(). —— Python Glossary Hashable

Dictionaries&Sets - 图1

According to the above saying, it is easy to see: The atomic immutable types ( str , bytes , numeric types) are all hashable. A frozen set is always hashable, because its elements must be hashable by definition.
But a tuple is hashable only if all its items are hashable.

  1. In [7]: tpM1 = (1, 2, (3, 4))
  2. In [8]: tpM2 = (1, 2, [3, 4])
  3. In [9]: hash(tpM1)
  4. Out[9]: -2725224101759650258
  5. In [10]: hash(tpM2)
  6. ---------------------------------------------------------------------------
  7. TypeError Traceback (most recent call last)
  8. <ipython-input-10-46e92fa07838> in <module>()
  9. ----> 1 hash(tpM2)
  10. TypeError: unhashable type: 'list'
  1. **User-defined types** are hashable by default because their hash value is their **id()** and

they all compare not equal. If an object implements a custom eq that takes into account its internal state, it may be hashable only if all its attributes are immutable.

2. Ways to Build Dictionaries

2.1 Dict Constructor

  1. dict(mapping) -> new dictionary initialized from a mapping object’s (key, value) pairs
  1. dict(iterable) -> new dictionary initialized as if via:
  1. d = {}


for k, v in iterable:
d[k] = v

  1. dict(**kwargs) -> new dictionary initialized with the name=value pairs
  1. in the keyword argument list. For example: dict(one=1, two=2)
  1. >>> a = dict(one=1, two=2, three=3) # 3
  2. >>> b = {'one': 1, 'two': 2, 'three': 3} # 1
  3. >>> c = dict(zip(['one', 'two', 'three'], [1, 2, 3])) # 2
  4. >>> d = dict([('two', 2), ('one', 1), ('three', 3)]) # 2
  5. >>> e = dict({'three': 3, 'one': 1, 'two': 2}) # 2
  6. >>> a == b == c == d == e
  7. True

2.2 Dict Comprehensions

Dict Comprehensions with list of pairs(tuples/lists) dict.items()

  1. >>> country_code = {country: code for code, country in DIAL_CODES}
  2. >>> {code: country.upper() for country, code in country_code.items()
  3. ... if code < 66}

3. Common Mapping Methods

Dictionaries&Sets - 图2

4. Handling Missing Keys

4.1 Handle Missing Keys with dict.setdefault

dict access with d[k] will raise error when k is not an existing key.
d.get(k, default) is a more convenient way than handling KeyError.

D.setdefault(k[,d]) -> D.get(k,d), also set D[k]=d if k not in D

  1. tmp_list = my_dict.get(key, [])
  2. tmp_list.append(new_value)
  3. my_dict[key] = tmp_list
  4. ||
  5. if key not in my_dict:
  6. my_dict[key] = []
  7. my_dict[key].append(new_value)
  8. ||
  9. my_dict.setdefault(key, []).append(new_value)

4.2 Handle Missing Keys with collections.defaultdict

Work Theory: When instantiating a defaultdict , you provide a callable that is used to produce a default value whenever getitem is passed a nonexistent key argument.

  1. dd = defaultdict(list)

if new-key is not in dd , the expression dd['new-key'] does the following steps:

  1. Calls list() to create a new list.

  2. Inserts the list into dd using ‘new-key’ as key.

  3. Returns a reference to that list.

The callable that produces the default values is held in an instance attribute called default_factory.

  1. # 上述case可以进一步演化为如下代码
  2. my_dict = collections.defaultdict(list)
  3. my_dict[key].append(new_value)

The defaultfactory of a defaultdict is only invoked to provide default values for getitem calls, and not for the other methods. For example, if dd is a defaultdict , and k is a missing key, _dd[k] will call the default_factory to create a default value, but dd.get(k) still returns None.

4.2.1 **missing** method — actual default_factory working reason

This method is not defined in the base dict class, but dict is aware of it: If you subclass dict and provide a missing method, the standard dict.getitem will call it whenever a key is not found, instead of raising KeyError.

The missing method is just called by __getitem__ (d[k] operator) . This is why the default_factory of defaultdict works only with __getitem.

  1. #!/usr/bin/python
  2. # -*- encoding:utf-8 -*-
  3. import sys
  4. #import crash_on_ipy
  5. class StrKeyDict0(dict):
  6. def __missing__(self, key):
  7. if isinstance(key, str): #☆1
  8. raise KeyError(key)
  9. return self[str(key)] #☆2
  10. def get(self, key, default=None):
  11. try:
  12. return self[key] #☆3
  13. except KeyError:
  14. return default
  15. def __contains__(self, key):
  16. return key in self.keys() or str(key) in self.keys() #☆4
  17. if __name__ == '__main__':
  18. d = StrKeyDict0([('one',1), ('(1, 2, 3)', 'N/A'), ('1', 'oneone')])
  19. >>> (1,2,3) in d
  20. False
  21. >>> (1, 2, 3) in d
  22. True

☆3 若key不在self.keys()中,将调用 missing方法。
#☆2 如果key不在self.keys()中,将key转换为str形式,再次get,如果无法获取,将再次调用missing方法
#☆1 如果key是str类型,证明key不在self.keys()中,截止查询,抛异常
#☆4 在contains方法中,使用self.keys()而不是 key in self,避免递归调用 contains死循环

Python3: dict.keys() returns dict_keys object

5. Variations of dict

5.1 collections.OrderedDict

Maintains keys in insertion order.
popitem method of an OrderedDict pops the first item by default, but if called as my_odict.popitem(last=True) , it pops the last item added.

5.2 collections.ChainMap

5.3 collections.Counter

A mapping that holds an integer count for each key. Updating an existing key adds to its count.

Important method:
update : update the Counter’s keys’ occurrence
most_common : an ordered list of tuples with the n most common items and their counts

  1. >>> ct = collections.Counter('abracadabra')
  2. >>> ct
  3. Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
  4. >>> ct.update('aaaaazzz')
  5. >>> ct
  6. Counter({'a': 10, 'z': 3, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
  7. >>> ct.most_common(2)
  8. [('a', 10), ('z', 3)]

5.4 collections.UserDict

UserDict does not inherit from dict , but has an internal dict instance, called data , which holds the actual items.

  1. import collections
  2. class StrKeyDict(collections.UserDict):
  3. def __missing__(self, key):
  4. if isinstance(key, str):
  5. raise KeyError(key)
  6. return self[str(key)]
  7. def __contains__(self, key):
  8. return str(key) in self.data
  9. def __setitem__(self, key, item):
  10. self.data[str(key)] = item

setitem converts any key to a str .

5.5 Immutable Mappings

Mapping(internal, Mutable, dynamic) —-> types.MappingProxyType(ReadOnly, Immutable)

  1. >>> from types import MappingProxyType
  2. >>> d = {1: 'A'}
  3. >>> d_proxy = MappingProxyType(d)
  4. >>> d_proxy
  5. mappingproxy({1: 'A'})
  6. >>> d_proxy[1]
  7. 'A'
  8. >>> d_proxy[2] = 'x'
  9. Traceback (most recent call last):
  10. File "<stdin>", line 1, in <module>
  11. TypeError: 'mappingproxy' object does not support item assignment
  12. >>> d[2] = 'B'
  13. >>> d_proxy
  14. mappingproxy({1: 'A', 2: 'B'})
  15. >>> d_proxy[2]
  16. 'B'
  17. >>>

6. Set

6.1 Constructor

Empty Set: must use set()
{} represents an empty dict.

Literal set like {1, 2, 3} is both faster and more readable than calling the constructor like set([1, 2, 3]).

6.2 Set Comprehensions

  1. from unicodedata import name
  2. {chr(i) for i in range(32, 256) if 'SIGN' in name(chr(i), '')}

6.3 Set Operations

Dictionaries&Sets - 图3

Dictionaries&Sets - 图4

7. HashMap Theory under the dict and set

7.1 Hashes and Equality

The hash() built-in function works directly with built-in types and falls back to calling hash for user-defined types. If two objects compare equal, their hash values must also be equal, otherwise the hash table algorithm does not work. For example, because 1== 1.0 is true, hash(1) == hash(1.0) must also be true, even though the internal representation of an int and a float are very different.

7.2 Hash Table Algorithm

To fetch the value at mydict[search_key] , Python calls hash(search_key) to obtain the hash value of search_key and uses the least significant bits of that number as an offset to look up a bucket in the hash table (_the number of bits used depends on the current size of the table). If the found bucket is empty, KeyError is raised. Otherwise, the found bucket has an item—a found_key:found_value pair—and then Python checks whether search_key == found_key . If they match, that was the item sought: found_value is returned.

However, if search_key and found_key do not match, this is a hash collision. This happens because a hash function maps arbitrary objects to a small number of bits, and—in addition—the hash table is indexed with a subset of those bits. In order to resolve the collision, the algorithm then takes different bits in the hash, massages them in a particular way, and uses the result as an offset to look up a different bucket. If that is empty, KeyError is raised; if not, either the keys match and the item value is returned, or the collision resolution process is repeated

Dictionaries&Sets - 图5

Additionally, when inserting items, Python may determine that the hash table is too crowded and rebuild it to a new location with more room. As the hash table grows, so does the number of hash bits** used as bucket offsets**, and this keeps the rate of collisions low.

7.3 Avoid Insertion during scan

Whenever you add a new item to a dict , the Python interpreter may decide that the hash table of that dictionary needs to grow, with the result that the keys are likely to be ordered differently in the new hash table.
If you are iterating over the dictionary keys and changing them at the same time, your loop may not scan all the items as expected — not even the items that were already in the dictionary before you added to it.

This is why modifying the contents of a dict while iterating through it is a bad idea. If you need to scan and add items to a dictionary, do it in two steps: read the dict from start to finish and collect the needed additions in a second dict . Then update the first one with it.

In Python 3, the .keys() , .items() , and .values() methods return dictionary views, which behave more like sets than the lists returned by these methods in Python 2. Such views are also dynamic: they do not replicate the contents of the dict , and they immediately reflect any changes to the dict.