Programming Memoirs

Python’s built-in container data types: categorisation and iteration

The aim of this short article is to take a look on the built-in container data types in Python 3.1. This introduction, however, is not a typical one. Typical Python data types tutorials focus on describing individual data types one by one. Here, instead, I try to describe the ways Python container data types can be grouped according to their properties. I also explain when and why some containers are iterable, why others are not. Therefore, the article explains such concepts as:

  • mutable and immutable objects,
  • hashable objects,
  • ordered and unordered objects,
  • the relationship between sequence/set/mapping types with the iterable type, and demystifies the implicit sequence iterator in custom sequence containers.

The article assumes the reader has a fundamental understanding of container types in Python.

Brief recap

So… what are container data types in Python? In short — they are types whose instances are capable of storing other objects. As a quick reminder — some of the fundamental built-in container objects include:

  • lists – the most popular container data type in python; can store any number of any objects.
  • tuples – similar to list, yet once created are immutable,
  • sets – can store only unique elements,
  • bytes – immutable sequence of integers in the range 0 <= x < 256,
  • bytearray – like bytes, but mutable,
  • dictionary – also known as associative arrays. They contain mapping of keys into value,
  • collections.OrderedDict - ordered dictionary, present in the collections module. Like typical dictionary yet it remembers the order in which items have been added,
  • str – string, a sequence of unicode characters,
  • range – a sequence of numbers — more precisely a list containing arithmetic progressions
  • array.array – present in the array module. Similar to list, yet during the construction it is restricted to holding a specific data type,

More information on those can be found in http://docs.python.org/tutorial/datastructures.html

Python built-in container types cheat-sheet

I will just leave it here as it may come in handy for later.

Data type Mutable Ordered Literal example Constructor
Sequence types
list yes yes [1,2,3] list()
tuple no yes (1,2,3) tuple()
str no yes “text”  /  ‘text’ str()
range no yes range()
bytes no yes b’abcde’  /  b”abc” bytes()
bytearray yes yes bytearray()
array * yes yes array.array()
Set types
set yes no {1,2,3} set()
frozenset no no frozenset()
Mapping types
dict yes no {“key1″: “val”, “key2″: “val”} dict()
OrderedDict * yes yes none collections.OrderedDict()

* — can be found in a separate module

Categorising Python’s built-in container data types

The mentioned container types in Python can be segregated using the following criteria:

  • container mutability, or lack of thereof,
  • determinable order of contained elements, or lack of thereof,
  • the way the contained elements are accessed.

Mutable, immutable and hashable objects

What every newcomer to Python quickly learns is that all objects in Python can be either mutable or immutable. The fact that a container object is immutable doesn’t always mean that the objects it holds are also immutable (e.g. an immutable tuple holding mutable lists). However, container objects are fully immutable only if the object itself, and the objects it contains are recursively immutable. Recursively immutable objects may be hashable. This is important as only hashable objects can be used in a mapping container object (see below) as keys.

All of Python’s immutable built-in objects are hashable, while no mutable containers (such as lists or dictionaries) are.

Source: http://docs.python.org/release/3.1.3/glossary.html

Examples of mutable containers include:

  • list,
  • set,
  • dictionary,
  • OrderedDictionary
  • bytearray
  • array

Examples of immutable containers include:

  • string,
  • frozenset,
  • tuple,
  • bytes

The main implication of the mutable/immutable distinction and hashability is that not all container object can store all other container objects, in particular:

  • sets can store only hashable object (each object in set has to have a unique hash — sets do not store duplicate objects, as opposed to e.g. lists or tuples)
  • dictionaries can have only hashable objects as keys

Now, an interesting thing is that instances of custom classes in Python are by default hashable (even if they are by nature mutable), with their hash being their id(object). This means that two, seeming identical object instances of a custom class will have different hashes unless they implement valid __hash__() and __eq__() functions.  This is illustrated by the following code.

class MyClass1:
    """ This class does not implement any
    explicit __hash__() not __eq__() function. Thus
    the hash of the object is by default
    obtained using the id() function. """
    def __init__ (self, x):
        self.data = x
    def __getitem__(self, x):
        return self.data[x]

class MyClass2:
    """ This class does implement
    a valid __hash__() and __eq__() function. Thus
    the hash value is computed as expected. """
    def __init__ (self, x):
        self.data = x
    def __getitem__(self, x):
        return self.data[x]
    def __hash__(self):
        return hash(self.data)
    def __eq__(x, y):
        return hash(x) == hash(y)

obj1 = MyClass1((1,2,3))  # instance of MyClass1
obj2 = MyClass1((1,2,3))  # different, but identical, instance of MyClass1
my_set = {obj1, obj2}  # The two object are identical. We would expect only one to be added to set
print (my_set)
# output:  {<__main__.MyClass1 object at 0x1d3e110>, <__main__.MyClass1 object at 0x1d3e050>}
# Not good -- two identical objects are in the set. This is because their hash value is obtained from the id() function.

obj1 = MyClass2((1,2,3))  # instance of MyClass2
obj2 = MyClass2((1,2,3))  # different, but identical, instance of MyClass2
my_set = {obj1, obj2}  # The two object are identical. We would expect only one to be added to set
print (my_set)
# output:  {<__main__.MyClass2 object at 0x1d3e1d0>}
# As the object are identical, and hashed correctly, only one of them is actually inside the set.

Also, for the hashability to work in an expected manner, the programmer must “promise” that hashable object will not change during they lifetime. This promise, however, can be broken. Yet this can lead to unexpected problem in later development stages.

Why the distinction between mutable and immutable objects?

  1. Immutable objects can potentially be faster.
  2. Recursively immutable object are potentially hashable. Because such object never change their content throughout their lifetime, they can be used as keys in dictionaries. Also, hashability allows the use of the object in a set (sets store only elements with a unique hash).
  3. Code using immutable object is safer.

Order and unordered containers

Container object can store their content in either an ordered or unordered manner. Order, or lack of thereof, is unrelated to the mutability of objects.  This means that both mutable and immutable objects can be either ordered or unordered.

Examples of ordered containers include:

  • list,
  • string,
  • tuple,
  • bytes,
  • bytearrays,
  • collections.OrderedDict.
  • array

Examples of unordered containers include:

  • dictionary,
  • set,
  • frozenset.

When iterating over data in an ordered container object the data is a accesses in a determinable manner. When iterating over data in an unordered container object the data is accessed in an undetermined order. This is illustrated by the following code fragment.

my_list = list(range(10))
my_list.append (10)
my_list.append("text")
my_list.append(11)
print ("my_list:", my_list)

my_set = set(range(10))
my_set.add(10)
my_set.add("text")
my_set.add(11)
print ("my_set: ", my_set)

keys = "a b c d e f g h i j".split(" ")  # create a list of keys
my_dict = dict(zip(keys, range(10))) # zip() creates a dictionary from two sequences
my_dict['k']=10
my_dict["text_key"]="text_val"
my_dict['l']=11
print ("my_dict:", my_dict)

# program output:
# my_list: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 'text', 11]
# my_set:  {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 'text'}
# my_dict: {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'j': 9, 'l': 11, 'k': 10, 'text_key': 'text_val'}

# as we can see, only elements in the my_list were output in a manner we could expect

Set, sequence, mapping and iterable container types

A different criterion with which one can divide container objects is by the way elements contained within those are accessed. This give rise to sequence, mapping, set and iterable container types.

Now, the interesting part is the distinction between set/sequence/mapping container types and the iterable container types . It was really hard to find any in-depth information on this. I reckon it was because most people just conform to the magic of Python according to which  “things just works”. I got interested why do they work the way they do — why are my simple custom sequence containers iterable even though they do not contain any __iter__() functions? What follows is what I gathered from the pieces of information I had found. Feel free to correct me if I erred somewhere.

Sequence types and sequence iterator

Sequence and mapping container object are quite similar. In each of them data stored is accessed by means of a key. The main difference is that in sequence containers values are accessed by keys implemented as (hashable) integers from 0 to n-1, where n is equal to the number of elements in the container, whereas in mapping containers, such key can be any hashable objects (e.g. tuples).

Sequence data types include e.g.

  • str,
  • bytes,
  • bytearray,
  • list,
  • tuple,
  • range.

All built-in sequence types are by default iterable as they implement an __iter__() function. This functions returns, depending on the specific built-in container object type, a list_iterator, tuple_iterator, str_iterator, bytes_iterator, bytearray_iterator or a range_iterator objects.

Custom sequence types are also iterable even if they do not explicitly implement an __iter__() function. In such case the __iter__() cannot be called, but an iterator object can still be retrieved with the iter(sequence_type_object) function as soon as you add a __getitem__() function in your class. To give an example of a simplistic __getitem__() function:

def __getitem__(self, x):
   return self.data[x]

This is the case because in such object this function is used by an implicit sequence iterator (see more in “Iterator types” section), which, because the values are mapped using integer keys from 0 to n-1, can with no trouble ‘predict’ the names of the keys.

Mapping types

As mentioned earlier, mapping container objects can use any hashable object as a key which servers as a mapping to a value.

Mapping types include e.g.:

  • dict,
  • collections.OrderedDict.

All built-in mapping types are by default iterable as they implement the __iter__() function.

Simple custom mapping objects types are generally not iterable unless they explicitly implement an __iter__() function. If this is not the case case __iter__() obviously cannot be called directly, but an iterator object can still be retrieved with the iter(mapping_object) function. Such iterator object will, however, only work if the keys in the dictionary are integers from 0 to n where n+1 is the number of items in the dictionary, which is rarely the case. Thus for custom mapping types it is advisable to inherit from the dict class.

Set types

Set container objects store hashable objects. The elements in sets are unordered.

Set data types include e.g.:

  • set,
  • frozenset.

All built-in set types are by default iterable as they implement an __iter__() function. This function returns, in the case of set and frozentset types, the set_iterator and frozenset_iterator objects respectively.

Custom set container objects have to explicitly implement the __iter__() function to be iterable.

Iterable types

Objects stored in pure iterable container objects cannot be accessed by means of a key, but can only be iterated over using a iterator object. Iterable containers implement an __iter__() function which returns the iterator object, which in turn implements an __iter__() function, returning itself and a __next__() returning next element in the container (or throwing an exception if the end has been reached).

Python defines several iterator objects to support iteration over general and specific sequence types, dictionaries, and other more specialized forms.

Source: http://docs.python.org/library/stdtypes.html

Now bare with me for a moment. Custom sequence/set/mapping containers do not have to implement the aforementioned iterator-related function. Thus, in pure theory, iterable and set/sequence/mapping containers are quite distinct things. However, in practice:

  • all built-in container types (set, sequence, mapping) also explicitly implement a __iter__() function, returning adequate iterator objects, e.g. list_iterator, set_iterator, tuple_iterator objects,
  • all custom sequence containers explicitly implementing a __getitem__() also implicitly implement an implicit sequence iterator returned via the iter() built-in function, returning an iterator object.

This boils down to the fact that in Python all mentioned built-in sequence/mapping/set containers are also iterable containers. All custom sequence containers are also iterable containers (due to implicit sequence iterator). However, not all iterable containers are sequence/mapping/set containers.


Interesting articles:

http://www.cs.toronto.edu/~tijmen/programming/immutableDictionaries.html

http://www.voidspace.org.uk/python/articles/mapping_sequence_ref.shtml

http://docs.python.org/py3k/library/stdtypes.html

5 ResponsesLeave one →

Leave a Reply