MultiIndex / advanced indexing

This section covers indexing with a MultiIndex and other advanced indexing features.

See the Indexing and Selecting Data for general indexing documentation.

::: danger Warning

Whether a copy or a reference is returned for a setting operation may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.

:::

See the cookbook for some advanced strategies.

Hierarchical indexing (MultiIndex)

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.

See the cookbook for some advanced strategies.

Changed in version 0.24.0: MultiIndex.labels has been renamed to MultiIndex.codes and MultiIndex.set_labels to MultiIndex.set_codes.

Creating a MultiIndex (hierarchical index) object

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

  1. In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
  2. ...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
  3. ...:
  4. In [2]: tuples = list(zip(*arrays))
  5. In [3]: tuples
  6. Out[3]:
  7. [('bar', 'one'),
  8. ('bar', 'two'),
  9. ('baz', 'one'),
  10. ('baz', 'two'),
  11. ('foo', 'one'),
  12. ('foo', 'two'),
  13. ('qux', 'one'),
  14. ('qux', 'two')]
  15. In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
  16. In [5]: index
  17. Out[5]:
  18. MultiIndex([('bar', 'one'),
  19. ('bar', 'two'),
  20. ('baz', 'one'),
  21. ('baz', 'two'),
  22. ('foo', 'one'),
  23. ('foo', 'two'),
  24. ('qux', 'one'),
  25. ('qux', 'two')],
  26. names=['first', 'second'])
  27. In [6]: s = pd.Series(np.random.randn(8), index=index)
  28. In [7]: s
  29. Out[7]:
  30. first second
  31. bar one 0.469112
  32. two -0.282863
  33. baz one -1.509059
  34. two -1.135632
  35. foo one 1.212112
  36. two -0.173215
  37. qux one 0.119209
  38. two -1.044236
  39. dtype: float64

When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product() method:

  1. In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
  2. In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second'])
  3. Out[9]:
  4. MultiIndex([('bar', 'one'),
  5. ('bar', 'two'),
  6. ('baz', 'one'),
  7. ('baz', 'two'),
  8. ('foo', 'one'),
  9. ('foo', 'two'),
  10. ('qux', 'one'),
  11. ('qux', 'two')],
  12. names=['first', 'second'])

You can also construct a MultiIndex from a DataFrame directly, using the method MultiIndex.from_frame(). This is a complementary method to MultiIndex.to_frame().

New in version 0.24.0.

  1. In [10]: df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
  2. ....: ['foo', 'one'], ['foo', 'two']],
  3. ....: columns=['first', 'second'])
  4. ....:
  5. In [11]: pd.MultiIndex.from_frame(df)
  6. Out[11]:
  7. MultiIndex([('bar', 'one'),
  8. ('bar', 'two'),
  9. ('foo', 'one'),
  10. ('foo', 'two')],
  11. names=['first', 'second'])

As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:

  1. In [12]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
  2. ....: np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
  3. ....:
  4. In [13]: s = pd.Series(np.random.randn(8), index=arrays)
  5. In [14]: s
  6. Out[14]:
  7. bar one -0.861849
  8. two -2.104569
  9. baz one -0.494929
  10. two 1.071804
  11. foo one 0.721555
  12. two -0.706771
  13. qux one -1.039575
  14. two 0.271860
  15. dtype: float64
  16. In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
  17. In [16]: df
  18. Out[16]:
  19. 0 1 2 3
  20. bar one -0.424972 0.567020 0.276232 -1.087401
  21. two -0.673690 0.113648 -1.478427 0.524988
  22. baz one 0.404705 0.577046 -1.715002 -1.039268
  23. two -0.370647 -1.157892 -1.344312 0.844885
  24. foo one 1.075770 -0.109050 1.643563 -1.469388
  25. two 0.357021 -0.674600 -1.776904 -0.968914
  26. qux one -1.294524 0.413738 0.276662 -0.472035
  27. two -0.013960 -0.362543 -0.006154 -0.923061

All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

  1. In [17]: df.index.names
  2. Out[17]: FrozenList([None, None])

This index can back any axis of a pandas object, and the number of levels of the index is up to you:

  1. In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
  2. In [19]: df
  3. Out[19]:
  4. first bar baz foo qux
  5. second one two one two one two one two
  6. A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
  7. B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737
  8. C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747
  9. In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
  10. Out[20]:
  11. first bar baz foo
  12. second one two one two one two
  13. first second
  14. bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804
  15. two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734
  16. baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738
  17. two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849
  18. foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232
  19. two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441

We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes. Note that how the index is displayed can be controlled using the multi_sparse option in pandas.set_options():

  1. In [21]: with pd.option_context('display.multi_sparse', False):
  2. ....: df
  3. ....:

It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:

  1. In [22]: pd.Series(np.random.randn(8), index=tuples)
  2. Out[22]:
  3. (bar, one) -1.236269
  4. (bar, two) 0.896171
  5. (baz, one) -0.487602
  6. (baz, two) -0.082240
  7. (foo, one) -2.182937
  8. (foo, two) 0.380396
  9. (qux, one) 0.084844
  10. (qux, two) 0.432390
  11. dtype: float64

The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find yourself working with hierarchically-indexed data without creating a MultiIndex explicitly yourself. However, when loading data from a file, you may wish to generate your own MultiIndex when preparing the data set.

Reconstructing the level labels

The method get_level_values() will return a vector of the labels for each location at a particular level:

  1. In [23]: index.get_level_values(0)
  2. Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
  3. In [24]: index.get_level_values('second')
  4. Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

Basic indexing on axis with MultiIndex

One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

  1. In [25]: df['bar']
  2. Out[25]:
  3. second one two
  4. A 0.895717 0.805244
  5. B 0.410835 0.813850
  6. C -1.413681 1.607920
  7. In [26]: df['bar', 'one']
  8. Out[26]:
  9. A 0.895717
  10. B 0.410835
  11. C -1.413681
  12. Name: (bar, one), dtype: float64
  13. In [27]: df['bar']['one']
  14. Out[27]:
  15. A 0.895717
  16. B 0.410835
  17. C -1.413681
  18. Name: one, dtype: float64
  19. In [28]: s['qux']
  20. Out[28]:
  21. one -1.039575
  22. two 0.271860
  23. dtype: float64

See Cross-section with hierarchical index for how to select on a deeper level.

Defined levels

The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. For example:

  1. In [29]: df.columns.levels # original MultiIndex
  2. Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
  3. In [30]: df[['foo','qux']].columns.levels # sliced
  4. Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the get_level_values() method.

  1. In [31]: df[['foo', 'qux']].columns.to_numpy()
  2. Out[31]:
  3. array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
  4. dtype=object)
  5. # for a specific level
  6. In [32]: df[['foo', 'qux']].columns.get_level_values(0)
  7. Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

To reconstruct the MultiIndex with only the used levels, the remove_unused_levels() method may be used.

New in version 0.20.0.

  1. In [33]: new_mi = df[['foo', 'qux']].columns.remove_unused_levels()
  2. In [34]: new_mi.levels
  3. Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])

Data alignment and using reindex

Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data alignment will work the same as an Index of tuples:

  1. In [35]: s + s[:-2]
  2. Out[35]:
  3. bar one -1.723698
  4. two -4.209138
  5. baz one -0.989859
  6. two 2.143608
  7. foo one 1.443110
  8. two -1.413542
  9. qux one NaN
  10. two NaN
  11. dtype: float64
  12. In [36]: s + s[::2]
  13. Out[36]:
  14. bar one -1.723698
  15. two NaN
  16. baz one -0.989859
  17. two NaN
  18. foo one 1.443110
  19. two NaN
  20. qux one -2.079150
  21. two NaN
  22. dtype: float64

The reindex() method of Series/DataFrames can be called with another MultiIndex, or even a list or array of tuples:

  1. In [37]: s.reindex(index[:3])
  2. Out[37]:
  3. first second
  4. bar one -0.861849
  5. two -2.104569
  6. baz one -0.494929
  7. dtype: float64
  8. In [38]: s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')])
  9. Out[38]:
  10. foo two -0.706771
  11. bar one -0.861849
  12. qux one -1.039575
  13. baz one -0.494929
  14. dtype: float64

Advanced indexing with hierarchical index

Syntactically integrating MultiIndex in advanced indexing with .loc is a bit challenging, but we’ve made every effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works as you would expect:

  1. In [39]: df = df.T
  2. In [40]: df
  3. Out[40]:
  4. A B C
  5. first second
  6. bar one 0.895717 0.410835 -1.413681
  7. two 0.805244 0.813850 1.607920
  8. baz one -1.206412 0.132003 1.024180
  9. two 2.565646 -0.827317 0.569605
  10. foo one 1.431256 -0.076467 0.875906
  11. two 1.340309 -1.187678 -2.211372
  12. qux one -1.170299 1.130127 0.974466
  13. two -0.226169 -1.436737 -2.006747
  14. In [41]: df.loc[('bar', 'two')]
  15. Out[41]:
  16. A 0.805244
  17. B 0.813850
  18. C 1.607920
  19. Name: (bar, two), dtype: float64

Note that df.loc['bar', 'two'] would also work in this example, but this shorthand notation can lead to ambiguity in general.

If you also want to index a specific column with .loc, you must use a tuple like this:

  1. In [42]: df.loc[('bar', 'two'), 'A']
  2. Out[42]: 0.8052440253863785

You don’t have to specify all levels of the MultiIndex by passing only the first elements of the tuple. For example, you can use “partial” indexing to get all elements with bar in the first level as follows:

df.loc[‘bar’]

This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent to df.loc['bar',] in this example).

“Partial” slicing also works quite nicely.

  1. In [43]: df.loc['baz':'foo']
  2. Out[43]:
  3. A B C
  4. first second
  5. baz one -1.206412 0.132003 1.024180
  6. two 2.565646 -0.827317 0.569605
  7. foo one 1.431256 -0.076467 0.875906
  8. two 1.340309 -1.187678 -2.211372

You can slice with a ‘range’ of values, by providing a slice of tuples.

  1. In [44]: df.loc[('baz', 'two'):('qux', 'one')]
  2. Out[44]:
  3. A B C
  4. first second
  5. baz two 2.565646 -0.827317 0.569605
  6. foo one 1.431256 -0.076467 0.875906
  7. two 1.340309 -1.187678 -2.211372
  8. qux one -1.170299 1.130127 0.974466
  9. In [45]: df.loc[('baz', 'two'):'foo']
  10. Out[45]:
  11. A B C
  12. first second
  13. baz two 2.565646 -0.827317 0.569605
  14. foo one 1.431256 -0.076467 0.875906
  15. two 1.340309 -1.187678 -2.211372

Passing a list of labels or tuples works similar to reindexing:

  1. In [46]: df.loc[[('bar', 'two'), ('qux', 'one')]]
  2. Out[46]:
  3. A B C
  4. first second
  5. bar two 0.805244 0.813850 1.607920
  6. qux one -1.170299 1.130127 0.974466

::: tip Note

It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).

:::

Importantly, a list of tuples indexes several complete MultiIndex keys, whereas a tuple of lists refer to several values within a level:

  1. In [47]: s = pd.Series([1, 2, 3, 4, 5, 6],
  2. ....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]))
  3. ....:
  4. In [48]: s.loc[[("A", "c"), ("B", "d")]] # list of tuples
  5. Out[48]:
  6. A c 1
  7. B d 5
  8. dtype: int64
  9. In [49]: s.loc[(["A", "B"], ["c", "d"])] # tuple of lists
  10. Out[49]:
  11. A c 1
  12. d 2
  13. B c 4
  14. d 5
  15. dtype: int64

Using slicers

You can slice a MultiIndex by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.

You can use slice(None) to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None).

As usual, both sides of the slicers are included as this is label indexing.

::: danger Warning

You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. There are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MultiIndex for the rows.

You should do this:

  1. df.loc[(slice('A1', 'A3'), ...), :] # noqa: E999

You should not do this:

  1. df.loc[(slice('A1', 'A3'), ...)] # noqa: E999

:::

  1. In [50]: def mklbl(prefix, n):
  2. ....: return ["%s%s" % (prefix, i) for i in range(n)]
  3. ....:
  4. In [51]: miindex = pd.MultiIndex.from_product([mklbl('A', 4),
  5. ....: mklbl('B', 2),
  6. ....: mklbl('C', 4),
  7. ....: mklbl('D', 2)])
  8. ....:
  9. In [52]: micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
  10. ....: ('b', 'foo'), ('b', 'bah')],
  11. ....: names=['lvl0', 'lvl1'])
  12. ....:
  13. In [53]: dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
  14. ....: .reshape((len(miindex), len(micolumns))),
  15. ....: index=miindex,
  16. ....: columns=micolumns).sort_index().sort_index(axis=1)
  17. ....:
  18. In [54]: dfmi
  19. Out[54]:
  20. lvl0 a b
  21. lvl1 bar foo bah foo
  22. A0 B0 C0 D0 1 0 3 2
  23. D1 5 4 7 6
  24. C1 D0 9 8 11 10
  25. D1 13 12 15 14
  26. C2 D0 17 16 19 18
  27. ... ... ... ... ...
  28. A3 B1 C1 D1 237 236 239 238
  29. C2 D0 241 240 243 242
  30. D1 245 244 247 246
  31. C3 D0 249 248 251 250
  32. D1 253 252 255 254
  33. [64 rows x 4 columns]

Basic MultiIndex slicing using slices, lists, and labels.

  1. In [55]: dfmi.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :]
  2. Out[55]:
  3. lvl0 a b
  4. lvl1 bar foo bah foo
  5. A1 B0 C1 D0 73 72 75 74
  6. D1 77 76 79 78
  7. C3 D0 89 88 91 90
  8. D1 93 92 95 94
  9. B1 C1 D0 105 104 107 106
  10. ... ... ... ... ...
  11. A3 B0 C3 D1 221 220 223 222
  12. B1 C1 D0 233 232 235 234
  13. D1 237 236 239 238
  14. C3 D0 249 248 251 250
  15. D1 253 252 255 254
  16. [24 rows x 4 columns]

You can use pandas.IndexSlice to facilitate a more natural syntax using :, rather than using slice(None).

  1. In [56]: idx = pd.IndexSlice
  2. In [57]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
  3. Out[57]:
  4. lvl0 a b
  5. lvl1 foo foo
  6. A0 B0 C1 D0 8 10
  7. D1 12 14
  8. C3 D0 24 26
  9. D1 28 30
  10. B1 C1 D0 40 42
  11. ... ... ...
  12. A3 B0 C3 D1 220 222
  13. B1 C1 D0 232 234
  14. D1 236 238
  15. C3 D0 248 250
  16. D1 252 254
  17. [32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multiple axes at the same time.

  1. In [58]: dfmi.loc['A1', (slice(None), 'foo')]
  2. Out[58]:
  3. lvl0 a b
  4. lvl1 foo foo
  5. B0 C0 D0 64 66
  6. D1 68 70
  7. C1 D0 72 74
  8. D1 76 78
  9. C2 D0 80 82
  10. ... ... ...
  11. B1 C1 D1 108 110
  12. C2 D0 112 114
  13. D1 116 118
  14. C3 D0 120 122
  15. D1 124 126
  16. [16 rows x 2 columns]
  17. In [59]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
  18. Out[59]:
  19. lvl0 a b
  20. lvl1 foo foo
  21. A0 B0 C1 D0 8 10
  22. D1 12 14
  23. C3 D0 24 26
  24. D1 28 30
  25. B1 C1 D0 40 42
  26. ... ... ...
  27. A3 B0 C3 D1 220 222
  28. B1 C1 D0 232 234
  29. D1 236 238
  30. C3 D0 248 250
  31. D1 252 254
  32. [32 rows x 2 columns]

Using a boolean indexer you can provide selection related to the values.

  1. In [60]: mask = dfmi[('a', 'foo')] > 200
  2. In [61]: dfmi.loc[idx[mask, :, ['C1', 'C3']], idx[:, 'foo']]
  3. Out[61]:
  4. lvl0 a b
  5. lvl1 foo foo
  6. A3 B0 C1 D1 204 206
  7. C3 D0 216 218
  8. D1 220 222
  9. B1 C1 D0 232 234
  10. D1 236 238
  11. C3 D0 248 250
  12. D1 252 254

You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.

  1. In [62]: dfmi.loc(axis=0)[:, :, ['C1', 'C3']]
  2. Out[62]:
  3. lvl0 a b
  4. lvl1 bar foo bah foo
  5. A0 B0 C1 D0 9 8 11 10
  6. D1 13 12 15 14
  7. C3 D0 25 24 27 26
  8. D1 29 28 31 30
  9. B1 C1 D0 41 40 43 42
  10. ... ... ... ... ...
  11. A3 B0 C3 D1 221 220 223 222
  12. B1 C1 D0 233 232 235 234
  13. D1 237 236 239 238
  14. C3 D0 249 248 251 250
  15. D1 253 252 255 254
  16. [32 rows x 4 columns]

Furthermore, you can set the values using the following methods.

  1. In [63]: df2 = dfmi.copy()
  2. In [64]: df2.loc(axis=0)[:, :, ['C1', 'C3']] = -10
  3. In [65]: df2
  4. Out[65]:
  5. lvl0 a b
  6. lvl1 bar foo bah foo
  7. A0 B0 C0 D0 1 0 3 2
  8. D1 5 4 7 6
  9. C1 D0 -10 -10 -10 -10
  10. D1 -10 -10 -10 -10
  11. C2 D0 17 16 19 18
  12. ... ... ... ... ...
  13. A3 B1 C1 D1 -10 -10 -10 -10
  14. C2 D0 241 240 243 242
  15. D1 245 244 247 246
  16. C3 D0 -10 -10 -10 -10
  17. D1 -10 -10 -10 -10
  18. [64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

  1. In [66]: df2 = dfmi.copy()
  2. In [67]: df2.loc[idx[:, :, ['C1', 'C3']], :] = df2 * 1000
  3. In [68]: df2
  4. Out[68]:
  5. lvl0 a b
  6. lvl1 bar foo bah foo
  7. A0 B0 C0 D0 1 0 3 2
  8. D1 5 4 7 6
  9. C1 D0 9000 8000 11000 10000
  10. D1 13000 12000 15000 14000
  11. C2 D0 17 16 19 18
  12. ... ... ... ... ...
  13. A3 B1 C1 D1 237000 236000 239000 238000
  14. C2 D0 241 240 243 242
  15. D1 245 244 247 246
  16. C3 D0 249000 248000 251000 250000
  17. D1 253000 252000 255000 254000
  18. [64 rows x 4 columns]

Cross-section

The xs() method of DataFrame additionally takes a level argument to make selecting data at a particular level of a MultiIndex easier.

  1. In [69]: df
  2. Out[69]:
  3. A B C
  4. first second
  5. bar one 0.895717 0.410835 -1.413681
  6. two 0.805244 0.813850 1.607920
  7. baz one -1.206412 0.132003 1.024180
  8. two 2.565646 -0.827317 0.569605
  9. foo one 1.431256 -0.076467 0.875906
  10. two 1.340309 -1.187678 -2.211372
  11. qux one -1.170299 1.130127 0.974466
  12. two -0.226169 -1.436737 -2.006747
  13. In [70]: df.xs('one', level='second')
  14. Out[70]:
  15. A B C
  16. first
  17. bar 0.895717 0.410835 -1.413681
  18. baz -1.206412 0.132003 1.024180
  19. foo 1.431256 -0.076467 0.875906
  20. qux -1.170299 1.130127 0.974466
  1. # using the slicers
  2. In [71]: df.loc[(slice(None), 'one'), :]
  3. Out[71]:
  4. A B C
  5. first second
  6. bar one 0.895717 0.410835 -1.413681
  7. baz one -1.206412 0.132003 1.024180
  8. foo one 1.431256 -0.076467 0.875906
  9. qux one -1.170299 1.130127 0.974466

You can also select on the columns with xs, by providing the axis argument.

  1. In [72]: df = df.T
  2. In [73]: df.xs('one', level='second', axis=1)
  3. Out[73]:
  4. first bar baz foo qux
  5. A 0.895717 -1.206412 1.431256 -1.170299
  6. B 0.410835 0.132003 -0.076467 1.130127
  7. C -1.413681 1.024180 0.875906 0.974466
  1. # using the slicers
  2. In [74]: df.loc[:, (slice(None), 'one')]
  3. Out[74]:
  4. first bar baz foo qux
  5. second one one one one
  6. A 0.895717 -1.206412 1.431256 -1.170299
  7. B 0.410835 0.132003 -0.076467 1.130127
  8. C -1.413681 1.024180 0.875906 0.974466

xs also allows selection with multiple keys.

  1. In [75]: df.xs(('one', 'bar'), level=('second', 'first'), axis=1)
  2. Out[75]:
  3. first bar
  4. second one
  5. A 0.895717
  6. B 0.410835
  7. C -1.413681
  1. # using the slicers
  2. In [76]: df.loc[:, ('bar', 'one')]
  3. Out[76]:
  4. A 0.895717
  5. B 0.410835
  6. C -1.413681
  7. Name: (bar, one), dtype: float64

You can pass drop_level=False to xs to retain the level that was selected.

  1. In [77]: df.xs('one', level='second', axis=1, drop_level=False)
  2. Out[77]:
  3. first bar baz foo qux
  4. second one one one one
  5. A 0.895717 -1.206412 1.431256 -1.170299
  6. B 0.410835 0.132003 -0.076467 1.130127
  7. C -1.413681 1.024180 0.875906 0.974466

Compare the above with the result using drop_level=True (the default value).

  1. In [78]: df.xs('one', level='second', axis=1, drop_level=True)
  2. Out[78]:
  3. first bar baz foo qux
  4. A 0.895717 -1.206412 1.431256 -1.170299
  5. B 0.410835 0.132003 -0.076467 1.130127
  6. C -1.413681 1.024180 0.875906 0.974466

Advanced reindexing and alignment

Using the parameter level in the reindex() and align() methods of pandas objects is useful to broadcast values across a level. For instance:

  1. In [79]: midx = pd.MultiIndex(levels=[['zero', 'one'], ['x', 'y']],
  2. ....: codes=[[1, 1, 0, 0], [1, 0, 1, 0]])
  3. ....:
  4. In [80]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)
  5. In [81]: df
  6. Out[81]:
  7. 0 1
  8. one y 1.519970 -0.493662
  9. x 0.600178 0.274230
  10. zero y 0.132885 -0.023688
  11. x 2.410179 1.450520
  12. In [82]: df2 = df.mean(level=0)
  13. In [83]: df2
  14. Out[83]:
  15. 0 1
  16. one 1.060074 -0.109716
  17. zero 1.271532 0.713416
  18. In [84]: df2.reindex(df.index, level=0)
  19. Out[84]:
  20. 0 1
  21. one y 1.060074 -0.109716
  22. x 1.060074 -0.109716
  23. zero y 1.271532 0.713416
  24. x 1.271532 0.713416
  25. # aligning
  26. In [85]: df_aligned, df2_aligned = df.align(df2, level=0)
  27. In [86]: df_aligned
  28. Out[86]:
  29. 0 1
  30. one y 1.519970 -0.493662
  31. x 0.600178 0.274230
  32. zero y 0.132885 -0.023688
  33. x 2.410179 1.450520
  34. In [87]: df2_aligned
  35. Out[87]:
  36. 0 1
  37. one y 1.060074 -0.109716
  38. x 1.060074 -0.109716
  39. zero y 1.271532 0.713416
  40. x 1.271532 0.713416

Swapping levels with swaplevel

The swaplevel() method can switch the order of two levels:

  1. In [88]: df[:5]
  2. Out[88]:
  3. 0 1
  4. one y 1.519970 -0.493662
  5. x 0.600178 0.274230
  6. zero y 0.132885 -0.023688
  7. x 2.410179 1.450520
  8. In [89]: df[:5].swaplevel(0, 1, axis=0)
  9. Out[89]:
  10. 0 1
  11. y one 1.519970 -0.493662
  12. x one 0.600178 0.274230
  13. y zero 0.132885 -0.023688
  14. x zero 2.410179 1.450520

Reordering levels with reorder_levels

The reorder_levels() method generalizes the swaplevel method, allowing you to permute the hierarchical index levels in one step:

  1. In [90]: df[:5].reorder_levels([1, 0], axis=0)
  2. Out[90]:
  3. 0 1
  4. y one 1.519970 -0.493662
  5. x one 0.600178 0.274230
  6. y zero 0.132885 -0.023688
  7. x zero 2.410179 1.450520

Renaming names of an Index or MultiIndex

The rename() method is used to rename the labels of a MultiIndex, and is typically used to rename the columns of a DataFrame. The columns argument of rename allows a dictionary to be specified that includes only the columns you wish to rename.

  1. In [91]: df.rename(columns={0: "col0", 1: "col1"})
  2. Out[91]:
  3. col0 col1
  4. one y 1.519970 -0.493662
  5. x 0.600178 0.274230
  6. zero y 0.132885 -0.023688
  7. x 2.410179 1.450520

This method can also be used to rename specific labels of the main index of the DataFrame.

  1. In [92]: df.rename(index={"one": "two", "y": "z"})
  2. Out[92]:
  3. 0 1
  4. two z 1.519970 -0.493662
  5. x 0.600178 0.274230
  6. zero z 0.132885 -0.023688
  7. x 2.410179 1.450520

The rename_axis() method is used to rename the name of a Index or MultiIndex. In particular, the names of the levels of a MultiIndex can be specified, which is useful if reset_index() is later used to move the values from the MultiIndex to a column.

  1. In [93]: df.rename_axis(index=['abc', 'def'])
  2. Out[93]:
  3. 0 1
  4. abc def
  5. one y 1.519970 -0.493662
  6. x 0.600178 0.274230
  7. zero y 0.132885 -0.023688
  8. x 2.410179 1.450520

Note that the columns of a DataFrame are an index, so that using rename_axis with the columns argument will change the name of that index.

  1. In [94]: df.rename_axis(columns="Cols").columns
  2. Out[94]: RangeIndex(start=0, stop=2, step=1, name='Cols')

Both rename and rename_axis support specifying a dictionary, Series or a mapping function to map labels/names to new values.

Sorting a MultiIndex

For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index, you can use sort_index().

  1. In [95]: import random
  2. In [96]: random.shuffle(tuples)
  3. In [97]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
  4. In [98]: s
  5. Out[98]:
  6. baz one 0.206053
  7. foo two -0.251905
  8. qux one -2.213588
  9. foo one 1.063327
  10. bar two 1.266143
  11. baz two 0.299368
  12. bar one -0.863838
  13. qux two 0.408204
  14. dtype: float64
  15. In [99]: s.sort_index()
  16. Out[99]:
  17. bar one -0.863838
  18. two 1.266143
  19. baz one 0.206053
  20. two 0.299368
  21. foo one 1.063327
  22. two -0.251905
  23. qux one -2.213588
  24. two 0.408204
  25. dtype: float64
  26. In [100]: s.sort_index(level=0)
  27. Out[100]:
  28. bar one -0.863838
  29. two 1.266143
  30. baz one 0.206053
  31. two 0.299368
  32. foo one 1.063327
  33. two -0.251905
  34. qux one -2.213588
  35. two 0.408204
  36. dtype: float64
  37. In [101]: s.sort_index(level=1)
  38. Out[101]:
  39. bar one -0.863838
  40. baz one 0.206053
  41. foo one 1.063327
  42. qux one -2.213588
  43. bar two 1.266143
  44. baz two 0.299368
  45. foo two -0.251905
  46. qux two 0.408204
  47. dtype: float64

You may also pass a level name to sort_index if the MultiIndex levels are named.

  1. In [102]: s.index.set_names(['L1', 'L2'], inplace=True)
  2. In [103]: s.sort_index(level='L1')
  3. Out[103]:
  4. L1 L2
  5. bar one -0.863838
  6. two 1.266143
  7. baz one 0.206053
  8. two 0.299368
  9. foo one 1.063327
  10. two -0.251905
  11. qux one -2.213588
  12. two 0.408204
  13. dtype: float64
  14. In [104]: s.sort_index(level='L2')
  15. Out[104]:
  16. L1 L2
  17. bar one -0.863838
  18. baz one 0.206053
  19. foo one 1.063327
  20. qux one -2.213588
  21. bar two 1.266143
  22. baz two 0.299368
  23. foo two -0.251905
  24. qux two 0.408204
  25. dtype: float64

On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:

  1. In [105]: df.T.sort_index(level=1, axis=1)
  2. Out[105]:
  3. one zero one zero
  4. x x y y
  5. 0 0.600178 2.410179 1.519970 0.132885
  6. 1 0.274230 1.450520 -0.493662 -0.023688

Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:

  1. In [106]: dfm = pd.DataFrame({'jim': [0, 0, 1, 1],
  2. .....: 'joe': ['x', 'x', 'z', 'y'],
  3. .....: 'jolie': np.random.rand(4)})
  4. .....:
  5. In [107]: dfm = dfm.set_index(['jim', 'joe'])
  6. In [108]: dfm
  7. Out[108]:
  8. jolie
  9. jim joe
  10. 0 x 0.490671
  11. x 0.120248
  12. 1 z 0.537020
  13. y 0.110968
  1. In [4]: dfm.loc[(1, 'z')]
  2. PerformanceWarning: indexing past lexsort depth may impact performance.
  3. Out[4]:
  4. jolie
  5. jim joe
  6. 1 z 0.64094

Furthermore, if you try to index something that is not fully lexsorted, this can raise:

  1. In [5]: dfm.loc[(0, 'y'):(1, 'z')]
  2. UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

The is_lexsorted() method on a MultiIndex shows if the index is sorted, and the lexsort_depth property returns the sort depth:

  1. In [109]: dfm.index.is_lexsorted()
  2. Out[109]: False
  3. In [110]: dfm.index.lexsort_depth
  4. Out[110]: 1
  1. In [111]: dfm = dfm.sort_index()
  2. In [112]: dfm
  3. Out[112]:
  4. jolie
  5. jim joe
  6. 0 x 0.490671
  7. x 0.120248
  8. 1 y 0.110968
  9. z 0.537020
  10. In [113]: dfm.index.is_lexsorted()
  11. Out[113]: True
  12. In [114]: dfm.index.lexsort_depth
  13. Out[114]: 2

And now selection works as expected.

  1. In [115]: dfm.loc[(0, 'y'):(1, 'z')]
  2. Out[115]:
  3. jolie
  4. jim joe
  5. 1 y 0.110968
  6. z 0.537020

Take methods

Similar to NumPy ndarrays, pandas Index, Series, and DataFrame also provides the take() method that retrieves elements along a given axis at the given indices. The given indices must be either a list or an ndarray of integer index positions. take will also accept negative integers as relative positions to the end of the object.

  1. In [116]: index = pd.Index(np.random.randint(0, 1000, 10))
  2. In [117]: index
  3. Out[117]: Int64Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')
  4. In [118]: positions = [0, 9, 3]
  5. In [119]: index[positions]
  6. Out[119]: Int64Index([214, 329, 567], dtype='int64')
  7. In [120]: index.take(positions)
  8. Out[120]: Int64Index([214, 329, 567], dtype='int64')
  9. In [121]: ser = pd.Series(np.random.randn(10))
  10. In [122]: ser.iloc[positions]
  11. Out[122]:
  12. 0 -0.179666
  13. 9 1.824375
  14. 3 0.392149
  15. dtype: float64
  16. In [123]: ser.take(positions)
  17. Out[123]:
  18. 0 -0.179666
  19. 9 1.824375
  20. 3 0.392149
  21. dtype: float64

For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.

  1. In [124]: frm = pd.DataFrame(np.random.randn(5, 3))
  2. In [125]: frm.take([1, 4, 3])
  3. Out[125]:
  4. 0 1 2
  5. 1 -1.237881 0.106854 -1.276829
  6. 4 0.629675 -1.425966 1.857704
  7. 3 0.979542 -1.633678 0.615855
  8. In [126]: frm.take([0, 2], axis=1)
  9. Out[126]:
  10. 0 2
  11. 0 0.595974 0.601544
  12. 1 -1.237881 -1.276829
  13. 2 -0.767101 1.499591
  14. 3 0.979542 0.615855
  15. 4 0.629675 1.857704

It is important to note that the take method on pandas objects are not intended to work on boolean indices and may return unexpected results.

  1. In [127]: arr = np.random.randn(10)
  2. In [128]: arr.take([False, False, True, True])
  3. Out[128]: array([-1.1935, -1.1935, 0.6775, 0.6775])
  4. In [129]: arr[[0, 1]]
  5. Out[129]: array([-1.1935, 0.6775])
  6. In [130]: ser = pd.Series(np.random.randn(10))
  7. In [131]: ser.take([False, False, True, True])
  8. Out[131]:
  9. 0 0.233141
  10. 0 0.233141
  11. 1 -0.223540
  12. 1 -0.223540
  13. dtype: float64
  14. In [132]: ser.iloc[[0, 1]]
  15. Out[132]:
  16. 0 0.233141
  17. 1 -0.223540
  18. dtype: float64

Finally, as a small note on performance, because the take method handles a narrower range of inputs, it can offer performance that is a good deal faster than fancy indexing.

  1. In [133]: arr = np.random.randn(10000, 5)
  2. In [134]: indexer = np.arange(10000)
  3. In [135]: random.shuffle(indexer)
  4. In [136]: %timeit arr[indexer]
  5. .....: %timeit arr.take(indexer, axis=0)
  6. .....:
  7. 152 us +- 988 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)
  8. 41.7 us +- 204 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)
  1. In [137]: ser = pd.Series(arr[:, 0])
  2. In [138]: %timeit ser.iloc[indexer]
  3. .....: %timeit ser.take(indexer)
  4. .....:
  5. 120 us +- 1.05 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)
  6. 110 us +- 795 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)

Index types

We have discussed MultiIndex in the previous sections pretty extensively. Documentation about DatetimeIndex and PeriodIndex are shown here, and documentation about TimedeltaIndex is found here.

In the following sub-sections we will highlight some other index types.

CategoricalIndex

CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a container around a Categorical and allows efficient indexing and storage of an index with a large number of duplicated elements.

  1. In [139]: from pandas.api.types import CategoricalDtype
  2. In [140]: df = pd.DataFrame({'A': np.arange(6),
  3. .....: 'B': list('aabbca')})
  4. .....:
  5. In [141]: df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
  6. In [142]: df
  7. Out[142]:
  8. A B
  9. 0 0 a
  10. 1 1 a
  11. 2 2 b
  12. 3 3 b
  13. 4 4 c
  14. 5 5 a
  15. In [143]: df.dtypes
  16. Out[143]:
  17. A int64
  18. B category
  19. dtype: object
  20. In [144]: df.B.cat.categories
  21. Out[144]: Index(['c', 'a', 'b'], dtype='object')

Setting the index will create a CategoricalIndex.

  1. In [145]: df2 = df.set_index('B')
  2. In [146]: df2.index
  3. Out[146]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Indexing with __getitem__/.iloc/.loc works similarly to an Index with duplicates. The indexers must be in the category or the operation will raise a KeyError.

  1. In [147]: df2.loc['a']
  2. Out[147]:
  3. A
  4. B
  5. a 0
  6. a 1
  7. a 5

The CategoricalIndex is preserved after indexing:

  1. In [148]: df2.loc['a'].index
  2. Out[148]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Sorting the index will sort by the order of the categories (recall that we created the index with CategoricalDtype(list('cab')), so the sorted order is cab).

  1. In [149]: df2.sort_index()
  2. Out[149]:
  3. A
  4. B
  5. c 4
  6. a 0
  7. a 1
  8. a 5
  9. b 2
  10. b 3

Groupby operations on the index will preserve the index nature as well.

  1. In [150]: df2.groupby(level=0).sum()
  2. Out[150]:
  3. A
  4. B
  5. c 4
  6. a 6
  7. b 5
  8. In [151]: df2.groupby(level=0).sum().index
  9. Out[151]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Reindexing operations will return a resulting index based on the type of the passed indexer. Passing a list will return a plain-old Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the passed Categorical dtype. This allows one to arbitrarily index these even with values not in the categories, similarly to how you can reindex any pandas index.

  1. In [152]: df2.reindex(['a', 'e'])
  2. Out[152]:
  3. A
  4. B
  5. a 0.0
  6. a 1.0
  7. a 5.0
  8. e NaN
  9. In [153]: df2.reindex(['a', 'e']).index
  10. Out[153]: Index(['a', 'a', 'a', 'e'], dtype='object', name='B')
  11. In [154]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))
  12. Out[154]:
  13. A
  14. B
  15. a 0.0
  16. a 1.0
  17. a 5.0
  18. e NaN
  19. In [155]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index
  20. Out[155]: CategoricalIndex(['a', 'a', 'a', 'e'], categories=['a', 'b', 'c', 'd', 'e'], ordered=False, name='B', dtype='category')

::: danger Warning

Reshaping and Comparison operations on a CategoricalIndex must have the same categories or a TypeError will be raised.

  1. In [9]: df3 = pd.DataFrame({'A': np.arange(6), 'B': pd.Series(list('aabbca')).astype('category')})
  2. In [11]: df3 = df3.set_index('B')
  3. In [11]: df3.index
  4. Out[11]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['a', 'b', 'c'], ordered=False, name='B', dtype='category')
  5. In [12]: pd.concat([df2, df3])
  6. TypeError: categories must match existing categories when appending

:::

Int64Index and RangeIndex

::: danger Warning

Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.

:::

Int64Index is a fundamental basic index in pandas. This is an immutable array implementing an ordered, sliceable set. Prior to 0.18.0, the Int64Index would provide the default index for all NDFrame objects.

RangeIndex is a sub-class of Int64Index added in version 0.18.0, now providing the default index for all NDFrame objects. RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered set. These are analogous to Python range types.

Float64Index

By default a Float64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation. This enables a pure label-based slicing paradigm that makes [],ix,loc for scalar indexing and slicing work exactly the same.

  1. In [156]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])
  2. In [157]: indexf
  3. Out[157]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')
  4. In [158]: sf = pd.Series(range(5), index=indexf)
  5. In [159]: sf
  6. Out[159]:
  7. 1.5 0
  8. 2.0 1
  9. 3.0 2
  10. 4.5 3
  11. 5.0 4
  12. dtype: int64

Scalar selection for [],.loc will always be label based. An integer will match an equal float index (e.g. 3 is equivalent to 3.0).

  1. In [160]: sf[3]
  2. Out[160]: 2
  3. In [161]: sf[3.0]
  4. Out[161]: 2
  5. In [162]: sf.loc[3]
  6. Out[162]: 2
  7. In [163]: sf.loc[3.0]
  8. Out[163]: 2

The only positional indexing is via iloc.

  1. In [164]: sf.iloc[3]
  2. Out[164]: 3

A scalar index that is not found will raise a KeyError. Slicing is primarily on the values of the index when using [],ix,loc, and always positional when using iloc. The exception is when the slice is boolean, in which case it will always be positional.

  1. In [165]: sf[2:4]
  2. Out[165]:
  3. 2.0 1
  4. 3.0 2
  5. dtype: int64
  6. In [166]: sf.loc[2:4]
  7. Out[166]:
  8. 2.0 1
  9. 3.0 2
  10. dtype: int64
  11. In [167]: sf.iloc[2:4]
  12. Out[167]:
  13. 3.0 2
  14. 4.5 3
  15. dtype: int64

In float indexes, slicing using floats is allowed.

  1. In [168]: sf[2.1:4.6]
  2. Out[168]:
  3. 3.0 2
  4. 4.5 3
  5. dtype: int64
  6. In [169]: sf.loc[2.1:4.6]
  7. Out[169]:
  8. 3.0 2
  9. 4.5 3
  10. dtype: int64

In non-float indexes, slicing using floats will raise a TypeError.

  1. In [1]: pd.Series(range(5))[3.5]
  2. TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)
  3. In [1]: pd.Series(range(5))[3.5:4.5]
  4. TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

::: danger Warning

Using a scalar float indexer for .iloc has been removed in 0.18.0, so the following will raise a TypeError:

  1. In [3]: pd.Series(range(5)).iloc[3.0]
  2. TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>

:::

Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like indexing scheme, but the data is recorded as floats. This could, for example, be millisecond offsets.

  1. In [170]: dfir = pd.concat([pd.DataFrame(np.random.randn(5, 2),
  2. .....: index=np.arange(5) * 250.0,
  3. .....: columns=list('AB')),
  4. .....: pd.DataFrame(np.random.randn(6, 2),
  5. .....: index=np.arange(4, 10) * 250.1,
  6. .....: columns=list('AB'))])
  7. .....:
  8. In [171]: dfir
  9. Out[171]:
  10. A B
  11. 0.0 -0.435772 -1.188928
  12. 250.0 -0.808286 -0.284634
  13. 500.0 -1.815703 1.347213
  14. 750.0 -0.243487 0.514704
  15. 1000.0 1.162969 -0.287725
  16. 1000.4 -0.179734 0.993962
  17. 1250.5 -0.212673 0.909872
  18. 1500.6 -0.733333 -0.349893
  19. 1750.7 0.456434 -0.306735
  20. 2000.8 0.553396 0.166221
  21. 2250.9 -0.101684 -0.734907

Selection operations then will always work on a value basis, for all selection operators.

  1. In [172]: dfir[0:1000.4]
  2. Out[172]:
  3. A B
  4. 0.0 -0.435772 -1.188928
  5. 250.0 -0.808286 -0.284634
  6. 500.0 -1.815703 1.347213
  7. 750.0 -0.243487 0.514704
  8. 1000.0 1.162969 -0.287725
  9. 1000.4 -0.179734 0.993962
  10. In [173]: dfir.loc[0:1001, 'A']
  11. Out[173]:
  12. 0.0 -0.435772
  13. 250.0 -0.808286
  14. 500.0 -1.815703
  15. 750.0 -0.243487
  16. 1000.0 1.162969
  17. 1000.4 -0.179734
  18. Name: A, dtype: float64
  19. In [174]: dfir.loc[1000.4]
  20. Out[174]:
  21. A -0.179734
  22. B 0.993962
  23. Name: 1000.4, dtype: float64

You could retrieve the first 1 second (1000 ms) of data as such:

  1. In [175]: dfir[0:1000]
  2. Out[175]:
  3. A B
  4. 0.0 -0.435772 -1.188928
  5. 250.0 -0.808286 -0.284634
  6. 500.0 -1.815703 1.347213
  7. 750.0 -0.243487 0.514704
  8. 1000.0 1.162969 -0.287725

If you need integer based selection, you should use iloc:

  1. In [176]: dfir.iloc[0:5]
  2. Out[176]:
  3. A B
  4. 0.0 -0.435772 -1.188928
  5. 250.0 -0.808286 -0.284634
  6. 500.0 -1.815703 1.347213
  7. 750.0 -0.243487 0.514704
  8. 1000.0 1.162969 -0.287725

IntervalIndex

New in version 0.20.0.

IntervalIndex together with its own dtype, IntervalDtype as well as the Interval scalar type, allow first-class support in pandas for interval notation.

The IntervalIndex allows some unique indexing and is also used as a return type for the categories in cut() and qcut().

Indexing with an IntervalIndex

An IntervalIndex can be used in Series and in DataFrame as the index.

  1. In [177]: df = pd.DataFrame({'A': [1, 2, 3, 4]},
  2. .....: index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))
  3. .....:
  4. In [178]: df
  5. Out[178]:
  6. A
  7. (0, 1] 1
  8. (1, 2] 2
  9. (2, 3] 3
  10. (3, 4] 4

Label based indexing via .loc along the edges of an interval works as you would expect, selecting that particular interval.

  1. In [179]: df.loc[2]
  2. Out[179]:
  3. A 2
  4. Name: (1, 2], dtype: int64
  5. In [180]: df.loc[[2, 3]]
  6. Out[180]:
  7. A
  8. (1, 2] 2
  9. (2, 3] 3

If you select a label contained within an interval, this will also select the interval.

  1. In [181]: df.loc[2.5]
  2. Out[181]:
  3. A 3
  4. Name: (2, 3], dtype: int64
  5. In [182]: df.loc[[2.5, 3.5]]
  6. Out[182]:
  7. A
  8. (2, 3] 3
  9. (3, 4] 4

Selecting using an Interval will only return exact matches (starting from pandas 0.25.0).

  1. In [183]: df.loc[pd.Interval(1, 2)]
  2. Out[183]:
  3. A 2
  4. Name: (1, 2], dtype: int64

Trying to select an Interval that is not exactly contained in the IntervalIndex will raise a KeyError.

  1. In [7]: df.loc[pd.Interval(0.5, 2.5)]
  2. ---------------------------------------------------------------------------
  3. KeyError: Interval(0.5, 2.5, closed='right')

Selecting all Intervals that overlap a given Interval can be performed using the overlaps() method to create a boolean indexer.

  1. In [184]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
  2. In [185]: idxr
  3. Out[185]: array([ True, True, True, False])
  4. In [186]: df[idxr]
  5. Out[186]:
  6. A
  7. (0, 1] 1
  8. (1, 2] 2
  9. (2, 3] 3

Binning data with cut and qcut

cut() and qcut() both return a Categorical object, and the bins they create are stored as an IntervalIndex in its .categories attribute.

  1. In [187]: c = pd.cut(range(4), bins=2)
  2. In [188]: c
  3. Out[188]:
  4. [(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
  5. Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
  6. In [189]: c.categories
  7. Out[189]:
  8. IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
  9. closed='right',
  10. dtype='interval[float64]')

cut() also accepts an IntervalIndex for its bins argument, which enables a useful pandas idiom. First, We call cut() with some data and bins set to a fixed number, to generate the bins. Then, we pass the values of .categories as the bins argument in subsequent calls to cut(), supplying new data which will be binned into the same bins.

  1. In [190]: pd.cut([0, 3, 5, 1], bins=c.categories)
  2. Out[190]:
  3. [(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
  4. Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

Any value which falls outside all bins will be assigned a NaN value.

Generating ranges of intervals

If we need intervals on a regular frequency, we can use the interval_range() function to create an IntervalIndex using various combinations of start, end, and periods. The default frequency for interval_range is a 1 for numeric intervals, and calendar day for datetime-like intervals:

  1. In [191]: pd.interval_range(start=0, end=5)
  2. Out[191]:
  3. IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
  4. closed='right',
  5. dtype='interval[int64]')
  6. In [192]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4)
  7. Out[192]:
  8. IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]],
  9. closed='right',
  10. dtype='interval[datetime64[ns]]')
  11. In [193]: pd.interval_range(end=pd.Timedelta('3 days'), periods=3)
  12. Out[193]:
  13. IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]],
  14. closed='right',
  15. dtype='interval[timedelta64[ns]]')

The freq parameter can used to specify non-default frequencies, and can utilize a variety of frequency aliases with datetime-like intervals:

  1. In [194]: pd.interval_range(start=0, periods=5, freq=1.5)
  2. Out[194]:
  3. IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]],
  4. closed='right',
  5. dtype='interval[float64]')
  6. In [195]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4, freq='W')
  7. Out[195]:
  8. IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]],
  9. closed='right',
  10. dtype='interval[datetime64[ns]]')
  11. In [196]: pd.interval_range(start=pd.Timedelta('0 days'), periods=3, freq='9H')
  12. Out[196]:
  13. IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]],
  14. closed='right',
  15. dtype='interval[timedelta64[ns]]')

Additionally, the closed parameter can be used to specify which side(s) the intervals are closed on. Intervals are closed on the right side by default.

  1. In [197]: pd.interval_range(start=0, end=4, closed='both')
  2. Out[197]:
  3. IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]],
  4. closed='both',
  5. dtype='interval[int64]')
  6. In [198]: pd.interval_range(start=0, end=4, closed='neither')
  7. Out[198]:
  8. IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)],
  9. closed='neither',
  10. dtype='interval[int64]')

New in version 0.23.0.

Specifying start, end, and periods will generate a range of evenly spaced intervals from start to end inclusively, with periods number of elements in the resulting IntervalIndex:

  1. In [199]: pd.interval_range(start=0, end=6, periods=4)
  2. Out[199]:
  3. IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]],
  4. closed='right',
  5. dtype='interval[float64]')
  6. In [200]: pd.interval_range(pd.Timestamp('2018-01-01'),
  7. .....: pd.Timestamp('2018-02-28'), periods=3)
  8. .....:
  9. Out[200]:
  10. IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]],
  11. closed='right',
  12. dtype='interval[datetime64[ns]]')

Miscellaneous indexing FAQ

Integer indexing

Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community. In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .loc. The following code will generate exceptions:

  1. In [201]: s = pd.Series(range(5))
  2. In [202]: s[-1]
  3. ---------------------------------------------------------------------------
  4. KeyError Traceback (most recent call last)
  5. <ipython-input-202-76c3dce40054> in <module>
  6. ----> 1 s[-1]
  7. /pandas/pandas/core/series.py in __getitem__(self, key)
  8. 1062 key = com.apply_if_callable(key, self)
  9. 1063 try:
  10. -> 1064 result = self.index.get_value(self, key)
  11. 1065
  12. 1066 if not is_scalar(result):
  13. /pandas/pandas/core/indexes/base.py in get_value(self, series, key)
  14. 4721 k = self._convert_scalar_indexer(k, kind="getitem")
  15. 4722 try:
  16. -> 4723 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
  17. 4724 except KeyError as e1:
  18. 4725 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
  19. /pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
  20. /pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
  21. /pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
  22. /pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
  23. /pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
  24. KeyError: -1
  25. In [203]: df = pd.DataFrame(np.random.randn(5, 4))
  26. In [204]: df
  27. Out[204]:
  28. 0 1 2 3
  29. 0 -0.130121 -0.476046 0.759104 0.213379
  30. 1 -0.082641 0.448008 0.656420 -1.051443
  31. 2 0.594956 -0.151360 -0.069303 1.221431
  32. 3 -0.182832 0.791235 0.042745 2.069775
  33. 4 1.446552 0.019814 -1.389212 -0.702312
  34. In [205]: df.loc[-2:]
  35. Out[205]:
  36. 0 1 2 3
  37. 0 -0.130121 -0.476046 0.759104 0.213379
  38. 1 -0.082641 0.448008 0.656420 -1.051443
  39. 2 0.594956 -0.151360 -0.069303 1.221431
  40. 3 -0.182832 0.791235 0.042745 2.069775
  41. 4 1.446552 0.019814 -1.389212 -0.702312

This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).

Non-monotonic indexes require exact matches

If the index of a Series or DataFrame is monotonically increasing or decreasing, then the bounds of a label-based slice can be outside the range of the index, much like slice indexing a normal Python list. Monotonicity of an index can be tested with the is_monotonic_increasing() and is_monotonic_decreasing() attributes.

  1. In [206]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=['data'], data=list(range(5)))
  2. In [207]: df.index.is_monotonic_increasing
  3. Out[207]: True
  4. # no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
  5. In [208]: df.loc[0:4, :]
  6. Out[208]:
  7. data
  8. 2 0
  9. 3 1
  10. 3 2
  11. 4 3
  12. # slice is are outside the index, so empty DataFrame is returned
  13. In [209]: df.loc[13:15, :]
  14. Out[209]:
  15. Empty DataFrame
  16. Columns: [data]
  17. Index: []

On the other hand, if the index is not monotonic, then both slice bounds must be unique members of the index.

  1. In [210]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5],
  2. .....: columns=['data'], data=list(range(6)))
  3. .....:
  4. In [211]: df.index.is_monotonic_increasing
  5. Out[211]: False
  6. # OK because 2 and 4 are in the index
  7. In [212]: df.loc[2:4, :]
  8. Out[212]:
  9. data
  10. 2 0
  11. 3 1
  12. 1 2
  13. 4 3
  1. # 0 is not in the index
  2. In [9]: df.loc[0:4, :]
  3. KeyError: 0
  4. # 3 is not a unique label
  5. In [11]: df.loc[2:3, :]
  6. KeyError: 'Cannot get right slice bound for non-unique label: 3'

Index.is_monotonic_increasing and Index.is_monotonic_decreasing only check that an index is weakly monotonic. To check for strict monotonicity, you can combine one of those with the is_unique() attribute.

  1. In [213]: weakly_monotonic = pd.Index(['a', 'b', 'c', 'c'])
  2. In [214]: weakly_monotonic
  3. Out[214]: Index(['a', 'b', 'c', 'c'], dtype='object')
  4. In [215]: weakly_monotonic.is_monotonic_increasing
  5. Out[215]: True
  6. In [216]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
  7. Out[216]: False

Endpoints are inclusive

Compared with standard Python sequence slicing in which the slice endpoint is not inclusive, label-based slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the “successor” or next element after a particular label in an index. For example, consider the following Series:

  1. In [217]: s = pd.Series(np.random.randn(6), index=list('abcdef'))
  2. In [218]: s
  3. Out[218]:
  4. a 0.301379
  5. b 1.240445
  6. c -0.846068
  7. d -0.043312
  8. e -1.658747
  9. f -0.819549
  10. dtype: float64

Suppose we wished to slice from c to e, using integers this would be accomplished as such:

  1. In [219]: s[2:5]
  2. Out[219]:
  3. c -0.846068
  4. d -0.043312
  5. e -1.658747
  6. dtype: float64

However, if you only had c and e, determining the next element in the index can be somewhat complicated. For example, the following does not work:

  1. s.loc['c':'e' + 1]

A very common use case is to limit a time series to start and end at two specific dates. To enable this, we made the design choice to make label-based slicing include both endpoints:

  1. In [220]: s.loc['c':'e']
  2. Out[220]:
  3. c -0.846068
  4. d -0.043312
  5. e -1.658747
  6. dtype: float64

This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you expect label-based slicing to behave exactly in the way that standard Python integer slicing works.

Indexing potentially changes underlying Series dtype

The different indexing operation can potentially change the dtype of a Series.

  1. In [221]: series1 = pd.Series([1, 2, 3])
  2. In [222]: series1.dtype
  3. Out[222]: dtype('int64')
  4. In [223]: res = series1.reindex([0, 4])
  5. In [224]: res.dtype
  6. Out[224]: dtype('float64')
  7. In [225]: res
  8. Out[225]:
  9. 0 1.0
  10. 4 NaN
  11. dtype: float64
  1. In [226]: series2 = pd.Series([True])
  2. In [227]: series2.dtype
  3. Out[227]: dtype('bool')
  4. In [228]: res = series2.reindex_like(series1)
  5. In [229]: res.dtype
  6. Out[229]: dtype('O')
  7. In [230]: res
  8. Out[230]:
  9. 0 True
  10. 1 NaN
  11. 2 NaN
  12. dtype: object

This is because the (re)indexing operations above silently inserts NaNs and the dtype changes accordingly. This can cause some issues when using numpy ufuncs such as numpy.logical_and.

See the this old issue for a more detailed discussion.