MultiIndex / advanced indexing
This section covers indexing with a MultiIndex and other advanced indexing features.
See the Indexing and Selecting Data for general indexing documentation.
::: danger Warning
Whether a copy or a reference is returned for a setting operation may
depend on the context. This is sometimes called chained assignment and
should be avoided. See Returning a View versus Copy.
:::
See the cookbook for some advanced strategies.
Hierarchical indexing (MultiIndex)
Hierarchical / Multi-level indexing is very exciting as it opens the door to some
quite sophisticated data analysis and manipulation, especially for working with
higher dimensional data. In essence, it enables you to store and manipulate
data with an arbitrary number of dimensions in lower dimensional data
structures like Series (1d) and DataFrame (2d).
In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.
See the cookbook for some advanced strategies.
Changed in version 0.24.0: MultiIndex.labels has been renamed to MultiIndex.codes
and MultiIndex.set_labels to MultiIndex.set_codes.
Creating a MultiIndex (hierarchical index) object
The MultiIndex object is the hierarchical analogue of the standard
Index object which typically stores the axis labels in pandas objects. You
can think of MultiIndex as an array of tuples where each tuple is unique. A
MultiIndex can be created from a list of arrays (using
MultiIndex.from_arrays()), an array of tuples (using
MultiIndex.from_tuples()), a crossed set of iterables (using
MultiIndex.from_product()), or a DataFrame (using
MultiIndex.from_frame()). The Index constructor will attempt to return
a MultiIndex when it is passed a list of tuples. The following examples
demonstrate different ways to initialize MultiIndexes.
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]...:In [2]: tuples = list(zip(*arrays))In [3]: tuplesOut[3]:[('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')]In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])In [5]: indexOut[5]:MultiIndex([('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')],names=['first', 'second'])In [6]: s = pd.Series(np.random.randn(8), index=index)In [7]: sOut[7]:first secondbar one 0.469112two -0.282863baz one -1.509059two -1.135632foo one 1.212112two -0.173215qux one 0.119209two -1.044236dtype: float64
When you want every pairing of the elements in two iterables, it can be easier
to use the MultiIndex.from_product() method:
In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second'])Out[9]:MultiIndex([('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')],names=['first', 'second'])
You can also construct a MultiIndex from a DataFrame directly, using
the method MultiIndex.from_frame(). This is a complementary method to
MultiIndex.to_frame().
New in version 0.24.0.
In [10]: df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],....: ['foo', 'one'], ['foo', 'two']],....: columns=['first', 'second'])....:In [11]: pd.MultiIndex.from_frame(df)Out[11]:MultiIndex([('bar', 'one'),('bar', 'two'),('foo', 'one'),('foo', 'two')],names=['first', 'second'])
As a convenience, you can pass a list of arrays directly into Series or
DataFrame to construct a MultiIndex automatically:
In [12]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),....: np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]....:In [13]: s = pd.Series(np.random.randn(8), index=arrays)In [14]: sOut[14]:bar one -0.861849two -2.104569baz one -0.494929two 1.071804foo one 0.721555two -0.706771qux one -1.039575two 0.271860dtype: float64In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)In [16]: dfOut[16]:0 1 2 3bar one -0.424972 0.567020 0.276232 -1.087401two -0.673690 0.113648 -1.478427 0.524988baz one 0.404705 0.577046 -1.715002 -1.039268two -0.370647 -1.157892 -1.344312 0.844885foo one 1.075770 -0.109050 1.643563 -1.469388two 0.357021 -0.674600 -1.776904 -0.968914qux one -1.294524 0.413738 0.276662 -0.472035two -0.013960 -0.362543 -0.006154 -0.923061
All of the MultiIndex constructors accept a names argument which stores
string names for the levels themselves. If no names are provided, None will
be assigned:
In [17]: df.index.namesOut[17]: FrozenList([None, None])
This index can back any axis of a pandas object, and the number of levels of the index is up to you:
In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)In [19]: dfOut[19]:first bar baz foo quxsecond one two one two one two one twoA 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])Out[20]:first bar baz foosecond one two one two one twofirst secondbar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441
We’ve “sparsified” the higher levels of the indexes to make the console output a
bit easier on the eyes. Note that how the index is displayed can be controlled using the
multi_sparse option in pandas.set_options():
In [21]: with pd.option_context('display.multi_sparse', False):....: df....:
It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:
In [22]: pd.Series(np.random.randn(8), index=tuples)Out[22]:(bar, one) -1.236269(bar, two) 0.896171(baz, one) -0.487602(baz, two) -0.082240(foo, one) -2.182937(foo, two) 0.380396(qux, one) 0.084844(qux, two) 0.432390dtype: float64
The reason that the MultiIndex matters is that it can allow you to do
grouping, selection, and reshaping operations as we will describe below and in
subsequent areas of the documentation. As you will see in later sections, you
can find yourself working with hierarchically-indexed data without creating a
MultiIndex explicitly yourself. However, when loading data from a file, you
may wish to generate your own MultiIndex when preparing the data set.
Reconstructing the level labels
The method get_level_values() will return a vector of the labels for each
location at a particular level:
In [23]: index.get_level_values(0)Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')In [24]: index.get_level_values('second')Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')
Basic indexing on axis with MultiIndex
One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:
In [25]: df['bar']Out[25]:second one twoA 0.895717 0.805244B 0.410835 0.813850C -1.413681 1.607920In [26]: df['bar', 'one']Out[26]:A 0.895717B 0.410835C -1.413681Name: (bar, one), dtype: float64In [27]: df['bar']['one']Out[27]:A 0.895717B 0.410835C -1.413681Name: one, dtype: float64In [28]: s['qux']Out[28]:one -1.039575two 0.271860dtype: float64
See Cross-section with hierarchical index for how to select on a deeper level.
Defined levels
The MultiIndex keeps all the defined levels of an index, even
if they are not actually used. When slicing an index, you may notice this.
For example:
In [29]: df.columns.levels # original MultiIndexOut[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])In [30]: df[['foo','qux']].columns.levels # slicedOut[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
This is done to avoid a recomputation of the levels in order to make slicing
highly performant. If you want to see only the used levels, you can use the
get_level_values() method.
In [31]: df[['foo', 'qux']].columns.to_numpy()Out[31]:array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],dtype=object)# for a specific levelIn [32]: df[['foo', 'qux']].columns.get_level_values(0)Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
To reconstruct the MultiIndex with only the used levels, the
remove_unused_levels() method may be used.
New in version 0.20.0.
In [33]: new_mi = df[['foo', 'qux']].columns.remove_unused_levels()In [34]: new_mi.levelsOut[34]: FrozenList([['foo', 'qux'], ['one', 'two']])
Data alignment and using reindex
Operations between differently-indexed objects having MultiIndex on the
axes will work as you expect; data alignment will work the same as an Index of
tuples:
In [35]: s + s[:-2]Out[35]:bar one -1.723698two -4.209138baz one -0.989859two 2.143608foo one 1.443110two -1.413542qux one NaNtwo NaNdtype: float64In [36]: s + s[::2]Out[36]:bar one -1.723698two NaNbaz one -0.989859two NaNfoo one 1.443110two NaNqux one -2.079150two NaNdtype: float64
The reindex() method of Series/DataFrames can be
called with another MultiIndex, or even a list or array of tuples:
In [37]: s.reindex(index[:3])Out[37]:first secondbar one -0.861849two -2.104569baz one -0.494929dtype: float64In [38]: s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')])Out[38]:foo two -0.706771bar one -0.861849qux one -1.039575baz one -0.494929dtype: float64
Advanced indexing with hierarchical index
Syntactically integrating MultiIndex in advanced indexing with .loc is a
bit challenging, but we’ve made every effort to do so. In general, MultiIndex
keys take the form of tuples. For example, the following works as you would expect:
In [39]: df = df.TIn [40]: dfOut[40]:A B Cfirst secondbar one 0.895717 0.410835 -1.413681two 0.805244 0.813850 1.607920baz one -1.206412 0.132003 1.024180two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466two -0.226169 -1.436737 -2.006747In [41]: df.loc[('bar', 'two')]Out[41]:A 0.805244B 0.813850C 1.607920Name: (bar, two), dtype: float64
Note that df.loc['bar', 'two'] would also work in this example, but this shorthand
notation can lead to ambiguity in general.
If you also want to index a specific column with .loc, you must use a tuple
like this:
In [42]: df.loc[('bar', 'two'), 'A']Out[42]: 0.8052440253863785
You don’t have to specify all levels of the MultiIndex by passing only the
first elements of the tuple. For example, you can use “partial” indexing to
get all elements with bar in the first level as follows:
df.loc[‘bar’]
This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent
to df.loc['bar',] in this example).
“Partial” slicing also works quite nicely.
In [43]: df.loc['baz':'foo']Out[43]:A B Cfirst secondbaz one -1.206412 0.132003 1.024180two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906two 1.340309 -1.187678 -2.211372
You can slice with a ‘range’ of values, by providing a slice of tuples.
In [44]: df.loc[('baz', 'two'):('qux', 'one')]Out[44]:A B Cfirst secondbaz two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466In [45]: df.loc[('baz', 'two'):'foo']Out[45]:A B Cfirst secondbaz two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906two 1.340309 -1.187678 -2.211372
Passing a list of labels or tuples works similar to reindexing:
In [46]: df.loc[[('bar', 'two'), ('qux', 'one')]]Out[46]:A B Cfirst secondbar two 0.805244 0.813850 1.607920qux one -1.170299 1.130127 0.974466
::: tip Note
It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).
:::
Importantly, a list of tuples indexes several complete MultiIndex keys,
whereas a tuple of lists refer to several values within a level:
In [47]: s = pd.Series([1, 2, 3, 4, 5, 6],....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]))....:In [48]: s.loc[[("A", "c"), ("B", "d")]] # list of tuplesOut[48]:A c 1B d 5dtype: int64In [49]: s.loc[(["A", "B"], ["c", "d"])] # tuple of listsOut[49]:A c 1d 2B c 4d 5dtype: int64
Using slicers
You can slice a MultiIndex by providing multiple indexers.
You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.
You can use slice(None) to select all the contents of that level. You do not need to specify all the
deeper levels, they will be implied as slice(None).
As usual, both sides of the slicers are included as this is label indexing.
::: danger Warning
You should specify all axes in the .loc specifier, meaning the indexer for the index and
for the columns. There are some ambiguous cases where the passed indexer could be mis-interpreted
as indexing both axes, rather than into say the MultiIndex for the rows.
You should do this:
df.loc[(slice('A1', 'A3'), ...), :] # noqa: E999
You should not do this:
df.loc[(slice('A1', 'A3'), ...)] # noqa: E999
:::
In [50]: def mklbl(prefix, n):....: return ["%s%s" % (prefix, i) for i in range(n)]....:In [51]: miindex = pd.MultiIndex.from_product([mklbl('A', 4),....: mklbl('B', 2),....: mklbl('C', 4),....: mklbl('D', 2)])....:In [52]: micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),....: ('b', 'foo'), ('b', 'bah')],....: names=['lvl0', 'lvl1'])....:In [53]: dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))....: .reshape((len(miindex), len(micolumns))),....: index=miindex,....: columns=micolumns).sort_index().sort_index(axis=1)....:In [54]: dfmiOut[54]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2D1 5 4 7 6C1 D0 9 8 11 10D1 13 12 15 14C2 D0 17 16 19 18... ... ... ... ...A3 B1 C1 D1 237 236 239 238C2 D0 241 240 243 242D1 245 244 247 246C3 D0 249 248 251 250D1 253 252 255 254[64 rows x 4 columns]
Basic MultiIndex slicing using slices, lists, and labels.
In [55]: dfmi.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :]Out[55]:lvl0 a blvl1 bar foo bah fooA1 B0 C1 D0 73 72 75 74D1 77 76 79 78C3 D0 89 88 91 90D1 93 92 95 94B1 C1 D0 105 104 107 106... ... ... ... ...A3 B0 C3 D1 221 220 223 222B1 C1 D0 233 232 235 234D1 237 236 239 238C3 D0 249 248 251 250D1 253 252 255 254[24 rows x 4 columns]
You can use pandas.IndexSlice to facilitate a more natural syntax
using :, rather than using slice(None).
In [56]: idx = pd.IndexSliceIn [57]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]Out[57]:lvl0 a blvl1 foo fooA0 B0 C1 D0 8 10D1 12 14C3 D0 24 26D1 28 30B1 C1 D0 40 42... ... ...A3 B0 C3 D1 220 222B1 C1 D0 232 234D1 236 238C3 D0 248 250D1 252 254[32 rows x 2 columns]
It is possible to perform quite complicated selections using this method on multiple axes at the same time.
In [58]: dfmi.loc['A1', (slice(None), 'foo')]Out[58]:lvl0 a blvl1 foo fooB0 C0 D0 64 66D1 68 70C1 D0 72 74D1 76 78C2 D0 80 82... ... ...B1 C1 D1 108 110C2 D0 112 114D1 116 118C3 D0 120 122D1 124 126[16 rows x 2 columns]In [59]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]Out[59]:lvl0 a blvl1 foo fooA0 B0 C1 D0 8 10D1 12 14C3 D0 24 26D1 28 30B1 C1 D0 40 42... ... ...A3 B0 C3 D1 220 222B1 C1 D0 232 234D1 236 238C3 D0 248 250D1 252 254[32 rows x 2 columns]
Using a boolean indexer you can provide selection related to the values.
In [60]: mask = dfmi[('a', 'foo')] > 200In [61]: dfmi.loc[idx[mask, :, ['C1', 'C3']], idx[:, 'foo']]Out[61]:lvl0 a blvl1 foo fooA3 B0 C1 D1 204 206C3 D0 216 218D1 220 222B1 C1 D0 232 234D1 236 238C3 D0 248 250D1 252 254
You can also specify the axis argument to .loc to interpret the passed
slicers on a single axis.
In [62]: dfmi.loc(axis=0)[:, :, ['C1', 'C3']]Out[62]:lvl0 a blvl1 bar foo bah fooA0 B0 C1 D0 9 8 11 10D1 13 12 15 14C3 D0 25 24 27 26D1 29 28 31 30B1 C1 D0 41 40 43 42... ... ... ... ...A3 B0 C3 D1 221 220 223 222B1 C1 D0 233 232 235 234D1 237 236 239 238C3 D0 249 248 251 250D1 253 252 255 254[32 rows x 4 columns]
Furthermore, you can set the values using the following methods.
In [63]: df2 = dfmi.copy()In [64]: df2.loc(axis=0)[:, :, ['C1', 'C3']] = -10In [65]: df2Out[65]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2D1 5 4 7 6C1 D0 -10 -10 -10 -10D1 -10 -10 -10 -10C2 D0 17 16 19 18... ... ... ... ...A3 B1 C1 D1 -10 -10 -10 -10C2 D0 241 240 243 242D1 245 244 247 246C3 D0 -10 -10 -10 -10D1 -10 -10 -10 -10[64 rows x 4 columns]
You can use a right-hand-side of an alignable object as well.
In [66]: df2 = dfmi.copy()In [67]: df2.loc[idx[:, :, ['C1', 'C3']], :] = df2 * 1000In [68]: df2Out[68]:lvl0 a blvl1 bar foo bah fooA0 B0 C0 D0 1 0 3 2D1 5 4 7 6C1 D0 9000 8000 11000 10000D1 13000 12000 15000 14000C2 D0 17 16 19 18... ... ... ... ...A3 B1 C1 D1 237000 236000 239000 238000C2 D0 241 240 243 242D1 245 244 247 246C3 D0 249000 248000 251000 250000D1 253000 252000 255000 254000[64 rows x 4 columns]
Cross-section
The xs() method of DataFrame additionally takes a level argument to make
selecting data at a particular level of a MultiIndex easier.
In [69]: dfOut[69]:A B Cfirst secondbar one 0.895717 0.410835 -1.413681two 0.805244 0.813850 1.607920baz one -1.206412 0.132003 1.024180two 2.565646 -0.827317 0.569605foo one 1.431256 -0.076467 0.875906two 1.340309 -1.187678 -2.211372qux one -1.170299 1.130127 0.974466two -0.226169 -1.436737 -2.006747In [70]: df.xs('one', level='second')Out[70]:A B Cfirstbar 0.895717 0.410835 -1.413681baz -1.206412 0.132003 1.024180foo 1.431256 -0.076467 0.875906qux -1.170299 1.130127 0.974466
# using the slicersIn [71]: df.loc[(slice(None), 'one'), :]Out[71]:A B Cfirst secondbar one 0.895717 0.410835 -1.413681baz one -1.206412 0.132003 1.024180foo one 1.431256 -0.076467 0.875906qux one -1.170299 1.130127 0.974466
You can also select on the columns with xs, by
providing the axis argument.
In [72]: df = df.TIn [73]: df.xs('one', level='second', axis=1)Out[73]:first bar baz foo quxA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
# using the slicersIn [74]: df.loc[:, (slice(None), 'one')]Out[74]:first bar baz foo quxsecond one one one oneA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
xs also allows selection with multiple keys.
In [75]: df.xs(('one', 'bar'), level=('second', 'first'), axis=1)Out[75]:first barsecond oneA 0.895717B 0.410835C -1.413681
# using the slicersIn [76]: df.loc[:, ('bar', 'one')]Out[76]:A 0.895717B 0.410835C -1.413681Name: (bar, one), dtype: float64
You can pass drop_level=False to xs to retain
the level that was selected.
In [77]: df.xs('one', level='second', axis=1, drop_level=False)Out[77]:first bar baz foo quxsecond one one one oneA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
Compare the above with the result using drop_level=True (the default value).
In [78]: df.xs('one', level='second', axis=1, drop_level=True)Out[78]:first bar baz foo quxA 0.895717 -1.206412 1.431256 -1.170299B 0.410835 0.132003 -0.076467 1.130127C -1.413681 1.024180 0.875906 0.974466
Advanced reindexing and alignment
Using the parameter level in the reindex() and
align() methods of pandas objects is useful to broadcast
values across a level. For instance:
In [79]: midx = pd.MultiIndex(levels=[['zero', 'one'], ['x', 'y']],....: codes=[[1, 1, 0, 0], [1, 0, 1, 0]])....:In [80]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)In [81]: dfOut[81]:0 1one y 1.519970 -0.493662x 0.600178 0.274230zero y 0.132885 -0.023688x 2.410179 1.450520In [82]: df2 = df.mean(level=0)In [83]: df2Out[83]:0 1one 1.060074 -0.109716zero 1.271532 0.713416In [84]: df2.reindex(df.index, level=0)Out[84]:0 1one y 1.060074 -0.109716x 1.060074 -0.109716zero y 1.271532 0.713416x 1.271532 0.713416# aligningIn [85]: df_aligned, df2_aligned = df.align(df2, level=0)In [86]: df_alignedOut[86]:0 1one y 1.519970 -0.493662x 0.600178 0.274230zero y 0.132885 -0.023688x 2.410179 1.450520In [87]: df2_alignedOut[87]:0 1one y 1.060074 -0.109716x 1.060074 -0.109716zero y 1.271532 0.713416x 1.271532 0.713416
Swapping levels with swaplevel
The swaplevel() method can switch the order of two levels:
In [88]: df[:5]Out[88]:0 1one y 1.519970 -0.493662x 0.600178 0.274230zero y 0.132885 -0.023688x 2.410179 1.450520In [89]: df[:5].swaplevel(0, 1, axis=0)Out[89]:0 1y one 1.519970 -0.493662x one 0.600178 0.274230y zero 0.132885 -0.023688x zero 2.410179 1.450520
Reordering levels with reorder_levels
The reorder_levels() method generalizes the swaplevel
method, allowing you to permute the hierarchical index levels in one step:
In [90]: df[:5].reorder_levels([1, 0], axis=0)Out[90]:0 1y one 1.519970 -0.493662x one 0.600178 0.274230y zero 0.132885 -0.023688x zero 2.410179 1.450520
Renaming names of an Index or MultiIndex
The rename() method is used to rename the labels of a
MultiIndex, and is typically used to rename the columns of a DataFrame.
The columns argument of rename allows a dictionary to be specified
that includes only the columns you wish to rename.
In [91]: df.rename(columns={0: "col0", 1: "col1"})Out[91]:col0 col1one y 1.519970 -0.493662x 0.600178 0.274230zero y 0.132885 -0.023688x 2.410179 1.450520
This method can also be used to rename specific labels of the main index
of the DataFrame.
In [92]: df.rename(index={"one": "two", "y": "z"})Out[92]:0 1two z 1.519970 -0.493662x 0.600178 0.274230zero z 0.132885 -0.023688x 2.410179 1.450520
The rename_axis() method is used to rename the name of a
Index or MultiIndex. In particular, the names of the levels of a
MultiIndex can be specified, which is useful if reset_index() is later
used to move the values from the MultiIndex to a column.
In [93]: df.rename_axis(index=['abc', 'def'])Out[93]:0 1abc defone y 1.519970 -0.493662x 0.600178 0.274230zero y 0.132885 -0.023688x 2.410179 1.450520
Note that the columns of a DataFrame are an index, so that using
rename_axis with the columns argument will change the name of that
index.
In [94]: df.rename_axis(columns="Cols").columnsOut[94]: RangeIndex(start=0, stop=2, step=1, name='Cols')
Both rename and rename_axis support specifying a dictionary,
Series or a mapping function to map labels/names to new values.
Sorting a MultiIndex
For MultiIndex-ed objects to be indexed and sliced effectively,
they need to be sorted. As with any index, you can use sort_index().
In [95]: import randomIn [96]: random.shuffle(tuples)In [97]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))In [98]: sOut[98]:baz one 0.206053foo two -0.251905qux one -2.213588foo one 1.063327bar two 1.266143baz two 0.299368bar one -0.863838qux two 0.408204dtype: float64In [99]: s.sort_index()Out[99]:bar one -0.863838two 1.266143baz one 0.206053two 0.299368foo one 1.063327two -0.251905qux one -2.213588two 0.408204dtype: float64In [100]: s.sort_index(level=0)Out[100]:bar one -0.863838two 1.266143baz one 0.206053two 0.299368foo one 1.063327two -0.251905qux one -2.213588two 0.408204dtype: float64In [101]: s.sort_index(level=1)Out[101]:bar one -0.863838baz one 0.206053foo one 1.063327qux one -2.213588bar two 1.266143baz two 0.299368foo two -0.251905qux two 0.408204dtype: float64
You may also pass a level name to sort_index if the MultiIndex levels
are named.
In [102]: s.index.set_names(['L1', 'L2'], inplace=True)In [103]: s.sort_index(level='L1')Out[103]:L1 L2bar one -0.863838two 1.266143baz one 0.206053two 0.299368foo one 1.063327two -0.251905qux one -2.213588two 0.408204dtype: float64In [104]: s.sort_index(level='L2')Out[104]:L1 L2bar one -0.863838baz one 0.206053foo one 1.063327qux one -2.213588bar two 1.266143baz two 0.299368foo two -0.251905qux two 0.408204dtype: float64
On higher dimensional objects, you can sort any of the other axes by level if
they have a MultiIndex:
In [105]: df.T.sort_index(level=1, axis=1)Out[105]:one zero one zerox x y y0 0.600178 2.410179 1.519970 0.1328851 0.274230 1.450520 -0.493662 -0.023688
Indexing will work even if the data are not sorted, but will be rather
inefficient (and show a PerformanceWarning). It will also
return a copy of the data rather than a view:
In [106]: dfm = pd.DataFrame({'jim': [0, 0, 1, 1],.....: 'joe': ['x', 'x', 'z', 'y'],.....: 'jolie': np.random.rand(4)}).....:In [107]: dfm = dfm.set_index(['jim', 'joe'])In [108]: dfmOut[108]:joliejim joe0 x 0.490671x 0.1202481 z 0.537020y 0.110968
In [4]: dfm.loc[(1, 'z')]PerformanceWarning: indexing past lexsort depth may impact performance.Out[4]:joliejim joe1 z 0.64094
Furthermore, if you try to index something that is not fully lexsorted, this can raise:
In [5]: dfm.loc[(0, 'y'):(1, 'z')]UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
The is_lexsorted() method on a MultiIndex shows if the
index is sorted, and the lexsort_depth property returns the sort depth:
In [109]: dfm.index.is_lexsorted()Out[109]: FalseIn [110]: dfm.index.lexsort_depthOut[110]: 1
In [111]: dfm = dfm.sort_index()In [112]: dfmOut[112]:joliejim joe0 x 0.490671x 0.1202481 y 0.110968z 0.537020In [113]: dfm.index.is_lexsorted()Out[113]: TrueIn [114]: dfm.index.lexsort_depthOut[114]: 2
And now selection works as expected.
In [115]: dfm.loc[(0, 'y'):(1, 'z')]Out[115]:joliejim joe1 y 0.110968z 0.537020
Take methods
Similar to NumPy ndarrays, pandas Index, Series, and DataFrame also provides
the take() method that retrieves elements along a given axis at the given
indices. The given indices must be either a list or an ndarray of integer
index positions. take will also accept negative integers as relative positions to the end of the object.
In [116]: index = pd.Index(np.random.randint(0, 1000, 10))In [117]: indexOut[117]: Int64Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')In [118]: positions = [0, 9, 3]In [119]: index[positions]Out[119]: Int64Index([214, 329, 567], dtype='int64')In [120]: index.take(positions)Out[120]: Int64Index([214, 329, 567], dtype='int64')In [121]: ser = pd.Series(np.random.randn(10))In [122]: ser.iloc[positions]Out[122]:0 -0.1796669 1.8243753 0.392149dtype: float64In [123]: ser.take(positions)Out[123]:0 -0.1796669 1.8243753 0.392149dtype: float64
For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.
In [124]: frm = pd.DataFrame(np.random.randn(5, 3))In [125]: frm.take([1, 4, 3])Out[125]:0 1 21 -1.237881 0.106854 -1.2768294 0.629675 -1.425966 1.8577043 0.979542 -1.633678 0.615855In [126]: frm.take([0, 2], axis=1)Out[126]:0 20 0.595974 0.6015441 -1.237881 -1.2768292 -0.767101 1.4995913 0.979542 0.6158554 0.629675 1.857704
It is important to note that the take method on pandas objects are not
intended to work on boolean indices and may return unexpected results.
In [127]: arr = np.random.randn(10)In [128]: arr.take([False, False, True, True])Out[128]: array([-1.1935, -1.1935, 0.6775, 0.6775])In [129]: arr[[0, 1]]Out[129]: array([-1.1935, 0.6775])In [130]: ser = pd.Series(np.random.randn(10))In [131]: ser.take([False, False, True, True])Out[131]:0 0.2331410 0.2331411 -0.2235401 -0.223540dtype: float64In [132]: ser.iloc[[0, 1]]Out[132]:0 0.2331411 -0.223540dtype: float64
Finally, as a small note on performance, because the take method handles
a narrower range of inputs, it can offer performance that is a good deal
faster than fancy indexing.
In [133]: arr = np.random.randn(10000, 5)In [134]: indexer = np.arange(10000)In [135]: random.shuffle(indexer)In [136]: %timeit arr[indexer].....: %timeit arr.take(indexer, axis=0).....:152 us +- 988 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)41.7 us +- 204 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)
In [137]: ser = pd.Series(arr[:, 0])In [138]: %timeit ser.iloc[indexer].....: %timeit ser.take(indexer).....:120 us +- 1.05 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)110 us +- 795 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)
Index types
We have discussed MultiIndex in the previous sections pretty extensively.
Documentation about DatetimeIndex and PeriodIndex are shown here,
and documentation about TimedeltaIndex is found here.
In the following sub-sections we will highlight some other index types.
CategoricalIndex
CategoricalIndex is a type of index that is useful for supporting
indexing with duplicates. This is a container around a Categorical
and allows efficient indexing and storage of an index with a large number of duplicated elements.
In [139]: from pandas.api.types import CategoricalDtypeIn [140]: df = pd.DataFrame({'A': np.arange(6),.....: 'B': list('aabbca')}).....:In [141]: df['B'] = df['B'].astype(CategoricalDtype(list('cab')))In [142]: dfOut[142]:A B0 0 a1 1 a2 2 b3 3 b4 4 c5 5 aIn [143]: df.dtypesOut[143]:A int64B categorydtype: objectIn [144]: df.B.cat.categoriesOut[144]: Index(['c', 'a', 'b'], dtype='object')
Setting the index will create a CategoricalIndex.
In [145]: df2 = df.set_index('B')In [146]: df2.indexOut[146]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
Indexing with __getitem__/.iloc/.loc works similarly to an Index with duplicates.
The indexers must be in the category or the operation will raise a KeyError.
In [147]: df2.loc['a']Out[147]:ABa 0a 1a 5
The CategoricalIndex is preserved after indexing:
In [148]: df2.loc['a'].indexOut[148]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
Sorting the index will sort by the order of the categories (recall that we
created the index with CategoricalDtype(list('cab')), so the sorted
order is cab).
In [149]: df2.sort_index()Out[149]:ABc 4a 0a 1a 5b 2b 3
Groupby operations on the index will preserve the index nature as well.
In [150]: df2.groupby(level=0).sum()Out[150]:ABc 4a 6b 5In [151]: df2.groupby(level=0).sum().indexOut[151]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
Reindexing operations will return a resulting index based on the type of the passed
indexer. Passing a list will return a plain-old Index; indexing with
a Categorical will return a CategoricalIndex, indexed according to the categories
of the passed Categorical dtype. This allows one to arbitrarily index these even with
values not in the categories, similarly to how you can reindex any pandas index.
In [152]: df2.reindex(['a', 'e'])Out[152]:ABa 0.0a 1.0a 5.0e NaNIn [153]: df2.reindex(['a', 'e']).indexOut[153]: Index(['a', 'a', 'a', 'e'], dtype='object', name='B')In [154]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))Out[154]:ABa 0.0a 1.0a 5.0e NaNIn [155]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).indexOut[155]: CategoricalIndex(['a', 'a', 'a', 'e'], categories=['a', 'b', 'c', 'd', 'e'], ordered=False, name='B', dtype='category')
::: danger Warning
Reshaping and Comparison operations on a CategoricalIndex must have the same categories
or a TypeError will be raised.
In [9]: df3 = pd.DataFrame({'A': np.arange(6), 'B': pd.Series(list('aabbca')).astype('category')})In [11]: df3 = df3.set_index('B')In [11]: df3.indexOut[11]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['a', 'b', 'c'], ordered=False, name='B', dtype='category')In [12]: pd.concat([df2, df3])TypeError: categories must match existing categories when appending
:::
Int64Index and RangeIndex
::: danger Warning
Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.
:::
Int64Index is a fundamental basic index in pandas.
This is an immutable array implementing an ordered, sliceable set.
Prior to 0.18.0, the Int64Index would provide the default index for all NDFrame objects.
RangeIndex is a sub-class of Int64Index added in version 0.18.0, now providing the default index for all NDFrame objects.
RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered set. These are analogous to Python range types.
Float64Index
By default a Float64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation.
This enables a pure label-based slicing paradigm that makes [],ix,loc for scalar indexing and slicing work exactly the
same.
In [156]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])In [157]: indexfOut[157]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')In [158]: sf = pd.Series(range(5), index=indexf)In [159]: sfOut[159]:1.5 02.0 13.0 24.5 35.0 4dtype: int64
Scalar selection for [],.loc will always be label based. An integer will match an equal float index (e.g. 3 is equivalent to 3.0).
In [160]: sf[3]Out[160]: 2In [161]: sf[3.0]Out[161]: 2In [162]: sf.loc[3]Out[162]: 2In [163]: sf.loc[3.0]Out[163]: 2
The only positional indexing is via iloc.
In [164]: sf.iloc[3]Out[164]: 3
A scalar index that is not found will raise a KeyError.
Slicing is primarily on the values of the index when using [],ix,loc, and
always positional when using iloc. The exception is when the slice is
boolean, in which case it will always be positional.
In [165]: sf[2:4]Out[165]:2.0 13.0 2dtype: int64In [166]: sf.loc[2:4]Out[166]:2.0 13.0 2dtype: int64In [167]: sf.iloc[2:4]Out[167]:3.0 24.5 3dtype: int64
In float indexes, slicing using floats is allowed.
In [168]: sf[2.1:4.6]Out[168]:3.0 24.5 3dtype: int64In [169]: sf.loc[2.1:4.6]Out[169]:3.0 24.5 3dtype: int64
In non-float indexes, slicing using floats will raise a TypeError.
In [1]: pd.Series(range(5))[3.5]TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)In [1]: pd.Series(range(5))[3.5:4.5]TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)
::: danger Warning
Using a scalar float indexer for .iloc has been removed in 0.18.0, so the following will raise a TypeError:
In [3]: pd.Series(range(5)).iloc[3.0]TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>
:::
Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like indexing scheme, but the data is recorded as floats. This could, for example, be millisecond offsets.
In [170]: dfir = pd.concat([pd.DataFrame(np.random.randn(5, 2),.....: index=np.arange(5) * 250.0,.....: columns=list('AB')),.....: pd.DataFrame(np.random.randn(6, 2),.....: index=np.arange(4, 10) * 250.1,.....: columns=list('AB'))]).....:In [171]: dfirOut[171]:A B0.0 -0.435772 -1.188928250.0 -0.808286 -0.284634500.0 -1.815703 1.347213750.0 -0.243487 0.5147041000.0 1.162969 -0.2877251000.4 -0.179734 0.9939621250.5 -0.212673 0.9098721500.6 -0.733333 -0.3498931750.7 0.456434 -0.3067352000.8 0.553396 0.1662212250.9 -0.101684 -0.734907
Selection operations then will always work on a value basis, for all selection operators.
In [172]: dfir[0:1000.4]Out[172]:A B0.0 -0.435772 -1.188928250.0 -0.808286 -0.284634500.0 -1.815703 1.347213750.0 -0.243487 0.5147041000.0 1.162969 -0.2877251000.4 -0.179734 0.993962In [173]: dfir.loc[0:1001, 'A']Out[173]:0.0 -0.435772250.0 -0.808286500.0 -1.815703750.0 -0.2434871000.0 1.1629691000.4 -0.179734Name: A, dtype: float64In [174]: dfir.loc[1000.4]Out[174]:A -0.179734B 0.993962Name: 1000.4, dtype: float64
You could retrieve the first 1 second (1000 ms) of data as such:
In [175]: dfir[0:1000]Out[175]:A B0.0 -0.435772 -1.188928250.0 -0.808286 -0.284634500.0 -1.815703 1.347213750.0 -0.243487 0.5147041000.0 1.162969 -0.287725
If you need integer based selection, you should use iloc:
In [176]: dfir.iloc[0:5]Out[176]:A B0.0 -0.435772 -1.188928250.0 -0.808286 -0.284634500.0 -1.815703 1.347213750.0 -0.243487 0.5147041000.0 1.162969 -0.287725
IntervalIndex
New in version 0.20.0.
IntervalIndex together with its own dtype, IntervalDtype
as well as the Interval scalar type, allow first-class support in pandas
for interval notation.
The IntervalIndex allows some unique indexing and is also used as a
return type for the categories in cut() and qcut().
Indexing with an IntervalIndex
An IntervalIndex can be used in Series and in DataFrame as the index.
In [177]: df = pd.DataFrame({'A': [1, 2, 3, 4]},.....: index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])).....:In [178]: dfOut[178]:A(0, 1] 1(1, 2] 2(2, 3] 3(3, 4] 4
Label based indexing via .loc along the edges of an interval works as you would expect,
selecting that particular interval.
In [179]: df.loc[2]Out[179]:A 2Name: (1, 2], dtype: int64In [180]: df.loc[[2, 3]]Out[180]:A(1, 2] 2(2, 3] 3
If you select a label contained within an interval, this will also select the interval.
In [181]: df.loc[2.5]Out[181]:A 3Name: (2, 3], dtype: int64In [182]: df.loc[[2.5, 3.5]]Out[182]:A(2, 3] 3(3, 4] 4
Selecting using an Interval will only return exact matches (starting from pandas 0.25.0).
In [183]: df.loc[pd.Interval(1, 2)]Out[183]:A 2Name: (1, 2], dtype: int64
Trying to select an Interval that is not exactly contained in the IntervalIndex will raise a KeyError.
In [7]: df.loc[pd.Interval(0.5, 2.5)]---------------------------------------------------------------------------KeyError: Interval(0.5, 2.5, closed='right')
Selecting all Intervals that overlap a given Interval can be performed using the
overlaps() method to create a boolean indexer.
In [184]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))In [185]: idxrOut[185]: array([ True, True, True, False])In [186]: df[idxr]Out[186]:A(0, 1] 1(1, 2] 2(2, 3] 3
Binning data with cut and qcut
cut() and qcut() both return a Categorical object, and the bins they
create are stored as an IntervalIndex in its .categories attribute.
In [187]: c = pd.cut(range(4), bins=2)In [188]: cOut[188]:[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]In [189]: c.categoriesOut[189]:IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],closed='right',dtype='interval[float64]')
cut() also accepts an IntervalIndex for its bins argument, which enables
a useful pandas idiom. First, We call cut() with some data and bins set to a
fixed number, to generate the bins. Then, we pass the values of .categories as the
bins argument in subsequent calls to cut(), supplying new data which will be
binned into the same bins.
In [190]: pd.cut([0, 3, 5, 1], bins=c.categories)Out[190]:[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
Any value which falls outside all bins will be assigned a NaN value.
Generating ranges of intervals
If we need intervals on a regular frequency, we can use the interval_range() function
to create an IntervalIndex using various combinations of start, end, and periods.
The default frequency for interval_range is a 1 for numeric intervals, and calendar day for
datetime-like intervals:
In [191]: pd.interval_range(start=0, end=5)Out[191]:IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],closed='right',dtype='interval[int64]')In [192]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4)Out[192]:IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]],closed='right',dtype='interval[datetime64[ns]]')In [193]: pd.interval_range(end=pd.Timedelta('3 days'), periods=3)Out[193]:IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]],closed='right',dtype='interval[timedelta64[ns]]')
The freq parameter can used to specify non-default frequencies, and can utilize a variety
of frequency aliases with datetime-like intervals:
In [194]: pd.interval_range(start=0, periods=5, freq=1.5)Out[194]:IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]],closed='right',dtype='interval[float64]')In [195]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4, freq='W')Out[195]:IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]],closed='right',dtype='interval[datetime64[ns]]')In [196]: pd.interval_range(start=pd.Timedelta('0 days'), periods=3, freq='9H')Out[196]:IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]],closed='right',dtype='interval[timedelta64[ns]]')
Additionally, the closed parameter can be used to specify which side(s) the intervals
are closed on. Intervals are closed on the right side by default.
In [197]: pd.interval_range(start=0, end=4, closed='both')Out[197]:IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]],closed='both',dtype='interval[int64]')In [198]: pd.interval_range(start=0, end=4, closed='neither')Out[198]:IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)],closed='neither',dtype='interval[int64]')
New in version 0.23.0.
Specifying start, end, and periods will generate a range of evenly spaced
intervals from start to end inclusively, with periods number of elements
in the resulting IntervalIndex:
In [199]: pd.interval_range(start=0, end=6, periods=4)Out[199]:IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]],closed='right',dtype='interval[float64]')In [200]: pd.interval_range(pd.Timestamp('2018-01-01'),.....: pd.Timestamp('2018-02-28'), periods=3).....:Out[200]:IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]],closed='right',dtype='interval[datetime64[ns]]')
Miscellaneous indexing FAQ
Integer indexing
Label-based indexing with integer axis labels is a thorny topic. It has been
discussed heavily on mailing lists and among various members of the scientific
Python community. In pandas, our general viewpoint is that labels matter more
than integer locations. Therefore, with an integer axis index only
label-based indexing is possible with the standard tools like .loc. The
following code will generate exceptions:
In [201]: s = pd.Series(range(5))In [202]: s[-1]---------------------------------------------------------------------------KeyError Traceback (most recent call last)<ipython-input-202-76c3dce40054> in <module>----> 1 s[-1]/pandas/pandas/core/series.py in __getitem__(self, key)1062 key = com.apply_if_callable(key, self)1063 try:-> 1064 result = self.index.get_value(self, key)10651066 if not is_scalar(result):/pandas/pandas/core/indexes/base.py in get_value(self, series, key)4721 k = self._convert_scalar_indexer(k, kind="getitem")4722 try:-> 4723 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))4724 except KeyError as e1:4725 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()KeyError: -1In [203]: df = pd.DataFrame(np.random.randn(5, 4))In [204]: dfOut[204]:0 1 2 30 -0.130121 -0.476046 0.759104 0.2133791 -0.082641 0.448008 0.656420 -1.0514432 0.594956 -0.151360 -0.069303 1.2214313 -0.182832 0.791235 0.042745 2.0697754 1.446552 0.019814 -1.389212 -0.702312In [205]: df.loc[-2:]Out[205]:0 1 2 30 -0.130121 -0.476046 0.759104 0.2133791 -0.082641 0.448008 0.656420 -1.0514432 0.594956 -0.151360 -0.069303 1.2214313 -0.182832 0.791235 0.042745 2.0697754 1.446552 0.019814 -1.389212 -0.702312
This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).
Non-monotonic indexes require exact matches
If the index of a Series or DataFrame is monotonically increasing or decreasing, then the bounds
of a label-based slice can be outside the range of the index, much like slice indexing a
normal Python list. Monotonicity of an index can be tested with the is_monotonic_increasing() and
is_monotonic_decreasing() attributes.
In [206]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=['data'], data=list(range(5)))In [207]: df.index.is_monotonic_increasingOut[207]: True# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:In [208]: df.loc[0:4, :]Out[208]:data2 03 13 24 3# slice is are outside the index, so empty DataFrame is returnedIn [209]: df.loc[13:15, :]Out[209]:Empty DataFrameColumns: [data]Index: []
On the other hand, if the index is not monotonic, then both slice bounds must be unique members of the index.
In [210]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5],.....: columns=['data'], data=list(range(6))).....:In [211]: df.index.is_monotonic_increasingOut[211]: False# OK because 2 and 4 are in the indexIn [212]: df.loc[2:4, :]Out[212]:data2 03 11 24 3
# 0 is not in the indexIn [9]: df.loc[0:4, :]KeyError: 0# 3 is not a unique labelIn [11]: df.loc[2:3, :]KeyError: 'Cannot get right slice bound for non-unique label: 3'
Index.is_monotonic_increasing and Index.is_monotonic_decreasing only check that
an index is weakly monotonic. To check for strict monotonicity, you can combine one of those with
the is_unique() attribute.
In [213]: weakly_monotonic = pd.Index(['a', 'b', 'c', 'c'])In [214]: weakly_monotonicOut[214]: Index(['a', 'b', 'c', 'c'], dtype='object')In [215]: weakly_monotonic.is_monotonic_increasingOut[215]: TrueIn [216]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_uniqueOut[216]: False
Endpoints are inclusive
Compared with standard Python sequence slicing in which the slice endpoint is
not inclusive, label-based slicing in pandas is inclusive. The primary
reason for this is that it is often not possible to easily determine the
“successor” or next element after a particular label in an index. For example,
consider the following Series:
In [217]: s = pd.Series(np.random.randn(6), index=list('abcdef'))In [218]: sOut[218]:a 0.301379b 1.240445c -0.846068d -0.043312e -1.658747f -0.819549dtype: float64
Suppose we wished to slice from c to e, using integers this would be
accomplished as such:
In [219]: s[2:5]Out[219]:c -0.846068d -0.043312e -1.658747dtype: float64
However, if you only had c and e, determining the next element in the
index can be somewhat complicated. For example, the following does not work:
s.loc['c':'e' + 1]
A very common use case is to limit a time series to start and end at two specific dates. To enable this, we made the design choice to make label-based slicing include both endpoints:
In [220]: s.loc['c':'e']Out[220]:c -0.846068d -0.043312e -1.658747dtype: float64
This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you expect label-based slicing to behave exactly in the way that standard Python integer slicing works.
Indexing potentially changes underlying Series dtype
The different indexing operation can potentially change the dtype of a Series.
In [221]: series1 = pd.Series([1, 2, 3])In [222]: series1.dtypeOut[222]: dtype('int64')In [223]: res = series1.reindex([0, 4])In [224]: res.dtypeOut[224]: dtype('float64')In [225]: resOut[225]:0 1.04 NaNdtype: float64
In [226]: series2 = pd.Series([True])In [227]: series2.dtypeOut[227]: dtype('bool')In [228]: res = series2.reindex_like(series1)In [229]: res.dtypeOut[229]: dtype('O')In [230]: resOut[230]:0 True1 NaN2 NaNdtype: object
This is because the (re)indexing operations above silently inserts NaNs and the dtype
changes accordingly. This can cause some issues when using numpy ufuncs
such as numpy.logical_and.
See the this old issue for a more detailed discussion.
