v0.20.1 (May 5, 2017)

This is a major release from 0.19.2 and includes a number of API changes, deprecations, new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version.

Highlights include:

  • New .agg() API for Series/DataFrame similar to the groupby-rolling-resample API’s, see here
  • Integration with the feather-format, including a new top-level pd.read_feather() and DataFrame.to_feather() method, see here.
  • The .ix indexer has been deprecated, see here
  • Panel has been deprecated, see here
  • Addition of an IntervalIndex and Interval scalar type, see here
  • Improved user API when grouping by index levels in .groupby(), see here
  • Improved support for UInt64 dtypes, see here
  • A new orient for JSON serialization, orient='table', that uses the Table Schema spec and that gives the possibility for a more interactive repr in the Jupyter Notebook, see here
  • Experimental support for exporting styled DataFrames (DataFrame.style) to Excel, see here
  • Window binary corr/cov operations now return a MultiIndexed DataFrame rather than a Panel, as Panel is now deprecated, see here
  • Support for S3 handling now uses s3fs, see here
  • Google BigQuery support now uses the pandas-gbq library, see here

::: danger Warning

Pandas has changed the internal structure and layout of the code base. This can affect imports that are not from the top-level pandas.* namespace, please see the changes here.

:::

Check the API Changes and deprecations before updating.

::: tip Note

This is a combined release for 0.20.0 and and 0.20.1. Version 0.20.1 contains one additional change for backwards-compatibility with downstream projects using pandas’ utils routines. (GH16250)

:::

What’s new in v0.20.0

New features

agg API for DataFrame/Series

Series & DataFrame have been enhanced to support the aggregation API. This is a familiar API from groupby, window operations, and resampling. This allows aggregation operations in a concise way by using agg() and transform(). The full documentation is here (GH1623).

Here is a sample

  1. In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
  2. ...: index=pd.date_range('1/1/2000', periods=10))
  3. ...:
  4. In [2]: df.iloc[3:7] = np.nan
  5. In [3]: df
  6. Out[3]:
  7. A B C
  8. 2000-01-01 0.469112 -0.282863 -1.509059
  9. 2000-01-02 -1.135632 1.212112 -0.173215
  10. 2000-01-03 0.119209 -1.044236 -0.861849
  11. 2000-01-04 NaN NaN NaN
  12. 2000-01-05 NaN NaN NaN
  13. 2000-01-06 NaN NaN NaN
  14. 2000-01-07 NaN NaN NaN
  15. 2000-01-08 0.113648 -1.478427 0.524988
  16. 2000-01-09 0.404705 0.577046 -1.715002
  17. 2000-01-10 -1.039268 -0.370647 -1.157892
  18. [10 rows x 3 columns]

One can operate using string function names, callables, lists, or dictionaries of these.

Using a single function is equivalent to .apply.

  1. In [4]: df.agg('sum')
  2. Out[4]:
  3. A -1.068226
  4. B -1.387015
  5. C -4.892029
  6. Length: 3, dtype: float64

Multiple aggregations with a list of functions.

  1. In [5]: df.agg(['sum', 'min'])
  2. Out[5]:
  3. A B C
  4. sum -1.068226 -1.387015 -4.892029
  5. min -1.135632 -1.478427 -1.715002
  6. [2 rows x 3 columns]

Using a dict provides the ability to apply specific aggregations per column. You will get a matrix-like output of all of the aggregators. The output has one column per unique function. Those functions applied to a particular column will be NaN:

  1. In [6]: df.agg({'A': ['sum', 'min'], 'B': ['min', 'max']})
  2. Out[6]:
  3. A B
  4. max NaN 1.212112
  5. min -1.135632 -1.478427
  6. sum -1.068226 NaN
  7. [3 rows x 2 columns]

The API also supports a .transform() function for broadcasting results.

  1. In [7]: df.transform(['abs', lambda x: x - x.min()])
  2. Out[7]:
  3. A B C
  4. abs <lambda> abs <lambda> abs <lambda>
  5. 2000-01-01 0.469112 1.604745 0.282863 1.195563 1.509059 0.205944
  6. 2000-01-02 1.135632 0.000000 1.212112 2.690539 0.173215 1.541787
  7. 2000-01-03 0.119209 1.254841 1.044236 0.434191 0.861849 0.853153
  8. 2000-01-04 NaN NaN NaN NaN NaN NaN
  9. 2000-01-05 NaN NaN NaN NaN NaN NaN
  10. 2000-01-06 NaN NaN NaN NaN NaN NaN
  11. 2000-01-07 NaN NaN NaN NaN NaN NaN
  12. 2000-01-08 0.113648 1.249281 1.478427 0.000000 0.524988 2.239990
  13. 2000-01-09 0.404705 1.540338 0.577046 2.055473 1.715002 0.000000
  14. 2000-01-10 1.039268 0.096364 0.370647 1.107780 1.157892 0.557110
  15. [10 rows x 6 columns]

When presented with mixed dtypes that cannot be aggregated, .agg() will only take the valid aggregations. This is similar to how groupby .agg() works. (GH15015)

  1. In [8]: df = pd.DataFrame({'A': [1, 2, 3],
  2. ...: 'B': [1., 2., 3.],
  3. ...: 'C': ['foo', 'bar', 'baz'],
  4. ...: 'D': pd.date_range('20130101', periods=3)})
  5. ...:
  6. In [9]: df.dtypes
  7. Out[9]:
  8. A int64
  9. B float64
  10. C object
  11. D datetime64[ns]
  12. Length: 4, dtype: object
  1. In [10]: df.agg(['min', 'sum'])
  2. Out[10]:
  3. A B C D
  4. min 1 1.0 bar 2013-01-01
  5. sum 6 6.0 foobarbaz NaT
  6. [2 rows x 4 columns]

dtype keyword for data IO

The 'python' engine for read_csv(), as well as the read_fwf() function for parsing fixed-width text files and read_excel() for parsing Excel files, now accept the dtype keyword argument for specifying the types of specific columns (GH14295). See the io docs for more information.

  1. In [11]: data = "a b\n1 2\n3 4"
  2. In [12]: pd.read_fwf(StringIO(data)).dtypes
  3. Out[12]:
  4. a int64
  5. b int64
  6. Length: 2, dtype: object
  7. In [13]: pd.read_fwf(StringIO(data), dtype={'a': 'float64', 'b': 'object'}).dtypes
  8. Out[13]:
  9. a float64
  10. b object
  11. Length: 2, dtype: object

.to_datetime() has gained an origin parameter

to_datetime() has gained a new parameter, origin, to define a reference date from where to compute the resulting timestamps when parsing numerical values with a specific unit specified. (GH11276, GH11745)

For example, with 1960-01-01 as the starting date:

  1. In [14]: pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01'))
  2. Out[14]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)

The default is set at origin='unix', which defaults to 1970-01-01 00:00:00, which is commonly called ‘unix epoch’ or POSIX time. This was the previous default, so this is a backward compatible change.

  1. In [15]: pd.to_datetime([1, 2, 3], unit='D')
  2. Out[15]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)

Groupby enhancements

Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names. Previously, only column names could be referenced. This allows to easily group by a column and index level at the same time. (GH5677)

  1. In [16]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
  2. ....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
  3. ....:
  4. In [17]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
  5. In [18]: df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
  6. ....: 'B': np.arange(8)},
  7. ....: index=index)
  8. ....:
  9. In [19]: df
  10. Out[19]:
  11. A B
  12. first second
  13. bar one 1 0
  14. two 1 1
  15. baz one 1 2
  16. two 1 3
  17. foo one 2 4
  18. two 2 5
  19. qux one 3 6
  20. two 3 7
  21. [8 rows x 2 columns]
  22. In [20]: df.groupby(['second', 'A']).sum()
  23. Out[20]:
  24. B
  25. second A
  26. one 1 2
  27. 2 4
  28. 3 6
  29. two 1 4
  30. 2 5
  31. 3 7
  32. [6 rows x 1 columns]

Better support for compressed URLs in read_csv

The compression code was refactored (GH12688). As a result, reading dataframes from URLs in read_csv() or read_table() now supports additional compression methods: xz, bz2, and zip (GH14570). Previously, only gzip compression was supported. By default, compression of URLs and paths are now inferred using their file extensions. Additionally, support for bz2 compression in the python 2 C-engine improved (GH14874).

  1. In [21]: url = ('https://github.com/{repo}/raw/{branch}/{path}'
  2. ....: .format(repo='pandas-dev/pandas',
  3. ....: branch='master',
  4. ....: path='pandas/tests/io/parser/data/salaries.csv.bz2'))
  5. ....:
  6. # default, infer compression
  7. In [22]: df = pd.read_csv(url, sep='\t', compression='infer')
  8. # explicitly specify compression
  9. In [23]: df = pd.read_csv(url, sep='\t', compression='bz2')
  10. In [24]: df.head(2)
  11. Out[24]:
  12. S X E M
  13. 0 13876 1 1 1
  14. 1 11608 1 3 0
  15. [2 rows x 4 columns]

Pickle file I/O now supports compression

read_pickle(), DataFrame.to_pickle() and Series.to_pickle() can now read from and write to compressed pickle files. Compression methods can be an explicit parameter or be inferred from the file extension. See the docs here.

  1. In [25]: df = pd.DataFrame({'A': np.random.randn(1000),
  2. ....: 'B': 'foo',
  3. ....: 'C': pd.date_range('20130101', periods=1000, freq='s')})
  4. ....:

Using an explicit compression type

  1. In [26]: df.to_pickle("data.pkl.compress", compression="gzip")
  2. In [27]: rt = pd.read_pickle("data.pkl.compress", compression="gzip")
  3. In [28]: rt.head()
  4. Out[28]:
  5. A B C
  6. 0 -1.344312 foo 2013-01-01 00:00:00
  7. 1 0.844885 foo 2013-01-01 00:00:01
  8. 2 1.075770 foo 2013-01-01 00:00:02
  9. 3 -0.109050 foo 2013-01-01 00:00:03
  10. 4 1.643563 foo 2013-01-01 00:00:04
  11. [5 rows x 3 columns]

The default is to infer the compression type from the extension (compression='infer'):

  1. In [29]: df.to_pickle("data.pkl.gz")
  2. In [30]: rt = pd.read_pickle("data.pkl.gz")
  3. In [31]: rt.head()
  4. Out[31]:
  5. A B C
  6. 0 -1.344312 foo 2013-01-01 00:00:00
  7. 1 0.844885 foo 2013-01-01 00:00:01
  8. 2 1.075770 foo 2013-01-01 00:00:02
  9. 3 -0.109050 foo 2013-01-01 00:00:03
  10. 4 1.643563 foo 2013-01-01 00:00:04
  11. [5 rows x 3 columns]
  12. In [32]: df["A"].to_pickle("s1.pkl.bz2")
  13. In [33]: rt = pd.read_pickle("s1.pkl.bz2")
  14. In [34]: rt.head()
  15. Out[34]:
  16. 0 -1.344312
  17. 1 0.844885
  18. 2 1.075770
  19. 3 -0.109050
  20. 4 1.643563
  21. Name: A, Length: 5, dtype: float64

UInt64 support improved

Pandas has significantly improved support for operations involving unsigned, or purely non-negative, integers. Previously, handling these integers would result in improper rounding or data-type casting, leading to incorrect results. Notably, a new numerical index, UInt64Index, has been created (GH14937)

  1. In [35]: idx = pd.UInt64Index([1, 2, 3])
  2. In [36]: df = pd.DataFrame({'A': ['a', 'b', 'c']}, index=idx)
  3. In [37]: df.index
  4. Out[37]: UInt64Index([1, 2, 3], dtype='uint64')
  • Bug in converting object elements of array-like objects to unsigned 64-bit integers (GH4471, GH14982)
  • Bug in Series.unique() in which unsigned 64-bit integers were causing overflow (GH14721)
  • Bug in DataFrame construction in which unsigned 64-bit integer elements were being converted to objects (GH14881)
  • Bug in pd.read_csv() in which unsigned 64-bit integer elements were being improperly converted to the wrong data types (GH14983)
  • Bug in pd.unique() in which unsigned 64-bit integers were causing overflow (GH14915)
  • Bug in pd.value_counts() in which unsigned 64-bit integers were being erroneously truncated in the output (GH14934)

GroupBy on categoricals

In previous versions, .groupby(..., sort=False) would fail with a ValueError when grouping on a categorical series with some categories not appearing in the data. (GH13179)

  1. In [38]: chromosomes = np.r_[np.arange(1, 23).astype(str), ['X', 'Y']]
  2. In [39]: df = pd.DataFrame({
  3. ....: 'A': np.random.randint(100),
  4. ....: 'B': np.random.randint(100),
  5. ....: 'C': np.random.randint(100),
  6. ....: 'chromosomes': pd.Categorical(np.random.choice(chromosomes, 100),
  7. ....: categories=chromosomes,
  8. ....: ordered=True)})
  9. ....:
  10. In [40]: df
  11. Out[40]:
  12. A B C chromosomes
  13. 0 87 22 81 4
  14. 1 87 22 81 13
  15. 2 87 22 81 22
  16. 3 87 22 81 2
  17. 4 87 22 81 6
  18. .. .. .. .. ...
  19. 95 87 22 81 8
  20. 96 87 22 81 11
  21. 97 87 22 81 X
  22. 98 87 22 81 1
  23. 99 87 22 81 19
  24. [100 rows x 4 columns]

Previous behavior:

  1. In [3]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
  2. ---------------------------------------------------------------------------
  3. ValueError: items in new_categories are not the same as in old categories

New behavior:

  1. In [41]: df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
  2. Out[41]:
  3. A B C
  4. chromosomes
  5. 2 348 88 324
  6. 3 348 88 324
  7. 4 348 88 324
  8. 5 261 66 243
  9. 6 174 44 162
  10. ... ... ... ...
  11. 22 348 88 324
  12. X 348 88 324
  13. Y 435 110 405
  14. 1 0 0 0
  15. 21 0 0 0
  16. [24 rows x 3 columns]

Table schema output

The new orient 'table' for DataFrame.to_json() will generate a Table Schema compatible string representation of the data.

  1. In [42]: df = pd.DataFrame(
  2. ....: {'A': [1, 2, 3],
  3. ....: 'B': ['a', 'b', 'c'],
  4. ....: 'C': pd.date_range('2016-01-01', freq='d', periods=3)},
  5. ....: index=pd.Index(range(3), name='idx'))
  6. ....:
  7. In [43]: df
  8. Out[43]:
  9. A B C
  10. idx
  11. 0 1 a 2016-01-01
  12. 1 2 b 2016-01-02
  13. 2 3 c 2016-01-03
  14. [3 rows x 3 columns]
  15. In [44]: df.to_json(orient='table')
  16. Out[44]: '{"schema": {"fields":[{"name":"idx","type":"integer"},{"name":"A","type":"integer"},{"name":"B","type":"string"},{"name":"C","type":"datetime"}],"primaryKey":["idx"],"pandas_version":"0.20.0"}, "data": [{"idx":0,"A":1,"B":"a","C":"2016-01-01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'

See IO: Table Schema for more information.

Additionally, the repr for DataFrame and Series can now publish this JSON Table schema representation of the Series or DataFrame if you are using IPython (or another frontend like nteract using the Jupyter messaging protocol). This gives frontends like the Jupyter notebook and nteract more flexibility in how they display pandas objects, since they have more information about the data. You must enable this by setting the display.html.table_schema option to True.

SciPy sparse matrix from/to SparseDataFrame

Pandas now supports creating sparse dataframes directly from scipy.sparse.spmatrix instances. See the documentation for more information. (GH4343)

All sparse formats are supported, but matrices that are not in COOrdinate format will be converted, copying data as needed.

  1. In [45]: from scipy.sparse import csr_matrix
  2. In [46]: arr = np.random.random(size=(1000, 5))
  3. In [47]: arr[arr < .9] = 0
  4. In [48]: sp_arr = csr_matrix(arr)
  5. In [49]: sp_arr
  6. Out[49]:
  7. <1000x5 sparse matrix of type '<class 'numpy.float64'>'
  8. with 501 stored elements in Compressed Sparse Row format>
  9. In [50]: sdf = pd.SparseDataFrame(sp_arr)
  10. In [51]: sdf
  11. Out[51]:
  12. 0 1 2 3 4
  13. 0 NaN NaN 0.977426 NaN NaN
  14. 1 NaN NaN NaN NaN 0.969340
  15. 2 NaN NaN NaN NaN NaN
  16. 3 NaN NaN NaN NaN NaN
  17. 4 NaN NaN NaN NaN NaN
  18. .. .. .. ... .. ...
  19. 995 NaN NaN NaN NaN 0.917524
  20. 996 NaN NaN NaN NaN NaN
  21. 997 NaN NaN NaN NaN 0.968178
  22. 998 NaN NaN NaN NaN 0.901563
  23. 999 NaN NaN NaN NaN NaN
  24. [1000 rows x 5 columns]

To convert a SparseDataFrame back to sparse SciPy matrix in COO format, you can use:

  1. In [52]: sdf.to_coo()
  2. Out[52]:
  3. <1000x5 sparse matrix of type '<class 'numpy.float64'>'
  4. with 501 stored elements in COOrdinate format>

Excel output for styled DataFrames

Experimental support has been added to export DataFrame.style formats to Excel using the openpyxl engine. (GH15530)

For example, after running the following, styled.xlsx renders as below:

  1. In [53]: np.random.seed(24)
  2. In [54]: df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
  3. In [55]: df = pd.concat([df, pd.DataFrame(np.random.RandomState(24).randn(10, 4),
  4. ....: columns=list('BCDE'))],
  5. ....: axis=1)
  6. ....:
  7. In [56]: df.iloc[0, 2] = np.nan
  8. In [57]: df
  9. Out[57]:
  10. A B C D E
  11. 0 1.0 1.329212 NaN -0.316280 -0.990810
  12. 1 2.0 -1.070816 -1.438713 0.564417 0.295722
  13. 2 3.0 -1.626404 0.219565 0.678805 1.889273
  14. 3 4.0 0.961538 0.104011 -0.481165 0.850229
  15. 4 5.0 1.453425 1.057737 0.165562 0.515018
  16. 5 6.0 -1.336936 0.562861 1.392855 -0.063328
  17. 6 7.0 0.121668 1.207603 -0.002040 1.627796
  18. 7 8.0 0.354493 1.037528 -0.385684 0.519818
  19. 8 9.0 1.686583 -1.325963 1.428984 -2.089354
  20. 9 10.0 -0.129820 0.631523 -0.586538 0.290720
  21. [10 rows x 5 columns]
  22. In [58]: styled = (df.style
  23. ....: .applymap(lambda val: 'color: %s' % 'red' if val < 0 else 'black')
  24. ....: .highlight_max())
  25. ....:
  26. In [59]: styled.to_excel('styled.xlsx', engine='openpyxl')

style-excel1

See the Style documentation for more detail.

IntervalIndex

pandas has gained an IntervalIndex with its own dtype, interval as well as the Interval scalar type. These allow first-class support for interval notation, specifically as a return type for the categories in cut() and qcut(). The IntervalIndex allows some unique indexing, see the docs. (GH7640, GH8625)

::: danger Warning

These indexing behaviors of the IntervalIndex are provisional and may change in a future version of pandas. Feedback on usage is welcome.

:::

Previous behavior:

The returned categories were strings, representing Intervals

  1. In [1]: c = pd.cut(range(4), bins=2)
  2. In [2]: c
  3. Out[2]:
  4. [(-0.003, 1.5], (-0.003, 1.5], (1.5, 3], (1.5, 3]]
  5. Categories (2, object): [(-0.003, 1.5] < (1.5, 3]]
  6. In [3]: c.categories
  7. Out[3]: Index(['(-0.003, 1.5]', '(1.5, 3]'], dtype='object')

New behavior:

  1. In [60]: c = pd.cut(range(4), bins=2)
  2. In [61]: c
  3. Out[61]:
  4. [(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
  5. Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
  6. In [62]: c.categories
  7. Out[62]:
  8. IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
  9. closed='right',
  10. dtype='interval[float64]')

Furthermore, this allows one to bin other data with these same bins, with NaN representing a missing value similar to other dtypes.

  1. In [63]: pd.cut([0, 3, 5, 1], bins=c.categories)
  2. Out[63]:
  3. [(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
  4. Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

An IntervalIndex can also be used in Series and DataFrame as the index.

  1. In [64]: df = pd.DataFrame({'A': range(4),
  2. ....: 'B': pd.cut([0, 3, 1, 1], bins=c.categories)
  3. ....: }).set_index('B')
  4. ....:
  5. In [65]: df
  6. Out[65]:
  7. A
  8. B
  9. (-0.003, 1.5] 0
  10. (1.5, 3.0] 1
  11. (-0.003, 1.5] 2
  12. (-0.003, 1.5] 3
  13. [4 rows x 1 columns]

Selecting via a specific interval:

  1. In [66]: df.loc[pd.Interval(1.5, 3.0)]
  2. Out[66]:
  3. A 1
  4. Name: (1.5, 3.0], Length: 1, dtype: int64

Selecting via a scalar value that is contained in the intervals.

  1. In [67]: df.loc[0]
  2. Out[67]:
  3. A
  4. B
  5. (-0.003, 1.5] 0
  6. (-0.003, 1.5] 2
  7. (-0.003, 1.5] 3
  8. [3 rows x 1 columns]

Other enhancements

  • DataFrame.rolling() now accepts the parameter closed='right'|'left'|'both'|'neither' to choose the rolling window-endpoint closedness. See the documentation (GH13965)
  • Integration with the feather-format, including a new top-level pd.read_feather() and DataFrame.to_feather() method, see here.
  • Series.str.replace() now accepts a callable, as replacement, which is passed to re.sub (GH15055)
  • Series.str.replace() now accepts a compiled regular expression as a pattern (GH15446)
  • Series.sort_index accepts parameters kind and na_position (GH13589, GH14444)
  • DataFrame and DataFrame.groupby() have gained a nunique() method to count the distinct values over an axis (GH14336, GH15197).
  • DataFrame has gained a melt() method, equivalent to pd.melt(), for unpivoting from a wide to long format (GH12640).
  • pd.read_excel() now preserves sheet order when using sheetname=None (GH9930)
  • Multiple offset aliases with decimal points are now supported (e.g. 0.5min is parsed as 30s) (GH8419)
  • .isnull() and .notnull() have been added to Index object to make them more consistent with the Series API (GH15300)
  • New UnsortedIndexError (subclass of KeyError) raised when indexing/slicing into an unsorted MultiIndex (GH11897). This allows differentiation between errors due to lack of sorting or an incorrect key. See here
  • MultiIndex has gained a .to_frame() method to convert to a DataFrame (GH12397)
  • pd.cut and pd.qcut now support datetime64 and timedelta64 dtypes (GH14714, GH14798)
  • pd.qcut has gained the duplicates='raise'|'drop' option to control whether to raise on duplicated edges (GH7751)
  • Series provides a to_excel method to output Excel files (GH8825)
  • The usecols argument in pd.read_csv() now accepts a callable function as a value (GH14154)
  • The skiprows argument in pd.read_csv() now accepts a callable function as a value (GH10882)
  • The nrows and chunksize arguments in pd.read_csv() are supported if both are passed (GH6774, GH15755)
  • DataFrame.plot now prints a title above each subplot if suplots=True and title is a list of strings (GH14753)
  • DataFrame.plot can pass the matplotlib 2.0 default color cycle as a single string as color parameter, see here. (GH15516)
  • Series.interpolate() now supports timedelta as an index type with method='time' (GH6424)
  • Addition of a level keyword to DataFrame/Series.rename to rename labels in the specified level of a MultiIndex (GH4160).
  • DataFrame.reset_index() will now interpret a tuple index.name as a key spanning across levels of columns, if this is a MultiIndex (GH16164)
  • Timedelta.isoformat method added for formatting Timedeltas as an ISO 8601 duration. See the Timedelta docs (GH15136)
  • .select_dtypes() now allows the string datetimetz to generically select datetimes with tz (GH14910)
  • The .to_latex() method will now accept multicolumn and multirow arguments to use the accompanying LaTeX enhancements
  • pd.merge_asof() gained the option direction='backward'|'forward'|'nearest' (GH14887)
  • Series/DataFrame.asfreq() have gained a fill_value parameter, to fill missing values (GH3715).
  • Series/DataFrame.resample.asfreq have gained a fill_value parameter, to fill missing values during resampling (GH3715).
  • pandas.util.hash_pandas_object() has gained the ability to hash a MultiIndex (GH15224)
  • Series/DataFrame.squeeze() have gained the axis parameter. (GH15339)
  • DataFrame.to_excel() has a new freeze_panes parameter to turn on Freeze Panes when exporting to Excel (GH15160)
  • pd.read_html() will parse multiple header rows, creating a MutliIndex header. (GH13434).
  • HTML table output skips colspan or rowspan attribute if equal to 1. (GH15403)
  • pandas.io.formats.style.Styler template now has blocks for easier extension, see the example notebook (GH15649)
  • Styler.render() now accepts **kwargs to allow user-defined variables in the template (GH15649)
  • Compatibility with Jupyter notebook 5.0; MultiIndex column labels are left-aligned and MultiIndex row-labels are top-aligned (GH15379)
  • TimedeltaIndex now has a custom date-tick formatter specifically designed for nanosecond level precision (GH8711)
  • pd.api.types.union_categoricals gained the ignore_ordered argument to allow ignoring the ordered attribute of unioned categoricals (GH13410). See the categorical union docs for more information.
  • DataFrame.to_latex() and DataFrame.to_string() now allow optional header aliases. (GH15536)
  • Re-enable the parse_dates keyword of pd.read_excel() to parse string columns as dates (GH14326)
  • Added .empty property to subclasses of Index. (GH15270)
  • Enabled floor division for Timedelta and TimedeltaIndex (GH15828)
  • pandas.io.json.json_normalize() gained the option errors='ignore'|'raise'; the default is errors='raise' which is backward compatible. (GH14583)
  • pandas.io.json.json_normalize() with an empty list will return an empty DataFrame (GH15534)
  • pandas.io.json.json_normalize() has gained a sep option that accepts str to separate joined fields; the default is “.”, which is backward compatible. (GH14883)
  • MultiIndex.remove_unused_levels() has been added to facilitate removing unused levels. (GH15694)
  • pd.read_csv() will now raise a ParserError error whenever any parsing error occurs (GH15913, GH15925)
  • pd.read_csv() now supports the error_bad_lines and warn_bad_lines arguments for the Python parser (GH15925)
  • The display.show_dimensions option can now also be used to specify whether the length of a Series should be shown in its repr (GH7117).
  • parallel_coordinates() has gained a sort_labels keyword argument that sorts class labels and the colors assigned to them (GH15908)
  • Options added to allow one to turn on/off using bottleneck and numexpr, see here (GH16157)
  • DataFrame.style.bar() now accepts two more options to further customize the bar chart. Bar alignment is set with align='left'|'mid'|'zero', the default is “left”, which is backward compatible; You can now pass a list of color=[color_negative, color_positive]. (GH14757)

Backwards incompatible API changes

Possible incompatibility for HDF5 formats created with pandas < 0.13.0

pd.TimeSeries was deprecated officially in 0.17.0, though has already been an alias since 0.13.0. It has been dropped in favor of pd.Series. (GH15098).

This may cause HDF5 files that were created in prior versions to become unreadable if pd.TimeSeries was used. This is most likely to be for pandas < 0.13.0. If you find yourself in this situation. You can use a recent prior version of pandas to read in your HDF5 files, then write them out again after applying the procedure below.

  1. In [2]: s = pd.TimeSeries([1, 2, 3], index=pd.date_range('20130101', periods=3))
  2. In [3]: s
  3. Out[3]:
  4. 2013-01-01 1
  5. 2013-01-02 2
  6. 2013-01-03 3
  7. Freq: D, dtype: int64
  8. In [4]: type(s)
  9. Out[4]: pandas.core.series.TimeSeries
  10. In [5]: s = pd.Series(s)
  11. In [6]: s
  12. Out[6]:
  13. 2013-01-01 1
  14. 2013-01-02 2
  15. 2013-01-03 3
  16. Freq: D, dtype: int64
  17. In [7]: type(s)
  18. Out[7]: pandas.core.series.Series

Map on Index types now return other Index types

map on an Index now returns an Index, not a numpy array (GH12766)

  1. In [68]: idx = pd.Index([1, 2])
  2. In [69]: idx
  3. Out[69]: Int64Index([1, 2], dtype='int64')
  4. In [70]: mi = pd.MultiIndex.from_tuples([(1, 2), (2, 4)])
  5. In [71]: mi
  6. Out[71]:
  7. MultiIndex([(1, 2),
  8. (2, 4)],
  9. )

Previous behavior:

  1. In [5]: idx.map(lambda x: x * 2)
  2. Out[5]: array([2, 4])
  3. In [6]: idx.map(lambda x: (x, x * 2))
  4. Out[6]: array([(1, 2), (2, 4)], dtype=object)
  5. In [7]: mi.map(lambda x: x)
  6. Out[7]: array([(1, 2), (2, 4)], dtype=object)
  7. In [8]: mi.map(lambda x: x[0])
  8. Out[8]: array([1, 2])

New behavior:

  1. In [72]: idx.map(lambda x: x * 2)
  2. Out[72]: Int64Index([2, 4], dtype='int64')
  3. In [73]: idx.map(lambda x: (x, x * 2))
  4. Out[73]:
  5. MultiIndex([(1, 2),
  6. (2, 4)],
  7. )
  8. In [74]: mi.map(lambda x: x)
  9. Out[74]:
  10. MultiIndex([(1, 2),
  11. (2, 4)],
  12. )
  13. In [75]: mi.map(lambda x: x[0])
  14. Out[75]: Int64Index([1, 2], dtype='int64')

map on a Series with datetime64 values may return int64 dtypes rather than int32

  1. In [76]: s = pd.Series(pd.date_range('2011-01-02T00:00', '2011-01-02T02:00', freq='H')
  2. ....: .tz_localize('Asia/Tokyo'))
  3. ....:
  4. In [77]: s
  5. Out[77]:
  6. 0 2011-01-02 00:00:00+09:00
  7. 1 2011-01-02 01:00:00+09:00
  8. 2 2011-01-02 02:00:00+09:00
  9. Length: 3, dtype: datetime64[ns, Asia/Tokyo]

Previous behavior:

  1. In [9]: s.map(lambda x: x.hour)
  2. Out[9]:
  3. 0 0
  4. 1 1
  5. 2 2
  6. dtype: int32

New behavior:

  1. In [78]: s.map(lambda x: x.hour)
  2. Out[78]:
  3. 0 0
  4. 1 1
  5. 2 2
  6. Length: 3, dtype: int64

Accessing datetime fields of Index now return Index

The datetime-related attributes (see here for an overview) of DatetimeIndex, PeriodIndex and TimedeltaIndex previously returned numpy arrays. They will now return a new Index object, except in the case of a boolean field, where the result will still be a boolean ndarray. (GH15022)

Previous behaviour:

  1. In [1]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')
  2. In [2]: idx.hour
  3. Out[2]: array([ 0, 10, 20, 6, 16], dtype=int32)

New behavior:

  1. In [79]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')
  2. In [80]: idx.hour
  3. Out[80]: Int64Index([0, 10, 20, 6, 16], dtype='int64')

This has the advantage that specific Index methods are still available on the result. On the other hand, this might have backward incompatibilities: e.g. compared to numpy arrays, Index objects are not mutable. To get the original ndarray, you can always convert explicitly using np.asarray(idx.hour).

pd.unique will now be consistent with extension types

In prior versions, using Series.unique() and pandas.unique() on Categorical and tz-aware data-types would yield different return types. These are now made consistent. (GH15903)

  • Datetime tz-aware

Previous behaviour:

  1. # Series
  2. In [5]: pd.Series([pd.Timestamp('20160101', tz='US/Eastern'),
  3. ...: pd.Timestamp('20160101', tz='US/Eastern')]).unique()
  4. Out[5]: array([Timestamp('2016-01-01 00:00:00-0500', tz='US/Eastern')], dtype=object)
  5. In [6]: pd.unique(pd.Series([pd.Timestamp('20160101', tz='US/Eastern'),
  6. ...: pd.Timestamp('20160101', tz='US/Eastern')]))
  7. Out[6]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')
  8. # Index
  9. In [7]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
  10. ...: pd.Timestamp('20160101', tz='US/Eastern')]).unique()
  11. Out[7]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
  12. In [8]: pd.unique([pd.Timestamp('20160101', tz='US/Eastern'),
  13. ...: pd.Timestamp('20160101', tz='US/Eastern')])
  14. Out[8]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

New behavior:

  1. # Series, returns an array of Timestamp tz-aware
  2. In [81]: pd.Series([pd.Timestamp(r'20160101', tz=r'US/Eastern'),
  3. ....: pd.Timestamp(r'20160101', tz=r'US/Eastern')]).unique()
  4. ....:
  5. Out[81]:
  6. <DatetimeArray>
  7. ['2016-01-01 00:00:00-05:00']
  8. Length: 1, dtype: datetime64[ns, US/Eastern]
  9. In [82]: pd.unique(pd.Series([pd.Timestamp('20160101', tz='US/Eastern'),
  10. ....: pd.Timestamp('20160101', tz='US/Eastern')]))
  11. ....:
  12. Out[82]:
  13. <DatetimeArray>
  14. ['2016-01-01 00:00:00-05:00']
  15. Length: 1, dtype: datetime64[ns, US/Eastern]
  16. # Index, returns a DatetimeIndex
  17. In [83]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
  18. ....: pd.Timestamp('20160101', tz='US/Eastern')]).unique()
  19. ....:
  20. Out[83]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
  21. In [84]: pd.unique(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
  22. ....: pd.Timestamp('20160101', tz='US/Eastern')]))
  23. ....:
  24. Out[84]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)
  • Categoricals

Previous behaviour:

  1. In [1]: pd.Series(list('baabc'), dtype='category').unique()
  2. Out[1]:
  3. [b, a, c]
  4. Categories (3, object): [b, a, c]
  5. In [2]: pd.unique(pd.Series(list('baabc'), dtype='category'))
  6. Out[2]: array(['b', 'a', 'c'], dtype=object)

New behavior:

  1. # returns a Categorical
  2. In [85]: pd.Series(list('baabc'), dtype='category').unique()
  3. Out[85]:
  4. [b, a, c]
  5. Categories (3, object): [b, a, c]
  6. In [86]: pd.unique(pd.Series(list('baabc'), dtype='category'))
  7. Out[86]:
  8. [b, a, c]
  9. Categories (3, object): [b, a, c]

S3 file handling

pandas now uses s3fs for handling S3 connections. This shouldn’t break any code. However, since s3fs is not a required dependency, you will need to install it separately, like boto in prior versions of pandas. (GH11915).

Partial string indexing changes

DatetimeIndex Partial String Indexing now works as an exact match, provided that string resolution coincides with index resolution, including a case when both are seconds (GH14826). See Slice vs. Exact Match for details.

  1. In [87]: df = pd.DataFrame({'a': [1, 2, 3]}, pd.DatetimeIndex(['2011-12-31 23:59:59',
  2. ....: '2012-01-01 00:00:00',
  3. ....: '2012-01-01 00:00:01']))
  4. ....:

Previous behavior:

  1. In [4]: df['2011-12-31 23:59:59']
  2. Out[4]:
  3. a
  4. 2011-12-31 23:59:59 1
  5. In [5]: df['a']['2011-12-31 23:59:59']
  6. Out[5]:
  7. 2011-12-31 23:59:59 1
  8. Name: a, dtype: int64

New behavior:

  1. In [4]: df['2011-12-31 23:59:59']
  2. KeyError: '2011-12-31 23:59:59'
  3. In [5]: df['a']['2011-12-31 23:59:59']
  4. Out[5]: 1

Concat of different float dtypes will not automatically upcast

Previously, concat of multiple objects with different float dtypes would automatically upcast results to a dtype of float64. Now the smallest acceptable dtype will be used (GH13247)

  1. In [88]: df1 = pd.DataFrame(np.array([1.0], dtype=np.float32, ndmin=2))
  2. In [89]: df1.dtypes
  3. Out[89]:
  4. 0 float32
  5. Length: 1, dtype: object
  6. In [90]: df2 = pd.DataFrame(np.array([np.nan], dtype=np.float32, ndmin=2))
  7. In [91]: df2.dtypes
  8. Out[91]:
  9. 0 float32
  10. Length: 1, dtype: object

Previous behavior:

  1. In [7]: pd.concat([df1, df2]).dtypes
  2. Out[7]:
  3. 0 float64
  4. dtype: object

New behavior:

  1. In [92]: pd.concat([df1, df2]).dtypes
  2. Out[92]:
  3. 0 float32
  4. Length: 1, dtype: object

Pandas Google BigQuery support has moved

pandas has split off Google BigQuery support into a separate package pandas-gbq. You can conda install pandas-gbq -c conda-forge or pip install pandas-gbq to get it. The functionality of read_gbq() and DataFrame.to_gbq() remain the same with the currently released version of pandas-gbq=0.1.4. Documentation is now hosted here (GH15347)

Memory usage for Index is more accurate

In previous versions, showing .memory_usage() on a pandas structure that has an index, would only include actual index values and not include structures that facilitated fast indexing. This will generally be different for Index and MultiIndex and less-so for other index types. (GH15237)

Previous behavior:

  1. In [8]: index = pd.Index(['foo', 'bar', 'baz'])
  2. In [9]: index.memory_usage(deep=True)
  3. Out[9]: 180
  4. In [10]: index.get_loc('foo')
  5. Out[10]: 0
  6. In [11]: index.memory_usage(deep=True)
  7. Out[11]: 180

New behavior:

  1. In [8]: index = pd.Index(['foo', 'bar', 'baz'])
  2. In [9]: index.memory_usage(deep=True)
  3. Out[9]: 180
  4. In [10]: index.get_loc('foo')
  5. Out[10]: 0
  6. In [11]: index.memory_usage(deep=True)
  7. Out[11]: 260

DataFrame.sort_index changes

In certain cases, calling .sort_index() on a MultiIndexed DataFrame would return the same DataFrame without seeming to sort. This would happen with a lexsorted, but non-monotonic levels. (GH15622, GH15687, GH14015, GH13431, GH15797)

This is unchanged from prior versions, but shown for illustration purposes:

  1. In [93]: df = pd.DataFrame(np.arange(6), columns=['value'],
  2. ....: index=pd.MultiIndex.from_product([list('BA'), range(3)]))
  3. ....:
  4. In [94]: df
  5. Out[94]:
  6. value
  7. B 0 0
  8. 1 1
  9. 2 2
  10. A 0 3
  11. 1 4
  12. 2 5
  13. [6 rows x 1 columns]
  1. In [95]: df.index.is_lexsorted()
  2. Out[95]: False
  3. In [96]: df.index.is_monotonic
  4. Out[96]: False

Sorting works as expected

  1. In [97]: df.sort_index()
  2. Out[97]:
  3. value
  4. A 0 3
  5. 1 4
  6. 2 5
  7. B 0 0
  8. 1 1
  9. 2 2
  10. [6 rows x 1 columns]
  1. In [98]: df.sort_index().index.is_lexsorted()
  2. Out[98]: True
  3. In [99]: df.sort_index().index.is_monotonic
  4. Out[99]: True

However, this example, which has a non-monotonic 2nd level, doesn’t behave as desired.

  1. In [100]: df = pd.DataFrame({'value': [1, 2, 3, 4]},
  2. .....: index=pd.MultiIndex([['a', 'b'], ['bb', 'aa']],
  3. .....: [[0, 0, 1, 1], [0, 1, 0, 1]]))
  4. .....:
  5. In [101]: df
  6. Out[101]:
  7. value
  8. a bb 1
  9. aa 2
  10. b bb 3
  11. aa 4
  12. [4 rows x 1 columns]

Previous behavior:

  1. In [11]: df.sort_index()
  2. Out[11]:
  3. value
  4. a bb 1
  5. aa 2
  6. b bb 3
  7. aa 4
  8. In [14]: df.sort_index().index.is_lexsorted()
  9. Out[14]: True
  10. In [15]: df.sort_index().index.is_monotonic
  11. Out[15]: False

New behavior:

  1. In [102]: df.sort_index()
  2. Out[102]:
  3. value
  4. a aa 2
  5. bb 1
  6. b aa 4
  7. bb 3
  8. [4 rows x 1 columns]
  9. In [103]: df.sort_index().index.is_lexsorted()
  10. Out[103]: True
  11. In [104]: df.sort_index().index.is_monotonic
  12. Out[104]: True

Groupby describe formatting

The output formatting of groupby.describe() now labels the describe() metrics in the columns instead of the index. This format is consistent with groupby.agg() when applying multiple functions at once. (GH4792)

Previous behavior:

  1. In [1]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})
  2. In [2]: df.groupby('A').describe()
  3. Out[2]:
  4. B
  5. A
  6. 1 count 2.000000
  7. mean 1.500000
  8. std 0.707107
  9. min 1.000000
  10. 25% 1.250000
  11. 50% 1.500000
  12. 75% 1.750000
  13. max 2.000000
  14. 2 count 2.000000
  15. mean 3.500000
  16. std 0.707107
  17. min 3.000000
  18. 25% 3.250000
  19. 50% 3.500000
  20. 75% 3.750000
  21. max 4.000000
  22. In [3]: df.groupby('A').agg([np.mean, np.std, np.min, np.max])
  23. Out[3]:
  24. B
  25. mean std amin amax
  26. A
  27. 1 1.5 0.707107 1 2
  28. 2 3.5 0.707107 3 4

New behavior:

  1. In [105]: df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})
  2. In [106]: df.groupby('A').describe()
  3. Out[106]:
  4. B
  5. count mean std min 25% 50% 75% max
  6. A
  7. 1 2.0 1.5 0.707107 1.0 1.25 1.5 1.75 2.0
  8. 2 2.0 3.5 0.707107 3.0 3.25 3.5 3.75 4.0
  9. [2 rows x 8 columns]
  10. In [107]: df.groupby('A').agg([np.mean, np.std, np.min, np.max])
  11. Out[107]:
  12. B
  13. mean std amin amax
  14. A
  15. 1 1.5 0.707107 1 2
  16. 2 3.5 0.707107 3 4
  17. [2 rows x 4 columns]

Window binary corr/cov operations return a MultiIndex DataFrame

A binary window operation, like .corr() or .cov(), when operating on a .rolling(..), .expanding(..), or .ewm(..) object, will now return a 2-level MultiIndexed DataFrame rather than a Panel, as Panel is now deprecated, see here. These are equivalent in function, but a MultiIndexed DataFrame enjoys more support in pandas. See the section on Windowed Binary Operations for more information. (GH15677)

  1. In [108]: np.random.seed(1234)
  2. In [109]: df = pd.DataFrame(np.random.rand(100, 2),
  3. .....: columns=pd.Index(['A', 'B'], name='bar'),
  4. .....: index=pd.date_range('20160101',
  5. .....: periods=100, freq='D', name='foo'))
  6. .....:
  7. In [110]: df.tail()
  8. Out[110]:
  9. bar A B
  10. foo
  11. 2016-04-05 0.640880 0.126205
  12. 2016-04-06 0.171465 0.737086
  13. 2016-04-07 0.127029 0.369650
  14. 2016-04-08 0.604334 0.103104
  15. 2016-04-09 0.802374 0.945553
  16. [5 rows x 2 columns]

Previous behavior:

  1. In [2]: df.rolling(12).corr()
  2. Out[2]:
  3. <class 'pandas.core.panel.Panel'>
  4. Dimensions: 100 (items) x 2 (major_axis) x 2 (minor_axis)
  5. Items axis: 2016-01-01 00:00:00 to 2016-04-09 00:00:00
  6. Major_axis axis: A to B
  7. Minor_axis axis: A to B

New behavior:

  1. In [111]: res = df.rolling(12).corr()
  2. In [112]: res.tail()
  3. Out[112]:
  4. bar A B
  5. foo bar
  6. 2016-04-07 B -0.132090 1.000000
  7. 2016-04-08 A 1.000000 -0.145775
  8. B -0.145775 1.000000
  9. 2016-04-09 A 1.000000 0.119645
  10. B 0.119645 1.000000
  11. [5 rows x 2 columns]

Retrieving a correlation matrix for a cross-section

  1. In [113]: df.rolling(12).corr().loc['2016-04-07']
  2. Out[113]:
  3. bar A B
  4. foo bar
  5. 2016-04-07 A 1.00000 -0.13209
  6. B -0.13209 1.00000
  7. [2 rows x 2 columns]

HDFStore where string comparison

In previous versions most types could be compared to string column in a HDFStore usually resulting in an invalid comparison, returning an empty result frame. These comparisons will now raise a TypeError (GH15492)

  1. In [114]: df = pd.DataFrame({'unparsed_date': ['2014-01-01', '2014-01-01']})
  2. In [115]: df.to_hdf('store.h5', 'key', format='table', data_columns=True)
  3. In [116]: df.dtypes
  4. Out[116]:
  5. unparsed_date object
  6. Length: 1, dtype: object

Previous behavior:

  1. In [4]: pd.read_hdf('store.h5', 'key', where='unparsed_date > ts')
  2. File "<string>", line 1
  3. (unparsed_date > 1970-01-01 00:00:01.388552400)
  4. ^
  5. SyntaxError: invalid token

New behavior:

  1. In [18]: ts = pd.Timestamp('2014-01-01')
  2. In [19]: pd.read_hdf('store.h5', 'key', where='unparsed_date > ts')
  3. TypeError: Cannot compare 2014-01-01 00:00:00 of
  4. type <class 'pandas.tslib.Timestamp'> to string column

Index.intersection and inner join now preserve the order of the left Index

Index.intersection() now preserves the order of the calling Index (left) instead of the other Index (right) (GH15582). This affects inner joins, DataFrame.join() and merge(), and the .align method.

  • Index.intersection
  1. In [117]: left = pd.Index([2, 1, 0])
  2. In [118]: left
  3. Out[118]: Int64Index([2, 1, 0], dtype='int64')
  4. In [119]: right = pd.Index([1, 2, 3])
  5. In [120]: right
  6. Out[120]: Int64Index([1, 2, 3], dtype='int64')

Previous behavior:

  1. In [4]: left.intersection(right)
  2. Out[4]: Int64Index([1, 2], dtype='int64')

New behavior:

  1. In [121]: left.intersection(right)
  2. Out[121]: Int64Index([2, 1], dtype='int64')
  • DataFrame.join and pd.merge
  1. In [122]: left = pd.DataFrame({'a': [20, 10, 0]}, index=[2, 1, 0])
  2. In [123]: left
  3. Out[123]:
  4. a
  5. 2 20
  6. 1 10
  7. 0 0
  8. [3 rows x 1 columns]
  9. In [124]: right = pd.DataFrame({'b': [100, 200, 300]}, index=[1, 2, 3])
  10. In [125]: right
  11. Out[125]:
  12. b
  13. 1 100
  14. 2 200
  15. 3 300
  16. [3 rows x 1 columns]

Previous behavior:

  1. In [4]: left.join(right, how='inner')
  2. Out[4]:
  3. a b
  4. 1 10 100
  5. 2 20 200

New behavior:

  1. In [126]: left.join(right, how='inner')
  2. Out[126]:
  3. a b
  4. 2 20 200
  5. 1 10 100
  6. [2 rows x 2 columns]

Pivot table always returns a DataFrame

The documentation for pivot_table() states that a DataFrame is always returned. Here a bug is fixed that allowed this to return a Series under certain circumstance. (GH4386)

  1. In [127]: df = pd.DataFrame({'col1': [3, 4, 5],
  2. .....: 'col2': ['C', 'D', 'E'],
  3. .....: 'col3': [1, 3, 9]})
  4. .....:
  5. In [128]: df
  6. Out[128]:
  7. col1 col2 col3
  8. 0 3 C 1
  9. 1 4 D 3
  10. 2 5 E 9
  11. [3 rows x 3 columns]

Previous behavior:

  1. In [2]: df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum)
  2. Out[2]:
  3. col3 col2
  4. 1 C 3
  5. 3 D 4
  6. 9 E 5
  7. Name: col1, dtype: int64

New behavior:

  1. In [129]: df.pivot_table('col1', index=['col3', 'col2'], aggfunc=np.sum)
  2. Out[129]:
  3. col1
  4. col3 col2
  5. 1 C 3
  6. 3 D 4
  7. 9 E 5
  8. [3 rows x 1 columns]

Other API changes

  • numexpr version is now required to be >= 2.4.6 and it will not be used at all if this requisite is not fulfilled (GH15213).
  • CParserError has been renamed to ParserError in pd.read_csv() and will be removed in the future (GH12665)
  • SparseArray.cumsum() and SparseSeries.cumsum() will now always return SparseArray and SparseSeries respectively (GH12855)
  • DataFrame.applymap() with an empty DataFrame will return a copy of the empty DataFrame instead of a Series (GH8222)
  • Series.map() now respects default values of dictionary subclasses with a __missing__ method, such as collections.Counter (GH15999)
  • .loc has compat with .ix for accepting iterators, and NamedTuples (GH15120)
  • interpolate() and fillna() will raise a ValueError if the limit keyword argument is not greater than 0. (GH9217)
  • pd.read_csv() will now issue a ParserWarning whenever there are conflicting values provided by the dialect parameter and the user (GH14898)
  • pd.read_csv() will now raise a ValueError for the C engine if the quote character is larger than than one byte (GH11592)
  • inplace arguments now require a boolean value, else a ValueError is thrown (GH14189)
  • pandas.api.types.is_datetime64_ns_dtype will now report True on a tz-aware dtype, similar to pandas.api.types.is_datetime64_any_dtype
  • DataFrame.asof() will return a null filled Series instead the scalar NaN if a match is not found (GH15118)
  • Specific support for copy.copy() and copy.deepcopy() functions on NDFrame objects (GH15444)
  • Series.sort_values() accepts a one element list of bool for consistency with the behavior of DataFrame.sort_values() (GH15604)
  • .merge() and .join() on category dtype columns will now preserve the category dtype when possible (GH10409)
  • SparseDataFrame.default_fill_value will be 0, previously was nan in the return from pd.get_dummies(..., sparse=True) (GH15594)
  • The default behaviour of Series.str.match has changed from extracting groups to matching the pattern. The extracting behaviour was deprecated since pandas version 0.13.0 and can be done with the Series.str.extract method (GH5224). As a consequence, the as_indexer keyword is ignored (no longer needed to specify the new behaviour) and is deprecated.
  • NaT will now correctly report False for datetimelike boolean operations such as is_month_start (GH15781)
  • NaT will now correctly return np.nan for Timedelta and Period accessors such as days and quarter (GH15782)
  • NaT will now returns NaT for tz_localize and tz_convert methods (GH15830)
  • DataFrame and Panel constructors with invalid input will now raise ValueError rather than PandasError, if called with scalar inputs and not axes (GH15541)
  • DataFrame and Panel constructors with invalid input will now raise ValueError rather than pandas.core.common.PandasError, if called with scalar inputs and not axes; The exception PandasError is removed as well. (GH15541)
  • The exception pandas.core.common.AmbiguousIndexError is removed as it is not referenced (GH15541)

Reorganization of the library: privacy changes

Modules privacy has changed

Some formerly public python/c/c++/cython extension modules have been moved and/or renamed. These are all removed from the public API. Furthermore, the pandas.core, pandas.compat, and pandas.util top-level modules are now considered to be PRIVATE. If indicated, a deprecation warning will be issued if you reference theses modules. (GH12588)

Previous Location New Location Deprecated
pandas.lib pandas._libs.lib X
pandas.tslib pandas._libs.tslib X
pandas.computation pandas.core.computation X
pandas.msgpack pandas.io.msgpack
pandas.index pandas._libs.index
pandas.algos pandas._libs.algos
pandas.hashtable pandas._libs.hashtable
pandas.indexes pandas.core.indexes
pandas.json pandas._libs.json / pandas.io.json X
pandas.parser pandas._libs.parsers X
pandas.formats pandas.io.formats
pandas.sparse pandas.core.sparse
pandas.tools pandas.core.reshape X
pandas.types pandas.core.dtypes X
pandas.io.sas.saslib pandas.io.sas._sas
pandas._join pandas._libs.join
pandas._hash pandas._libs.hashing
pandas._period pandas._libs.period
pandas._sparse pandas._libs.sparse
pandas._testing pandas._libs.testing
pandas._window pandas._libs.window

Some new subpackages are created with public functionality that is not directly exposed in the top-level namespace: pandas.errors, pandas.plotting and pandas.testing (more details below). Together with pandas.api.types and certain functions in the pandas.io and pandas.tseries submodules, these are now the public subpackages.

Further changes:

  • The function union_categoricals() is now importable from pandas.api.types, formerly from pandas.types.concat (GH15998)
  • The type import pandas.tslib.NaTType is deprecated and can be replaced by using type(pandas.NaT) (GH16146)
  • The public functions in pandas.tools.hashing deprecated from that locations, but are now importable from pandas.util (GH16223)
  • The modules in pandas.util: decorators, print_versions, doctools, validators, depr_module are now private. Only the functions exposed in pandas.util itself are public (GH16223)

pandas.errors

We are adding a standard public module for all pandas exceptions & warnings pandas.errors. (GH14800). Previously these exceptions & warnings could be imported from pandas.core.common or pandas.io.common. These exceptions and warnings will be removed from the *.common locations in a future release. (GH15541)

The following are now part of this API:

  1. ['DtypeWarning',
  2. 'EmptyDataError',
  3. 'OutOfBoundsDatetime',
  4. 'ParserError',
  5. 'ParserWarning',
  6. 'PerformanceWarning',
  7. 'UnsortedIndexError',
  8. 'UnsupportedFunctionCall']

pandas.testing

We are adding a standard module that exposes the public testing functions in pandas.testing (GH9895). Those functions can be used when writing tests for functionality using pandas objects.

The following testing functions are now part of this API:

pandas.plotting

A new public pandas.plotting module has been added that holds plotting functionality that was previously in either pandas.tools.plotting or in the top-level namespace. See the deprecations sections for more details.

Other Development Changes

  • Building pandas for development now requires cython >= 0.23 (GH14831)
  • Require at least 0.23 version of cython to avoid problems with character encodings (GH14699)
  • Switched the test framework to use pytest (GH13097)
  • Reorganization of tests directory layout (GH14854, GH15707).

Deprecations

Deprecate .ix

The .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers. .ix offers a lot of magic on the inference of what the user wants to do. To wit, .ix can decide to index positionally OR via labels, depending on the data type of the index. This has caused quite a bit of user confusion over the years. The full indexing documentation is here. (GH14218)

The recommended methods of indexing are:

  • .loc if you want to label index
  • .iloc if you want to positionally index.

Using .ix will now show a DeprecationWarning with a link to some examples of how to convert code here.

  1. In [130]: df = pd.DataFrame({'A': [1, 2, 3],
  2. .....: 'B': [4, 5, 6]},
  3. .....: index=list('abc'))
  4. .....:
  5. In [131]: df
  6. Out[131]:
  7. A B
  8. a 1 4
  9. b 2 5
  10. c 3 6
  11. [3 rows x 2 columns]

Previous behavior, where you wish to get the 0th and the 2nd elements from the index in the ‘A’ column.

  1. In [3]: df.ix[[0, 2], 'A']
  2. Out[3]:
  3. a 1
  4. c 3
  5. Name: A, dtype: int64

Using .loc. Here we will select the appropriate indexes from the index, then use label indexing.

  1. In [132]: df.loc[df.index[[0, 2]], 'A']
  2. Out[132]:
  3. a 1
  4. c 3
  5. Name: A, Length: 2, dtype: int64

Using .iloc. Here we will get the location of the ‘A’ column, then use positional indexing to select things.

  1. In [133]: df.iloc[[0, 2], df.columns.get_loc('A')]
  2. Out[133]:
  3. a 1
  4. c 3
  5. Name: A, Length: 2, dtype: int64

Deprecate Panel

Panel is deprecated and will be removed in a future version. The recommended way to represent 3-D data are with a MultiIndex on a DataFrame via the to_frame() or with the xarray package. Pandas provides a to_xarray() method to automate this conversion (GH13563).

  1. In [133]: import pandas.util.testing as tm
  2. In [134]: p = tm.makePanel()
  3. In [135]: p
  4. Out[135]:
  5. <class 'pandas.core.panel.Panel'>
  6. Dimensions: 3 (items) x 3 (major_axis) x 4 (minor_axis)
  7. Items axis: ItemA to ItemC
  8. Major_axis axis: 2000-01-03 00:00:00 to 2000-01-05 00:00:00
  9. Minor_axis axis: A to D

Convert to a MultiIndex DataFrame

  1. In [136]: p.to_frame()
  2. Out[136]:
  3. ItemA ItemB ItemC
  4. major minor
  5. 2000-01-03 A 0.628776 -1.409432 0.209395
  6. B 0.988138 -1.347533 -0.896581
  7. C -0.938153 1.272395 -0.161137
  8. D -0.223019 -0.591863 -1.051539
  9. 2000-01-04 A 0.186494 1.422986 -0.592886
  10. B -0.072608 0.363565 1.104352
  11. C -1.239072 -1.449567 0.889157
  12. D 2.123692 -0.414505 -0.319561
  13. 2000-01-05 A 0.952478 -2.147855 -1.473116
  14. B -0.550603 -0.014752 -0.431550
  15. C 0.139683 -1.195524 0.288377
  16. D 0.122273 -1.425795 -0.619993
  17. [12 rows x 3 columns]

Convert to an xarray DataArray

  1. In [137]: p.to_xarray()
  2. Out[137]:
  3. <xarray.DataArray (items: 3, major_axis: 3, minor_axis: 4)>
  4. array([[[ 0.628776, 0.988138, -0.938153, -0.223019],
  5. [ 0.186494, -0.072608, -1.239072, 2.123692],
  6. [ 0.952478, -0.550603, 0.139683, 0.122273]],
  7. [[-1.409432, -1.347533, 1.272395, -0.591863],
  8. [ 1.422986, 0.363565, -1.449567, -0.414505],
  9. [-2.147855, -0.014752, -1.195524, -1.425795]],
  10. [[ 0.209395, -0.896581, -0.161137, -1.051539],
  11. [-0.592886, 1.104352, 0.889157, -0.319561],
  12. [-1.473116, -0.43155 , 0.288377, -0.619993]]])
  13. Coordinates:
  14. * items (items) object 'ItemA' 'ItemB' 'ItemC'
  15. * major_axis (major_axis) datetime64[ns] 2000-01-03 2000-01-04 2000-01-05
  16. * minor_axis (minor_axis) object 'A' 'B' 'C' 'D'

Deprecate groupby.agg() with a dictionary when renaming

The .groupby(..).agg(..), .rolling(..).agg(..), and .resample(..).agg(..) syntax can accept a variable of inputs, including scalars, list, and a dict of column names to scalars or lists. This provides a useful syntax for constructing multiple (potentially different) aggregations.

However, .agg(..) can also accept a dict that allows ‘renaming’ of the result columns. This is a complicated and confusing syntax, as well as not consistent between Series and DataFrame. We are deprecating this ‘renaming’ functionality.

  • We are deprecating passing a dict to a grouped/rolled/resampled Series. This allowed one to rename the resulting aggregation, but this had a completely different meaning than passing a dictionary to a grouped DataFrame, which accepts column-to-aggregations.
  • We are deprecating passing a dict-of-dicts to a grouped/rolled/resampled DataFrame in a similar manner.

This is an illustrative example:

  1. In [134]: df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
  2. .....: 'B': range(5),
  3. .....: 'C': range(5)})
  4. .....:
  5. In [135]: df
  6. Out[135]:
  7. A B C
  8. 0 1 0 0
  9. 1 1 1 1
  10. 2 1 2 2
  11. 3 2 3 3
  12. 4 2 4 4
  13. [5 rows x 3 columns]

Here is a typical useful syntax for computing different aggregations for different columns. This is a natural, and useful syntax. We aggregate from the dict-to-list by taking the specified columns and applying the list of functions. This returns a MultiIndex for the columns (this is not deprecated).

  1. In [136]: df.groupby('A').agg({'B': 'sum', 'C': 'min'})
  2. Out[136]:
  3. B C
  4. A
  5. 1 3 0
  6. 2 7 3
  7. [2 rows x 2 columns]

Here’s an example of the first deprecation, passing a dict to a grouped Series. This is a combination aggregation & renaming:

  1. In [6]: df.groupby('A').B.agg({'foo': 'count'})
  2. FutureWarning: using a dict on a Series for aggregation
  3. is deprecated and will be removed in a future version
  4. Out[6]:
  5. foo
  6. A
  7. 1 3
  8. 2 2

You can accomplish the same operation, more idiomatically by:

  1. In [137]: df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})
  2. Out[137]:
  3. foo
  4. A
  5. 1 3
  6. 2 2
  7. [2 rows x 1 columns]

Here’s an example of the second deprecation, passing a dict-of-dict to a grouped DataFrame:

  1. In [23]: (df.groupby('A')
  2. ...: .agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
  3. ...: )
  4. FutureWarning: using a dict with renaming is deprecated and
  5. will be removed in a future version
  6. Out[23]:
  7. B C
  8. foo bar
  9. A
  10. 1 3 0
  11. 2 7 3

You can accomplish nearly the same by:

  1. In [138]: (df.groupby('A')
  2. .....: .agg({'B': 'sum', 'C': 'min'})
  3. .....: .rename(columns={'B': 'foo', 'C': 'bar'})
  4. .....: )
  5. .....:
  6. Out[138]:
  7. foo bar
  8. A
  9. 1 3 0
  10. 2 7 3
  11. [2 rows x 2 columns]

Deprecate .plotting

The pandas.tools.plotting module has been deprecated, in favor of the top level pandas.plotting module. All the public plotting functions are now available from pandas.plotting (GH12548).

Furthermore, the top-level pandas.scatter_matrix and pandas.plot_params are deprecated. Users can import these from pandas.plotting as well.

Previous script:

  1. pd.tools.plotting.scatter_matrix(df)
  2. pd.scatter_matrix(df)

Should be changed to:

  1. pd.plotting.scatter_matrix(df)

Other deprecations

  • SparseArray.to_dense() has deprecated the fill parameter, as that parameter was not being respected (GH14647)
  • SparseSeries.to_dense() has deprecated the sparse_only parameter (GH14647)
  • Series.repeat() has deprecated the reps parameter in favor of repeats (GH12662)
  • The Series constructor and .astype method have deprecated accepting timestamp dtypes without a frequency (e.g. np.datetime64) for the dtype parameter (GH15524)
  • Index.repeat() and MultiIndex.repeat() have deprecated the n parameter in favor of repeats (GH12662)
  • Categorical.searchsorted() and Series.searchsorted() have deprecated the v parameter in favor of value (GH12662)
  • TimedeltaIndex.searchsorted(), DatetimeIndex.searchsorted(), and PeriodIndex.searchsorted() have deprecated the key parameter in favor of value (GH12662)
  • DataFrame.astype() has deprecated the raise_on_error parameter in favor of errors (GH14878)
  • Series.sortlevel and DataFrame.sortlevel have been deprecated in favor of Series.sort_index and DataFrame.sort_index (GH15099)
  • importing concat from pandas.tools.merge has been deprecated in favor of imports from the pandas namespace. This should only affect explicit imports (GH15358)
  • Series/DataFrame/Panel.consolidate() been deprecated as a public method. (GH15483)
  • The as_indexer keyword of Series.str.match() has been deprecated (ignored keyword) (GH15257).
  • The following top-level pandas functions have been deprecated and will be removed in a future version (GH13790, GH15940) pd.pnow(), replaced by Period.now() pd.Term, is removed, as it is not applicable to user code. Instead use in-line string expressions in the where clause when searching in HDFStore pd.Expr, is removed, as it is not applicable to user code. pd.match(), is removed. pd.groupby(), replaced by using the .groupby() method directly on a Series/DataFrame pd.get_store(), replaced by a direct call to pd.HDFStore(...)
  • pd.pnow(), replaced by Period.now()
  • pd.Term, is removed, as it is not applicable to user code. Instead use in-line string expressions in the where clause when searching in HDFStore
  • pd.Expr, is removed, as it is not applicable to user code.
  • pd.match(), is removed.
  • pd.groupby(), replaced by using the .groupby() method directly on a Series/DataFrame
  • pd.get_store(), replaced by a direct call to pd.HDFStore(...)
  • is_any_int_dtype, is_floating_dtype, and is_sequence are deprecated from pandas.api.types (GH16042)

Removal of prior version deprecations/changes

  • The pandas.rpy module is removed. Similar functionality can be accessed through the rpy2 project. See the R interfacing docs for more details.
  • The pandas.io.ga module with a google-analytics interface is removed (GH11308). Similar functionality can be found in the Google2Pandas package.
  • pd.to_datetime and pd.to_timedelta have dropped the coerce parameter in favor of errors (GH13602)
  • pandas.stats.fama_macbeth, pandas.stats.ols, pandas.stats.plm and pandas.stats.var, as well as the top-level pandas.fama_macbeth and pandas.ols routines are removed. Similar functionality can be found in the statsmodels package. (GH11898)
  • The TimeSeries and SparseTimeSeries classes, aliases of Series and SparseSeries, are removed (GH10890, GH15098).
  • Series.is_time_series is dropped in favor of Series.index.is_all_dates (GH15098)
  • The deprecated irow, icol, iget and iget_value methods are removed in favor of iloc and iat as explained here (GH10711).
  • The deprecated DataFrame.iterkv() has been removed in favor of DataFrame.iteritems() (GH10711)
  • The Categorical constructor has dropped the name parameter (GH10632)
  • Categorical has dropped support for NaN categories (GH10748)
  • The take_last parameter has been dropped from duplicated(), drop_duplicates(), nlargest(), and nsmallest() methods (GH10236, GH10792, GH10920)
  • Series, Index, and DataFrame have dropped the sort and order methods (GH10726)
  • Where clauses in pytables are only accepted as strings and expressions types and not other data-types (GH12027)
  • DataFrame has dropped the combineAdd and combineMult methods in favor of add and mul respectively (GH10735)

Performance improvements

  • Improved performance of pd.wide_to_long() (GH14779)
  • Improved performance of pd.factorize() by releasing the GIL with object dtype when inferred as strings (GH14859, GH16057)
  • Improved performance of timeseries plotting with an irregular DatetimeIndex (or with compat_x=True) (GH15073).
  • Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
  • Improved performance and reduced memory when indexing with a MultiIndex (GH15245)
  • When reading buffer object in read_sas() method without specified format, filepath string is inferred rather than buffer object. (GH14947)
  • Improved performance of .rank() for categorical data (GH15498)
  • Improved performance when using .unstack() (GH15503)
  • Improved performance of merge/join on category columns (GH10409)
  • Improved performance of drop_duplicates() on bool columns (GH12963)
  • Improve performance of pd.core.groupby.GroupBy.apply when the applied function used the .name attribute of the group DataFrame (GH15062).
  • Improved performance of iloc indexing with a list or array (GH15504).
  • Improved performance of Series.sort_index() with a monotonic index (GH15694)
  • Improved performance in pd.read_csv() on some platforms with buffered reads (GH16039)

Bug fixes

Conversion

  • Bug in Timestamp.replace now raises TypeError when incorrect argument names are given; previously this raised ValueError (GH15240)
  • Bug in Timestamp.replace with compat for passing long integers (GH15030)
  • Bug in Timestamp returning UTC based time/date attributes when a timezone was provided (GH13303, GH6538)
  • Bug in Timestamp incorrectly localizing timezones during construction (GH11481, GH15777)
  • Bug in TimedeltaIndex addition where overflow was being allowed without error (GH14816)
  • Bug in TimedeltaIndex raising a ValueError when boolean indexing with loc (GH14946)
  • Bug in catching an overflow in Timestamp + Timedelta/Offset operations (GH15126)
  • Bug in DatetimeIndex.round() and Timestamp.round() floating point accuracy when rounding by milliseconds or less (GH14440, GH15578)
  • Bug in astype() where inf values were incorrectly converted to integers. Now raises error now with astype() for Series and DataFrames (GH14265)
  • Bug in DataFrame(..).apply(to_numeric) when values are of type decimal.Decimal. (GH14827)
  • Bug in describe() when passing a numpy array which does not contain the median to the percentiles keyword argument (GH14908)
  • Cleaned up PeriodIndex constructor, including raising on floats more consistently (GH13277)
  • Bug in using __deepcopy__ on empty NDFrame objects (GH15370)
  • Bug in .replace() may result in incorrect dtypes. (GH12747, GH15765)
  • Bug in Series.replace and DataFrame.replace which failed on empty replacement dicts (GH15289)
  • Bug in Series.replace which replaced a numeric by string (GH15743)
  • Bug in Index construction with NaN elements and integer dtype specified (GH15187)
  • Bug in Series construction with a datetimetz (GH14928)
  • Bug in Series.dt.round() inconsistent behaviour on NaT ‘s with different arguments (GH14940)
  • Bug in Series constructor when both copy=True and dtype arguments are provided (GH15125)
  • Incorrect dtyped Series was returned by comparison methods (e.g., lt, gt, …) against a constant for an empty DataFrame (GH15077)
  • Bug in Series.ffill() with mixed dtypes containing tz-aware datetimes. (GH14956)
  • Bug in DataFrame.fillna() where the argument downcast was ignored when fillna value was of type dict (GH15277)
  • Bug in .asfreq(), where frequency was not set for empty Series (GH14320)
  • Bug in DataFrame construction with nulls and datetimes in a list-like (GH15869)
  • Bug in DataFrame.fillna() with tz-aware datetimes (GH15855)
  • Bug in is_string_dtype, is_timedelta64_ns_dtype, and is_string_like_dtype in which an error was raised when None was passed in (GH15941)
  • Bug in the return type of pd.unique on a Categorical, which was returning an ndarray and not a Categorical (GH15903)
  • Bug in Index.to_series() where the index was not copied (and so mutating later would change the original), (GH15949)
  • Bug in indexing with partial string indexing with a len-1 DataFrame (GH16071)
  • Bug in Series construction where passing invalid dtype didn’t raise an error. (GH15520)

Indexing

  • Bug in Index power operations with reversed operands (GH14973)
  • Bug in DataFrame.sort_values() when sorting by multiple columns where one column is of type int64 and contains NaT (GH14922)
  • Bug in DataFrame.reindex() in which method was ignored when passing columns (GH14992)
  • Bug in DataFrame.loc with indexing a MultiIndex with a Series indexer (GH14730, GH15424)
  • Bug in DataFrame.loc with indexing a MultiIndex with a numpy array (GH15434)
  • Bug in Series.asof which raised if the series contained all np.nan (GH15713)
  • Bug in .at when selecting from a tz-aware column (GH15822)
  • Bug in Series.where() and DataFrame.where() where array-like conditionals were being rejected (GH15414)
  • Bug in Series.where() where TZ-aware data was converted to float representation (GH15701)
  • Bug in .loc that would not return the correct dtype for scalar access for a DataFrame (GH11617)
  • Bug in output formatting of a MultiIndex when names are integers (GH12223, GH15262)
  • Bug in Categorical.searchsorted() where alphabetical instead of the provided categorical order was used (GH14522)
  • Bug in Series.iloc where a Categorical object for list-like indexes input was returned, where a Series was expected. (GH14580)
  • Bug in DataFrame.isin comparing datetimelike to empty frame (GH15473)
  • Bug in .reset_index() when an all NaN level of a MultiIndex would fail (GH6322)
  • Bug in .reset_index() when raising error for index name already present in MultiIndex columns (GH16120)
  • Bug in creating a MultiIndex with tuples and not passing a list of names; this will now raise ValueError (GH15110)
  • Bug in the HTML display with with a MultiIndex and truncation (GH14882)
  • Bug in the display of .info() where a qualifier (+) would always be displayed with a MultiIndex that contains only non-strings (GH15245)
  • Bug in pd.concat() where the names of MultiIndex of resulting DataFrame are not handled correctly when None is presented in the names of MultiIndex of input DataFrame (GH15787)
  • Bug in DataFrame.sort_index() and Series.sort_index() where na_position doesn’t work with a MultiIndex (GH14784, GH16604)
  • Bug in in pd.concat() when combining objects with a CategoricalIndex (GH16111)
  • Bug in indexing with a scalar and a CategoricalIndex (GH16123)

I/O

  • Bug in pd.to_numeric() in which float and unsigned integer elements were being improperly casted (GH14941, GH15005)
  • Bug in pd.read_fwf() where the skiprows parameter was not being respected during column width inference (GH11256)
  • Bug in pd.read_csv() in which the dialect parameter was not being verified before processing (GH14898)
  • Bug in pd.read_csv() in which missing data was being improperly handled with usecols (GH6710)
  • Bug in pd.read_csv() in which a file containing a row with many columns followed by rows with fewer columns would cause a crash (GH14125)
  • Bug in pd.read_csv() for the C engine where usecols were being indexed incorrectly with parse_dates (GH14792)
  • Bug in pd.read_csv() with parse_dates when multi-line headers are specified (GH15376)
  • Bug in pd.read_csv() with float_precision='round_trip' which caused a segfault when a text entry is parsed (GH15140)
  • Bug in pd.read_csv() when an index was specified and no values were specified as null values (GH15835)
  • Bug in pd.read_csv() in which certain invalid file objects caused the Python interpreter to crash (GH15337)
  • Bug in pd.read_csv() in which invalid values for nrows and chunksize were allowed (GH15767)
  • Bug in pd.read_csv() for the Python engine in which unhelpful error messages were being raised when parsing errors occurred (GH15910)
  • Bug in pd.read_csv() in which the skipfooter parameter was not being properly validated (GH15925)
  • Bug in pd.to_csv() in which there was numeric overflow when a timestamp index was being written (GH15982)
  • Bug in pd.util.hashing.hash_pandas_object() in which hashing of categoricals depended on the ordering of categories, instead of just their values. (GH15143)
  • Bug in .to_json() where lines=True and contents (keys or values) contain escaped characters (GH15096)
  • Bug in .to_json() causing single byte ascii characters to be expanded to four byte unicode (GH15344)
  • Bug in .to_json() for the C engine where rollover was not correctly handled for case where frac is odd and diff is exactly 0.5 (GH15716, GH15864)
  • Bug in pd.read_json() for Python 2 where lines=True and contents contain non-ascii unicode characters (GH15132)
  • Bug in pd.read_msgpack() in which Series categoricals were being improperly processed (GH14901)
  • Bug in pd.read_msgpack() which did not allow loading of a dataframe with an index of type CategoricalIndex (GH15487)
  • Bug in pd.read_msgpack() when deserializing a CategoricalIndex (GH15487)
  • Bug in DataFrame.to_records() with converting a DatetimeIndex with a timezone (GH13937)
  • Bug in DataFrame.to_records() which failed with unicode characters in column names (GH11879)
  • Bug in .to_sql() when writing a DataFrame with numeric index names (GH15404).
  • Bug in DataFrame.to_html() with index=False and max_rows raising in IndexError (GH14998)
  • Bug in pd.read_hdf() passing a Timestamp to the where parameter with a non date column (GH15492)
  • Bug in DataFrame.to_stata() and StataWriter which produces incorrectly formatted files to be produced for some locales (GH13856)
  • Bug in StataReader and StataWriter which allows invalid encodings (GH15723)
  • Bug in the Series repr not showing the length when the output was truncated (GH15962).

Plotting

  • Bug in DataFrame.hist where plt.tight_layout caused an AttributeError (use matplotlib >= 2.0.1) (GH9351)
  • Bug in DataFrame.boxplot where fontsize was not applied to the tick labels on both axes (GH15108)
  • Bug in the date and time converters pandas registers with matplotlib not handling multiple dimensions (GH16026)
  • Bug in pd.scatter_matrix() could accept either color or c, but not both (GH14855)

Groupby/resample/rolling

  • Bug in .groupby(..).resample() when passed the on= kwarg. (GH15021)
  • Properly set __name__ and __qualname__ for Groupby.* functions (GH14620)
  • Bug in GroupBy.get_group() failing with a categorical grouper (GH15155)
  • Bug in .groupby(...).rolling(...) when on is specified and using a DatetimeIndex (GH15130, GH13966)
  • Bug in groupby operations with timedelta64 when passing numeric_only=False (GH5724)
  • Bug in groupby.apply() coercing object dtypes to numeric types, when not all values were numeric (GH14423, GH15421, GH15670)
  • Bug in resample, where a non-string loffset argument would not be applied when resampling a timeseries (GH13218)
  • Bug in DataFrame.groupby().describe() when grouping on Index containing tuples (GH14848)
  • Bug in groupby().nunique() with a datetimelike-grouper where bins counts were incorrect (GH13453)
  • Bug in groupby.transform() that would coerce the resultant dtypes back to the original (GH10972, GH11444)
  • Bug in groupby.agg() incorrectly localizing timezone on datetime (GH15426, GH10668, GH13046)
  • Bug in .rolling/expanding() functions where count() was not counting np.Inf, nor handling object dtypes (GH12541)
  • Bug in .rolling() where pd.Timedelta or datetime.timedelta was not accepted as a window argument (GH15440)
  • Bug in Rolling.quantile function that caused a segmentation fault when called with a quantile value outside of the range [0, 1] (GH15463)
  • Bug in DataFrame.resample().median() if duplicate column names are present (GH14233)

Sparse

  • Bug in SparseSeries.reindex on single level with list of length 1 (GH15447)
  • Bug in repr-formatting a SparseDataFrame after a value was set on (a copy of) one of its series (GH15488)
  • Bug in SparseDataFrame construction with lists not coercing to dtype (GH15682)
  • Bug in sparse array indexing in which indices were not being validated (GH15863)

Reshaping

  • Bug in pd.merge_asof() where left_index or right_index caused a failure when multiple by was specified (GH15676)
  • Bug in pd.merge_asof() where left_index/right_index together caused a failure when tolerance was specified (GH15135)
  • Bug in DataFrame.pivot_table() where dropna=True would not drop all-NaN columns when the columns was a category dtype (GH15193)
  • Bug in pd.melt() where passing a tuple value for value_vars caused a TypeError (GH15348)
  • Bug in pd.pivot_table() where no error was raised when values argument was not in the columns (GH14938)
  • Bug in pd.concat() in which concatenating with an empty dataframe with join='inner' was being improperly handled (GH15328)
  • Bug with sort=True in DataFrame.join and pd.merge when joining on indexes (GH15582)
  • Bug in DataFrame.nsmallest and DataFrame.nlargest where identical values resulted in duplicated rows (GH15297)
  • Bug in pandas.pivot_table() incorrectly raising UnicodeError when passing unicode input for margins keyword (GH13292)

Numeric

  • Bug in .rank() which incorrectly ranks ordered categories (GH15420)
  • Bug in .corr() and .cov() where the column and index were the same object (GH14617)
  • Bug in .mode() where mode was not returned if was only a single value (GH15714)
  • Bug in pd.cut() with a single bin on an all 0s array (GH15428)
  • Bug in pd.qcut() with a single quantile and an array with identical values (GH15431)
  • Bug in pandas.tools.utils.cartesian_product() with large input can cause overflow on windows (GH15265)
  • Bug in .eval() which caused multi-line evals to fail with local variables not on the first line (GH15342)

Other

  • Compat with SciPy 0.19.0 for testing on .interpolate() (GH15662)
  • Compat for 32-bit platforms for .qcut/cut; bins will now be int64 dtype (GH14866)
  • Bug in interactions with Qt when a QtApplication already exists (GH14372)
  • Avoid use of np.finfo() during import pandas removed to mitigate deadlock on Python GIL misuse (GH14641)

Contributors

A total of 204 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.

  • Adam J. Stewart +
  • Adrian +
  • Ajay Saxena
  • Akash Tandon +
  • Albert Villanova del Moral +
  • Aleksey Bilogur +
  • Alexis Mignon +
  • Amol Kahat +
  • Andreas Winkler +
  • Andrew Kittredge +
  • Anthonios Partheniou
  • Arco Bast +
  • Ashish Singal +
  • Baurzhan Muftakhidinov +
  • Ben Kandel
  • Ben Thayer +
  • Ben Welsh +
  • Bill Chambers +
  • Brandon M. Burroughs
  • Brian +
  • Brian McFee +
  • Carlos Souza +
  • Chris
  • Chris Ham
  • Chris Warth
  • Christoph Gohlke
  • Christoph Paulik +
  • Christopher C. Aycock
  • Clemens Brunner +
  • D.S. McNeil +
  • DaanVanHauwermeiren +
  • Daniel Himmelstein
  • Dave Willmer
  • David Cook +
  • David Gwynne +
  • David Hoffman +
  • David Krych
  • Diego Fernandez +
  • Dimitris Spathis +
  • Dmitry L +
  • Dody Suria Wijaya +
  • Dominik Stanczak +
  • Dr-Irv
  • Dr. Irv +
  • Elliott Sales de Andrade +
  • Ennemoser Christoph +
  • Francesc Alted +
  • Fumito Hamamura +
  • Giacomo Ferroni
  • Graham R. Jeffries +
  • Greg Williams +
  • Guilherme Beltramini +
  • Guilherme Samora +
  • Hao Wu +
  • Harshit Patni +
  • Ilya V. Schurov +
  • Iván Vallés Pérez
  • Jackie Leng +
  • Jaehoon Hwang +
  • James Draper +
  • James Goppert +
  • James McBride +
  • James Santucci +
  • Jan Schulz
  • Jeff Carey
  • Jeff Reback
  • JennaVergeynst +
  • Jim +
  • Jim Crist
  • Joe Jevnik
  • Joel Nothman +
  • John +
  • John Tucker +
  • John W. O’Brien
  • John Zwinck
  • Jon M. Mease
  • Jon Mease
  • Jonathan Whitmore +
  • Jonathan de Bruin +
  • Joost Kranendonk +
  • Joris Van den Bossche
  • Joshua Bradt +
  • Julian Santander
  • Julien Marrec +
  • Jun Kim +
  • Justin Solinsky +
  • Kacawi +
  • Kamal Kamalaldin +
  • Kerby Shedden
  • Kernc
  • Keshav Ramaswamy
  • Kevin Sheppard
  • Kyle Kelley
  • Larry Ren
  • Leon Yin +
  • Line Pedersen +
  • Lorenzo Cestaro +
  • Luca Scarabello
  • Lukasz +
  • Mahmoud Lababidi
  • Mark Mandel +
  • Matt Roeschke
  • Matthew Brett
  • Matthew Roeschke +
  • Matti Picus
  • Maximilian Roos
  • Michael Charlton +
  • Michael Felt
  • Michael Lamparski +
  • Michiel Stock +
  • Mikolaj Chwalisz +
  • Min RK
  • Miroslav Šedivý +
  • Mykola Golubyev
  • Nate Yoder
  • Nathalie Rud +
  • Nicholas Ver Halen
  • Nick Chmura +
  • Nolan Nichols +
  • Pankaj Pandey +
  • Pawel Kordek
  • Pete Huang +
  • Peter +
  • Peter Csizsek +
  • Petio Petrov +
  • Phil Ruffwind +
  • Pietro Battiston
  • Piotr Chromiec
  • Prasanjit Prakash +
  • Rob Forgione +
  • Robert Bradshaw
  • Robin +
  • Rodolfo Fernandez
  • Roger Thomas
  • Rouz Azari +
  • Sahil Dua
  • Sam Foo +
  • Sami Salonen +
  • Sarah Bird +
  • Sarma Tangirala +
  • Scott Sanderson
  • Sebastian Bank
  • Sebastian Gsänger +
  • Shawn Heide
  • Shyam Saladi +
  • Sinhrks
  • Stephen Rauch +
  • Sébastien de Menten +
  • Tara Adiseshan
  • Thiago Serafim
  • Thoralf Gutierrez +
  • Thrasibule +
  • Tobias Gustafsson +
  • Tom Augspurger
  • Tong SHEN +
  • Tong Shen +
  • TrigonaMinima +
  • Uwe +
  • Wes Turner
  • Wiktor Tomczak +
  • WillAyd
  • Yaroslav Halchenko
  • Yimeng Zhang +
  • abaldenko +
  • adrian-stepien +
  • alexandercbooth +
  • atbd +
  • bastewart +
  • bmagnusson +
  • carlosdanielcsantos +
  • chaimdemulder +
  • chris-b1
  • dickreuter +
  • discort +
  • dr-leo +
  • dubourg
  • dwkenefick +
  • funnycrab +
  • gfyoung
  • goldenbull +
  • hesham.shabana@hotmail.com
  • jojomdt +
  • linebp +
  • manu +
  • manuels +
  • mattip +
  • maxalbert +
  • mcocdawc +
  • nuffe +
  • paul-mannino
  • pbreach +
  • sakkemo +
  • scls19fr
  • sinhrks
  • stijnvanhoey +
  • the-nose-knows +
  • themrmax +
  • tomrod +
  • tzinckgraf
  • wandersoncferreira
  • watercrossing +
  • wcwagner
  • xgdgsc +
  • yui-knk