Nullable integer data type

New in version 0.24.0.

::: tip Note

IntegerArray is currently experimental. Its API or implementation may change without warning.

:::

In Working with missing data, we saw that pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

  1. In [1]: arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
  2. In [2]: arr
  3. Out[2]:
  4. <IntegerArray>
  5. [1, 2, NaN]
  6. Length: 3, dtype: Int64

Or the string alias "Int64" (note the capital "I", to differentiate from NumPy’s 'int64' dtype:

  1. In [3]: pd.array([1, 2, np.nan], dtype="Int64")
  2. Out[3]:
  3. <IntegerArray>
  4. [1, 2, NaN]
  5. Length: 3, dtype: Int64

This array can be stored in a DataFrame or Series like any NumPy array.

  1. In [4]: pd.Series(arr)
  2. Out[4]:
  3. 0 1
  4. 1 2
  5. 2 NaN
  6. dtype: Int64

You can also pass the list-like object to the Series constructor with the dtype.

  1. In [5]: s = pd.Series([1, 2, np.nan], dtype="Int64")
  2. In [6]: s
  3. Out[6]:
  4. 0 1
  5. 1 2
  6. 2 NaN
  7. dtype: Int64

By default (if you don’t specify dtype), NumPy is used, and you’ll end up with a float64 dtype Series:

  1. In [7]: pd.Series([1, 2, np.nan])
  2. Out[7]:
  3. 0 1.0
  4. 1 2.0
  5. 2 NaN
  6. dtype: float64

Operations involving an integer array will behave similar to NumPy arrays. Missing values will be propagated, and and the data will be coerced to another dtype if needed.

  1. # arithmetic
  2. In [8]: s + 1
  3. Out[8]:
  4. 0 2
  5. 1 3
  6. 2 NaN
  7. dtype: Int64
  8. # comparison
  9. In [9]: s == 1
  10. Out[9]:
  11. 0 True
  12. 1 False
  13. 2 False
  14. dtype: bool
  15. # indexing
  16. In [10]: s.iloc[1:3]
  17. Out[10]:
  18. 1 2
  19. 2 NaN
  20. dtype: Int64
  21. # operate with other dtypes
  22. In [11]: s + s.iloc[1:3].astype('Int8')
  23. Out[11]:
  24. 0 NaN
  25. 1 4
  26. 2 NaN
  27. dtype: Int64
  28. # coerce when needed
  29. In [12]: s + 0.01
  30. Out[12]:
  31. 0 1.01
  32. 1 2.01
  33. 2 NaN
  34. dtype: float64

These dtypes can operate as part of of DataFrame.

  1. In [13]: df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})
  2. In [14]: df
  3. Out[14]:
  4. A B C
  5. 0 1 1 a
  6. 1 2 1 a
  7. 2 NaN 3 b
  8. In [15]: df.dtypes
  9. Out[15]:
  10. A Int64
  11. B int64
  12. C object
  13. dtype: object

These dtypes can be merged & reshaped & casted.

  1. In [16]: pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypes
  2. Out[16]:
  3. A Int64
  4. B int64
  5. C object
  6. dtype: object
  7. In [17]: df['A'].astype(float)
  8. Out[17]:
  9. 0 1.0
  10. 1 2.0
  11. 2 NaN
  12. Name: A, dtype: float64

Reduction and groupby operations such as ‘sum’ work as well.

  1. In [18]: df.sum()
  2. Out[18]:
  3. A 3
  4. B 5
  5. C aab
  6. dtype: object
  7. In [19]: df.groupby('B').A.sum()
  8. Out[19]:
  9. B
  10. 1 3
  11. 3 0
  12. Name: A, dtype: Int64