Nullable integer data type
New in version 0.24.0.
::: tip Note
IntegerArray is currently experimental. Its API or implementation may change without warning.
:::
In Working with missing data, we saw that pandas primarily uses NaN to represent
missing data. Because NaN is a float, this forces an array of integers with
any missing values to become floating point. In some cases, this may not matter
much. But if your integer column is, say, an identifier, casting to float can
be problematic. Some integers cannot even be represented as floating point
numbers.
Pandas can represent integer data with possibly missing values using
arrays.IntegerArray. This is an extension types
implemented within pandas. It is not the default dtype for integers, and will not be inferred;
you must explicitly pass the dtype into array() or Series:
In [1]: arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())In [2]: arrOut[2]:<IntegerArray>[1, 2, NaN]Length: 3, dtype: Int64
Or the string alias "Int64" (note the capital "I", to differentiate from
NumPy’s 'int64' dtype:
In [3]: pd.array([1, 2, np.nan], dtype="Int64")Out[3]:<IntegerArray>[1, 2, NaN]Length: 3, dtype: Int64
This array can be stored in a DataFrame or Series like any
NumPy array.
In [4]: pd.Series(arr)Out[4]:0 11 22 NaNdtype: Int64
You can also pass the list-like object to the Series constructor
with the dtype.
In [5]: s = pd.Series([1, 2, np.nan], dtype="Int64")In [6]: sOut[6]:0 11 22 NaNdtype: Int64
By default (if you don’t specify dtype), NumPy is used, and you’ll end
up with a float64 dtype Series:
In [7]: pd.Series([1, 2, np.nan])Out[7]:0 1.01 2.02 NaNdtype: float64
Operations involving an integer array will behave similar to NumPy arrays. Missing values will be propagated, and and the data will be coerced to another dtype if needed.
# arithmeticIn [8]: s + 1Out[8]:0 21 32 NaNdtype: Int64# comparisonIn [9]: s == 1Out[9]:0 True1 False2 Falsedtype: bool# indexingIn [10]: s.iloc[1:3]Out[10]:1 22 NaNdtype: Int64# operate with other dtypesIn [11]: s + s.iloc[1:3].astype('Int8')Out[11]:0 NaN1 42 NaNdtype: Int64# coerce when neededIn [12]: s + 0.01Out[12]:0 1.011 2.012 NaNdtype: float64
These dtypes can operate as part of of DataFrame.
In [13]: df = pd.DataFrame({'A': s, 'B': [1, 1, 3], 'C': list('aab')})In [14]: dfOut[14]:A B C0 1 1 a1 2 1 a2 NaN 3 bIn [15]: df.dtypesOut[15]:A Int64B int64C objectdtype: object
These dtypes can be merged & reshaped & casted.
In [16]: pd.concat([df[['A']], df[['B', 'C']]], axis=1).dtypesOut[16]:A Int64B int64C objectdtype: objectIn [17]: df['A'].astype(float)Out[17]:0 1.01 2.02 NaNName: A, dtype: float64
Reduction and groupby operations such as ‘sum’ work as well.
In [18]: df.sum()Out[18]:A 3B 5C aabdtype: objectIn [19]: df.groupby('B').A.sum()Out[19]:B1 33 0Name: A, dtype: Int64
