Comparison with other tools

Comparison with R / R libraries

Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page was started to provide a more detailed look at the R language) and its many third party libraries as they relate to pandas. In comparisons with R and CRAN libraries, we care about the following things:

  • Functionality / flexibility: what can/cannot be done with each tool
  • Performance: how fast are operations. Hard numbers/benchmarks are preferable
  • Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)

This page is also here to offer a bit of a translation guide for users of these R packages.

For transfer of DataFrame objects from pandas to R, one option is to use HDF5 files, see External Compatibility for an example.

Quick Reference

We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.

Querying, Filtering, Sampling

R pandas
dim(df) df.shape
head(df) df.head()
slice(df, 1:10) df.iloc[:9]
filter(df, col1 == 1, col2 == 1) df.query(‘col1 == 1 & col2 == 1’)
df[df$col1 == 1 & df$col2 == 1,] df[(df.col1 == 1) & (df.col2 == 1)]
select(df, col1, col2) df[[‘col1’, ‘col2’]]
select(df, col1:col3) df.loc[:, ‘col1’:’col3’]
select(df, -(col1:col3)) df.drop(cols_to_drop, axis=1) but see [1]
distinct(select(df, col1)) df[[‘col1’]].drop_duplicates()
distinct(select(df, col1, col2)) df[[‘col1’, ‘col2’]].drop_duplicates()
sample_n(df, 10) df.sample(n=10)
sample_frac(df, 0.01) df.sample(frac=0.01)

::: tip [1] | R’s shorthand for a subrange of columns (select(df, col1:col3)) can be approached cleanly in pandas, if you have the list of columns, for example df[cols[1:3]] or df.drop(cols[1:3]), but doing this by column name is a bit messy. :::

Sorting

R pandas
arrange(df, col1, col2) df.sort_values([‘col1’, ‘col2’])
arrange(df, desc(col1)) df.sort_values(‘col1’, ascending=False)

Transforming

R pandas
select(df, col_one = col1) df.rename(columns={‘col1’: ‘col_one’})[‘col_one’]
rename(df, col_one = col1) df.rename(columns={‘col1’: ‘col_one’})
mutate(df, c=a-b) df.assign(c=df.a-df.b)

Grouping and Summarizing

R pandas
summary(df) df.describe()
gdf <- group_by(df, col1) gdf = df.groupby(‘col1’)
summarise(gdf, avg=mean(col1, na.rm=TRUE)) df.groupby(‘col1’).agg({‘col1’: ‘mean’})
summarise(gdf, total=sum(col1)) df.groupby(‘col1’).sum()

Base R

Slicing with R’s c

R makes it easy to access data.frame columns by name

  1. df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))
  2. df[, c("a", "c", "e")]

or by integer location

  1. df <- data.frame(matrix(rnorm(1000), ncol=100))
  2. df[, c(1:10, 25:30, 40, 50:100)]

Selecting multiple columns by name in pandas is straightforward

  1. In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc'))
  2. In [2]: df[['a', 'c']]
  3. Out[2]:
  4. a c
  5. 0 0.469112 -1.509059
  6. 1 -1.135632 -0.173215
  7. 2 0.119209 -0.861849
  8. 3 -2.104569 1.071804
  9. 4 0.721555 -1.039575
  10. 5 0.271860 0.567020
  11. 6 0.276232 -0.673690
  12. 7 0.113648 0.524988
  13. 8 0.404705 -1.715002
  14. 9 -1.039268 -1.157892
  15. In [3]: df.loc[:, ['a', 'c']]
  16. Out[3]:
  17. a c
  18. 0 0.469112 -1.509059
  19. 1 -1.135632 -0.173215
  20. 2 0.119209 -0.861849
  21. 3 -2.104569 1.071804
  22. 4 0.721555 -1.039575
  23. 5 0.271860 0.567020
  24. 6 0.276232 -0.673690
  25. 7 0.113648 0.524988
  26. 8 0.404705 -1.715002
  27. 9 -1.039268 -1.157892

Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the iloc indexer attribute and numpy.r_.

  1. In [4]: named = list('abcdefg')
  2. In [5]: n = 30
  3. In [6]: columns = named + np.arange(len(named), n).tolist()
  4. In [7]: df = pd.DataFrame(np.random.randn(n, n), columns=columns)
  5. In [8]: df.iloc[:, np.r_[:10, 24:30]]
  6. Out[8]:
  7. a b c d e f g 7 8 9 24 25 26 27 28 29
  8. 0 -1.344312 0.844885 1.075770 -0.109050 1.643563 -1.469388 0.357021 -0.674600 -1.776904 -0.968914 -1.170299 -0.226169 0.410835 0.813850 0.132003 -0.827317
  9. 1 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738
  10. 2 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849 -0.954208 1.462696 -1.743161 -0.826591 0.084844 0.432390 1.519970 -0.493662 0.600178 0.274230
  11. 3 0.132885 -0.023688 2.410179 1.450520 0.206053 -0.251905 -2.213588 1.063327 1.266143 0.299368 -2.484478 -0.281461 0.030711 0.109121 1.126203 -0.977349
  12. 4 1.474071 -0.064034 -1.282782 0.781836 -1.071357 0.441153 2.353925 0.583787 0.221471 -0.744471 -1.197071 -1.066969 -0.303421 -0.858447 0.306996 -0.028665
  13. 5 0.384316 1.574159 1.588931 0.476720 0.473424 -0.242861 -0.014805 -0.284319 0.650776 -1.461665 -0.902937 0.068159 -0.057873 -0.368204 -1.144073 0.861209
  14. 6 0.800193 0.782098 -1.069094 -1.099248 0.255269 0.009750 0.661084 0.379319 -0.008434 1.952541 0.604603 2.121453 0.597701 0.563700 0.967661 -1.057909
  15. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
  16. 23 1.534417 -1.374226 -0.367477 0.782551 1.356489 0.981552 0.304501 0.354041 -1.232756 -0.267074 0.641606 -1.690959 0.961088 0.052372 1.166439 0.407281
  17. 24 0.859275 -0.995910 0.261263 1.783442 0.380989 2.289726 0.309489 2.189028 1.389045 -0.873585 -0.169076 0.840316 0.638172 0.890673 -1.949397 -0.003437
  18. 25 1.492125 -0.068190 0.681456 1.221829 -0.434352 1.204815 -0.195612 1.251683 -1.040389 -0.796211 1.944517 0.042344 -0.307904 0.428572 0.880609 0.487645
  19. 26 0.725238 0.624607 -0.141185 -0.143948 -0.328162 2.095086 -0.608888 -0.926422 1.872601 -2.513465 -0.846188 1.190624 0.778507 1.008500 1.424017 0.717110
  20. 27 1.262419 1.950057 0.301038 -0.933858 0.814946 0.181439 -0.110015 -2.364638 -1.584814 0.307941 -1.341814 0.334281 -0.162227 1.007824 2.826008 1.458383
  21. 28 -1.585746 -0.899734 0.921494 -0.211762 -0.059182 0.058308 0.915377 -0.696321 0.150664 -3.060395 0.403620 -0.026602 -0.240481 0.577223 -1.088417 0.326687
  22. 29 -0.986248 0.169729 -1.158091 1.019673 0.646039 0.917399 -0.010435 0.366366 0.922729 0.869610 -1.209247 -0.671466 0.332872 -2.013086 -1.602549 0.333109
  23. [30 rows x 16 columns]

aggregate

In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df and splitting it into groups by1 and by2:

  1. df <- data.frame(
  2. v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
  3. v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
  4. by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
  5. by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
  6. aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)

The groupby() method is similar to base R aggregate function.

  1. In [9]: df = pd.DataFrame(
  2. ...: {'v1': [1, 3, 5, 7, 8, 3, 5, np.nan, 4, 5, 7, 9],
  3. ...: 'v2': [11, 33, 55, 77, 88, 33, 55, np.nan, 44, 55, 77, 99],
  4. ...: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
  5. ...: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
  6. ...: np.nan]})
  7. ...:
  8. In [10]: g = df.groupby(['by1', 'by2'])
  9. In [11]: g[['v1', 'v2']].mean()
  10. Out[11]:
  11. v1 v2
  12. by1 by2
  13. 1 95 5.0 55.0
  14. 99 5.0 55.0
  15. 2 95 7.0 77.0
  16. 99 NaN NaN
  17. big damp 3.0 33.0
  18. blue dry 3.0 33.0
  19. red red 4.0 44.0
  20. wet 1.0 11.0

For more details and examples see the groupby documentation.

match / %in%

A common way to select data in R is using %in% which is defined using the function match. The operator %in% is used to return a logical vector indicating if there is a match or not:

  1. s <- 0:4
  2. s %in% c(2,4)

The isin() method is similar to R %in% operator:

  1. In [12]: s = pd.Series(np.arange(5), dtype=np.float32)
  2. In [13]: s.isin([2, 4])
  3. Out[13]:
  4. 0 False
  5. 1 False
  6. 2 True
  7. 3 False
  8. 4 True
  9. dtype: bool

The match function returns a vector of the positions of matches of its first argument in its second:

  1. s <- 0:4
  2. match(s, c(2,4))

For more details and examples see the reshaping documentation.

tapply

tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular. Using a data.frame called baseball, and retrieving information based on the array team:

  1. baseball <-
  2. data.frame(team = gl(5, 5,
  3. labels = paste("Team", LETTERS[1:5])),
  4. player = sample(letters, 25),
  5. batting.average = runif(25, .200, .400))
  6. tapply(baseball$batting.average, baseball.example$team,
  7. max)

In pandas we may use pivot_table() method to handle this:

  1. In [14]: import random
  2. In [15]: import string
  3. In [16]: baseball = pd.DataFrame(
  4. ....: {'team': ["team %d" % (x + 1) for x in range(5)] * 5,
  5. ....: 'player': random.sample(list(string.ascii_lowercase), 25),
  6. ....: 'batting avg': np.random.uniform(.200, .400, 25)})
  7. ....:
  8. In [17]: baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max)
  9. Out[17]:
  10. team team 1 team 2 team 3 team 4 team 5
  11. batting avg 0.352134 0.295327 0.397191 0.394457 0.396194

For more details and examples see the reshaping documentation.

subset

The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame where one column’s values are less than another column’s values:

  1. df <- data.frame(a=rnorm(10), b=rnorm(10))
  2. subset(df, a <= b)
  3. df[df$a <= df$b,] # note the comma

In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an index/slice as well as standard boolean indexing:

  1. In [18]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
  2. In [19]: df.query('a <= b')
  3. Out[19]:
  4. a b
  5. 1 0.174950 0.552887
  6. 2 -0.023167 0.148084
  7. 3 -0.495291 -0.300218
  8. 4 -0.860736 0.197378
  9. 5 -1.134146 1.720780
  10. 7 -0.290098 0.083515
  11. 8 0.238636 0.946550
  12. In [20]: df[df.a <= df.b]
  13. Out[20]:
  14. a b
  15. 1 0.174950 0.552887
  16. 2 -0.023167 0.148084
  17. 3 -0.495291 -0.300218
  18. 4 -0.860736 0.197378
  19. 5 -1.134146 1.720780
  20. 7 -0.290098 0.083515
  21. 8 0.238636 0.946550
  22. In [21]: df.loc[df.a <= df.b]
  23. Out[21]:
  24. a b
  25. 1 0.174950 0.552887
  26. 2 -0.023167 0.148084
  27. 3 -0.495291 -0.300218
  28. 4 -0.860736 0.197378
  29. 5 -1.134146 1.720780
  30. 7 -0.290098 0.083515
  31. 8 0.238636 0.946550

For more details and examples see the query documentation.

with

An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:

  1. df <- data.frame(a=rnorm(10), b=rnorm(10))
  2. with(df, a + b)
  3. df$a + df$b # same as the previous expression

In pandas the equivalent expression, using the eval() method, would be:

  1. In [22]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
  2. In [23]: df.eval('a + b')
  3. Out[23]:
  4. 0 -0.091430
  5. 1 -2.483890
  6. 2 -0.252728
  7. 3 -0.626444
  8. 4 -0.261740
  9. 5 2.149503
  10. 6 -0.332214
  11. 7 0.799331
  12. 8 -2.377245
  13. 9 2.104677
  14. dtype: float64
  15. In [24]: df.a + df.b # same as the previous expression
  16. Out[24]:
  17. 0 -0.091430
  18. 1 -2.483890
  19. 2 -0.252728
  20. 3 -0.626444
  21. 4 -0.261740
  22. 5 2.149503
  23. 6 -0.332214
  24. 7 0.799331
  25. 8 -2.377245
  26. 9 2.104677
  27. dtype: float64

In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.

plyr

plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data structures could be mapped in Python.

R Python
array list
lists dictionary or list of objects
data.frame dataframe

ddply

An expression using a data.frame called df in R where you want to summarize x by month:

  1. require(plyr)
  2. df <- data.frame(
  3. x = runif(120, 1, 168),
  4. y = runif(120, 7, 334),
  5. z = runif(120, 1.7, 20.7),
  6. month = rep(c(5,6,7,8),30),
  7. week = sample(1:4, 120, TRUE)
  8. )
  9. ddply(df, .(month, week), summarize,
  10. mean = round(mean(x), 2),
  11. sd = round(sd(x), 2))

In pandas the equivalent expression, using the groupby() method, would be:

  1. In [25]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 120),
  2. ....: 'y': np.random.uniform(7., 334., 120),
  3. ....: 'z': np.random.uniform(1.7, 20.7, 120),
  4. ....: 'month': [5, 6, 7, 8] * 30,
  5. ....: 'week': np.random.randint(1, 4, 120)})
  6. ....:
  7. In [26]: grouped = df.groupby(['month', 'week'])
  8. In [27]: grouped['x'].agg([np.mean, np.std])
  9. Out[27]:
  10. mean std
  11. month week
  12. 5 1 63.653367 40.601965
  13. 2 78.126605 53.342400
  14. 3 92.091886 57.630110
  15. 6 1 81.747070 54.339218
  16. 2 70.971205 54.687287
  17. 3 100.968344 54.010081
  18. 7 1 61.576332 38.844274
  19. 2 61.733510 48.209013
  20. 3 71.688795 37.595638
  21. 8 1 62.741922 34.618153
  22. 2 91.774627 49.790202
  23. 3 73.936856 60.773900

For more details and examples see the groupby documentation.

reshape / reshape2

melt.array

An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:

  1. a <- array(c(1:23, NA), c(2,3,4))
  2. data.frame(melt(a))

In Python, since a is a list, you can simply use list comprehension.

  1. In [28]: a = np.array(list(range(1, 24)) + [np.NAN]).reshape(2, 3, 4)
  2. In [29]: pd.DataFrame([tuple(list(x) + [val]) for x, val in np.ndenumerate(a)])
  3. Out[29]:
  4. 0 1 2 3
  5. 0 0 0 0 1.0
  6. 1 0 0 1 2.0
  7. 2 0 0 2 3.0
  8. 3 0 0 3 4.0
  9. 4 0 1 0 5.0
  10. 5 0 1 1 6.0
  11. 6 0 1 2 7.0
  12. .. .. .. .. ...
  13. 17 1 1 1 18.0
  14. 18 1 1 2 19.0
  15. 19 1 1 3 20.0
  16. 20 1 2 0 21.0
  17. 21 1 2 1 22.0
  18. 22 1 2 2 23.0
  19. 23 1 2 3 NaN
  20. [24 rows x 4 columns]

melt.list

An expression using a list called a in R where you want to melt it into a data.frame:

  1. a <- as.list(c(1:4, NA))
  2. data.frame(melt(a))

In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as required.

  1. In [30]: a = list(enumerate(list(range(1, 5)) + [np.NAN]))
  2. In [31]: pd.DataFrame(a)
  3. Out[31]:
  4. 0 1
  5. 0 0 1.0
  6. 1 1 2.0
  7. 2 2 3.0
  8. 3 3 4.0
  9. 4 4 NaN

For more details and examples see the Into to Data Structures documentation.

melt.data.frame

An expression using a data.frame called cheese in R where you want to reshape the data.frame:

  1. cheese <- data.frame(
  2. first = c('John', 'Mary'),
  3. last = c('Doe', 'Bo'),
  4. height = c(5.5, 6.0),
  5. weight = c(130, 150)
  6. )
  7. melt(cheese, id=c("first", "last"))

In Python, the melt() method is the R equivalent:

  1. In [32]: cheese = pd.DataFrame({'first': ['John', 'Mary'],
  2. ....: 'last': ['Doe', 'Bo'],
  3. ....: 'height': [5.5, 6.0],
  4. ....: 'weight': [130, 150]})
  5. ....:
  6. In [33]: pd.melt(cheese, id_vars=['first', 'last'])
  7. Out[33]:
  8. first last variable value
  9. 0 John Doe height 5.5
  10. 1 Mary Bo height 6.0
  11. 2 John Doe weight 130.0
  12. 3 Mary Bo weight 150.0
  13. In [34]: cheese.set_index(['first', 'last']).stack() # alternative way
  14. Out[34]:
  15. first last
  16. John Doe height 5.5
  17. weight 130.0
  18. Mary Bo height 6.0
  19. weight 150.0
  20. dtype: float64

For more details and examples see the reshaping documentation.

cast

In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:

  1. df <- data.frame(
  2. x = runif(12, 1, 168),
  3. y = runif(12, 7, 334),
  4. z = runif(12, 1.7, 20.7),
  5. month = rep(c(5,6,7),4),
  6. week = rep(c(1,2), 6)
  7. )
  8. mdf <- melt(df, id=c("month", "week"))
  9. acast(mdf, week ~ month ~ variable, mean)

In Python the best way is to make use of pivot_table():

  1. In [35]: df = pd.DataFrame({'x': np.random.uniform(1., 168., 12),
  2. ....: 'y': np.random.uniform(7., 334., 12),
  3. ....: 'z': np.random.uniform(1.7, 20.7, 12),
  4. ....: 'month': [5, 6, 7] * 4,
  5. ....: 'week': [1, 2] * 6})
  6. ....:
  7. In [36]: mdf = pd.melt(df, id_vars=['month', 'week'])
  8. In [37]: pd.pivot_table(mdf, values='value', index=['variable', 'week'],
  9. ....: columns=['month'], aggfunc=np.mean)
  10. ....:
  11. Out[37]:
  12. month 5 6 7
  13. variable week
  14. x 1 93.888747 98.762034 55.219673
  15. 2 94.391427 38.112932 83.942781
  16. y 1 94.306912 279.454811 227.840449
  17. 2 87.392662 193.028166 173.899260
  18. z 1 11.016009 10.079307 16.170549
  19. 2 8.476111 17.638509 19.003494

Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and FeedType:

  1. df <- data.frame(
  2. Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
  3. 'Animal2', 'Animal3'),
  4. FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
  5. Amount = c(10, 7, 4, 2, 5, 6, 2)
  6. )
  7. dcast(df, Animal ~ FeedType, sum, fill=NaN)
  8. # Alternative method using base R
  9. with(df, tapply(Amount, list(Animal, FeedType), sum))

Python can approach this in two different ways. Firstly, similar to above using pivot_table():

  1. In [38]: df = pd.DataFrame({
  2. ....: 'Animal': ['Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
  3. ....: 'Animal2', 'Animal3'],
  4. ....: 'FeedType': ['A', 'B', 'A', 'A', 'B', 'B', 'A'],
  5. ....: 'Amount': [10, 7, 4, 2, 5, 6, 2],
  6. ....: })
  7. ....:
  8. In [39]: df.pivot_table(values='Amount', index='Animal', columns='FeedType',
  9. ....: aggfunc='sum')
  10. ....:
  11. Out[39]:
  12. FeedType A B
  13. Animal
  14. Animal1 10.0 5.0
  15. Animal2 2.0 13.0
  16. Animal3 6.0 NaN

The second approach is to use the groupby() method:

  1. In [40]: df.groupby(['Animal', 'FeedType'])['Amount'].sum()
  2. Out[40]:
  3. Animal FeedType
  4. Animal1 A 10
  5. B 5
  6. Animal2 A 2
  7. B 13
  8. Animal3 A 6
  9. Name: Amount, dtype: int64

For more details and examples see the reshaping documentation or the groupby documentation.

factor

pandas has a data type for categorical data.

  1. cut(c(1,2,3,4,5,6), 3)
  2. factor(c(1,2,3,2,2,3))

In pandas this is accomplished with pd.cut and astype("category"):

  1. In [41]: pd.cut(pd.Series([1, 2, 3, 4, 5, 6]), 3)
  2. Out[41]:
  3. 0 (0.995, 2.667]
  4. 1 (0.995, 2.667]
  5. 2 (2.667, 4.333]
  6. 3 (2.667, 4.333]
  7. 4 (4.333, 6.0]
  8. 5 (4.333, 6.0]
  9. dtype: category
  10. Categories (3, interval[float64]): [(0.995, 2.667] < (2.667, 4.333] < (4.333, 6.0]]
  11. In [42]: pd.Series([1, 2, 3, 2, 2, 3]).astype("category")
  12. Out[42]:
  13. 0 1
  14. 1 2
  15. 2 3
  16. 3 2
  17. 4 2
  18. 5 3
  19. dtype: category
  20. Categories (3, int64): [1, 2, 3]

For more details and examples see categorical introduction and the API documentation. There is also a documentation regarding the differences to R’s factor.

Comparison with SQL

Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of how various SQL operations would be performed using pandas.

If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.

As is customary, we import pandas and NumPy as follows:

  1. In [1]: import pandas as pd
  2. In [2]: import numpy as np

Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a DataFrame called tips and assume we have a database table of the same name and structure.

  1. In [3]: url = ('https://raw.github.com/pandas-dev'
  2. ...: '/pandas/master/pandas/tests/data/tips.csv')
  3. ...:
  4. In [4]: tips = pd.read_csv(url)
  5. In [5]: tips.head()
  6. Out[5]:
  7. total_bill tip sex smoker day time size
  8. 0 16.99 1.01 Female No Sun Dinner 2
  9. 1 10.34 1.66 Male No Sun Dinner 3
  10. 2 21.01 3.50 Male No Sun Dinner 3
  11. 3 23.68 3.31 Male No Sun Dinner 2
  12. 4 24.59 3.61 Female No Sun Dinner 4

SELECT

In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all columns):

  1. SELECT total_bill, tip, smoker, time
  2. FROM tips
  3. LIMIT 5;

With pandas, column selection is done by passing a list of column names to your DataFrame:

  1. In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
  2. Out[6]:
  3. total_bill tip smoker time
  4. 0 16.99 1.01 No Dinner
  5. 1 10.34 1.66 No Dinner
  6. 2 21.01 3.50 No Dinner
  7. 3 23.68 3.31 No Dinner
  8. 4 24.59 3.61 No Dinner

Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).

WHERE

Filtering in SQL is done via a WHERE clause.

  1. SELECT *
  2. FROM tips
  3. WHERE time = 'Dinner'
  4. LIMIT 5;

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

  1. In [7]: tips[tips['time'] == 'Dinner'].head(5)
  2. Out[7]:
  3. total_bill tip sex smoker day time size
  4. 0 16.99 1.01 Female No Sun Dinner 2
  5. 1 10.34 1.66 Male No Sun Dinner 3
  6. 2 21.01 3.50 Male No Sun Dinner 3
  7. 3 23.68 3.31 Male No Sun Dinner 2
  8. 4 24.59 3.61 Female No Sun Dinner 4

The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with True.

  1. In [8]: is_dinner = tips['time'] == 'Dinner'
  2. In [9]: is_dinner.value_counts()
  3. Out[9]:
  4. True 176
  5. False 68
  6. Name: time, dtype: int64
  7. In [10]: tips[is_dinner].head(5)
  8. Out[10]:
  9. total_bill tip sex smoker day time size
  10. 0 16.99 1.01 Female No Sun Dinner 2
  11. 1 10.34 1.66 Male No Sun Dinner 3
  12. 2 21.01 3.50 Male No Sun Dinner 3
  13. 3 23.68 3.31 Male No Sun Dinner 2
  14. 4 24.59 3.61 Female No Sun Dinner 4

Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).

  1. -- tips of more than $5.00 at Dinner meals
  2. SELECT *
  3. FROM tips
  4. WHERE time = 'Dinner' AND tip > 5.00;
  1. # tips of more than $5.00 at Dinner meals
  2. In [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
  3. Out[11]:
  4. total_bill tip sex smoker day time size
  5. 23 39.42 7.58 Male No Sat Dinner 4
  6. 44 30.40 5.60 Male No Sun Dinner 4
  7. 47 32.40 6.00 Male No Sun Dinner 4
  8. 52 34.81 5.20 Female No Sun Dinner 4
  9. 59 48.27 6.73 Male No Sat Dinner 4
  10. 116 29.93 5.07 Male No Sun Dinner 4
  11. 155 29.85 5.14 Female No Sun Dinner 5
  12. 170 50.81 10.00 Male Yes Sat Dinner 3
  13. 172 7.25 5.15 Male Yes Sun Dinner 2
  14. 181 23.33 5.65 Male Yes Sun Dinner 2
  15. 183 23.17 6.50 Male Yes Sun Dinner 4
  16. 211 25.89 5.16 Male Yes Sat Dinner 4
  17. 212 48.33 9.00 Male No Sat Dinner 4
  18. 214 28.17 6.50 Female Yes Sat Dinner 3
  19. 239 29.03 5.92 Male No Sat Dinner 3
  1. -- tips by parties of at least 5 diners OR bill total was more than $45
  2. SELECT *
  3. FROM tips
  4. WHERE size >= 5 OR total_bill > 45;
  1. # tips by parties of at least 5 diners OR bill total was more than $45
  2. In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
  3. Out[12]:
  4. total_bill tip sex smoker day time size
  5. 59 48.27 6.73 Male No Sat Dinner 4
  6. 125 29.80 4.20 Female No Thur Lunch 6
  7. 141 34.30 6.70 Male No Thur Lunch 6
  8. 142 41.19 5.00 Male No Thur Lunch 5
  9. 143 27.05 5.00 Female No Thur Lunch 6
  10. 155 29.85 5.14 Female No Sun Dinner 5
  11. 156 48.17 5.00 Male No Sun Dinner 6
  12. 170 50.81 10.00 Male Yes Sat Dinner 3
  13. 182 45.35 3.50 Male Yes Sun Dinner 3
  14. 185 20.69 5.00 Male No Sun Dinner 5
  15. 187 30.46 2.00 Male Yes Sun Dinner 5
  16. 212 48.33 9.00 Male No Sat Dinner 4
  17. 216 28.15 3.00 Male Yes Sat Dinner 5

NULL checking is done using the notna() and isna() methods.

  1. In [13]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
  2. ....: 'col2': ['F', np.NaN, 'G', 'H', 'I']})
  3. ....:
  4. In [14]: frame
  5. Out[14]:
  6. col1 col2
  7. 0 A F
  8. 1 B NaN
  9. 2 NaN G
  10. 3 C H
  11. 4 D I

Assume we have a table of the same structure as our DataFrame above. We can see only the records where col2 IS NULL with the following query:

  1. SELECT *
  2. FROM frame
  3. WHERE col2 IS NULL;
  1. In [15]: frame[frame['col2'].isna()]
  2. Out[15]:
  3. col1 col2
  4. 1 B NaN

Getting items where col1 IS NOT NULL can be done with notna().

  1. SELECT *
  2. FROM frame
  3. WHERE col1 IS NOT NULL;
  1. In [16]: frame[frame['col1'].notna()]
  2. Out[16]:
  3. col1 col2
  4. 0 A F
  5. 1 B NaN
  6. 3 C H
  7. 4 D I

GROUP BY

In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method. groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function (typically aggregation) , and then combine the groups together.

A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a query getting us the number of tips left by sex:

  1. SELECT sex, count(*)
  2. FROM tips
  3. GROUP BY sex;
  4. /*
  5. Female 87
  6. Male 157
  7. */

The pandas equivalent would be:

  1. In [17]: tips.groupby('sex').size()
  2. Out[17]:
  3. sex
  4. Female 87
  5. Male 157
  6. dtype: int64

Notice that in the pandas code we used size() and not count(). This is because count() applies the function to each column, returning the number of not null records within each.

  1. In [18]: tips.groupby('sex').count()
  2. Out[18]:
  3. total_bill tip smoker day time size
  4. sex
  5. Female 87 87 87 87 87 87
  6. Male 157 157 157 157 157 157

Alternatively, we could have applied the count() method to an individual column:

  1. In [19]: tips.groupby('sex')['total_bill'].count()
  2. Out[19]:
  3. sex
  4. Female 87
  5. Male 157
  6. Name: total_bill, dtype: int64

Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount differs by day of the week - agg() allows you to pass a dictionary to your grouped DataFrame, indicating which functions to apply to specific columns.

  1. SELECT day, AVG(tip), COUNT(*)
  2. FROM tips
  3. GROUP BY day;
  4. /*
  5. Fri 2.734737 19
  6. Sat 2.993103 87
  7. Sun 3.255132 76
  8. Thur 2.771452 62
  9. */
  1. In [20]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
  2. Out[20]:
  3. tip day
  4. day
  5. Fri 2.734737 19
  6. Sat 2.993103 87
  7. Sun 3.255132 76
  8. Thur 2.771452 62

Grouping by more than one column is done by passing a list of columns to the groupby() method.

  1. SELECT smoker, day, COUNT(*), AVG(tip)
  2. FROM tips
  3. GROUP BY smoker, day;
  4. /*
  5. smoker day
  6. No Fri 4 2.812500
  7. Sat 45 3.102889
  8. Sun 57 3.167895
  9. Thur 45 2.673778
  10. Yes Fri 15 2.714000
  11. Sat 42 2.875476
  12. Sun 19 3.516842
  13. Thur 17 3.030000
  14. */
  1. In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
  2. Out[21]:
  3. tip
  4. size mean
  5. smoker day
  6. No Fri 4.0 2.812500
  7. Sat 45.0 3.102889
  8. Sun 57.0 3.167895
  9. Thur 45.0 2.673778
  10. Yes Fri 15.0 2.714000
  11. Sat 42.0 2.875476
  12. Sun 19.0 3.516842
  13. Thur 17.0 3.030000

JOIN

JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns to join on (column names or indices).

  1. In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
  2. ....: 'value': np.random.randn(4)})
  3. ....:
  4. In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
  5. ....: 'value': np.random.randn(4)})
  6. ....:

Assume we have two database tables of the same name and structure as our DataFrames.

Now let’s go over the various types of JOINs.

INNER JOIN

  1. SELECT *
  2. FROM df1
  3. INNER JOIN df2
  4. ON df1.key = df2.key;
  1. # merge performs an INNER JOIN by default
  2. In [24]: pd.merge(df1, df2, on='key')
  3. Out[24]:
  4. key value_x value_y
  5. 0 B -0.282863 1.212112
  6. 1 D -1.135632 -0.173215
  7. 2 D -1.135632 0.119209

merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another DataFrame’s index.

  1. In [25]: indexed_df2 = df2.set_index('key')
  2. In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)
  3. Out[26]:
  4. key value_x value_y
  5. 1 B -0.282863 1.212112
  6. 3 D -1.135632 -0.173215
  7. 3 D -1.135632 0.119209

LEFT OUTER JOIN

  1. -- show all records from df1
  2. SELECT *
  3. FROM df1
  4. LEFT OUTER JOIN df2
  5. ON df1.key = df2.key;
  1. # show all records from df1
  2. In [27]: pd.merge(df1, df2, on='key', how='left')
  3. Out[27]:
  4. key value_x value_y
  5. 0 A 0.469112 NaN
  6. 1 B -0.282863 1.212112
  7. 2 C -1.509059 NaN
  8. 3 D -1.135632 -0.173215
  9. 4 D -1.135632 0.119209

RIGHT JOIN

  1. -- show all records from df2
  2. SELECT *
  3. FROM df1
  4. RIGHT OUTER JOIN df2
  5. ON df1.key = df2.key;
  1. # show all records from df2
  2. In [28]: pd.merge(df1, df2, on='key', how='right')
  3. Out[28]:
  4. key value_x value_y
  5. 0 B -0.282863 1.212112
  6. 1 D -1.135632 -0.173215
  7. 2 D -1.135632 0.119209
  8. 3 E NaN -1.044236

FULL JOIN

pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).

  1. -- show all records from both tables
  2. SELECT *
  3. FROM df1
  4. FULL OUTER JOIN df2
  5. ON df1.key = df2.key;
  1. # show all records from both frames
  2. In [29]: pd.merge(df1, df2, on='key', how='outer')
  3. Out[29]:
  4. key value_x value_y
  5. 0 A 0.469112 NaN
  6. 1 B -0.282863 1.212112
  7. 2 C -1.509059 NaN
  8. 3 D -1.135632 -0.173215
  9. 4 D -1.135632 0.119209
  10. 5 E NaN -1.044236

UNION

UNION ALL can be performed using concat().

  1. In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
  2. ....: 'rank': range(1, 4)})
  3. ....:
  4. In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
  5. ....: 'rank': [1, 4, 5]})
  6. ....:
  1. SELECT city, rank
  2. FROM df1
  3. UNION ALL
  4. SELECT city, rank
  5. FROM df2;
  6. /*
  7. city rank
  8. Chicago 1
  9. San Francisco 2
  10. New York City 3
  11. Chicago 1
  12. Boston 4
  13. Los Angeles 5
  14. */
  1. In [32]: pd.concat([df1, df2])
  2. Out[32]:
  3. city rank
  4. 0 Chicago 1
  5. 1 San Francisco 2
  6. 2 New York City 3
  7. 0 Chicago 1
  8. 1 Boston 4
  9. 2 Los Angeles 5

SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.

  1. SELECT city, rank
  2. FROM df1
  3. UNION
  4. SELECT city, rank
  5. FROM df2;
  6. -- notice that there is only one Chicago record this time
  7. /*
  8. city rank
  9. Chicago 1
  10. San Francisco 2
  11. New York City 3
  12. Boston 4
  13. Los Angeles 5
  14. */

In pandas, you can use concat() in conjunction with drop_duplicates().

In [33]: pd.concat([df1, df2]).drop_duplicates() Out[33]: city rank 0 Chicago 1 1 San Francisco 2 2 New York City 3 1 Boston 4 2 Los Angeles 5

Pandas equivalents for some SQL analytic and aggregate functions

Top N rows with offset

  1. -- MySQL
  2. SELECT * FROM tips
  3. ORDER BY tip DESC
  4. LIMIT 10 OFFSET 5;
  1. In [34]: tips.nlargest(10 + 5, columns='tip').tail(10)
  2. Out[34]:
  3. total_bill tip sex smoker day time size
  4. 183 23.17 6.50 Male Yes Sun Dinner 4
  5. 214 28.17 6.50 Female Yes Sat Dinner 3
  6. 47 32.40 6.00 Male No Sun Dinner 4
  7. 239 29.03 5.92 Male No Sat Dinner 3
  8. 88 24.71 5.85 Male No Thur Lunch 2
  9. 181 23.33 5.65 Male Yes Sun Dinner 2
  10. 44 30.40 5.60 Male No Sun Dinner 4
  11. 52 34.81 5.20 Female No Sun Dinner 4
  12. 85 34.83 5.17 Female No Thur Lunch 4
  13. 211 25.89 5.16 Male Yes Sat Dinner 4

Top N rows per group

  1. -- Oracle's ROW_NUMBER() analytic function
  2. SELECT * FROM (
  3. SELECT
  4. t.*,
  5. ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
  6. FROM tips t
  7. )
  8. WHERE rn < 3
  9. ORDER BY day, rn;
  1. In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
  2. ....: .groupby(['day'])
  3. ....: .cumcount() + 1)
  4. ....: .query('rn < 3')
  5. ....: .sort_values(['day', 'rn']))
  6. ....:
  7. Out[35]:
  8. total_bill tip sex smoker day time size rn
  9. 95 40.17 4.73 Male Yes Fri Dinner 4 1
  10. 90 28.97 3.00 Male Yes Fri Dinner 2 2
  11. 170 50.81 10.00 Male Yes Sat Dinner 3 1
  12. 212 48.33 9.00 Male No Sat Dinner 4 2
  13. 156 48.17 5.00 Male No Sun Dinner 6 1
  14. 182 45.35 3.50 Male Yes Sun Dinner 3 2
  15. 197 43.11 5.00 Female Yes Thur Lunch 4 1
  16. 142 41.19 5.00 Male No Thur Lunch 5 2

the same using rank(method=’first’) function

  1. In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']
  2. ....: .rank(method='first', ascending=False))
  3. ....: .query('rnk < 3')
  4. ....: .sort_values(['day', 'rnk']))
  5. ....:
  6. Out[36]:
  7. total_bill tip sex smoker day time size rnk
  8. 95 40.17 4.73 Male Yes Fri Dinner 4 1.0
  9. 90 28.97 3.00 Male Yes Fri Dinner 2 2.0
  10. 170 50.81 10.00 Male Yes Sat Dinner 3 1.0
  11. 212 48.33 9.00 Male No Sat Dinner 4 2.0
  12. 156 48.17 5.00 Male No Sun Dinner 6 1.0
  13. 182 45.35 3.50 Male Yes Sun Dinner 3 2.0
  14. 197 43.11 5.00 Female Yes Thur Lunch 4 1.0
  15. 142 41.19 5.00 Male No Thur Lunch 5 2.0
  1. -- Oracle's RANK() analytic function
  2. SELECT * FROM (
  3. SELECT
  4. t.*,
  5. RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
  6. FROM tips t
  7. WHERE tip < 2
  8. )
  9. WHERE rnk < 3
  10. ORDER BY sex, rnk;

Let’s find tips with (rank < 3) per gender group for (tips < 2). Notice that when using rank(method='min') function rnk_min remains the same for the same tip (as Oracle’s RANK() function)

  1. In [37]: (tips[tips['tip'] < 2]
  2. ....: .assign(rnk_min=tips.groupby(['sex'])['tip']
  3. ....: .rank(method='min'))
  4. ....: .query('rnk_min < 3')
  5. ....: .sort_values(['sex', 'rnk_min']))
  6. ....:
  7. Out[37]:
  8. total_bill tip sex smoker day time size rnk_min
  9. 67 3.07 1.00 Female Yes Sat Dinner 1 1.0
  10. 92 5.75 1.00 Female Yes Fri Dinner 2 1.0
  11. 111 7.25 1.00 Female No Sat Dinner 1 1.0
  12. 236 12.60 1.00 Male Yes Sat Dinner 2 1.0
  13. 237 32.83 1.17 Male Yes Sat Dinner 2 2.0

UPDATE

  1. UPDATE tips
  2. SET tip = tip*2
  3. WHERE tip < 2;
  1. In [38]: tips.loc[tips['tip'] < 2, 'tip'] *= 2

DELETE

  1. DELETE FROM tips
  2. WHERE tip > 9;

In pandas we select the rows that should remain, instead of deleting them

  1. In [39]: tips = tips.loc[tips['tip'] <= 9]

Comparison with SAS

For potential users coming from SAS) this page is meant to demonstrate how different SAS operations would be performed in pandas.

If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.

As is customary, we import pandas and NumPy as follows:

  1. In [1]: import pandas as pd
  2. In [2]: import numpy as np

::: tip Note

Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which displays the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter notebook or terminal) - the equivalent in SAS would be:

  1. proc print data=df(obs=5);
  2. run;

:::

Data Structures

General Terminology Translation

pandas SAS
DataFrame data set
column variable
row observation
groupby BY-group
NaN .

DataFrame / Series

A DataFrame in pandas is analogous to a SAS data set - a two-dimensional data source with labeled columns that can be of different types. As will be shown in this document, almost any operation that can be applied to a data set using SAS’s DATA step, can also be accomplished in pandas.

A Series is the data structure that represents one column of a DataFrame. SAS doesn’t have a separate data structure for a single column, but in general, working with a Series is analogous to referencing a column in the DATA step.

Index

Every DataFrame and Series has an Index - which are labels on the rows of the data. SAS does not have an exactly analogous concept. A data set’s rows are essentially unlabeled, other than an implicit integer index that can be accessed during the DATA step (_N_).

In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1, and so on). While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as a collection of columns. Please see the indexing documentation for much more on how to use an Index effectively.

Data Input / Output

Constructing a DataFrame from Values

A SAS data set can be built from specified values by placing the data after a datalines statement and specifying the column names.

  1. data df;
  2. input x y;
  3. datalines;
  4. 1 2
  5. 3 4
  6. 5 6
  7. ;
  8. run;

A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often convenient to specify it as a Python dictionary, where the keys are the column names and the values are the data.

  1. In [3]: df = pd.DataFrame({'x': [1, 3, 5], 'y': [2, 4, 6]})
  2. In [4]: df
  3. Out[4]:
  4. x y
  5. 0 1 2
  6. 1 3 4
  7. 2 5 6

Reading External Data

Like SAS, pandas provides utilities for reading in data from many formats. The tips dataset, found within the pandas tests (csv) will be used in many of the following examples.

SAS provides PROC IMPORT to read csv data into a data set.

  1. proc import datafile='tips.csv' dbms=csv out=tips replace;
  2. getnames=yes;
  3. run;

The pandas method is read_csv(), which works similarly.

  1. In [5]: url = ('https://raw.github.com/pandas-dev/'
  2. ...: 'pandas/master/pandas/tests/data/tips.csv')
  3. ...:
  4. In [6]: tips = pd.read_csv(url)
  5. In [7]: tips.head()
  6. Out[7]:
  7. total_bill tip sex smoker day time size
  8. 0 16.99 1.01 Female No Sun Dinner 2
  9. 1 10.34 1.66 Male No Sun Dinner 3
  10. 2 21.01 3.50 Male No Sun Dinner 3
  11. 3 23.68 3.31 Male No Sun Dinner 2
  12. 4 24.59 3.61 Female No Sun Dinner 4

Like PROC IMPORT, read_csv can take a number of parameters to specify how the data should be parsed. For example, if the data was instead tab delimited, and did not have column names, the pandas command would be:

  1. tips = pd.read_csv('tips.csv', sep='\t', header=None)
  2. # alternatively, read_table is an alias to read_csv with tab delimiter
  3. tips = pd.read_table('tips.csv', header=None)

In addition to text/csv, pandas supports a variety of other data formats such as Excel, HDF5, and SQL databases. These are all read via a pd.read_* function. See the IO documentation for more details.

Exporting Data

The inverse of PROC IMPORT in SAS is PROC EXPORT

  1. proc export data=tips outfile='tips2.csv' dbms=csv;
  2. run;

Similarly in pandas, the opposite of read_csv is to_csv(), and other data formats follow a similar api.

  1. tips.to_csv('tips2.csv')

Data Operations

Operations on Columns

In the DATA step, arbitrary math expressions can be used on new or existing columns.

  1. data tips;
  2. set tips;
  3. total_bill = total_bill - 2;
  4. new_bill = total_bill / 2;
  5. run;

pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New columns can be assigned in the same way.

  1. In [8]: tips['total_bill'] = tips['total_bill'] - 2
  2. In [9]: tips['new_bill'] = tips['total_bill'] / 2.0
  3. In [10]: tips.head()
  4. Out[10]:
  5. total_bill tip sex smoker day time size new_bill
  6. 0 14.99 1.01 Female No Sun Dinner 2 7.495
  7. 1 8.34 1.66 Male No Sun Dinner 3 4.170
  8. 2 19.01 3.50 Male No Sun Dinner 3 9.505
  9. 3 21.68 3.31 Male No Sun Dinner 2 10.840
  10. 4 22.59 3.61 Female No Sun Dinner 4 11.295

Filtering

Filtering in SAS is done with an if or where statement, on one or more columns.

  1. data tips;
  2. set tips;
  3. if total_bill > 10;
  4. run;
  5. data tips;
  6. set tips;
  7. where total_bill > 10;
  8. /* equivalent in this case - where happens before the
  9. DATA step begins and can also be used in PROC statements */
  10. run;

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing

  1. In [11]: tips[tips['total_bill'] > 10].head()
  2. Out[11]:
  3. total_bill tip sex smoker day time size
  4. 0 14.99 1.01 Female No Sun Dinner 2
  5. 2 19.01 3.50 Male No Sun Dinner 3
  6. 3 21.68 3.31 Male No Sun Dinner 2
  7. 4 22.59 3.61 Female No Sun Dinner 4
  8. 5 23.29 4.71 Male No Sun Dinner 4

If/Then Logic

In SAS, if/then logic can be used to create new columns.

  1. data tips;
  2. set tips;
  3. format bucket $4.;
  4. if total_bill < 10 then bucket = 'low';
  5. else bucket = 'high';
  6. run;

The same operation in pandas can be accomplished using the where method from numpy.

  1. In [12]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')
  2. In [13]: tips.head()
  3. Out[13]:
  4. total_bill tip sex smoker day time size bucket
  5. 0 14.99 1.01 Female No Sun Dinner 2 high
  6. 1 8.34 1.66 Male No Sun Dinner 3 low
  7. 2 19.01 3.50 Male No Sun Dinner 3 high
  8. 3 21.68 3.31 Male No Sun Dinner 2 high
  9. 4 22.59 3.61 Female No Sun Dinner 4 high

Date Functionality

SAS provides a variety of functions to do operations on date/datetime columns.

  1. data tips;
  2. set tips;
  3. format date1 date2 date1_plusmonth mmddyy10.;
  4. date1 = mdy(1, 15, 2013);
  5. date2 = mdy(2, 15, 2015);
  6. date1_year = year(date1);
  7. date2_month = month(date2);
  8. * shift date to beginning of next interval;
  9. date1_next = intnx('MONTH', date1, 1);
  10. * count intervals between dates;
  11. months_between = intck('MONTH', date1, date2);
  12. run;

The equivalent pandas operations are shown below. In addition to these functions pandas supports other Time Series features not available in Base SAS (such as resampling and custom offsets) - see the timeseries documentation for more details.

  1. In [14]: tips['date1'] = pd.Timestamp('2013-01-15')
  2. In [15]: tips['date2'] = pd.Timestamp('2015-02-15')
  3. In [16]: tips['date1_year'] = tips['date1'].dt.year
  4. In [17]: tips['date2_month'] = tips['date2'].dt.month
  5. In [18]: tips['date1_next'] = tips['date1'] + pd.offsets.MonthBegin()
  6. In [19]: tips['months_between'] = (
  7. ....: tips['date2'].dt.to_period('M') - tips['date1'].dt.to_period('M'))
  8. ....:
  9. In [20]: tips[['date1', 'date2', 'date1_year', 'date2_month',
  10. ....: 'date1_next', 'months_between']].head()
  11. ....:
  12. Out[20]:
  13. date1 date2 date1_year date2_month date1_next months_between
  14. 0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
  15. 1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
  16. 2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
  17. 3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
  18. 4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>

Selection of Columns

SAS provides keywords in the DATA step to select, drop, and rename columns.

  1. data tips;
  2. set tips;
  3. keep sex total_bill tip;
  4. run;
  5. data tips;
  6. set tips;
  7. drop sex;
  8. run;
  9. data tips;
  10. set tips;
  11. rename total_bill=total_bill_2;
  12. run;

The same operations are expressed in pandas below.

  1. # keep
  2. In [21]: tips[['sex', 'total_bill', 'tip']].head()
  3. Out[21]:
  4. sex total_bill tip
  5. 0 Female 14.99 1.01
  6. 1 Male 8.34 1.66
  7. 2 Male 19.01 3.50
  8. 3 Male 21.68 3.31
  9. 4 Female 22.59 3.61
  10. # drop
  11. In [22]: tips.drop('sex', axis=1).head()
  12. Out[22]:
  13. total_bill tip smoker day time size
  14. 0 14.99 1.01 No Sun Dinner 2
  15. 1 8.34 1.66 No Sun Dinner 3
  16. 2 19.01 3.50 No Sun Dinner 3
  17. 3 21.68 3.31 No Sun Dinner 2
  18. 4 22.59 3.61 No Sun Dinner 4
  19. # rename
  20. In [23]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()
  21. Out[23]:
  22. total_bill_2 tip sex smoker day time size
  23. 0 14.99 1.01 Female No Sun Dinner 2
  24. 1 8.34 1.66 Male No Sun Dinner 3
  25. 2 19.01 3.50 Male No Sun Dinner 3
  26. 3 21.68 3.31 Male No Sun Dinner 2
  27. 4 22.59 3.61 Female No Sun Dinner 4

Sorting by Values

Sorting in SAS is accomplished via PROC SORT

  1. proc sort data=tips;
  2. by sex total_bill;
  3. run;

pandas objects have a sort_values() method, which takes a list of columns to sort by.

  1. In [24]: tips = tips.sort_values(['sex', 'total_bill'])
  2. In [25]: tips.head()
  3. Out[25]:
  4. total_bill tip sex smoker day time size
  5. 67 1.07 1.00 Female Yes Sat Dinner 1
  6. 92 3.75 1.00 Female Yes Fri Dinner 2
  7. 111 5.25 1.00 Female No Sat Dinner 1
  8. 145 6.35 1.50 Female No Thur Lunch 2
  9. 135 6.51 1.25 Female No Thur Lunch 2

String Processing

Length

SAS determines the length of a character string with the LENGTHN and LENGTHC functions. LENGTHN excludes trailing blanks and LENGTHC includes trailing blanks.

  1. data _null_;
  2. set tips;
  3. put(LENGTHN(time));
  4. put(LENGTHC(time));
  5. run;

Python determines the length of a character string with the len function. len includes trailing blanks. Use len and rstrip to exclude trailing blanks.

  1. In [26]: tips['time'].str.len().head()
  2. Out[26]:
  3. 67 6
  4. 92 6
  5. 111 6
  6. 145 5
  7. 135 5
  8. Name: time, dtype: int64
  9. In [27]: tips['time'].str.rstrip().str.len().head()
  10. Out[27]:
  11. 67 6
  12. 92 6
  13. 111 6
  14. 145 5
  15. 135 5
  16. Name: time, dtype: int64

Find

SAS determines the position of a character in a string with the FINDW function. FINDW takes the string defined by the first argument and searches for the first position of the substring you supply as the second argument.

  1. data _null_;
  2. set tips;
  3. put(FINDW(sex,'ale'));
  4. run;

Python determines the position of a character in a string with the find function. find searches for the first position of the substring. If the substring is found, the function returns its position. Keep in mind that Python indexes are zero-based and the function will return -1 if it fails to find the substring.

  1. In [28]: tips['sex'].str.find("ale").head()
  2. Out[28]:
  3. 67 3
  4. 92 3
  5. 111 3
  6. 145 3
  7. 135 3
  8. Name: sex, dtype: int64

Substring

SAS extracts a substring from a string based on its position with the SUBSTR function.

  1. data _null_;
  2. set tips;
  3. put(substr(sex,1,1));
  4. run;

With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that Python indexes are zero-based.

  1. In [29]: tips['sex'].str[0:1].head()
  2. Out[29]:
  3. 67 F
  4. 92 F
  5. 111 F
  6. 145 F
  7. 135 F
  8. Name: sex, dtype: object

Scan

The SAS SCAN function returns the nth word from a string. The first argument is the string you want to parse and the second argument specifies which word you want to extract.

  1. data firstlast;
  2. input String $60.;
  3. First_Name = scan(string, 1);
  4. Last_Name = scan(string, -1);
  5. datalines2;
  6. John Smith;
  7. Jane Cook;
  8. ;;;
  9. run;

Python extracts a substring from a string based on its text by using regular expressions. There are much more powerful approaches, but this just shows a simple approach.

  1. In [30]: firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})
  2. In [31]: firstlast['First_Name'] = firstlast['String'].str.split(" ", expand=True)[0]
  3. In [32]: firstlast['Last_Name'] = firstlast['String'].str.rsplit(" ", expand=True)[0]
  4. In [33]: firstlast
  5. Out[33]:
  6. String First_Name Last_Name
  7. 0 John Smith John John
  8. 1 Jane Cook Jane Jane

Upcase, Lowcase, and Propcase

The SAS UPCASE LOWCASE and PROPCASE functions change the case of the argument.

  1. data firstlast;
  2. input String $60.;
  3. string_up = UPCASE(string);
  4. string_low = LOWCASE(string);
  5. string_prop = PROPCASE(string);
  6. datalines2;
  7. John Smith;
  8. Jane Cook;
  9. ;;;
  10. run;

The equivalent Python functions are upper, lower, and title.

  1. In [34]: firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})
  2. In [35]: firstlast['string_up'] = firstlast['String'].str.upper()
  3. In [36]: firstlast['string_low'] = firstlast['String'].str.lower()
  4. In [37]: firstlast['string_prop'] = firstlast['String'].str.title()
  5. In [38]: firstlast
  6. Out[38]:
  7. String string_up string_low string_prop
  8. 0 John Smith JOHN SMITH john smith John Smith
  9. 1 Jane Cook JANE COOK jane cook Jane Cook

Merging

The following tables will be used in the merge examples

  1. In [39]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
  2. ....: 'value': np.random.randn(4)})
  3. ....:
  4. In [40]: df1
  5. Out[40]:
  6. key value
  7. 0 A 0.469112
  8. 1 B -0.282863
  9. 2 C -1.509059
  10. 3 D -1.135632
  11. In [41]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
  12. ....: 'value': np.random.randn(4)})
  13. ....:
  14. In [42]: df2
  15. Out[42]:
  16. key value
  17. 0 B 1.212112
  18. 1 D -0.173215
  19. 2 D 0.119209
  20. 3 E -1.044236

In SAS, data must be explicitly sorted before merging. Different types of joins are accomplished using the in= dummy variables to track whether a match was found in one or both input frames.

  1. proc sort data=df1;
  2. by key;
  3. run;
  4. proc sort data=df2;
  5. by key;
  6. run;
  7. data left_join inner_join right_join outer_join;
  8. merge df1(in=a) df2(in=b);
  9. if a and b then output inner_join;
  10. if a then output left_join;
  11. if b then output right_join;
  12. if a or b then output outer_join;
  13. run;

pandas DataFrames have a merge() method, which provides similar functionality. Note that the data does not have to be sorted ahead of time, and different join types are accomplished via the how keyword.

  1. In [43]: inner_join = df1.merge(df2, on=['key'], how='inner')
  2. In [44]: inner_join
  3. Out[44]:
  4. key value_x value_y
  5. 0 B -0.282863 1.212112
  6. 1 D -1.135632 -0.173215
  7. 2 D -1.135632 0.119209
  8. In [45]: left_join = df1.merge(df2, on=['key'], how='left')
  9. In [46]: left_join
  10. Out[46]:
  11. key value_x value_y
  12. 0 A 0.469112 NaN
  13. 1 B -0.282863 1.212112
  14. 2 C -1.509059 NaN
  15. 3 D -1.135632 -0.173215
  16. 4 D -1.135632 0.119209
  17. In [47]: right_join = df1.merge(df2, on=['key'], how='right')
  18. In [48]: right_join
  19. Out[48]:
  20. key value_x value_y
  21. 0 B -0.282863 1.212112
  22. 1 D -1.135632 -0.173215
  23. 2 D -1.135632 0.119209
  24. 3 E NaN -1.044236
  25. In [49]: outer_join = df1.merge(df2, on=['key'], how='outer')
  26. In [50]: outer_join
  27. Out[50]:
  28. key value_x value_y
  29. 0 A 0.469112 NaN
  30. 1 B -0.282863 1.212112
  31. 2 C -1.509059 NaN
  32. 3 D -1.135632 -0.173215
  33. 4 D -1.135632 0.119209
  34. 5 E NaN -1.044236

Missing Data

Like SAS, pandas has a representation for missing data - which is the special float value NaN (not a number). Many of the semantics are the same, for example missing data propagates through numeric operations, and is ignored by default for aggregations.

  1. In [51]: outer_join
  2. Out[51]:
  3. key value_x value_y
  4. 0 A 0.469112 NaN
  5. 1 B -0.282863 1.212112
  6. 2 C -1.509059 NaN
  7. 3 D -1.135632 -0.173215
  8. 4 D -1.135632 0.119209
  9. 5 E NaN -1.044236
  10. In [52]: outer_join['value_x'] + outer_join['value_y']
  11. Out[52]:
  12. 0 NaN
  13. 1 0.929249
  14. 2 NaN
  15. 3 -1.308847
  16. 4 -1.016424
  17. 5 NaN
  18. dtype: float64
  19. In [53]: outer_join['value_x'].sum()
  20. Out[53]: -3.5940742896293765

One difference is that missing data cannot be compared to its sentinel value. For example, in SAS you could do this to filter missing values.

  1. data outer_join_nulls;
  2. set outer_join;
  3. if value_x = .;
  4. run;
  5. data outer_join_no_nulls;
  6. set outer_join;
  7. if value_x ^= .;
  8. run;

Which doesn’t work in pandas. Instead, the pd.isna or pd.notna functions should be used for comparisons.

  1. In [54]: outer_join[pd.isna(outer_join['value_x'])]
  2. Out[54]:
  3. key value_x value_y
  4. 5 E NaN -1.044236
  5. In [55]: outer_join[pd.notna(outer_join['value_x'])]
  6. Out[55]:
  7. key value_x value_y
  8. 0 A 0.469112 NaN
  9. 1 B -0.282863 1.212112
  10. 2 C -1.509059 NaN
  11. 3 D -1.135632 -0.173215
  12. 4 D -1.135632 0.119209

pandas also provides a variety of methods to work with missing data - some of which would be challenging to express in SAS. For example, there are methods to drop all rows with any missing values, replacing mi(ssing values with a specified value, like the mean, or forward filling from previous rows. See the missing data documentation for more.

  1. In [56]: outer_join.dropna()
  2. Out[56]:
  3. key value_x value_y
  4. 1 B -0.282863 1.212112
  5. 3 D -1.135632 -0.173215
  6. 4 D -1.135632 0.119209
  7. In [57]: outer_join.fillna(method='ffill')
  8. Out[57]:
  9. key value_x value_y
  10. 0 A 0.469112 NaN
  11. 1 B -0.282863 1.212112
  12. 2 C -1.509059 1.212112
  13. 3 D -1.135632 -0.173215
  14. 4 D -1.135632 0.119209
  15. 5 E -1.135632 -1.044236
  16. In [58]: outer_join['value_x'].fillna(outer_join['value_x'].mean())
  17. Out[58]:
  18. 0 0.469112
  19. 1 -0.282863
  20. 2 -1.509059
  21. 3 -1.135632
  22. 4 -1.135632
  23. 5 -0.718815
  24. Name: value_x, dtype: float64

GroupBy

Aggregation

SAS’s PROC SUMMARY can be used to group by one or more key variables and compute aggregations on numeric columns.

  1. proc summary data=tips nway;
  2. class sex smoker;
  3. var total_bill tip;
  4. output out=tips_summed sum=;
  5. run;

pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documentation for more details and examples.

  1. In [59]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()
  2. In [60]: tips_summed.head()
  3. Out[60]:
  4. total_bill tip
  5. sex smoker
  6. Female No 869.68 149.77
  7. Yes 527.27 96.74
  8. Male No 1725.75 302.00
  9. Yes 1217.07 183.07

Transformation

In SAS, if the group aggregations need to be used with the original frame, it must be merged back together. For example, to subtract the mean for each observation by smoker group.

  1. proc summary data=tips missing nway;
  2. class smoker;
  3. var total_bill;
  4. output out=smoker_means mean(total_bill)=group_bill;
  5. run;
  6. proc sort data=tips;
  7. by smoker;
  8. run;
  9. data tips;
  10. merge tips(in=a) smoker_means(in=b);
  11. by smoker;
  12. adj_total_bill = total_bill - group_bill;
  13. if a and b;
  14. run;

pandas groubpy provides a transform mechanism that allows these type of operations to be succinctly expressed in one operation.

  1. In [61]: gb = tips.groupby('smoker')['total_bill']
  2. In [62]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')
  3. In [63]: tips.head()
  4. Out[63]:
  5. total_bill tip sex smoker day time size adj_total_bill
  6. 67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
  7. 92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
  8. 111 5.25 1.00 Female No Sat Dinner 1 -11.938278
  9. 145 6.35 1.50 Female No Thur Lunch 2 -10.838278
  10. 135 6.51 1.25 Female No Thur Lunch 2 -10.678278

By Group Processing

In addition to aggregation, pandas groupby can be used to replicate most other by group processing from SAS. For example, this DATA step reads the data by sex/smoker group and filters to the first entry for each.

  1. proc sort data=tips;
  2. by sex smoker;
  3. run;
  4. data tips_first;
  5. set tips;
  6. by sex smoker;
  7. if FIRST.sex or FIRST.smoker then output;
  8. run;

In pandas this would be written as:

  1. In [64]: tips.groupby(['sex', 'smoker']).first()
  2. Out[64]:
  3. total_bill tip day time size adj_total_bill
  4. sex smoker
  5. Female No 5.25 1.00 Sat Dinner 1 -11.938278
  6. Yes 1.07 1.00 Sat Dinner 1 -17.686344
  7. Male No 5.51 2.00 Thur Lunch 2 -11.678278
  8. Yes 5.25 5.15 Sun Dinner 2 -13.506344

Other Considerations

Disk vs Memory

pandas operates exclusively in memory, where a SAS data set exists on disk. This means that the size of data able to be loaded in pandas is limited by your machine’s memory, but also that the operations on that data may be faster.

If out of core processing is needed, one possibility is the dask.dataframe library (currently in development) which provides a subset of pandas functionality for an on-disk DataFrame

Data Interop

pandas provides a read_sas() method that can read SAS data saved in the XPORT or SAS7BDAT binary format.

  1. libname xportout xport 'transport-file.xpt';
  2. data xportout.tips;
  3. set tips(rename=(total_bill=tbill));
  4. * xport variable names limited to 6 characters;
  5. run;
  1. df = pd.read_sas('transport-file.xpt')
  2. df = pd.read_sas('binary-file.sas7bdat')

You can also specify the file format directly. By default, pandas will try to infer the file format based on its extension.

  1. df = pd.read_sas('transport-file.xpt', format='xport')
  2. df = pd.read_sas('binary-file.sas7bdat', format='sas7bdat')

XPORT is a relatively limited format and the parsing of it is not as optimized as some of the other pandas readers. An alternative way to interop data between SAS and pandas is to serialize to csv.

  1. # version 0.17, 10M rows
  2. In [8]: %time df = pd.read_sas('big.xpt')
  3. Wall time: 14.6 s
  4. In [9]: %time df = pd.read_csv('big.csv')
  5. Wall time: 4.86 s

Comparison with Stata

For potential users coming from Stata this page is meant to demonstrate how different Stata operations would be performed in pandas.

If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.

As is customary, we import pandas and NumPy as follows. This means that we can refer to the libraries as pd and np, respectively, for the rest of the document.

  1. In [1]: import pandas as pd
  2. In [2]: import numpy as np

::: tip Note

Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which displays the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter notebook or terminal) – the equivalent in Stata would be:

  1. list in 1/5

:::

Data Structures

General Terminology Translation

pandas Stata
DataFrame data set
column variable
row observation
groupby bysort
NaN .

DataFrame / Series

A DataFrame in pandas is analogous to a Stata data set – a two-dimensional data source with labeled columns that can be of different types. As will be shown in this document, almost any operation that can be applied to a data set in Stata can also be accomplished in pandas.

A Series is the data structure that represents one column of a DataFrame. Stata doesn’t have a separate data structure for a single column, but in general, working with a Series is analogous to referencing a column of a data set in Stata.

Index

Every DataFrame and Series has an Index – labels on the rows of the data. Stata does not have an exactly analogous concept. In Stata, a data set’s rows are essentially unlabeled, other than an implicit integer index that can be accessed with _n.

In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1, and so on). While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as a collection of columns. Please see the indexing documentation for much more on how to use an Index effectively.

Data Input / Output

Constructing a DataFrame from Values

A Stata data set can be built from specified values by placing the data after an input statement and specifying the column names.

  1. input x y
  2. 1 2
  3. 3 4
  4. 5 6
  5. end

A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often convenient to specify it as a Python dictionary, where the keys are the column names and the values are the data.

  1. In [3]: df = pd.DataFrame({'x': [1, 3, 5], 'y': [2, 4, 6]})
  2. In [4]: df
  3. Out[4]:
  4. x y
  5. 0 1 2
  6. 1 3 4
  7. 2 5 6

Reading External Data

Like Stata, pandas provides utilities for reading in data from many formats. The tips data set, found within the pandas tests (csv) will be used in many of the following examples.

Stata provides import delimited to read csv data into a data set in memory. If the tips.csv file is in the current working directory, we can import it as follows.

  1. import delimited tips.csv

The pandas method is read_csv(), which works similarly. Additionally, it will automatically download the data set if presented with a url.

  1. In [5]: url = ('https://raw.github.com/pandas-dev'
  2. ...: '/pandas/master/pandas/tests/data/tips.csv')
  3. ...:
  4. In [6]: tips = pd.read_csv(url)
  5. In [7]: tips.head()
  6. Out[7]:
  7. total_bill tip sex smoker day time size
  8. 0 16.99 1.01 Female No Sun Dinner 2
  9. 1 10.34 1.66 Male No Sun Dinner 3
  10. 2 21.01 3.50 Male No Sun Dinner 3
  11. 3 23.68 3.31 Male No Sun Dinner 2
  12. 4 24.59 3.61 Female No Sun Dinner 4

Like import delimited, read_csv() can take a number of parameters to specify how the data should be parsed. For example, if the data were instead tab delimited, did not have column names, and existed in the current working directory, the pandas command would be:

  1. tips = pd.read_csv('tips.csv', sep='\t', header=None)
  2. # alternatively, read_table is an alias to read_csv with tab delimiter
  3. tips = pd.read_table('tips.csv', header=None)

Pandas can also read Stata data sets in .dta format with the read_stata() function.

  1. df = pd.read_stata('data.dta')

In addition to text/csv and Stata files, pandas supports a variety of other data formats such as Excel, SAS, HDF5, Parquet, and SQL databases. These are all read via a pd.read_* function. See the IO documentation for more details.

Exporting Data

The inverse of import delimited in Stata is export delimited

  1. export delimited tips2.csv

Similarly in pandas, the opposite of read_csv is DataFrame.to_csv().

  1. tips.to_csv('tips2.csv')

Pandas can also export to Stata file format with the DataFrame.to_stata() method.

  1. tips.to_stata('tips2.dta')

Data Operations

Operations on Columns

In Stata, arbitrary math expressions can be used with the generate and replace commands on new or existing columns. The drop command drops the column from the data set.

  1. replace total_bill = total_bill - 2
  2. generate new_bill = total_bill / 2
  3. drop new_bill

pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New columns can be assigned in the same way. The DataFrame.drop() method drops a column from the DataFrame.

  1. In [8]: tips['total_bill'] = tips['total_bill'] - 2
  2. In [9]: tips['new_bill'] = tips['total_bill'] / 2
  3. In [10]: tips.head()
  4. Out[10]:
  5. total_bill tip sex smoker day time size new_bill
  6. 0 14.99 1.01 Female No Sun Dinner 2 7.495
  7. 1 8.34 1.66 Male No Sun Dinner 3 4.170
  8. 2 19.01 3.50 Male No Sun Dinner 3 9.505
  9. 3 21.68 3.31 Male No Sun Dinner 2 10.840
  10. 4 22.59 3.61 Female No Sun Dinner 4 11.295
  11. In [11]: tips = tips.drop('new_bill', axis=1)

Filtering

Filtering in Stata is done with an if clause on one or more columns.

  1. list if total_bill > 10

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

  1. In [12]: tips[tips['total_bill'] > 10].head()
  2. Out[12]:
  3. total_bill tip sex smoker day time size
  4. 0 14.99 1.01 Female No Sun Dinner 2
  5. 2 19.01 3.50 Male No Sun Dinner 3
  6. 3 21.68 3.31 Male No Sun Dinner 2
  7. 4 22.59 3.61 Female No Sun Dinner 4
  8. 5 23.29 4.71 Male No Sun Dinner 4

If/Then Logic

In Stata, an if clause can also be used to create new columns.

  1. generate bucket = "low" if total_bill < 10
  2. replace bucket = "high" if total_bill >= 10

The same operation in pandas can be accomplished using the where method from numpy.

  1. In [13]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')
  2. In [14]: tips.head()
  3. Out[14]:
  4. total_bill tip sex smoker day time size bucket
  5. 0 14.99 1.01 Female No Sun Dinner 2 high
  6. 1 8.34 1.66 Male No Sun Dinner 3 low
  7. 2 19.01 3.50 Male No Sun Dinner 3 high
  8. 3 21.68 3.31 Male No Sun Dinner 2 high
  9. 4 22.59 3.61 Female No Sun Dinner 4 high

Date Functionality

Stata provides a variety of functions to do operations on date/datetime columns.

  1. generate date1 = mdy(1, 15, 2013)
  2. generate date2 = date("Feb152015", "MDY")
  3. generate date1_year = year(date1)
  4. generate date2_month = month(date2)
  5. * shift date to beginning of next month
  6. generate date1_next = mdy(month(date1) + 1, 1, year(date1)) if month(date1) != 12
  7. replace date1_next = mdy(1, 1, year(date1) + 1) if month(date1) == 12
  8. generate months_between = mofd(date2) - mofd(date1)
  9. list date1 date2 date1_year date2_month date1_next months_between

The equivalent pandas operations are shown below. In addition to these functions, pandas supports other Time Series features not available in Stata (such as time zone handling and custom offsets) – see the timeseries documentation for more details.

  1. In [15]: tips['date1'] = pd.Timestamp('2013-01-15')
  2. In [16]: tips['date2'] = pd.Timestamp('2015-02-15')
  3. In [17]: tips['date1_year'] = tips['date1'].dt.year
  4. In [18]: tips['date2_month'] = tips['date2'].dt.month
  5. In [19]: tips['date1_next'] = tips['date1'] + pd.offsets.MonthBegin()
  6. In [20]: tips['months_between'] = (tips['date2'].dt.to_period('M')
  7. ....: - tips['date1'].dt.to_period('M'))
  8. ....:
  9. In [21]: tips[['date1', 'date2', 'date1_year', 'date2_month', 'date1_next',
  10. ....: 'months_between']].head()
  11. ....:
  12. Out[21]:
  13. date1 date2 date1_year date2_month date1_next months_between
  14. 0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
  15. 1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
  16. 2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
  17. 3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
  18. 4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>

Selection of Columns

Stata provides keywords to select, drop, and rename columns.

  1. keep sex total_bill tip
  2. drop sex
  3. rename total_bill total_bill_2

The same operations are expressed in pandas below. Note that in contrast to Stata, these operations do not happen in place. To make these changes persist, assign the operation back to a variable.

  1. # keep
  2. In [22]: tips[['sex', 'total_bill', 'tip']].head()
  3. Out[22]:
  4. sex total_bill tip
  5. 0 Female 14.99 1.01
  6. 1 Male 8.34 1.66
  7. 2 Male 19.01 3.50
  8. 3 Male 21.68 3.31
  9. 4 Female 22.59 3.61
  10. # drop
  11. In [23]: tips.drop('sex', axis=1).head()
  12. Out[23]:
  13. total_bill tip smoker day time size
  14. 0 14.99 1.01 No Sun Dinner 2
  15. 1 8.34 1.66 No Sun Dinner 3
  16. 2 19.01 3.50 No Sun Dinner 3
  17. 3 21.68 3.31 No Sun Dinner 2
  18. 4 22.59 3.61 No Sun Dinner 4
  19. # rename
  20. In [24]: tips.rename(columns={'total_bill': 'total_bill_2'}).head()
  21. Out[24]:
  22. total_bill_2 tip sex smoker day time size
  23. 0 14.99 1.01 Female No Sun Dinner 2
  24. 1 8.34 1.66 Male No Sun Dinner 3
  25. 2 19.01 3.50 Male No Sun Dinner 3
  26. 3 21.68 3.31 Male No Sun Dinner 2
  27. 4 22.59 3.61 Female No Sun Dinner 4

Sorting by Values

Sorting in Stata is accomplished via sort

  1. sort sex total_bill

pandas objects have a DataFrame.sort_values() method, which takes a list of columns to sort by.

  1. In [25]: tips = tips.sort_values(['sex', 'total_bill'])
  2. In [26]: tips.head()
  3. Out[26]:
  4. total_bill tip sex smoker day time size
  5. 67 1.07 1.00 Female Yes Sat Dinner 1
  6. 92 3.75 1.00 Female Yes Fri Dinner 2
  7. 111 5.25 1.00 Female No Sat Dinner 1
  8. 145 6.35 1.50 Female No Thur Lunch 2
  9. 135 6.51 1.25 Female No Thur Lunch 2

String Processing

Finding Length of String

Stata determines the length of a character string with the strlen() and ustrlen() functions for ASCII and Unicode strings, respectively.

  1. generate strlen_time = strlen(time)
  2. generate ustrlen_time = ustrlen(time)

Python determines the length of a character string with the len function. In Python 3, all strings are Unicode strings. len includes trailing blanks. Use len and rstrip to exclude trailing blanks.

  1. In [27]: tips['time'].str.len().head()
  2. Out[27]:
  3. 67 6
  4. 92 6
  5. 111 6
  6. 145 5
  7. 135 5
  8. Name: time, dtype: int64
  9. In [28]: tips['time'].str.rstrip().str.len().head()
  10. Out[28]:
  11. 67 6
  12. 92 6
  13. 111 6
  14. 145 5
  15. 135 5
  16. Name: time, dtype: int64

Finding Position of Substring

Stata determines the position of a character in a string with the strpos() function. This takes the string defined by the first argument and searches for the first position of the substring you supply as the second argument.

  1. generate str_position = strpos(sex, "ale")

Python determines the position of a character in a string with the find() function. find searches for the first position of the substring. If the substring is found, the function returns its position. Keep in mind that Python indexes are zero-based and the function will return -1 if it fails to find the substring.

  1. In [29]: tips['sex'].str.find("ale").head()
  2. Out[29]:
  3. 67 3
  4. 92 3
  5. 111 3
  6. 145 3
  7. 135 3
  8. Name: sex, dtype: int64

Extracting Substring by Position

Stata extracts a substring from a string based on its position with the substr() function.

  1. generate short_sex = substr(sex, 1, 1)

With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that Python indexes are zero-based.

  1. In [30]: tips['sex'].str[0:1].head()
  2. Out[30]:
  3. 67 F
  4. 92 F
  5. 111 F
  6. 145 F
  7. 135 F
  8. Name: sex, dtype: object

Extracting nth Word

The Stata word() function returns the nth word from a string. The first argument is the string you want to parse and the second argument specifies which word you want to extract.

  1. clear
  2. input str20 string
  3. "John Smith"
  4. "Jane Cook"
  5. end
  6. generate first_name = word(name, 1)
  7. generate last_name = word(name, -1)

Python extracts a substring from a string based on its text by using regular expressions. There are much more powerful approaches, but this just shows a simple approach.

  1. In [31]: firstlast = pd.DataFrame({'string': ['John Smith', 'Jane Cook']})
  2. In [32]: firstlast['First_Name'] = firstlast['string'].str.split(" ", expand=True)[0]
  3. In [33]: firstlast['Last_Name'] = firstlast['string'].str.rsplit(" ", expand=True)[0]
  4. In [34]: firstlast
  5. Out[34]:
  6. string First_Name Last_Name
  7. 0 John Smith John John
  8. 1 Jane Cook Jane Jane

Changing Case

The Stata strupper(), strlower(), strproper(), ustrupper(), ustrlower(), and ustrtitle() functions change the case of ASCII and Unicode strings, respectively.

  1. clear
  2. input str20 string
  3. "John Smith"
  4. "Jane Cook"
  5. end
  6. generate upper = strupper(string)
  7. generate lower = strlower(string)
  8. generate title = strproper(string)
  9. list

The equivalent Python functions are upper, lower, and title.

  1. In [35]: firstlast = pd.DataFrame({'string': ['John Smith', 'Jane Cook']})
  2. In [36]: firstlast['upper'] = firstlast['string'].str.upper()
  3. In [37]: firstlast['lower'] = firstlast['string'].str.lower()
  4. In [38]: firstlast['title'] = firstlast['string'].str.title()
  5. In [39]: firstlast
  6. Out[39]:
  7. string upper lower title
  8. 0 John Smith JOHN SMITH john smith John Smith
  9. 1 Jane Cook JANE COOK jane cook Jane Cook

Merging

The following tables will be used in the merge examples

  1. In [40]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
  2. ....: 'value': np.random.randn(4)})
  3. ....:
  4. In [41]: df1
  5. Out[41]:
  6. key value
  7. 0 A 0.469112
  8. 1 B -0.282863
  9. 2 C -1.509059
  10. 3 D -1.135632
  11. In [42]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
  12. ....: 'value': np.random.randn(4)})
  13. ....:
  14. In [43]: df2
  15. Out[43]:
  16. key value
  17. 0 B 1.212112
  18. 1 D -0.173215
  19. 2 D 0.119209
  20. 3 E -1.044236

In Stata, to perform a merge, one data set must be in memory and the other must be referenced as a file name on disk. In contrast, Python must have both DataFrames already in memory.

By default, Stata performs an outer join, where all observations from both data sets are left in memory after the merge. One can keep only observations from the initial data set, the merged data set, or the intersection of the two by using the values created in the _merge variable.

  1. * First create df2 and save to disk
  2. clear
  3. input str1 key
  4. B
  5. D
  6. D
  7. E
  8. end
  9. generate value = rnormal()
  10. save df2.dta
  11. * Now create df1 in memory
  12. clear
  13. input str1 key
  14. A
  15. B
  16. C
  17. D
  18. end
  19. generate value = rnormal()
  20. preserve
  21. * Left join
  22. merge 1:n key using df2.dta
  23. keep if _merge == 1
  24. * Right join
  25. restore, preserve
  26. merge 1:n key using df2.dta
  27. keep if _merge == 2
  28. * Inner join
  29. restore, preserve
  30. merge 1:n key using df2.dta
  31. keep if _merge == 3
  32. * Outer join
  33. restore
  34. merge 1:n key using df2.dta

pandas DataFrames have a DataFrame.merge() method, which provides similar functionality. Note that different join types are accomplished via the how keyword.

  1. In [44]: inner_join = df1.merge(df2, on=['key'], how='inner')
  2. In [45]: inner_join
  3. Out[45]:
  4. key value_x value_y
  5. 0 B -0.282863 1.212112
  6. 1 D -1.135632 -0.173215
  7. 2 D -1.135632 0.119209
  8. In [46]: left_join = df1.merge(df2, on=['key'], how='left')
  9. In [47]: left_join
  10. Out[47]:
  11. key value_x value_y
  12. 0 A 0.469112 NaN
  13. 1 B -0.282863 1.212112
  14. 2 C -1.509059 NaN
  15. 3 D -1.135632 -0.173215
  16. 4 D -1.135632 0.119209
  17. In [48]: right_join = df1.merge(df2, on=['key'], how='right')
  18. In [49]: right_join
  19. Out[49]:
  20. key value_x value_y
  21. 0 B -0.282863 1.212112
  22. 1 D -1.135632 -0.173215
  23. 2 D -1.135632 0.119209
  24. 3 E NaN -1.044236
  25. In [50]: outer_join = df1.merge(df2, on=['key'], how='outer')
  26. In [51]: outer_join
  27. Out[51]:
  28. key value_x value_y
  29. 0 A 0.469112 NaN
  30. 1 B -0.282863 1.212112
  31. 2 C -1.509059 NaN
  32. 3 D -1.135632 -0.173215
  33. 4 D -1.135632 0.119209
  34. 5 E NaN -1.044236

Missing Data

Like Stata, pandas has a representation for missing data – the special float value NaN (not a number). Many of the semantics are the same; for example missing data propagates through numeric operations, and is ignored by default for aggregations.

  1. In [52]: outer_join
  2. Out[52]:
  3. key value_x value_y
  4. 0 A 0.469112 NaN
  5. 1 B -0.282863 1.212112
  6. 2 C -1.509059 NaN
  7. 3 D -1.135632 -0.173215
  8. 4 D -1.135632 0.119209
  9. 5 E NaN -1.044236
  10. In [53]: outer_join['value_x'] + outer_join['value_y']
  11. Out[53]:
  12. 0 NaN
  13. 1 0.929249
  14. 2 NaN
  15. 3 -1.308847
  16. 4 -1.016424
  17. 5 NaN
  18. dtype: float64
  19. In [54]: outer_join['value_x'].sum()
  20. Out[54]: -3.5940742896293765

One difference is that missing data cannot be compared to its sentinel value. For example, in Stata you could do this to filter missing values.

  1. * Keep missing values
  2. list if value_x == .
  3. * Keep non-missing values
  4. list if value_x != .

This doesn’t work in pandas. Instead, the pd.isna() or pd.notna() functions should be used for comparisons.

  1. In [55]: outer_join[pd.isna(outer_join['value_x'])]
  2. Out[55]:
  3. key value_x value_y
  4. 5 E NaN -1.044236
  5. In [56]: outer_join[pd.notna(outer_join['value_x'])]
  6. Out[56]:
  7. key value_x value_y
  8. 0 A 0.469112 NaN
  9. 1 B -0.282863 1.212112
  10. 2 C -1.509059 NaN
  11. 3 D -1.135632 -0.173215
  12. 4 D -1.135632 0.119209

Pandas also provides a variety of methods to work with missing data – some of which would be challenging to express in Stata. For example, there are methods to drop all rows with any missing values, replacing missing values with a specified value, like the mean, or forward filling from previous rows. See the missing data documentation for more.

  1. # Drop rows with any missing value
  2. In [57]: outer_join.dropna()
  3. Out[57]:
  4. key value_x value_y
  5. 1 B -0.282863 1.212112
  6. 3 D -1.135632 -0.173215
  7. 4 D -1.135632 0.119209
  8. # Fill forwards
  9. In [58]: outer_join.fillna(method='ffill')
  10. Out[58]:
  11. key value_x value_y
  12. 0 A 0.469112 NaN
  13. 1 B -0.282863 1.212112
  14. 2 C -1.509059 1.212112
  15. 3 D -1.135632 -0.173215
  16. 4 D -1.135632 0.119209
  17. 5 E -1.135632 -1.044236
  18. # Impute missing values with the mean
  19. In [59]: outer_join['value_x'].fillna(outer_join['value_x'].mean())
  20. Out[59]:
  21. 0 0.469112
  22. 1 -0.282863
  23. 2 -1.509059
  24. 3 -1.135632
  25. 4 -1.135632
  26. 5 -0.718815
  27. Name: value_x, dtype: float64

GroupBy

Aggregation

Stata’s collapse can be used to group by one or more key variables and compute aggregations on numeric columns.

  1. collapse (sum) total_bill tip, by(sex smoker)

pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documentation for more details and examples.

  1. In [60]: tips_summed = tips.groupby(['sex', 'smoker'])['total_bill', 'tip'].sum()
  2. In [61]: tips_summed.head()
  3. Out[61]:
  4. total_bill tip
  5. sex smoker
  6. Female No 869.68 149.77
  7. Yes 527.27 96.74
  8. Male No 1725.75 302.00
  9. Yes 1217.07 183.07

Transformation

In Stata, if the group aggregations need to be used with the original data set, one would usually use bysort with egen(). For example, to subtract the mean for each observation by smoker group.

  1. bysort sex smoker: egen group_bill = mean(total_bill)
  2. generate adj_total_bill = total_bill - group_bill

pandas groubpy provides a transform mechanism that allows these type of operations to be succinctly expressed in one operation.

  1. In [62]: gb = tips.groupby('smoker')['total_bill']
  2. In [63]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')
  3. In [64]: tips.head()
  4. Out[64]:
  5. total_bill tip sex smoker day time size adj_total_bill
  6. 67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
  7. 92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
  8. 111 5.25 1.00 Female No Sat Dinner 1 -11.938278
  9. 145 6.35 1.50 Female No Thur Lunch 2 -10.838278
  10. 135 6.51 1.25 Female No Thur Lunch 2 -10.678278

By Group Processing

In addition to aggregation, pandas groupby can be used to replicate most other bysort processing from Stata. For example, the following example lists the first observation in the current sort order by sex/smoker group.

  1. bysort sex smoker: list if _n == 1

In pandas this would be written as:

  1. In [65]: tips.groupby(['sex', 'smoker']).first()
  2. Out[65]:
  3. total_bill tip day time size adj_total_bill
  4. sex smoker
  5. Female No 5.25 1.00 Sat Dinner 1 -11.938278
  6. Yes 1.07 1.00 Sat Dinner 1 -17.686344
  7. Male No 5.51 2.00 Thur Lunch 2 -11.678278
  8. Yes 5.25 5.15 Sun Dinner 2 -13.506344

Other Considerations

Disk vs Memory

Pandas and Stata both operate exclusively in memory. This means that the size of data able to be loaded in pandas is limited by your machine’s memory. If out of core processing is needed, one possibility is the dask.dataframe library, which provides a subset of pandas functionality for an on-disk DataFrame.