-1

I am trying to select rows from a Pandas DataFrame, using the integer index values.

This does not work, and I obtain out of index errors.

  • This suggests to me that performing a selection of rows by index implicitly causes reset_index() to be called, although I may be mistaken
  • The following example explains why the behaviour I observe suggests this to be the case
import pandas

data = {
    'number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'fruit': 3 * ['apple'] + 3 * ['pear'] + 2 * ['banana'] + ['pear'] + ['apple'],
    'color': 3 * ['red', 'green', 'blue'] + ['red'],
    'letter': 5 * ['A', 'B'],
}
df = pandas.DataFrame(data)
df

Pandas DataFrame

df_selected = df[df['fruit'] == 'pear']
df_selected

Pandas DataFrame

df_selected.index
Index([3, 4, 5, 8], dtype='int64')

This certainly suggests I have a DataFrame with an Index containing the values 3, 4, 5 and 8.

I now want to select all the rows in the DataFrame from the first occurance of 'pear' to the last occurance, by using the integer index:

I thought this should be possible with the following syntax:

FIRST = 3
LAST = 8
df_selected[FIRST:LAST+1]

But I am mistaken:

Pandas DataFrame

  • When printing (displaying the DataFrame to stdout or a Jupyter Notebook Cell) the index shows values 3, 4, 5, 8.
  • When selecting by index using the syntax df_selected[A:B] or df_selected.iloc[A:B] the integer arguments A and B are interpreted as if df_selected.reset_index() had been called
  • I say this, because calling reset_index() produces the following output

Pandas DataFrame

Which implies the correct range to use when selecting by index is df_selected.iloc[0:3+1]

I am aware this is an incredibly basic question but I'm hoping someone can point me in the right direction as to understanding why the behaviour is this way, if there is a particular reason for it.

1 Answer 1

1

You should use loc to select by label:

FIRST = 3
LAST = 8

df_selected = df[df['fruit'] == 'pear']
out = df.loc[FIRST:LAST]

Or:

idx = df[df['fruit'] == 'pear'].index
out = df.loc[idx.min():idx.max()]

NB. since loc includes both ends, you do not need the +1.

Output:

   number   fruit  color letter
3       4    pear    red      B
4       5    pear  green      A
5       6    pear   blue      B
6       7  banana    red      A
7       8  banana  green      B
8       9    pear   blue      A

When using df_selected[1:2], this behaves like iloc and selects from the fourth to the ninth position (so just the row with label 8).

alternative

If your goal is to select all values between the first and last match, you could also use boolean indexing:

m = df['fruit'].eq('pear')
out = df[m.cummax()&m[::-1].cummax()]

How it works:

   number   fruit  color letter      m  m.cummax()  m[::-1].cummax()      &
0       1   apple    red      A  False       False              True  False
1       2   apple  green      B  False       False              True  False
2       3   apple   blue      A  False       False              True  False
3       4    pear    red      B   True        True              True   True
4       5    pear  green      A   True        True              True   True
5       6    pear   blue      B   True        True              True   True
6       7  banana    red      A  False        True              True   True
7       8  banana  green      B  False        True              True   True
8       9    pear   blue      A   True        True              True   True
9      10   apple    red      B  False        True             False  False
3
  • That final example is quite interesting. Thanks for this, great answer. I guess, infering from what you have said here, a DataFrame index conceptually is more like a label than an integer index, and the integer index is actually an invisible thing from the point of view of the user of the DataFrame API. In many cases, the label matches the integer index but in the example shown here, that is not the case because of the previous row subset select operation. Commented Jul 5 at 14:32
  • I know that was quite a long and complex sentence, but hopefully the point I am trying to make is understandable? Commented Jul 5 at 14:33
  • 1
    df[a:b] is a mixed (unreliable) way to index. If you have non-numeric labels it will acts like loc, or like iloc if you pass numbers (try df.index = df.index.astype(str) ; print(df['3':'8']) ; print(df[3:8])). That's why you should never use it and go withloc with labels, and iloc with positions. Those can always be trusted.
    – mozway
    Commented Jul 5 at 15:21

Not the answer you're looking for? Browse other questions tagged or ask your own question.