I am trying to select rows from a Pandas DataFrame, using the integer index values.
This does not work, and I obtain out of index errors.
- This suggests to me that performing a selection of rows by index implicitly causes
reset_index()
to be called, although I may be mistaken - The following example explains why the behaviour I observe suggests this to be the case
import pandas
data = {
'number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'fruit': 3 * ['apple'] + 3 * ['pear'] + 2 * ['banana'] + ['pear'] + ['apple'],
'color': 3 * ['red', 'green', 'blue'] + ['red'],
'letter': 5 * ['A', 'B'],
}
df = pandas.DataFrame(data)
df
df_selected = df[df['fruit'] == 'pear']
df_selected
df_selected.index
Index([3, 4, 5, 8], dtype='int64')
This certainly suggests I have a DataFrame with an Index containing the values 3, 4, 5 and 8.
I now want to select all the rows in the DataFrame from the first occurance of 'pear' to the last occurance, by using the integer index:
I thought this should be possible with the following syntax:
FIRST = 3
LAST = 8
df_selected[FIRST:LAST+1]
But I am mistaken:
- When printing (displaying the DataFrame to
stdout
or a Jupyter Notebook Cell) the index shows values3, 4, 5, 8
. - When selecting by index using the syntax
df_selected[A:B]
ordf_selected.iloc[A:B]
the integer argumentsA
andB
are interpreted as ifdf_selected.reset_index()
had been called - I say this, because calling
reset_index()
produces the following output
Which implies the correct range to use when selecting by index is df_selected.iloc[0:3+1]
I am aware this is an incredibly basic question but I'm hoping someone can point me in the right direction as to understanding why the behaviour is this way, if there is a particular reason for it.