pyspark.pandas.DataFrame.loc#
- property DataFrame.loc#
Access a group of rows and columns by label(s) or a boolean Series.
.loc[]
is primarily label based, but may also be used with a conditional boolean Series derived from the DataFrame or Series.Allowed inputs are:
A single label, e.g.
5
or'a'
, (note that5
is interpreted as a label of the index, and never as an integer position along the index) for column selection.A list or array of labels, e.g.
['a', 'b', 'c']
.A slice object with labels, e.g.
'a':'f'
.A conditional boolean Series derived from the DataFrame or Series
A boolean array of the same length as the column axis being sliced, e.g.
[True, False, True]
.An alignable boolean pandas Series to the column axis being sliced. The index of the key will be aligned before masking.
Not allowed inputs which pandas allows are:
A boolean array of the same length as the row axis being sliced, e.g.
[True, False, True]
.A
callable
function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
Note
MultiIndex is not supported yet.
Note
Note that contrary to usual python slices, both the start and the stop are included, and the step of the slice is not allowed.
Note
With a list or array of labels for row selection, pandas-on-Spark behaves as a filter without reordering by the labels.
See also
Series.loc
Access group of values using labels.
Examples
Getting values
>>> df = ps.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=['cobra', 'viper', 'sidewinder'], ... columns=['max_speed', 'shield']) >>> df max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8
Single label. Note this returns the row as a Series.
>>> df.loc['viper'] max_speed 4 shield 5 Name: viper, dtype: int64
List of labels. Note using
[[]]
returns a DataFrame. Also note that pandas-on-Spark behaves just a filter without reordering by the labels.>>> df.loc[['viper', 'sidewinder']] max_speed shield viper 4 5 sidewinder 7 8
>>> df.loc[['sidewinder', 'viper']] max_speed shield viper 4 5 sidewinder 7 8
Single label for column.
>>> df.loc['cobra', 'shield'] 2
List of labels for row.
>>> df.loc[['cobra'], 'shield'] cobra 2 Name: shield, dtype: int64
List of labels for column.
>>> df.loc['cobra', ['shield']] shield 2 Name: cobra, dtype: int64
List of labels for both row and column.
>>> df.loc[['cobra'], ['shield']] shield cobra 2
Slice with labels for row and single label for column. Note that both the start and stop of the slice are included.
>>> df.loc['cobra':'viper', 'max_speed'] cobra 1 viper 4 Name: max_speed, dtype: int64
Conditional that returns a boolean Series
>>> df.loc[df['shield'] > 6] max_speed shield sidewinder 7 8
Conditional that returns a boolean Series with column labels specified
>>> df.loc[df['shield'] > 6, ['max_speed']] max_speed sidewinder 7
A boolean array of the same length as the column axis being sliced.
>>> df.loc[:, [False, True]] shield cobra 2 viper 5 sidewinder 8
An alignable boolean Series to the column axis being sliced.
>>> df.loc[:, pd.Series([False, True], index=['max_speed', 'shield'])] shield cobra 2 viper 5 sidewinder 8
Setting values
Setting value for all items matching the list of labels.
>>> df.loc[['viper', 'sidewinder'], ['shield']] = 50 >>> df max_speed shield cobra 1 2 viper 4 50 sidewinder 7 50
Setting value for an entire row
>>> df.loc['cobra'] = 10 >>> df max_speed shield cobra 10 10 viper 4 50 sidewinder 7 50
Set value for an entire column
>>> df.loc[:, 'max_speed'] = 30 >>> df max_speed shield cobra 30 10 viper 30 50 sidewinder 30 50
Set value for an entire list of columns
>>> df.loc[:, ['max_speed', 'shield']] = 100 >>> df max_speed shield cobra 100 100 viper 100 100 sidewinder 100 100
Set value with Series
>>> df.loc[:, 'shield'] = df['shield'] * 2 >>> df max_speed shield cobra 100 200 viper 100 200 sidewinder 100 200
Getting values on a DataFrame with an index that has integer labels
Another example using integers for the index
>>> df = ps.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=[7, 8, 9], ... columns=['max_speed', 'shield']) >>> df max_speed shield 7 1 2 8 4 5 9 7 8
Slice with integer labels for rows. Note that both the start and stop of the slice are included.
>>> df.loc[7:9] max_speed shield 7 1 2 8 4 5 9 7 8