pyspark.pandas.Series.factorize#

Series.factorize(sort=True, use_na_sentinel=True)#

Encode the object as an enumerated type or categorical variable.

This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.

Parameters
sortbool, default True
use_na_sentinelbool, default True

If True, the sentinel -1 will be used for NaN values, effectively assigning them a distinct category. If False, NaN values will be encoded as non-negative integers, treating them as unique categories in the encoding process and retaining them in the set of unique categories in the data.

Returns
codesSeries or Index

A Series or Index that’s an indexer into uniques. uniques.take(codes) will have the same values as values.

uniquespd.Index

The unique valid values.

Note

Even if there’s a missing value in values, uniques will not contain an entry for it.

Examples

>>> psser = ps.Series(['b', None, 'a', 'c', 'b'])
>>> codes, uniques = psser.factorize()
>>> codes
0    1
1   -1
2    0
3    2
4    1
dtype: int32
>>> uniques
Index(['a', 'b', 'c'], dtype='object')
>>> codes, uniques = psser.factorize(use_na_sentinel=False)
>>> codes
0    1
1    3
2    0
3    2
4    1
dtype: int32
>>> uniques
Index(['a', 'b', 'c', None], dtype='object')

For Index:

>>> psidx = ps.Index(['b', None, 'a', 'c', 'b'])
>>> codes, uniques = psidx.factorize()
>>> codes
Index([1, -1, 0, 2, 1], dtype='int32')
>>> uniques
Index(['a', 'b', 'c'], dtype='object')