pyspark.pandas.Series.factorize#
- Series.factorize(sort=True, use_na_sentinel=True)#
Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.
- Parameters
- sortbool, default True
- use_na_sentinelbool, default True
If True, the sentinel -1 will be used for NaN values, effectively assigning them a distinct category. If False, NaN values will be encoded as non-negative integers, treating them as unique categories in the encoding process and retaining them in the set of unique categories in the data.
- Returns
- codesSeries or Index
A Series or Index that’s an indexer into uniques.
uniques.take(codes)
will have the same values as values.- uniquespd.Index
The unique valid values.
Note
Even if there’s a missing value in values, uniques will not contain an entry for it.
Examples
>>> psser = ps.Series(['b', None, 'a', 'c', 'b']) >>> codes, uniques = psser.factorize() >>> codes 0 1 1 -1 2 0 3 2 4 1 dtype: int32 >>> uniques Index(['a', 'b', 'c'], dtype='object')
>>> codes, uniques = psser.factorize(use_na_sentinel=False) >>> codes 0 1 1 3 2 0 3 2 4 1 dtype: int32 >>> uniques Index(['a', 'b', 'c', None], dtype='object')
For Index:
>>> psidx = ps.Index(['b', None, 'a', 'c', 'b']) >>> codes, uniques = psidx.factorize() >>> codes Index([1, -1, 0, 2, 1], dtype='int32') >>> uniques Index(['a', 'b', 'c'], dtype='object')