pyspark.sql.functions.max_by#

pyspark.sql.functions.max_by(col, ord)[source]#

Returns the value from the col parameter that is associated with the maximum value from the ord parameter. This function is often used to find the col parameter value corresponding to the maximum ord parameter value within each group when used with groupBy().

New in version 3.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

colColumn or str: The column representing the values to be returned. This could be the column instance or the column name as string.
ordColumn or str: The column that needs to be maximized. This could be the column instance or the column name as string.

Returns

Column: A column object representing the value from col that is associated with the maximum value from ord.

Examples

Example 1: Using max_by with groupBy

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([
...     ("Java", 2012, 20000), ("dotNET", 2012, 5000),
...     ("dotNET", 2013, 48000), ("Java", 2013, 30000)],
...     schema=("course", "year", "earnings"))
>>> df.groupby("course").agg(sf.max_by("year", "earnings")).sort("course").show()
+------+----------------------+
|course|max_by(year, earnings)|
+------+----------------------+
|  Java|                  2013|
|dotNET|                  2013|
+------+----------------------+

Example 2: Using max_by with different data types

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([
...     ("Marketing", "Anna", 4), ("IT", "Bob", 2),
...     ("IT", "Charlie", 3), ("Marketing", "David", 1)],
...     schema=("department", "name", "years_in_dept"))
>>> df.groupby("department").agg(
...     sf.max_by("name", "years_in_dept")
... ).sort("department").show()
+----------+---------------------------+
|department|max_by(name, years_in_dept)|
+----------+---------------------------+
|        IT|                    Charlie|
| Marketing|                       Anna|
+----------+---------------------------+

Example 3: Using max_by where ord has multiple maximum values

>>> import pyspark.sql.functions as sf
>>> df = spark.createDataFrame([
...     ("Consult", "Eva", 6), ("Finance", "Frank", 5),
...     ("Finance", "George", 9), ("Consult", "Henry", 7)],
...     schema=("department", "name", "years_in_dept"))
>>> df.groupby("department").agg(
...     sf.max_by("name", "years_in_dept")
... ).sort("department").show()
+----------+---------------------------+
|department|max_by(name, years_in_dept)|
+----------+---------------------------+
|   Consult|                      Henry|
|   Finance|                     George|
+----------+---------------------------+