pyspark.sql.DataFrame.asTable#

DataFrame.asTable()[source]#

Converts the DataFrame into a table_arg.TableArg object, which can be used as a table argument in a TVF(Table-Valued Function) including UDTF (User-Defined Table Function).

After obtaining a TableArg from a DataFrame using this method, you can specify partitioning and ordering for the table argument by calling methods such as partitionBy, orderBy, and withSinglePartition on the TableArg instance. - partitionBy: Partitions the data based on the specified columns. This method cannot be called after withSinglePartition() has been called. - orderBy: Orders the data within partitions based on the specified columns. - withSinglePartition: Indicates that the data should be treated as a single partition. This method cannot be called after partitionBy() has been called.

New in version 4.0.0.

Returns

table_arg.TableArg: A TableArg object representing a table argument.

Examples

>>> from pyspark.sql.functions import udtf
>>>
>>> # Create a simple UDTF that processes table data
>>> @udtf(returnType="id: int, doubled: int")
... class DoubleUDTF:
...     def eval(self, row):
...         yield row["id"], row["id"] * 2
...
>>> # Create a DataFrame
>>> df = spark.createDataFrame([(1,), (2,), (3,)], ["id"])
>>>
>>> # Use asTable() to pass the DataFrame as a table argument to the UDTF
>>> result = DoubleUDTF(df.asTable())
>>> result.show()
+---+-------+
| id|doubled|
+---+-------+
|  1|      2|
|  2|      4|
|  3|      6|
+---+-------+
>>>
>>> # Use partitionBy and orderBy to control data partitioning and ordering
>>> df2 = spark.createDataFrame(
...     [(1, "a"), (1, "b"), (2, "c"), (2, "d")], ["key", "value"]
... )
>>>
>>> @udtf(returnType="key: int, value: string")
... class ProcessUDTF:
...     def eval(self, row):
...         yield row["key"], row["value"]
...
>>> # Partition by 'key' and order by 'value' within each partition
>>> result2 = ProcessUDTF(df2.asTable().partitionBy("key").orderBy("value"))
>>> result2.show()
+---+-----+
|key|value|
+---+-----+
|  1|    a|
|  1|    b|
|  2|    c|
|  2|    d|
+---+-----+
>>>
>>> # Use withSinglePartition to process all data in a single partition
>>> result3 = ProcessUDTF(df2.asTable().withSinglePartition().orderBy("value"))
>>> result3.show()
+---+-----+
|key|value|
+---+-----+
|  1|    a|
|  1|    b|
|  2|    c|
|  2|    d|
+---+-----+