pyspark.sql.DataFrame.asTable#
- DataFrame.asTable()[source]#
Converts the DataFrame into a
table_arg.TableArgobject, which can be used as a table argument in a TVF(Table-Valued Function) including UDTF (User-Defined Table Function).After obtaining a TableArg from a DataFrame using this method, you can specify partitioning and ordering for the table argument by calling methods such as partitionBy, orderBy, and withSinglePartition on the TableArg instance. - partitionBy: Partitions the data based on the specified columns. This method cannot be called after withSinglePartition() has been called. - orderBy: Orders the data within partitions based on the specified columns. - withSinglePartition: Indicates that the data should be treated as a single partition. This method cannot be called after partitionBy() has been called.
New in version 4.0.0.
- Returns
table_arg.TableArgA TableArg object representing a table argument.
Examples
>>> from pyspark.sql.functions import udtf >>> >>> # Create a simple UDTF that processes table data >>> @udtf(returnType="id: int, doubled: int") ... class DoubleUDTF: ... def eval(self, row): ... yield row["id"], row["id"] * 2 ... >>> # Create a DataFrame >>> df = spark.createDataFrame([(1,), (2,), (3,)], ["id"]) >>> >>> # Use asTable() to pass the DataFrame as a table argument to the UDTF >>> result = DoubleUDTF(df.asTable()) >>> result.show() +---+-------+ | id|doubled| +---+-------+ | 1| 2| | 2| 4| | 3| 6| +---+-------+ >>> >>> # Use partitionBy and orderBy to control data partitioning and ordering >>> df2 = spark.createDataFrame( ... [(1, "a"), (1, "b"), (2, "c"), (2, "d")], ["key", "value"] ... ) >>> >>> @udtf(returnType="key: int, value: string") ... class ProcessUDTF: ... def eval(self, row): ... yield row["key"], row["value"] ... >>> # Partition by 'key' and order by 'value' within each partition >>> result2 = ProcessUDTF(df2.asTable().partitionBy("key").orderBy("value")) >>> result2.show() +---+-----+ |key|value| +---+-----+ | 1| a| | 1| b| | 2| c| | 2| d| +---+-----+ >>> >>> # Use withSinglePartition to process all data in a single partition >>> result3 = ProcessUDTF(df2.asTable().withSinglePartition().orderBy("value")) >>> result3.show() +---+-----+ |key|value| +---+-----+ | 1| a| | 1| b| | 2| c| | 2| d| +---+-----+