3

BigQuery supports:

  1. User Defined Functions (UDFs) in SQL and JavaScript.
  2. Analytic functions that compute values over a group of rows and return a single result for each row. These functions can be used with OVER clause. There is a predefined set of analytic functions.

The question #1: "Does BigQuery support analytic user-defined functions?"

The motivation behind this is that I want to implement the split-apply-combine pattern that is usually seen in Python pandas code. This could be useful for in-group normalization and other transformations that use group statistics.

I did a small test in Standart SQL:

create or replace function `mydataset.mylen`(arr array<string>) returns int64 as (
  array_length(arr)
);

WITH Produce AS
 (SELECT 'kale' as item, 23 as purchases, 'vegetable' as category
  UNION ALL SELECT 'orange', 2, 'fruit'
  UNION ALL SELECT 'cabbage', 9, 'vegetable'
  UNION ALL SELECT 'apple', 8, 'fruit'
  UNION ALL SELECT 'leek', 2, 'vegetable'
  UNION ALL SELECT 'lettuce', 10, 'vegetable')
SELECT 
  item, 
  purchases, 
  category, 
  `mydataset.mylen`(item) over (mywindow) as windowlen
FROM Produce
window mywindow as (
  partition by category
)

When I run the code above, I get:

Query error: Function mydataset.mylen does not support an OVER clause at [16:3]

Thus, in case BigQuery does support analytic UDFs, the question #2: "How to implement a UDF so that it supports an OVER clause?"

0

2 Answers 2

5

You are very close to solving the problem :)

A little context for the reader of the answer, BigQuery doesn't support user-defined aggregate/analytical function, so a way of emulating it is to write a scalar UDF accepting an array as input. Then in the query, array_agg() function is used to pack the data as input to the UDF, (and this is the step missing from the question).

  `mydataset.mylen`(item) over (mywindow) as windowlen

=>

  `mydataset.mylen`(array_agg(item) over (mywindow))  as windowlen
3
  • As I understand, mydataset.mylen function is applied to each row separately and gets the partitioned group as an argument. That's why the function doesn't have to return a similarly sized array - it's called for each row, not for a group of rows. Commented Nov 26, 2020 at 22:26
  • 1
    This is wonderful! Using this approach I can now transfer my pandas code to BigQuery. The main difference I see is that there is no vectorization: I will be computing the aggregate over the partitioned group for each row. Guess that's a minor thing that BigQuery's optimizer will be able to solve. Commented Nov 26, 2020 at 22:32
  • Not sure what is your expected output, if you only need a group of value, you should use GROUP BY category instead of a window function with category. But the UDF is the same in both case.
    – Yun Zhang
    Commented Nov 27, 2020 at 4:08
0

User Defined Aggregate Functions (UDAF) are now available in Google BigQuery.

Here is an example of defining a UDAF to calculate the Geometric Mean of a column of data.

Defining the UDAF:

CREATE TEMP AGGREGATE FUNCTION geometric_mean(
  column_values float64
)
RETURNS float64
AS
(
  EXP(SUM(LN(column_values))/COUNT(column_values))
);

Calling the UDAF

with test_data as (
  SELECT 1 AS col1 
  UNION ALL
  SELECT 3
  UNION ALL
  SELECT 5
)
select geometric_mean(col1) from test_data;

More info: https://qosf.com/UDAF-in-google-bigquery.html

Not the answer you're looking for? Browse other questions tagged or ask your own question.