I need to join 2 tables to create a dataset for a dashboard. The 2 tables are designed as follows: Table 1 records sales, so every datetime entry is a unique sale for a certain productID, with misc things like price etc Table 2 contains updates to the pricing algorithm, this contains some logic statements and benchmarks that derived the price. The price holds for a productID until it is updated. For example: ProductID 123 gets a price update in Table 2 at 09:00, 12:12 and 15:39 Table 1 records sales at 09:39, 12:00 and 16:00 What I need is the record of the sale from Table 1 with the at that time information from Table2, So: 09:39 -- Pricing info from table 2 at the 09:00 update 12:00 -- Pricing info from table 2 at the 09:00 update 16:00 -- Pricing info from table 2 at the 15:39 update Both tables contain data dating back multiple years, and ideally I want the new table dating back to the most recent origin of the 2 tables. What would the join conditions of this look like?
To achieve the desired result, you will need to perform a non-equi JOIN between the two tables. This type of join is not directly supported in standard SQL, but you can simulate it using a subquery or a LEFT JOIN combined with a WHERE clause to filter the results.
Here's a step-by-step approach to writing the SQL statement for BigQuery:
Identify the latest pricing update before each sale: For each sale in Table 1, find the most recent pricing update from Table 2 that occurred before the sale time.
Join the tables: Perform a LEFT JOIN between Table 1 and the subquery result from step 1 based on the
ProductID
and the condition that the sale time is greater than or equal to the pricing update time.Select the required columns: Choose the columns you need for your dashboard dataset from both tables.
Assuming the following schema for your tables:
- Table 1 (Sales):
SaleID
,ProductID
,SaleDateTime
,Price
, etc. - Table 2 (PricingUpdates):
UpdateID
,ProductID
,UpdateDateTime
,PricingLogic
, etc.
Here is the SQL statement that would accomplish this:
WITH
LatestPricingUpdates AS (
SELECT
p1.ProductID,
p1.UpdateDateTime,
p1.PricingLogic,
-- Include any other columns from Table 2 that you need
MAX(p2.UpdateDateTime) OVER (
PARTITION BY
p1.ProductID
ORDER BY
p2.UpdateDateTime ROWS BETWEEN UNBOUNDED PRECEDING
AND 1 PRECEDING
) AS PrevUpdateDateTime
FROM
`project.dataset.Table2` AS p1
LEFT JOIN `project.dataset.Table2` AS p2 ON p1.ProductID = p2.ProductID
AND p1.UpdateDateTime > p2.UpdateDateTime
)
SELECT
s.SaleID,
s.ProductID,
s.SaleDateTime,
s.Price AS SalePrice,
p.PricingLogic AS PricingInfo
-- Include any other columns from Table 1 or the LatestPricingUpdates CTE that you need
FROM
`project.dataset.Table1` AS s
LEFT JOIN LatestPricingUpdates AS p ON s.ProductID = p.ProductID
AND s.SaleDateTime >= p.UpdateDateTime
AND (
s.SaleDateTime < p.PrevUpdateDateTime
OR p.PrevUpdateDateTime IS NULL
)
WHERE
p.UpdateDateTime IS NOT NULL
ORDER BY
s.SaleDateTime
In this query:
- The
LatestPricingUpdates
Common Table Expression (CTE) is used to find the latest pricing update before each sale for a givenProductID
. - The
MAX
function combined with theOVER
clause is used to get the previous update's datetime for each pricing update. - The main
SELECT
statement then joins the sales table with this CTE to get the corresponding pricing information for each sale. - The
WHERE
clause ensures that we only include sales that have a corresponding pricing update.
Please note that you may need to adjust the column names and the project and dataset names to match your actual BigQuery setup.
- Public
- ·
- Fri, 24 Nov 2023 13:02:47 GMT