본문 바로가기

English/Google Analytics 4 (GA4)

[BigQuery] Exploring Last-Click Attribution Insights in BigQuery for GA4

The transition from UA to GA4 has indeed introduced its own share of challenges, particularly revolving around attribution models. GA4's emphasis on the last non-direct click in its attribution model has proven to be a significant hurdle for analysts and marketers alike.

In GA4, if the conversion is set only for the purchase event, and a user like Sam first visits the website/app through unpaid search, and then in the next session, comes through email marketing and makes a purchase, the credit for the conversion goes 100% to email marketing.

 

However, the issue arises when Sam visits the website/app again in a different session through paid search, and in the middle of that session, he visits with a UTM parameter. Then, he comes back again in a different session through direct traffic and makes a purchase. In GA4, the last attribution would be to the session with the UTM parameter, not the direct traffic.

 

In GA4:

Event name Event count Conversions Conversion Paths
purchase 2 2 Unpaid search > Email marketing
Unpaid search > Email marketing > Paid search > Affiliate with UTM parameters

 

In UA:

Event name Event count Conversions Conversion Paths
purchase 2 2 Unpaid search > Email marketing
Unpaid search > Email marketing > Paid search > Affiliate with UTM parameters > (none)

 

One of my clients expressed concern about aligning their analytics standards with UA due to the difficulty of changing internal reports. They've been using a last-click attribution model for quite some time, and if GA4 doesn't align with UA, they may consider discontinuing its use. Despite reaching out to Google support and even involving our Korean Google account manager, we were informed that this feature wasn't on their roadmap, which was quite frustrating.

However, during the recent holiday season, I found myself with some spare time and began to wonder if BigQuery could offer a solution. Despite my limited skills with BigQuery, I decided to give it a try. I must admit, I encountered some challenges and spent a considerable amount of time trying to navigate through it. My experience with Oracle also came into play, as I found myself struggling with functions like LAST_VALUE and FIRST_VALUE, which I hadn't used much before 😅 After investing nearly 999 hours into studying their definitions and use cases, I finally managed to come up with a solution.

Let me share the testing scenarios I devised:

Data from BigQuery

  1. Natural Search from Naver without Conversion: Users land on the site through a natural search on Naver but don't proceed to make a purchase (3 instances).
  2. Natural Search from Naver with Conversion: Users arrive via a natural search on Naver in one session and then make a purchase in another session (1 instance).
  3. Direct Traffic with Conversion: Similar to the second scenario, users initially come through a natural search on Naver, but in the same session, they directly navigate to the site and make a purchase (2 instances).

 

Further testing unveiled:

Data from BigQuery

  • No Conversion after Naver Natural Search: Instances where users land through Naver's natural search but don't convert (2 instances).
  • Naver Natural Search without Conversion: Users land through Naver's natural search and convert (3 instances).
  • Google Natural Search with Conversion: Users initially arrive through Naver's natural search, then proceed to make a purchase after a Google natural search (1 instance).

Analyzing these findings highlighted the significant weightage of "(direct)" or "(none)" attributions compared to GA4's non-direct click attributions.

Data from BigQuery

In summary, my exploration through BigQuery shed light on the attribution challenges posed by GA4's attribution model, particularly in emphasizing the dominance of "(direct)" or "(none)" attributions. This insight holds valuable implications for marketers aiming to optimize their attribution strategies within the GA4 landscape.

** 성공??

WITH CTE AS (
  SELECT 
    event_name,
    event_date,
    -- manual_source와 manual_medium에 대해 NULL이면 (direct)와 (none)을 사용
    IFNULL(collected_traffic_source.manual_source, '(direct)') AS manual_source,
    IFNULL(collected_traffic_source.manual_medium, '(none)') AS manual_medium,
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id,
    event_timestamp
  FROM 
    `analytics_000000000.events_*`
  WHERE 
    (_TABLE_SUFFIX BETWEEN '20240222' AND '20240222')
  -- AND user_pseudo_id = '89161034.1703226979' 
  -- AND user_pseudo_id = c5d4e3ba41cda8f18c441c61ae3c3fb9
),

CTE_user_sessions_channel_conversion_filled AS (
  SELECT
    event_name,
    ga_session_id,
    user_pseudo_id,
    event_timestamp,
    -- 세션 내에서 manual_source와 manual_medium의 마지막 유효값을 사용. 없으면 (direct)와 (none)을 기본값으로 사용.
    COALESCE(
      LAST_VALUE(IF(manual_source <> '(direct)', manual_source, NULL) IGNORE NULLS) OVER (PARTITION BY user_pseudo_id, ga_session_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
      '(direct)'
    ) AS filled_manual_source,
    COALESCE(
      LAST_VALUE(IF(manual_medium <> '(none)', manual_medium, NULL) IGNORE NULLS) OVER (PARTITION BY user_pseudo_id, ga_session_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
      '(none)'
    ) AS filled_manual_medium,
    CASE
        WHEN event_name = 'purchase' THEN 1
        ELSE 0
    END AS conversion
  FROM 
    CTE
)

SELECT 
    filled_manual_source AS manual_source,
    filled_manual_medium AS manual_medium,
    SUM(conversion) AS conversion_count
FROM 
    CTE_user_sessions_channel_conversion_filled
GROUP BY 
    filled_manual_source, filled_manual_medium
ORDER BY 
    conversion_count DESC;

 

  1. CTE (Common Table Expression) Declaration:
    • This part of the query, named CTE, initiates the data extraction process from the Google Analytics BigQuery dataset for a specific day (20240222). It selects various attributes of event data including:
    • event_name: The name of the event, indicating the type of interaction or activity recorded.
    • event_date: The date on which the event occurred.
    • manual_source and manual_medium: These fields capture the traffic source and medium as manually set within Google Analytics. The IFNULL function is used here to replace null values with '(direct)' for the source and '(none)'' for the medium, providing default values where explicit data isn't provided.
      user_pseudo_id: A pseudonymized identifier for the user, allowing for user activity tracking while maintaining privacy.
    • ga_session_id: Extracted from the event parameters (event_params) using the UNNEST function, this identifies the Google Analytics session ID, helping to group events by session.
    • event_timestamp: The timestamp when the event was recorded, useful for sequencing events within a session.
    • The selection is restricted to events occurring on a specific date range (in this case, 20240222), and can optionally be filtered to a specific user by uncommenting the user_pseudo_id condition.
  2. Filling in Traffic Source and Medium:
    • The CTE_user_sessions_channel_conversion_filled part of the query aims to accurately fill in the traffic source and medium for each session. It addresses the challenge of null values and inconsistencies in source and medium data through a series of steps:
    • COALESCE and LAST_VALUE Functions: These functions are used to determine the most appropriate non-null values for manual_source and manual_medium within each session. The query looks for the last non-'(direct)' source and non-'(none)' medium, utilizing LAST_VALUE with a condition to ignore '(direct)' and '(none)' unless they are the only values available.
    • Fallback to Default Values: If no non-default values are found within a session, the query falls back to using '(direct)' for source and '(none)' for medium, ensuring every event is attributed to a source and medium.
    • Conversion Tracking: A CASE statement is used to flag events as conversions (1) if the event name matches 'purchase', allowing for the tracking of conversion events within the dataset.
  3. Aggregation and Summary:
    • Finally, the query aggregates conversion data by the filled-in source and medium, summarizing the conversion counts for each source-medium pair:
    • Grouping and Summation: The query groups the data by filled_manual_source and filled_manual_medium, summing up the conversion flags (1 for purchase events) to count conversions for each source-medium combination.
    • Ordering: The results are ordered by conversion count in descending order, highlighting the most effective source-medium pairs in driving conversions on the specified date.

Conclusion:
This detailed SQL query demonstrates a sophisticated approach to handling Google Analytics data within BigQuery, focusing on accurately attributing conversions to traffic sources and mediums. By thoughtfully managing null values and ensuring consistent application of default values, the query provides a reliable foundation for analyzing user behavior and conversion patterns, essential for informed marketing and business decision-making.

 

user_pseudo_id = '89161034.1703226979'

 

Extracting simply the source and medium values from BigQuery.
Adjusting the source and medium considering the ga_session_id

If the source and medium are null within the same ga_session_id, correct them to (direct) (none).If the source and medium are null within the same ga_session_id, but there were values in a previous instance within the same session, correct them to those previous values. If a new ga_session_id has null values, correct them to (direct) (none).

This query serves the purpose of analyzing traffic sources and mediums for individual user sessions, providing insights into user behavior specifically focused on last-click conversion events.

In such scenarios, it's generally important to consider the lookback window. However, our client has also been setting a 24-hour lookback window in UA.

This has important implications for marketers looking to optimize their strategies in line with existing internal standards, as it meets the 80% average for direct customer inflow criteria.

Limitation of the Article: It would have been beneficial to include user_id as a condition for distinguishing users, in addition to user_pseudo_id. However, the user_id value is currently unavailable (due to user provided data collection being enabled).

I would like to say, thank you, to Javi - 

'English > Google Analytics 4 (GA4)' 카테고리의 다른 글

ADB Environment Setup Guide for MacBook  (0) 2024.11.22