UA → GA4
We are now about six months into the big GA4 migration. At this point we’ve settled into what we can and can’t do. But there is still more discomfort than what we’d hoped for six months into a new product. One of the biggest issues is the lack of session-level information. The system was not built with sessions in mind and some may argue that’s fine. But a lot of times, you want to understand how particular campaigns are working at the moment. For instance, how are your Black Friday efforts doing on, you know, Black Friday? This data is really hard to come by in GA4. But fear not, we will be tackling this issue and more in this blog!
What this blog is!
The purpose of this blog is to empower YOU to set up a real-time channel & campaign dashboard quickly and efficiently. This blog will equip you with 2 SQL queries that can be used to pull yesterday’s data and the previous 30/60 minutes of data from GA4, and display it in a looker studio report.
What this blog is not!
This blog is not a comprehensive tutorial on GA4, BigQuery, or the combination of the two. This blog is a means to an end to get you up and running with a real-time dashboard. If you want to understand all of the detailed aspects of BigQuery & GA4, I’d advise taking the wonderful Simmer course led by the brilliant Johan van de Werken. That course is a deep dive into GA4 & BigQuery. Our approach today is to empower marketers to find a critical report that they may be missing. We’ll be taking shortcuts and skipping some basic definitions. You’ve been warned! Now let’s go!
The problem with GA4 attribution data? (well, one of the problems)
There are a lot of issues with GA4 but there’s one that comes up time and time again. One that just causes heartbreak among so many of my clients. One that rips people’s souls in half as I see them wince every time I mention it…yesterday and today’s data just doesn’t work properly in GA4. The data doesn’t fully exist in the platform until 48 hours go by.
What do you mean by that?
Well let’s take a look…according to Google Analytics documentation, GA4 can take up to 48 hours to process data from your website or app.
I’ve seen the data processing times be both fast and slow, but the fact is you can’t rely on the data populating quickly. This can be a real pain-in-the-neck for all businesses, but especially for those in the ecommerce industry.
But Josh! How am I going to see how my holiday sale is performing? How can we make quick decisions to see how our campaigns are converting YoY or even DoD? Can we at least go into real time to get a gut check of how our campaigns are performing in the last 30 minutes? We just sent an email out–are people clicking it?
Real-Time Fails in GA4
Well that’s kind of complicated too. GA4 currently displays only the first touch information in real-time on an aggregate level.
So for instance if I go to our site with this UTM link: you would hope that this attribution source would at least show up somewhere in Realtime to show how many people are on the site. Let’s see if that’s the case.
First we go to the link.
Now we navigate into Realtime. The first thing I notice in the attribution section of Realtime is that we only have access to First User data. This is not super useful as I don’t really care about how a person originally came to my website, I am interested to see why a user is there right now. Imagine the scenario where you send out 1000 emails about your Black Friday sale and then you navigate to see how many people are on the site from the email you just sent out. Suddenly you see 99 people from an email about March Madness- not super useful huh?
Well, does that data just not exist then? If I dive into the user snapshot and find my own session (by chance as there is not currently a way to find a specific user in the snapshot) you can see that the campaign, source and medium data exists…so it obviously recognizes that I am on the site with these data parameters.
Maybe we can find the session data in the comparison filters? I exist here and therefore I should be able to use my source data to find myself even if it’s not explicitly available in the dashboards. Nope, wrong again! Even though Google makes it available in your filters, you can’t actually use session source as a real-time filter.
Thus, it is impossible to see what channels are driving your website traffic at this very moment. Unless… there was a way to get real-time attribution data down to almost the second. Well guess what, Bob – we can with BigQuery!
Connect BigQuery
Ok so let’s start this process. The first thing you’ll need to do is make sure that you connect GA4 to BigQuery. First you’ll want to navigate to the admin section of GA4 and make sure you have an active link with BigQuery. If you haven’t started setting one up yet, you’ll need to create a BigQuery account and then link your GA4 project to it.
Once you have an active link, it will look like this:
The BigQuery connection will take a day or two to start populating the data into your BigQuery project, so be patient. Additionally, it will ONLY start populating data once you have connected it; there is currently no support for retroactive GA4 data in BigQuery.
NOTE: You will be able to create a free link but you’ll definitely want to upgrade to the paid version of BigQuery (it does not cost very much to store the GA4 data.) Without upgrading to paid, you will not be able to access the export type “streaming” which we will need for any type of real time dashboard. Additionally, you will not be able to store more than two months of data at a time without upgrading, so for the sake of getting access to full functionality, make sure to add a billing account.
You can choose whether or not to export User Data – but I would suggest it. We are not going to be using it for this tutorial but it will be helpful down the road.
The Belly of the BigQuery Beast
Now, we’ll want to navigate to BigQuery. Go to the Google Cloud Console This should bring you directly to the correct project.
You should see a new dataset under your project_id called analytics_{{streamid}}.
If you selected “Daily Export”, “Streaming Export”, & “User Data Export” in your connector you will see the following datasets:
Let’s break them down:
- Events: This is your daily repository of historical analytics data. Once a day, BigQuery takes the intraday data and “deposits” it into the events dataset. It lives in events from that point on.
- Events_intraday: This is your “real-time” data, it displays the last 12-24 hours worth of data and continues to populate in near real-time. You may see yesterday’s data in there until BigQuery processes and packages to the events dataset but in general you’ll see mostly the last 24 hours.
- Pseudonymous_users: This shows information about the “user” in relationship to the GA4 cookie identifier (client ID). This has information on device, geo, audiences, etc all tied up to the single user.
- Users (not listed): If you are sending in User IDs, you’d have another dataset of the same information as pseudonymous user information listed by the user ID
For now we’re going to be focusing on the events_intraday dataset for the purpose of this exercise but the same logic can be applied to the events dataset.
Manipulating and Condensing the Tables
If you dive into a preview of one of those datasets you’ll see A LOT of columns.
The GA4/BigQuery connector gives us a lot of data to play around with but that doesn’t necessarily mean we need all of it all of the time. Querying in BigQuery is what ends up costing the big bucks and querying a large dataset like GA4 data can add up! Additionally, some of the data is broken down into single columns that we are used to seeing as one column (i.e. Session Source & Session Medium). To get to the data we need and how we’d like to see it, we need to build our own tables from the existing datasets.
Building our own dedicated table with just the information we need will make finding the data & visualizing the tables in Looker Studio way easier and it will save you quite a bit of moola. Additionally, it’ll make it way more lightweight, making it faster to update. All around wins!
So now we need some SQL! “DID YOU SAY SQL?! OMG I DIDN’T KNOW THERE WAS CODE INVOLVED. I’VE INVESTED ALL THIS TIME IN YOUR LONG-ASS ARTICLE AND NOW I HAVE TO CODE!?”
Wait, wait, before you get all huffy and puffy, I’ve got a simple solution. Ahmed Ali put out an incredibly valuable site which does all the work for you: GA4SQL.
There is a lot to play around with here but let’s take a look at two code snippets to help us achieve the most critical marketing real-time data
So what does this code do? A couple of things:
This SQL query is designed to classify and aggregate session data from an analytics dataset. It categorizes sessions into channels based on the source and medium of the session, using a series of CASE statements to determine if the session is from the following:
‘Direct’, ‘Cross-network’, ‘Paid Shopping’, and various other defined channel types such as ‘Paid Search’, ‘Paid Social’, ‘Paid Video’, ‘Display’, ‘Organic Shopping’, ‘Organic Social’, ‘Organic Video’, ‘Organic Search’, ‘Email’, ‘Affiliates’, ‘Referral’, ‘Audio’, ‘SMS’, and ‘Mobile Push Notifications’. If none of these cases match, it defaults to ‘Unassigned’.
The query then aggregates this information, along with other session details like the campaign name, region, country, landing page, device category, and **** to count distinct sessions, sum engaged sessions, purchases, and revenue, and count total users. This is done within a subquery that extracts and prepares this data from event parameters, traffic sources, and user IDs, filtering the data for the previous day.
The query provides a detailed breakdown of user sessions by various marketing channels and other dimensions, and calculates key performance metrics such as engagement, purchases, and revenue for a given ****.
Great, so now that you have a basic understanding of what the code is doing, we need to manipulate it further to get to what we are looking for with this report. Currently, the GA4SQL website does not have an option for selecting from the intraday dataset, which is the dataset that we need to get real time reports. Additionally, since it doesn’t pull from intraday, certain dimensions such as hour and minute are missing.
With that said, a few changes need to be made to the code to make it work for this purpose:
1. It assumes we want to pull from yesterday’s data. We really just want this to pull from today’s data. If you’d like to also have historical data, you can use the code above with some slight modifications (such as **** range) to build a fully comprehensive BigQuery historical report, but we’re currently after the last 24 hours of data.
2. There’s no mention of hours or minutes here- what if we want to see the activity of our website in the last 30 minutes? We won’t be able to do that with the code above.
So there’s a little modification we need to make.
We will want to build a view in BigQuery. A view is a virtual table created by a SQL query on our main table. This allows for a much smaller, flexible data pull.
We will be building a “by the minute” view for our “last 30 minute LS report”.
Last 30 minutes:
Some edits must be made to the SQL query to provide the correct time data, and accurately pull the last 30 minutes of data from your source.
- Update line 186 with your table ID in the format: project_ID.Dataset_ID.Table_ID.
- Update lines 158 & 159 with the correct offset based on the time zone set in your Google Analytics account. This can be found in Analytics under Admin > Property > Property details > Reporting Time Zone. The value should be a string like “-08:00” or “+2:00”
- Update line 165 with the correct offset. This interval should always be a positive integer. However, If your time zone is ahead of UTC you must also change the TIMESTAMP_SUB function to TIMESTAMP_ADD, along with updating the interval value. This line will output the datetime that your event happened according to the time zone set in your GA4 account.
To get the last 60 min of data rather than the last 30 min
- On line 189 of the query change “INTERVAL 30 MINUTE” to “INTERVAL 60 MINUTE”
Okay we have our code now.
Lets run the SQL. In BigQuery, you will want to navigate to your GA4 events_intraday data set. Once you are there, run a query.
In your query tab, paste the code above that you’d like to run. Ultimately, it would be great to have the option for both the last 30 mins and the hourly one. The most important thing is to switch out the dataset in the SQL from with YOUR DATASET ID. You can command/ctrl F and search for intraday within the SQL. This should get you to the section that you need to change.
When this is done you can save your BigQuery copy as a view or save as view depending on if you’ve already saved the query or not.
In the top right it will also tell you how large the file is. This is what you can use to determine the cost of querying the dashboard every minute. The above amount for this query 95MB which in total if you pull every minute of everyday for a month would total 4104000 MB. Just for this query, to have it run in real time we’d be looking at around $18 dollars per month (not bad) according to Google Cloud Platform Pricing but this definitely gives you a sense of the important of good consolidation of your Query weight & the balance of how often you pull the data.
Once you save a copy to a view or save it as a view it will prompt you to add it to a project and existing data set. Choose your data set and name it whatever you’d like.
Now you should have a view with a much lighter data_set. This is what we will be using to build the Looker Studio report off of.
Add Real-Time Data to Looker Studio
Copy Our Template:
Now we’re ready to create a realtime dashboard in Looker Studio! You can either create a new report from scratch OR you can use this template:
BigQuery Realtime Dashboard Template [MAKE A COPY]
To use this template, click the link above. Then click the 3-dot menu in the top right corner and select “Make a copy.”
You’ll be prompted to select a new data source:
Click the dropdown under “New Data Source” and select “Create Data Source”:
Build Your Own:
Want to build your own? Follow the steps below:
Now it’s just like any other Looker Studio report. Navigate to LS and add a new data source. Search for the BigQuery connector and select it.
Find your project ID and you should have your newly created views in the list to choose from.
Customize Your Report For Real-Time
Pick the Real Time Table View that you’d like to use. We’ll use the 30 min one but you can select any and all. The only difference is the suggested data freshness.
An important step before moving on is to change the data freshness on the data source
Find your data source that you just added
Click on the data freshness:
Now you can change the timing of when the data refreshes from BigQuery (i.e. it pulls the sql request again). Remember that the faster this updates the more $ you’ll be paying a lot. For the 30-min dashboard, I’d suggest a data freshness of every 1 minute. For the hourly one, you can probably get away with once an hour or once every 15 mins. It’s really up to you!
If you’d like to pull every 1 min, you can click the custom radio button and change the period to every one 1min.
Now we are ready to start building reports, the original intention of this article was to get session data in a real time report so lets start there. There’s really endless combinations of visualizations that you can do but I like building a time series of data by the minute and some pie charts but you can really pull in whatever you’d like.
A few notes about the above data:
- Sometimes there are null values in the data for whatever reason…that correct themselves later on, you’ll just need to deal with those, you can filter them out if you’d like.
- A time series chart will restart at the beginning even though its after the event. There’s a SQL solve for this too which I will note below
All and all this is a super complex and deeply intense way to get to real-time marketing data, and while fundamentally this is cool and the data is yours and at your fingertips to do what you want…you may want to find a simpler solution like Piwik Pro or another alternative, which does not have the same data lag.
Additional modifications:
Last day of data pulled by hour
Some edits must be made to the SQL query to provide the correct time data, and accurately pull the last 24 hours of data from your source.
- Update line 186 with your table ID in the format: project_ID.Dataset_ID.Table_ID.
- Update lines 158 & 159 with the correct offset based on the time zone set in your Google Analytics account. This can be found in Analytics under Admin > Property > Property details > Reporting Time Zone. The value should be a string like “-08:00” or “+2:00”
- Update line 165 with the correct offset. This interval should always be a positive integer. However, If your time zone is ahead of UTC you must also change the TIMESTAMP_SUB function to TIMESTAMP_ADD, along with updating the interval value. This line will output the datetime that your event happened according to the time zone set in your GA4 account.