Extracting Google Analytics data through API and AWS Lambda to S3

RJZ
5 min readAug 10, 2020

How to build your own Google Analytics ETL?

Sometimes Google Analytics webpage is just not enough (or maybe you want to add some data from other sources?). Here we will try to build a working lambda function to extract data from a Google Analytics account. The function will create a csv file with transaction_id and other marketing columns of one day and one account view.

The demo is done on Mac OS Catalina.

Requirements:
1) Google Account
2) Google Analytics account
3) AWS account
4) Installed Docker

Here are the main steps:
1) Create Google API service account and download credentials file
2) Enable Google Analytics and Google Analytics reporting for this API
3) Grant Google Analytics access to your service account email
4) Create python function to write api query result to S3

STEP 1

Go to your cloud credential page: https://console.cloud.google.com/apis/credentials and click “Create Credentials” and select “Service account”

Create desired account name:

Skip part 2:

Click “CREATE KEY” and generate JSON key file. It should download to your computer.

Step 2

Go to https://console.cloud.google.com/apis/library and enable both analytics services

Step 3

Go to your Google Analytics. Go to Admin -> Account User Management and add your service account email (e.g. mytesting@local-storm-255611.iam.gserviceaccount.com) and give “Read & Analyse” permissions.

Step 4

Create AWS account (or log in if you already have one) and create a bucket:

https://console.aws.amazon.com/s3/bucket/create?region=eu-north-1

Go to https://eu-north-1.console.aws.amazon.com/lambda/home?region=eu-north-1#/functions and press “Create Function”

Then, name your function and choose python 3.8 runtime

Create credential file in your lambda function and copy credential json file, downloaded in the previous step.

Copy the function to lambda_function.py:

import json
import boto3
import csv from apiclient.discovery
import build from oauth2client.service_account
import ServiceAccountCredentials
SCOPES = [‘https://www.googleapis.com/auth/analytics.readonly']
KEY_FILE_LOCATION = ‘cred.json’
def initialize_analyticsreporting():
“””
Initializes an Analytics Reporting API V4 service object. Returns: An authorized Analytics Reporting API V4 service object.
“””
credentials = ServiceAccountCredentials.from_json_keyfile_name( KEY_FILE_LOCATION, SCOPES)
# Build the service object.
analytics = build(‘analyticsreporting’, ‘v4’, credentials=credentials, cache_discovery=False)
return analytics
def get_report(analytics, VIEW_ID, dt):
“””
Queries the Analytics Reporting API V4. Args: analytics: An authorized Analytics Reporting API V4 service object. Returns: The Analytics Reporting API V4 response.
“””
return analytics.reports().batchGet( body={ ‘reportRequests’: [ { ‘viewId’: VIEW_ID, ‘dateRanges’: [{‘startDate’: dt, ‘endDate’: dt}], ‘hideTotals’: True, ‘metrics’: [{‘expression’: ‘ga:itemQuantity’}], ‘dimensions’: [{‘name’: ‘ga:transactionId’}, {‘name’: ‘ga:dateHour’}, {‘name’: ‘ga:adContent’}, {‘name’: ‘ga:sourceMedium’}, {‘name’: ‘ga:campaign’}, {‘name’: ‘ga:keyword’} ] }] } ).execute()
def lambda_handler(event, context):
VIEW_ID = str(event[‘view_id’])
dt=str(event[‘dt’])
analytics = initialize_analyticsreporting()
response = get_report(analytics, VIEW_ID, dt)
if response[‘reports’][0][‘data’].get(‘rows’):
a=[] for row in json.loads(json.dumps(response))[‘reports’][0][‘data’][‘rows’]:
a.append(row[‘dimensions’])
with open(‘/tmp/temp.csv’, ‘w’, newline=’’) as file:
writer = csv.writer(file)
writer.writerows(a)
with open(“/tmp/temp.csv”, “rb”) as f:
s3 = boto3.client(‘s3’)
fn=’google/GA/’ + dt[0:4] + ‘/’ + dt[5:7] + ‘/’ + dt[8:10] + ‘/’ + str(VIEW_ID) + ‘.csv’
s3.upload_fileobj(f, str(event[‘bucket’]), fn)

If you would try to run the function, it won’t work. Google api library is not a standard one, so we need to import it by adding a layer.

First, we will create a layer using Docker desktop. You will need to open terminal and just paste this code

cd Documents/
mkdir google_layer
cd google_layer/
mkdir lambda_function lambda_layers
mkdir -p lambda_layers/python/lib/python3.8/site-packages
cd lambda_layers
docker run -v "$PWD":/var/task "lambci/lambda:build-python3.8" /bin/sh -c "pip install google-api-python-client -t python/lib/python3.8/site-packages/; pip install oauth2client -t python/lib/python3.8/site-packages/; exit" zip -r ga.zip python > /dev/null

After this you will find in your documents folder, google_layers folder with ga.zip Upload this file to your lambda layers.

In Lambda menu, go to Layers and press “Create layer”.

Upload a .zip file that we have created in the previous step.

Name it and choose Python 3.8 compatible runtime

Go back to your created function and click on Layers part of Designer and click “Add a layer”

Chose your created layer here and click add. Do not forget to save the function after.

Let’s test our function.

Click on test and configure your test event (paste this and change where needed):

{

“view_id”: 1111111,

“dt”: “2020–04–20”,

“bucket”: “mytestbucket”

}

You can find your own view_id on this page: https://ga-dev-tools.appspot.com/account-explorer/

Finally you would need to give access to your lambda role to access your S3 bucket.

Go to https://console.aws.amazon.com/iam/home?region=eu-north-1#/home

From the left menu, select roles and click on your lambda role:

Click on “Attach policies”:

Start typing “S3” and check the box on “AmazonS3FullAccess”

Now you can try to run your lambda function. If everything is ok, you will see a file created in your S3 folder.

Further notes:

  • You can extract any kind of data from here. You can build a valid query using metrics and dimensions explorer https://ga-dev-tools.appspot.com/dimensions-metrics-explorer/
  • You can write additional function to process the created file and insert rows into the database.
  • You can write an additional lambda function, to extract all the historical data.
  • Step functions could be used if you have a complex ETL process.
  • Additional logic for pagination can be easily added
  • Stitch Data is offering free 6 million rows per month for free, so if you are not planning for a lot of rows, you might want to check them out
  • Original website www.bigdatawonders.com

--

--