Collect pageviews with Flask and Cassandra

data-warehousing web-analytics python cassandra flask

Here is a simple example of collecting pageviews using Flask and Cassandra. The correct way from the client side to make a cross-site request to save a pageview is using CORS, but since old browsers don't support CORS we will request a light image with some browser arguments.

Python requirements:

Flask==0.11
cassandra-driver==3.5.0
pytz==2016.4

Create a Cassandra keyspace:

CREATE KEYSPACE wa
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

USE wa;

To support multiple applications, create the table apps:

CREATE TABLE apps (
    id uuid PRIMARY KEY,
    name text,
    url text
);

The pageviews will be saved in a composite-keyed table with app as a partition key, thereby pageviews for the same app will be stored physically together. Create the composite-keyed table pageviews:

CREATE TABLE pageviews (
    app uuid,
    date timestamp,
    utma uuid,
    utmb uuid,
    path text,
    title text,
    ip text,
    referrer text,
    useragent text,
    platform text,
    language text,
    screensize text,
    pixelratio float,
    PRIMARY KEY (app, date)
);

Inside the <header> of each page to collect pageviews add the following script, editing the APP_ID to match to the database and the ANALYTICS_URL where the Flask app is running (also it is a good idea to minimize the script):

<script type="text/javascript">
var d,i,q,x;
d = {
    app: '{{APP_UUID}}',
    path: location.pathname,
    title: document.title,
    platform: navigator.platform,
    language: navigator.language,
    screensize: screen.width+'x'+screen.height,
    pixelratio: devicePixelRatio,
    referrer: document.referrer
};
q = [];
for (i in d) q.push([i,encodeURIComponent(d[i])].join('='));
new Image().src = '{{ANALYTICS_URL}}?'+q.join('&');
</script>

Now, in Flask, save the pageviews. Also, we are using the cookies _utma and _utmb in the same way Google Analytics does (more here): _utma is used to "remember" a user (expires in two years) and "_utmb" is used to record the visit duration (expires in 30 minutes):

from datetime import datetime, timedelta
from uuid import UUID, uuid4
from flask import Flask, request, send_file
from cassandra.cluster import Cluster
import pytz


app = Flask(__name__)
app.config.from_pyfile('config.py')


@app.before_request
def before_request():
    app.cluster = Cluster()
    app.db = app.cluster.connect('wa')


@app.teardown_request
def teardown_request(exception):
    app.cluster.shutdown()


@app.route('/')
def pageview():
    data = request.args.to_dict()
    response = send_file('img.gif', mimetype='image/gif')

    # Verify app.
    try:
        data['app'] = UUID(data['app'])
    except ValueError:
        return response
    query = 'SELECT id FROM apps WHERE id=%s'
    if not list(app.db.execute(query, [data['app']])):
        return response

    # Tracking cookies.
    now = datetime.now(pytz.timezone('Europe/London'))
    if '_utma' in request.cookies:
        utma = UUID(request.cookies['_utma'])
    else:
        utma = uuid4()
    response.set_cookie('_utma', str(utma), expires=now+timedelta(days=730))
    if '_utmb' in request.cookies:
        utmb = UUID(request.cookies['_utmb'])
    else:
        utmb = uuid4()
    response.set_cookie(
        '_utmb', str(utmb), expires=now+timedelta(seconds=1800))

    # Save pageview.
    data.update(utma=utma,
                utmb=utmb,
                date=now,
                ip=request.remote_addr,
                useragent=request.headers['User-Agent'],
                pixelratio=float(data.get('pixelratio') or 1))
    query = 'INSERT INTO pageviews (%s) VALUES (%s)' % (
        ','.join(data.keys()), ','.join(['%s']*len(data)))
    app.db.execute(query, data.values())

    # Prevent HTTP caching.
    response.headers['Last-Modified'] = now
    response.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
    response.headers['Pragma'] = 'no-cache'
    response.headers['Expires'] = '0'
    return response


if __name__ == '__main__':
    app.run()

Remember to add the img.gif and config.py file with extra settings (like SERVER_NAME).