Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Googlebot logs from MaxCDN #7

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

mrdavidlaing
Copy link
Member

Its possible to get logs of googlebot traffic to MaxCDN via the MaxCDN api. This gives source logs in the following format:

{  "bytes": 46953, "client_asn": "AS15169 Google Inc.", "client_city": "Mountain View", "client_continent": "NA", "client_country": "US", "client_dma": "0", "client_ip": "66.249.67.220", "client_latitude": 37.38600158691406, "client_longitude": -122.08380126953125, "client_state": "CA", "company_id": 85, "cache_status": "MISS", "hostname": "cdn.yoast.com", "method": "GET", "origin_time": 0.024, "pop": "vir", "protocol": "HTTP/1.1", "query_string": "", "referer": "-", "scheme": "https", "status": 200, "time": "2014-06-30T08:40:45.159Z", "uri": "/wp-content/uploads/2009/10/apple-404.png", "user_agent": "Googlebot-Image/1.0", "zone_id": 33008     }

These should be parsed into a format that makes analysing them easy

@mrdavidlaing
Copy link
Member Author

A very basic json filter gives the following:

'@type': googlebot-maxcdn
  '@message': '{"bytes":0,"client_asn":"AS16509 Amazon.com, Inc.","client_city":"-","client_continent":"EU","client_country":"IE","client_dma":"0","client_ip":"54.247.60.162","client_latitude":53,"client_longitude":-8,"client_state":"-","company_id":85,"cache_status":"MISS","hostname":"cdn.yoast.com","method":"HEAD","origin_time":0.471,"pop":"lhr","protocol":"HTTP\/1.1","query_string":"","referer":"-","scheme":"https","status":200,"time":"2014-07-01T05:10:50.388Z","uri":"\/wp-content\/uploads\/2007\/12\/blogmetrics02.png","user_agent":"Googlebot\/2.1
    (+http:\/\/www.google.com\/bot.html)","zone_id":33008}'
  '@version': '1'
  '@timestamp': 2014-07-01 06:10:50.388000000 +01:00
  bytes: 0
  client_asn: AS16509 Amazon.com, Inc.
  client_city: '-'
  client_continent: EU
  client_country: IE
  client_dma: '0'
  client_ip: 54.247.60.162
  client_latitude: 53
  client_longitude: -8
  client_state: '-'
  company_id: 85
  cache_status: MISS
  hostname: cdn.yoast.com
  method: HEAD
  origin_time: 0.471
  pop: lhr
  protocol: HTTP/1.1
  query_string: ''
  referer: '-'
  scheme: https
  status: 200
  time: '2014-07-01T05:10:50.388Z'
  uri: /wp-content/uploads/2007/12/blogmetrics02.png
  user_agent: Googlebot/2.1 (+http://www.google.com/bot.html)
  zone_id: 33008

Compared to @type:googlebot which has the following shape:

  '@type': googlebot
  '@message': '{ "content_type": "text/xml; charset=UTF-8", "@timestamp": "2014-06-19T21:54:20-07:00",
    "remote_addr": "66.249.69.45", "body_bytes_sent": 38704, "request_time": 1.539,
    "status": 200, "robots": "noindex,follow", "redirect_location": "-", "request_method":
    "GET", "scheme": "https", "server_name": "yoast.com", "request_uri": "/cat/wordpress/feed/",
    "document_uri": "/index.php", "http_user_agent": "Mozilla/5.0 (compatible; Googlebot/2.1;
    +http://www.google.com/bot.html)" }'
  '@version': '1'
  '@timestamp': 2014-06-20 04:54:20.000000000 Z
  content_type:
    charset: utf-8
    type: text/xml
  remote_addr: 66.249.69.45
  body_bytes_sent: 38704
  request_time: 1.539
  status: 200
  robots: noindex,follow
  redirect_location: '-'
  request_method: GET
  scheme: https
  server_name: yoast.com
  request_uri: /cat/wordpress/feed/
  document_uri: /index.php
  http_user_agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  remote_addr_dns: crawl-66-249-69-45.googlebot.com

I think we should rename the @type:googlebot-maxcdn fields to match those of @type:googlebot

@jdevalk - do you agree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant