Photon Sphere aims to provide a machine learning approach to identifying domain DNS requests that are seen as pernicious (analytics, trackers, ad-serving) for use along with Pi Hole (https://pi-hole.net/) while being deployable on a Raspberry Pi. Model uses the unsupervised text tokenizer YouTokenToMe to parse and tokenize domains for use in a lightweight embedding model. Ideally, common elements (e.g. domain names having words such as 'ads' or 'tracker') among prior known pernicious domains can be used to identify domains that would traditionally require parsing by hand or an exceptionally complicated regex.
The model is composed of a siamese embedding layer with a distance metric learning network. The model is trained using a triplet loss to maximize dissimilarites between domains (e.g. login.microsoft.com - analytics.microsoft.com) while minimizing similarities (e.g. login.github.com - github.com).
- YouTokenToMe(YTTM) vocab size is 300 by default (too large results in overfitting)
- Model can be run in real-time or on the archived Pi Hole SQL DNS query logs
- Online learning aspect is still in development
- tensorflow
- numpy
- sqlalchemy
- youtokentome