TL;DR: we use Algolia to implement a site search feature on ethereum.org.
Algolia allows us to index the content on ethereum.org and implement a powerful site search tool on ethereum.org. In order to create the index of our content, we use a web crawling tool called Docsearch. Docsearch takes a start_urls of ethereum.org and crawls the site to index the content based on a docsearchConfig file.
We kick off the crawling and indexing of ethereum.org through a GitHub Action that triggers on the merge to master
branch. View the GitHub Action.
Some important notes about the docsearch config file:
index_name
is the name of the algolia index where the generated index will be uploaded to.start_urls
are the urls that the crawler will start from. Some important attributes in thestart_urls
that we use are:lang
: regex path to different languages that the site is translated to that need crawling. Since ethereum.org is translated to 37+ languages, we need to be able to crawl the website in each language for indexing.page_rank
: the rank of pages that breaks ties when multiple query results have the same weight. This weight is derived from the selectors.
stop_urls
is used to strip out query parameters in the websites urls. We were running into issues where we were getting duplicate query results due to query parameters making urls unique. Stripping these out solved our deduplication problem.- selectors are used to specify what the crawler should look for when weighting content for the index.
We generate the docsearchConfig.json file using a script. This allows us to dynamically pull in the languages the websites support from the translations.json data file. Our GitHub action executes this script.