You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# for link extraction and webgraph construction also:
idna
# for parsing HTML (used in cc_index_word_count.py)
beautifulsoup4
lxml
# for HDFS support (requires environments variables JAVA_HOME and HADOOP_HOME):
#pydoop
# to parse WARC/WAT/WET files using FastWARC (https://pypi.org/project/FastWARC/)
#fastwarc
# (tested with)
#fastwarc==0.14.1
# to parse HTML (used in cc_index_word_count.py) using Resiliparse (https://pypi.org/project/Resiliparse/). Resiliparse requires compatible fastwarc version.