Updated zim.Archive to libkiwix's bhavior

rgaudin · rgaudin · commit a45d6b94699f · 2022-05-09T16:01:31.000Z
- Added `Archive.tags` and `Archive.get_tags()` for a parsed list of Tags
- `Archive.get_tags()` can return libkiwix's version which includes hints
- `Article.article_counter` now returns libzim's `article_count` or the parsed count
based on the file's version

Also updated Changelog format
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,13 +1,40 @@
-# 1.4.3
+## Changelog
+
+All notable changes to this project are documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) (as of version 1.5.0).
+
+## [Unreleased]
+
+### Added
+
+- `zim.Archive.tags` and `zim.Archive.get_tags()` to retrieve parsed Tags
+  with optionnal `libkiwix` param to include libkiwix's hints
+- [tests] Counter tests now also uses a libzim6 file.
+
+### Changed
+
+- `zim.Archive.article_counter` follows libkiwix's new bahavior of
+  returning libzim's `article_count` for libzim 7+ ZIMs and
+  returning previously returned (parsed) value for older ZIMs.
+
+### Removed
+
+- Unreachable code removed in `imaging` module.
+- [tests] “Sanskrit” removed from tests as output not predicatble depending on plaftform.
+
+
+## [1.4.3]
 
 * `zim.Archive.counters` wont fail on missing `Counter` metadata
 
-# 1.4.2
+## [1.4.2]
 
 * Fixed leak in `zim.Archive`'s `.counters`
 * New `.get_text_metadata()` method on `zim.Archive` to save UTF-8 decoding
 
-# 1.4.1
+## [1.4.1]
 
 * New `Counter` metadata based properties for Archive:
   * `.counters`: parsed dict of the Counter metadata
@@ -17,7 +44,7 @@
 * Added `uri` module with `rebuild_uri()`
 
 
-# 1.4.0
+## [1.4.0]
 
 * Using new python-libzim based on libzim v7
   * New Creator API
@@ -37,7 +64,7 @@
 * Fixed `image.save_image()` saving to disk even when using a bytes stream
 * Fixed `image.transformation.resize_image()` when resizing a byte stream without a dst
 
-# 1.3.6 (internal)
+## [1.3.6 (internal)]
 
 Intermediate release using unreleased libzim to support development of libzim7.
 Don't use it.
@@ -58,7 +85,7 @@ Don't use it.
 * Added delete_fpath to add_item_for() and fixed StaticItem's auto remove
 * Updated badges for new repo name
 
-# 1.3.5
+## [1.3.5]
 
 * add `stream_file()` to stream content from a URL into a file or a `BytesIO` object
 * deprecated `save_file()`
@@ -67,7 +94,7 @@ Don't use it.
 * Added support for in-memory optimization for PNG, JPEG, and WebP images
 * allows enabling debug logs via ZIMSCRAPERLIB_DEBUG environ
 
-# 1.3.4
+## [1.3.4]
 
 * added `wait` option in `YoutubeDownloader` to allow parallelism while using context manager
 * do not use extension for finding format in `ensure_matches()` in `image.optimization` module
@@ -76,21 +103,21 @@ Don't use it.
 * `save_image` moved from `image` to `image.utils`
 * added `convert_image` `optimize_image` `resize_image` functions to `image` module
 
-# 1.3.3
+## [1.3.3]
 
 * added `YoutubeDownloader` to `download` to download YT videos using a capped nb of threads
 
-# 1.3.2
+## [1.3.2]
 
 * fixed rewriting of links with empty target
 * added support for image optimization using `zimscraperlib.image.optimization` for webp, gif, jpeg and png formats
 * added `format_for()` in `zimscraperlib.image.probing` to get PIL image format from the suffix
 
-# 1.3.1
+## [1.3.1]
 
 * replaced BeautifoulSoup parser in rewriting (`html.parser` –> `lxml`)
 
-# 1.3.0
+## [1.3.0]
 
 * detect mimetypes from filenames for all text files
 * fixed non-filename based StaticArticle
@@ -107,15 +134,15 @@ Don't use it.
 * changed `get_colors()` param names (`image_path` -> `src`)
 * changed `resize_image()` param names (`fpath` -> `src`)
 
-# 1.2.1
+## [1.2.1]
 
 * fixed URL rewriting when running from /
 * added support for link rewriting in `<object>` element
 * prevent from raising error if element doesn't have the attribute with url
 * use non greedy match for CSS URL links (shortest string matching `url()` format)
 * fix namespace of target only if link doesn't have a netloc
 
-# 1.2.0
+## [1.2.0]
 
 * added UTF8 to constants
 * added mime_type discovery via magic (filesystem)
@@ -128,46 +155,46 @@ Don't use it.
   * Added zim.rewriting: tools to rewrite links/urls in HTML/CSS
 * add timeout and retries to save_file() and make it return headers
 
-# 1.1.2
+## [1.1.2]
 
 * fixed `convert_image()` which tried to use a closed file
 
-# 1.1.1
+## [1.1.1]
 
 * exposed reencode, Config and get_media_info in zimscraperlib.video
 * added save_image() and convert_image() in zimscraperlib.imaging
 * added support for upscaling in resize_image() via allow_upscaling
 * resize_image() now supports params given by user and preservs image colorspace
 * fixed tests for zimscraperlib.imaging
 
-# 1.1.0
+## [1.1.0]
 
 * added video module with reencode, presets, config builder and video file probing
 * `make_zim_file()` accepts extra kwargs for zimwriterfs
 
-# 1.0.6
+## [1.0.6]
 
 * added translation support to i18n
 
-# 1.0.5
+## [1.0.5]
 
 * added s3transfer to verbose dependencies list
 * changed default log format to include module name
 
-# 1.0.4
+## [1.0.4]
 
 * verbose dependencies (urllib3, boto3) now logged at WARNING level by default
 * ability to set verbose dependencies log level and add modules to the list
 * zimscraperlib's logging level now aligned with scraper's requested one
 
 
-# 1.0.3
+## [1.0.3]
 
 * fix_ogvjs_dist script more generic (#1)
 * updated zim to support other zimwriterfs params (#10)
 * more flexible requirements for requests dependency
 
-# 1.0.2
+## [1.0.2]
 
 * fixed return value of `get_language_details` on non-existent language
 * fixed crash on `resize_image` with method `height`
@@ -179,11 +206,11 @@ Don't use it.
 * added `create_favicon` to generate a squared favicon
 * added `handle_user_provided_file` to handle user file/URL from param
 
-# 1.0.1
+## [1.0.1]
 
 * fixed fix_ogvjs_dist
 
-# 1.0.0
+## [1.0.0]
 
 * initial version providing
  * download: save_file, save_large_file
diff --git a/src/zimscraperlib/zim/_libkiwix.py b/src/zimscraperlib/zim/_libkiwix.py
@@ -9,11 +9,12 @@
 
 https://github.com/kiwix/libkiwix/blob/master/src/reader.cpp
 https://github.com/kiwix/libkiwix/blob/master/src/tools/archiveTools.cpp
+https://github.com/kiwix/libkiwix/blob/master/src/tools/otherTools.cpp
 """
 
 import io
 from collections import namedtuple
-from typing import Dict, Optional, Tuple
+from typing import Dict, List, Optional, Tuple
 
 MimetypeAndCounter = namedtuple("MimetypeAndCounter", ["mimetype", "value"])
 CounterMap = Dict[type(MimetypeAndCounter.mimetype), type(MimetypeAndCounter.value)]
@@ -105,3 +106,39 @@ def getMediaCount(counterMap: CounterMap) -> int:
             counter += count
 
     return counter
+
+
+def convertTags(tags_str: str) -> List[str]:
+    """List of tags expanded with libkiwix's additional hints for pic/vid/det/index"""
+    tags = tags_str.split(";")
+    tagsList = []
+    picSeen = vidSeen = detSeen = indexSeen = False
+    for tag in tags:
+        # not upstream
+        if not tag:
+            continue
+        picSeen |= tag == "nopic" or tag.startswith("_pictures:")
+        vidSeen |= tag == "novid" or tag.startswith("_videos:")
+        detSeen |= tag == "nodet" or tag.startswith("_details:")
+        indexSeen |= tag.startswith("_ftindex")
+
+        if tag == "nopic":
+            tagsList.append("_pictures:no")
+        elif tag == "novid":
+            tagsList.append("_videos:no")
+        elif tag == "nodet":
+            tagsList.append("_details:no")
+        elif tag == "_ftindex":
+            tagsList.append("_ftindex:yes")
+        else:
+            tagsList.append(tag)
+
+    if not indexSeen:
+        tagsList.append("_ftindex:no")
+    if not picSeen:
+        tagsList.append("_pictures:yes")
+    if not vidSeen:
+        tagsList.append("_videos:yes")
+    if not detSeen:
+        tagsList.append("_details:yes")
+    return tagsList
diff --git a/src/zimscraperlib/zim/archive.py b/src/zimscraperlib/zim/archive.py
@@ -10,13 +10,13 @@
     - direct access to search results and number of results
     - public Entry access by Id"""
 
-from typing import Dict, Iterable, Optional
+from typing import Dict, Iterable, List, Optional
 
 import libzim.reader
 import libzim.search  # Query, Searcher
 import libzim.suggestion  # SuggestionSearcher
 
-from ._libkiwix import getArticleCount, getMediaCount, parseMimetypeCounter
+from ._libkiwix import convertTags, getArticleCount, getMediaCount, parseMimetypeCounter
 from .items import Item
 
 
@@ -36,6 +36,22 @@ def metadata(self) -> Dict[str, str]:
             if not key.startswith("Illustration_")
         }
 
+    @property
+    def tags(self):
+        return self.get_tags()
+
+    def get_tags(self, libkiwix: bool = False) -> List[str]:
+        """List of ZIM tags, optionnaly expanded with libkiwix's hints"""
+        try:
+            tags_meta = self.get_text_metadata("Tags")
+        except RuntimeError:  # pragma: nocover
+            tags_meta = ""
+
+        if libkiwix:
+            return convertTags(tags_meta)
+
+        return tags_meta.split(";")
+
     def get_text_metadata(self, name: str) -> str:
         """Decoded value of a text metadata"""
         return super().get_metadata(name).decode("UTF-8")
@@ -94,6 +110,20 @@ def counters(self) -> Dict[str, int]:
     @property
     def article_counter(self) -> int:
         """Nb of *articles* in the ZIM, using counters (from libkiwix)"""
+
+        # [libkiwix HACK]
+        # getArticleCount() returns different things depending on
+        # the "version" of the zim.
+        # On old zim (<=6), it returns the number of entry in `A` namespace
+        # On recent zim (>=7), it returns:
+        #  - the number of entry in `C` namespace (==getEntryCount)
+        #    if no frontArticleIndex is present
+        #  - the number of front article if a frontArticleIndex is present
+        # The use case >=7 without frontArticleIndex is pretty rare so we don't care
+        # We can detect if we are reading a zim <= 6
+        # by checking if we have a newNamespaceScheme.
+        if self.has_new_namespace_scheme:
+            return self.article_count
         return getArticleCount(self.counters)
 
     @property
diff --git a/tests/conftest.py b/tests/conftest.py
@@ -140,6 +140,19 @@ def small_zim_file(tmpdir_factory):
     return dst
 
 
+@pytest.fixture(scope="session")
+def ns_zim_file(tmpdir_factory):
+    from zimscraperlib.download import stream_file
+
+    dst = tmpdir_factory.mktemp("data").join("ns.zim")
+    stream_file(
+        "https://github.com/openzim/zim-testing-suite/raw/v0.4/data/withns/"
+        "wikibooks_be_all_nopic_2017-02.zim",
+        dst,
+    )
+    return dst
+
+
 @pytest.mark.slow
 @pytest.fixture(scope="session")
 def real_zim_file(tmpdir_factory):
diff --git a/tests/zim/test_archive.py b/tests/zim/test_archive.py