Skip to content

Commit a45d6b9

Browse files
committed
Updated zim.Archive to libkiwix's bhavior
- Added `Archive.tags` and `Archive.get_tags()` for a parsed list of Tags - `Archive.get_tags()` can return libkiwix's version which includes hints - `Article.article_counter` now returns libzim's `article_count` or the parsed count based on the file's version Also updated Changelog format
1 parent fe0b5fa commit a45d6b9

File tree

5 files changed

+207
-29
lines changed

5 files changed

+207
-29
lines changed

CHANGELOG.md

Lines changed: 50 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,40 @@
1-
# 1.4.3
1+
## Changelog
2+
3+
All notable changes to this project are documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) (as of version 1.5.0).
7+
8+
## [Unreleased]
9+
10+
### Added
11+
12+
- `zim.Archive.tags` and `zim.Archive.get_tags()` to retrieve parsed Tags
13+
with optionnal `libkiwix` param to include libkiwix's hints
14+
- [tests] Counter tests now also uses a libzim6 file.
15+
16+
### Changed
17+
18+
- `zim.Archive.article_counter` follows libkiwix's new bahavior of
19+
returning libzim's `article_count` for libzim 7+ ZIMs and
20+
returning previously returned (parsed) value for older ZIMs.
21+
22+
### Removed
23+
24+
- Unreachable code removed in `imaging` module.
25+
- [tests] “Sanskrit” removed from tests as output not predicatble depending on plaftform.
26+
27+
28+
## [1.4.3]
229

330
* `zim.Archive.counters` wont fail on missing `Counter` metadata
431

5-
# 1.4.2
32+
## [1.4.2]
633

734
* Fixed leak in `zim.Archive`'s `.counters`
835
* New `.get_text_metadata()` method on `zim.Archive` to save UTF-8 decoding
936

10-
# 1.4.1
37+
## [1.4.1]
1138

1239
* New `Counter` metadata based properties for Archive:
1340
* `.counters`: parsed dict of the Counter metadata
@@ -17,7 +44,7 @@
1744
* Added `uri` module with `rebuild_uri()`
1845

1946

20-
# 1.4.0
47+
## [1.4.0]
2148

2249
* Using new python-libzim based on libzim v7
2350
* New Creator API
@@ -37,7 +64,7 @@
3764
* Fixed `image.save_image()` saving to disk even when using a bytes stream
3865
* Fixed `image.transformation.resize_image()` when resizing a byte stream without a dst
3966

40-
# 1.3.6 (internal)
67+
## [1.3.6 (internal)]
4168

4269
Intermediate release using unreleased libzim to support development of libzim7.
4370
Don't use it.
@@ -58,7 +85,7 @@ Don't use it.
5885
* Added delete_fpath to add_item_for() and fixed StaticItem's auto remove
5986
* Updated badges for new repo name
6087

61-
# 1.3.5
88+
## [1.3.5]
6289

6390
* add `stream_file()` to stream content from a URL into a file or a `BytesIO` object
6491
* deprecated `save_file()`
@@ -67,7 +94,7 @@ Don't use it.
6794
* Added support for in-memory optimization for PNG, JPEG, and WebP images
6895
* allows enabling debug logs via ZIMSCRAPERLIB_DEBUG environ
6996

70-
# 1.3.4
97+
## [1.3.4]
7198

7299
* added `wait` option in `YoutubeDownloader` to allow parallelism while using context manager
73100
* do not use extension for finding format in `ensure_matches()` in `image.optimization` module
@@ -76,21 +103,21 @@ Don't use it.
76103
* `save_image` moved from `image` to `image.utils`
77104
* added `convert_image` `optimize_image` `resize_image` functions to `image` module
78105

79-
# 1.3.3
106+
## [1.3.3]
80107

81108
* added `YoutubeDownloader` to `download` to download YT videos using a capped nb of threads
82109

83-
# 1.3.2
110+
## [1.3.2]
84111

85112
* fixed rewriting of links with empty target
86113
* added support for image optimization using `zimscraperlib.image.optimization` for webp, gif, jpeg and png formats
87114
* added `format_for()` in `zimscraperlib.image.probing` to get PIL image format from the suffix
88115

89-
# 1.3.1
116+
## [1.3.1]
90117

91118
* replaced BeautifoulSoup parser in rewriting (`html.parser` –> `lxml`)
92119

93-
# 1.3.0
120+
## [1.3.0]
94121

95122
* detect mimetypes from filenames for all text files
96123
* fixed non-filename based StaticArticle
@@ -107,15 +134,15 @@ Don't use it.
107134
* changed `get_colors()` param names (`image_path` -> `src`)
108135
* changed `resize_image()` param names (`fpath` -> `src`)
109136

110-
# 1.2.1
137+
## [1.2.1]
111138

112139
* fixed URL rewriting when running from /
113140
* added support for link rewriting in `<object>` element
114141
* prevent from raising error if element doesn't have the attribute with url
115142
* use non greedy match for CSS URL links (shortest string matching `url()` format)
116143
* fix namespace of target only if link doesn't have a netloc
117144

118-
# 1.2.0
145+
## [1.2.0]
119146

120147
* added UTF8 to constants
121148
* added mime_type discovery via magic (filesystem)
@@ -128,46 +155,46 @@ Don't use it.
128155
* Added zim.rewriting: tools to rewrite links/urls in HTML/CSS
129156
* add timeout and retries to save_file() and make it return headers
130157

131-
# 1.1.2
158+
## [1.1.2]
132159

133160
* fixed `convert_image()` which tried to use a closed file
134161

135-
# 1.1.1
162+
## [1.1.1]
136163

137164
* exposed reencode, Config and get_media_info in zimscraperlib.video
138165
* added save_image() and convert_image() in zimscraperlib.imaging
139166
* added support for upscaling in resize_image() via allow_upscaling
140167
* resize_image() now supports params given by user and preservs image colorspace
141168
* fixed tests for zimscraperlib.imaging
142169

143-
# 1.1.0
170+
## [1.1.0]
144171

145172
* added video module with reencode, presets, config builder and video file probing
146173
* `make_zim_file()` accepts extra kwargs for zimwriterfs
147174

148-
# 1.0.6
175+
## [1.0.6]
149176

150177
* added translation support to i18n
151178

152-
# 1.0.5
179+
## [1.0.5]
153180

154181
* added s3transfer to verbose dependencies list
155182
* changed default log format to include module name
156183

157-
# 1.0.4
184+
## [1.0.4]
158185

159186
* verbose dependencies (urllib3, boto3) now logged at WARNING level by default
160187
* ability to set verbose dependencies log level and add modules to the list
161188
* zimscraperlib's logging level now aligned with scraper's requested one
162189

163190

164-
# 1.0.3
191+
## [1.0.3]
165192

166193
* fix_ogvjs_dist script more generic (#1)
167194
* updated zim to support other zimwriterfs params (#10)
168195
* more flexible requirements for requests dependency
169196

170-
# 1.0.2
197+
## [1.0.2]
171198

172199
* fixed return value of `get_language_details` on non-existent language
173200
* fixed crash on `resize_image` with method `height`
@@ -179,11 +206,11 @@ Don't use it.
179206
* added `create_favicon` to generate a squared favicon
180207
* added `handle_user_provided_file` to handle user file/URL from param
181208

182-
# 1.0.1
209+
## [1.0.1]
183210

184211
* fixed fix_ogvjs_dist
185212

186-
# 1.0.0
213+
## [1.0.0]
187214

188215
* initial version providing
189216
* download: save_file, save_large_file

src/zimscraperlib/zim/_libkiwix.py

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,12 @@
99
1010
https://github.com/kiwix/libkiwix/blob/master/src/reader.cpp
1111
https://github.com/kiwix/libkiwix/blob/master/src/tools/archiveTools.cpp
12+
https://github.com/kiwix/libkiwix/blob/master/src/tools/otherTools.cpp
1213
"""
1314

1415
import io
1516
from collections import namedtuple
16-
from typing import Dict, Optional, Tuple
17+
from typing import Dict, List, Optional, Tuple
1718

1819
MimetypeAndCounter = namedtuple("MimetypeAndCounter", ["mimetype", "value"])
1920
CounterMap = Dict[type(MimetypeAndCounter.mimetype), type(MimetypeAndCounter.value)]
@@ -105,3 +106,39 @@ def getMediaCount(counterMap: CounterMap) -> int:
105106
counter += count
106107

107108
return counter
109+
110+
111+
def convertTags(tags_str: str) -> List[str]:
112+
"""List of tags expanded with libkiwix's additional hints for pic/vid/det/index"""
113+
tags = tags_str.split(";")
114+
tagsList = []
115+
picSeen = vidSeen = detSeen = indexSeen = False
116+
for tag in tags:
117+
# not upstream
118+
if not tag:
119+
continue
120+
picSeen |= tag == "nopic" or tag.startswith("_pictures:")
121+
vidSeen |= tag == "novid" or tag.startswith("_videos:")
122+
detSeen |= tag == "nodet" or tag.startswith("_details:")
123+
indexSeen |= tag.startswith("_ftindex")
124+
125+
if tag == "nopic":
126+
tagsList.append("_pictures:no")
127+
elif tag == "novid":
128+
tagsList.append("_videos:no")
129+
elif tag == "nodet":
130+
tagsList.append("_details:no")
131+
elif tag == "_ftindex":
132+
tagsList.append("_ftindex:yes")
133+
else:
134+
tagsList.append(tag)
135+
136+
if not indexSeen:
137+
tagsList.append("_ftindex:no")
138+
if not picSeen:
139+
tagsList.append("_pictures:yes")
140+
if not vidSeen:
141+
tagsList.append("_videos:yes")
142+
if not detSeen:
143+
tagsList.append("_details:yes")
144+
return tagsList

src/zimscraperlib/zim/archive.py

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,13 @@
1010
- direct access to search results and number of results
1111
- public Entry access by Id"""
1212

13-
from typing import Dict, Iterable, Optional
13+
from typing import Dict, Iterable, List, Optional
1414

1515
import libzim.reader
1616
import libzim.search # Query, Searcher
1717
import libzim.suggestion # SuggestionSearcher
1818

19-
from ._libkiwix import getArticleCount, getMediaCount, parseMimetypeCounter
19+
from ._libkiwix import convertTags, getArticleCount, getMediaCount, parseMimetypeCounter
2020
from .items import Item
2121

2222

@@ -36,6 +36,22 @@ def metadata(self) -> Dict[str, str]:
3636
if not key.startswith("Illustration_")
3737
}
3838

39+
@property
40+
def tags(self):
41+
return self.get_tags()
42+
43+
def get_tags(self, libkiwix: bool = False) -> List[str]:
44+
"""List of ZIM tags, optionnaly expanded with libkiwix's hints"""
45+
try:
46+
tags_meta = self.get_text_metadata("Tags")
47+
except RuntimeError: # pragma: nocover
48+
tags_meta = ""
49+
50+
if libkiwix:
51+
return convertTags(tags_meta)
52+
53+
return tags_meta.split(";")
54+
3955
def get_text_metadata(self, name: str) -> str:
4056
"""Decoded value of a text metadata"""
4157
return super().get_metadata(name).decode("UTF-8")
@@ -94,6 +110,20 @@ def counters(self) -> Dict[str, int]:
94110
@property
95111
def article_counter(self) -> int:
96112
"""Nb of *articles* in the ZIM, using counters (from libkiwix)"""
113+
114+
# [libkiwix HACK]
115+
# getArticleCount() returns different things depending on
116+
# the "version" of the zim.
117+
# On old zim (<=6), it returns the number of entry in `A` namespace
118+
# On recent zim (>=7), it returns:
119+
# - the number of entry in `C` namespace (==getEntryCount)
120+
# if no frontArticleIndex is present
121+
# - the number of front article if a frontArticleIndex is present
122+
# The use case >=7 without frontArticleIndex is pretty rare so we don't care
123+
# We can detect if we are reading a zim <= 6
124+
# by checking if we have a newNamespaceScheme.
125+
if self.has_new_namespace_scheme:
126+
return self.article_count
97127
return getArticleCount(self.counters)
98128

99129
@property

tests/conftest.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,19 @@ def small_zim_file(tmpdir_factory):
140140
return dst
141141

142142

143+
@pytest.fixture(scope="session")
144+
def ns_zim_file(tmpdir_factory):
145+
from zimscraperlib.download import stream_file
146+
147+
dst = tmpdir_factory.mktemp("data").join("ns.zim")
148+
stream_file(
149+
"https://github.com/openzim/zim-testing-suite/raw/v0.4/data/withns/"
150+
"wikibooks_be_all_nopic_2017-02.zim",
151+
dst,
152+
)
153+
return dst
154+
155+
143156
@pytest.mark.slow
144157
@pytest.fixture(scope="session")
145158
def real_zim_file(tmpdir_factory):

0 commit comments

Comments
 (0)