Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DecodedURL #54

Merged
merged 28 commits into from
Jan 7, 2018
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
98edfc6
WIP DecodedURL with path and replace
mahmoud Dec 17, 2017
dce971b
DecodedURL.replace(query) and DecodedURL.query working. Fix up userin…
mahmoud Dec 18, 2017
2162c67
DecodedURL userinfo now working
mahmoud Dec 18, 2017
d1bfc68
add new arguments to _percent_decode and use them in the DecodedURL
mahmoud Dec 30, 2017
adc7370
added DecodedURL to the public API, started on tests, added DecodedUR…
mahmoud Dec 31, 2017
fc28ee3
add host and port to DecodedURL
mahmoud Dec 31, 2017
358402f
add DecodedURL.scheme property, plus some more tests
mahmoud Dec 31, 2017
34d4212
add and test a bunch of DecodedURL passthrough methods
mahmoud Dec 31, 2017
161e93a
test DecodedURL userinfo-related properties. Add generic percent enco…
mahmoud Dec 31, 2017
95ea9ea
add, test, and slightly refactor query manipulation methods. still pr…
mahmoud Dec 31, 2017
976c083
fix and test the aforementioned issue with query parameters which are…
mahmoud Dec 31, 2017
02a841c
bit of housekeeping on DecodedURL, rearranging and adding a couple mi…
mahmoud Dec 31, 2017
9a0438e
DecodedURL: made .query a tuple. more obvious errors on attempted mut…
mahmoud Dec 31, 2017
662ec79
DecodedURL.replace() now fixed and tested working for roundtripping U…
mahmoud Dec 31, 2017
3aa4b61
add some Twisted-style methods to DecodedURL for consistency with URL…
mahmoud Dec 31, 2017
82deb29
some notes and docs on DecodedURL
mahmoud Dec 31, 2017
0b311d7
split out host decoding to its own function, remove duplicated code
mahmoud Jan 1, 2018
30e19a6
fix delimiter typo in docstring, thanks alex!
mahmoud Jan 1, 2018
696e8fd
add parse convenience function and EncodedURL alias for URL to top-le…
mahmoud Jan 1, 2018
67ab0ec
add lazy keyword to DecodedURL methods where appropriate. Add some co…
mahmoud Jan 1, 2018
6a90f4a
DecodedURL pretty much fully covered by tests
mahmoud Jan 6, 2018
ad63b9b
docstrings for all DecodedURL members
mahmoud Jan 6, 2018
afc907b
cover another line in _percent_decode
mahmoud Jan 6, 2018
9f3212b
Merge branch 'master' into i44_decoded_url
mahmoud Jan 6, 2018
ac3be79
fixing test coverage for parse
mahmoud Jan 6, 2018
dd8248c
Merge branch 'i44_decoded_url' of github.com:python-hyper/hyperlink i…
mahmoud Jan 6, 2018
6dd3272
DecodedURL: address most of the comments about docstrings
mahmoud Jan 7, 2018
e8616fa
add a few more DecodedURL/parse tests and a couple comments
mahmoud Jan 7, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
* Switch off ctypes/socket for IP validation
* rebase method for path (prepends to path)
* switch default percent encoding to upper case (a la RFC 3986 2.1)
* sibling() should be maximal=False like child()

## normalize method

Expand Down
3 changes: 2 additions & 1 deletion hyperlink/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@

from ._url import URL, URLParseError, register_scheme, parse_host
from ._url import URL, DecodedURL, URLParseError, register_scheme, parse_host

__all__ = [
"URL",
"DecodedURL",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to say that we should just avoid exposing this entirely, but it probably needs to be exposed for type annotations.

However, as per #44, could we have a "decoded" property on URL that provides an interface to this, and an "encoded" property on DecodedURL that maps back to a URL?

(At this point I think I'm in favor of adding an EncodedURL alias for URL, then maybe adding a top-level entry point like hyperlink.parse() which takes a decoded kwarg flag which defaults to true, to make it easier to get started with DecodedURL which is what I think we all want most of the time. That can definitely be deferred to a separate ticket though.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am pretty much in favor of all those conveniences :) And I also agree that this probably needs to be exposed.

"URLParseError",
"register_scheme",
"parse_host"
Expand Down
299 changes: 296 additions & 3 deletions hyperlink/_url.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,13 +202,27 @@ def _make_quote_map(safe_chars):
_QUERY_DECODE_MAP = _make_decode_map(_QUERY_DELIMS)
_FRAGMENT_QUOTE_MAP = _make_quote_map(_FRAGMENT_SAFE)
_FRAGMENT_DECODE_MAP = _make_decode_map(_FRAGMENT_DELIMS)
_UNRESERVED_QUOTE_MAP = _make_quote_map(_UNRESERVED_CHARS)
_UNRESERVED_DECODE_MAP = dict([(k, v) for k, v in _HEX_CHAR_MAP.items()
if v.decode('ascii', 'replace')
in _UNRESERVED_CHARS])

_ROOT_PATHS = frozenset(((), (u'',)))


def _encode_reserved(text, maximal=True):
"""A very comprehensive percent encoding for encoding all
delimeters. Used for arguments to DecodedURL, where a % means a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "delimiters"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, this kind of typo is so very unlike me, I assure you! ;) Thanks!

percent sign, and not the character used by URLs for escaping
bytes.
"""
if maximal:
bytestr = normalize('NFC', text).encode('utf8')
return u''.join([_UNRESERVED_QUOTE_MAP[b] for b in bytestr])
return u''.join([_UNRESERVED_QUOTE_MAP[t] if t in _UNRESERVED_CHARS
else t for t in text])


def _encode_path_part(text, maximal=True):
"Percent-encode a single segment of a URL path."
if maximal:
Expand Down Expand Up @@ -485,7 +499,8 @@ def _decode_fragment_part(text, normalize_case=False):
_decode_map=_FRAGMENT_DECODE_MAP)


def _percent_decode(text, normalize_case=False, _decode_map=_HEX_CHAR_MAP):
def _percent_decode(text, normalize_case=False, subencoding='utf-8',
raise_subencoding_exc=False, _decode_map=_HEX_CHAR_MAP):
"""Convert percent-encoded text characters to their normal,
human-readable equivalents.

Expand Down Expand Up @@ -547,9 +562,13 @@ def _percent_decode(text, normalize_case=False, _decode_map=_HEX_CHAR_MAP):

unquoted_bytes = b''.join(res)

if subencoding is False:
return unquoted_bytes
try:
return unquoted_bytes.decode("utf-8")
return unquoted_bytes.decode(subencoding)
except UnicodeDecodeError:
if raise_subencoding_exc:
raise
return text


Expand Down Expand Up @@ -1040,7 +1059,7 @@ def from_text(cls, text):
rooted, userinfo, uses_netloc)

def normalize(self, scheme=True, host=True, path=True, query=True,
fragment=True):
fragment=True, userinfo=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should mention userinfo.

There's a TODO for userinfo. Should that go?

"""Return a new URL object with several standard normalizations
applied:

Expand Down Expand Up @@ -1091,6 +1110,11 @@ def normalize(self, scheme=True, host=True, path=True, query=True,
if fragment:
kw['fragment'] = _decode_unreserved(self.fragment,
normalize_case=True)
if userinfo:
kw['userinfo'] = u':'.join([_decode_unreserved(p,
normalize_case=True)
for p in self.userinfo.split(':', 1)])

return self.replace(**kw)

def child(self, *segments):
Expand Down Expand Up @@ -1449,3 +1473,272 @@ def remove(self, name):
"""
return self.replace(query=((k, v) for (k, v) in self.query
if k != name))


class DecodedURL(object):
"""DecodedURL is a type meant to act as a higher-level interface to
the URL. It is the `unicode` to URL's `bytes`. `DecodedURL` has
almost exactly the same API as `URL`, but everything going in and
out is in its maximally decoded state. All percent decoding is
handled automatically.

Where applicable, a UTF-8 encoding is presumed. Be advised that
some interactions, can raise UnicodeEncodeErrors and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma splice:

some interactions , can raise UnicodeEncodeErrors...

Also, maybe you want backticks around UnicodeEncodeErrors and UnicodeDecodeErrors.

UnicodeDecodeErrors, just like when working with
bytestrings.

Examples of such interactions include handling query strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of the previous paragraph?

encoding binary data, and paths containing segments with special
characters encoded with codecs other than UTF-8.
"""
def __init__(self, url):
self._url = url

@classmethod
def from_text(cls, text):
_url = URL.from_text(text)
return cls(_url)

def to_text(self, *a, **kw):
return self._url.to_text(*a, **kw)

def to_uri(self, *a, **kw):
return self._url.to_uri(*a, **kw)

def to_iri(self, *a, **kw):
return self._url.to_iri(*a, **kw)

def click(self, href=u''):
return type(self)(self._url.click(href=href))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not self.__class__?


def sibling(self, segment):
return type(self)(self._url.sibling(_encode_reserved(segment)))

def child(self, *segments):
if not segments:
return self
new_segs = [_encode_reserved(s) for s in segments]
return type(self)(self._url.child(*new_segs))

def normalize(self, *a, **kw):
return type(self)(self._url.normalize(*a, **kw))

@property
def absolute(self):
return self._url.absolute

@property
def scheme(self):
return self._url.scheme

@property
def host(self):
host = self._url.host
try:
host_bytes = host.encode("ascii")
except UnicodeEncodeError:
host_text = host
else:
try:
host_text = host_bytes.decode("idna")
except ValueError:
# only reached on "narrow" (UCS-2) Python builds <3.4, see #7
# NOTE: not going to raise here, because there's no
# ambiguity in the IDNA, and the host is still
# technically usable
host_text = host

return host_text

@property
def port(self):
return self._url.port

@property
def rooted(self):
return self._url.rooted

@property
def path(self):
return tuple([_percent_decode(p, raise_subencoding_exc=True)
for p in self._url.path])

@property
def query(self):
return tuple([tuple(_percent_decode(x, raise_subencoding_exc=True)
if x is not None else None
for x in (k, v))
for k, v in self._url.query])

@property
def fragment(self):
return _percent_decode(self._url.fragment, raise_subencoding_exc=True)

@property
def userinfo(self):
return tuple([_percent_decode(p, raise_subencoding_exc=True)
for p in self._url.userinfo.split(':', 1)])

@property
def user(self):
return self.userinfo[0]

@property
def password(self):
return self.userinfo[1]

@property
def uses_netloc(self):
return self._url.uses_netloc

def replace(self, scheme=_UNSET, host=_UNSET, path=_UNSET, query=_UNSET,
fragment=_UNSET, port=_UNSET, rooted=_UNSET, userinfo=_UNSET,
uses_netloc=_UNSET):
"""This replace differs a little from URL.replace. For instance, it
accepts userinfo as a tuple, not as a string. As with the rest
of the methods on DecodedURL, if you pass a reserved
character, it will be automatically encoded instead of an
error being raised.
"""
if path is not _UNSET:
path = [_encode_reserved(p) for p in path]
if query is not _UNSET:
query = [[_encode_reserved(x)
if x is not None else None
for x in (k, v)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above about nullable query parameter keys.

for k, v in iter_pairs(query)]
if userinfo is not _UNSET:
if not len(userinfo) == 2:
raise ValueError('userinfo expected sequence of'
' ["user", "password"], got %r' % userinfo)
userinfo = u':'.join([_encode_reserved(p) for p in userinfo])
new_url = self._url.replace(scheme=scheme,
host=host,
path=path,
query=query,
fragment=fragment,
port=port,
rooted=rooted,
userinfo=userinfo,
uses_netloc=uses_netloc)
return type(self)(url=new_url)

def get(self, name):
# TODO: another reason to do this in the __init__
return [v for (k, v) in self.query if name == k]

def add(self, name, value=None):
return self.replace(query=self.query + ((name, value),))

def set(self, name, value=None):
query = self.query
q = [(k, v) for (k, v) in query if k != name]
idx = next((i for (i, (k, v)) in enumerate(query) if k == name), -1)
q[idx:idx] = [(name, value)]
return self.replace(query=q)

def remove(self, name):
return self.replace(query=((k, v) for (k, v) in self.query
if k != name))

def __repr__(self):
cn = self.__class__.__name__
return '%s(url=%r)' % (cn, self._url)

def __str__(self):
# TODO: the underlying URL's __str__ needs to change to make
# this work as the URL
return str(self._url)

def __eq__(self, other):
if not isinstance(other, self.__class__):
return NotImplemented
return self.normalize().to_uri() == other.normalize().to_uri()

def __ne__(self, other):
if not isinstance(other, self.__class__):
return NotImplemented
return not self.__eq__(other)

def __hash__(self):
return hash((self.__class__, self.scheme, self.userinfo, self.host,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Since equality is delegated upwards to URL, would it be better to do the same for hashing?

e.g.

def __hash__(self):
    return hash(self.normalize().to_uri())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I may have subtly changed my mind midway through development, and now I lean more toward not delegating up to URL for equality. I'll get that fixed, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever you're delegating hash like this, remember you should also include a tweak, so that the DecodedURL and the URL representing the same data don't hash the same.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glyph why is that? (correctness shouldn't be effected since dictionaries compare by equality, and I'm not entirely sure what the performance problem you're trying to forestall)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share @moshez's curiosity.

@mahmoud - will you follow @alexwlchan's advice about delegating __hash__?

self.path, self.query, self.fragment, self.port,
self.rooted, self.uses_netloc))

# # Begin Twisted Compat Code
asURI = to_uri
asIRI = to_iri

@classmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this its own classmethod when the instance methods are aliased?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Twisted-style single character argument name copied for consistency with URL.

def fromText(cls, s):
return cls.from_text(s)

def asText(self, includeSecrets=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Twisted-style camelCase argument name.

return self.to_text(with_password=includeSecrets)

def __dir__(self):
try:
ret = object.__dir__(self)
except AttributeError:
# object.__dir__ == AttributeError # pdw for py2
ret = dir(self.__class__) + list(self.__dict__.keys())
ret = sorted(set(ret) - set(['fromText', 'asURI', 'asIRI', 'asText']))
return ret

# # End Twisted Compat Code




"""Probably turn the properties into normal attributes now that they
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like most of this is done. Please remove everything except what's left to do for posterity's sake.

raise exceptions, or at least cachedproperties.

* Decode
* Percent
* userinfo
* user
* path
* query
* fragment
* Wrap in decoder
* .get()
* Wrap in encoder
* replace
* add()
* set()
* child (split and encode?)
* sibling
* remove()
* Passthrough
* __eq__ / __ne__ / __hash__
* absolute()
* Return new DecodedURL with new ._url (the other kind of passthrough)
* normalize()
* click()
* Strict passthrough (doesn't return a DecodedURL)
* to_uri()
* to_iri()

# Factoring

Should DecodedURL be a subclass of URL? Inheritance isn't cool
anymore, so obviously not right? But seriously, it could be:

* Every single method of URL is wrapped with almost an identical API,
except for __init__
* A DecodedURL is as much a URL as both `bytes` and `unicode` are
strings

On the arguments against:

* __init__ differs
* No real benefit to calling super()
* Only a few duplicate methods could be reused (Twisted compat,
.get(), a couple others)

# Remaining design questions

* Should _encode_reserved(maximal=False) be used instead?
* If yes, should the underlying URL have .to_iri() applied to it?

"""
Loading