Keywords search dropdown for Advanced Search #231

peterjdann · 2024-06-14T06:37:02Z

This pull request proposes:
(a) Fixing a bug in the catalog's Advanced Search function that currently results in the generation of an incorrect list of results if a user enters a multiterm keywords entry such as "political fiction" or "french revolution"
(b) Adding an autocomplete style dropdown list to the keywords field of the catalog Advanced Search

I believe implementing this dropdown will make it much easier than it is presently for users prepared to try our Advanced Search function to find an audiobook that may match a specific subject or genre interest.

Anyone reviewing this code proposal will notice that the SQL used to populate the keywords dropdown list does a look up of the project_keywords table. The purpose of this lookup is to avoid listing in the keywords dropdown keywords which are, in effect, orphans — that is to say, terms that appear in the keywords table, but are never referenced in the project_keywords table. In the scrubbed database, there are currently 2928 such terms. They can be viewed by running the following SQL against the scrubbed database:

SELECT DISTINCT k.value, k.id
FROM keywords k
where k.id NOT IN (SELECT DISTINCT pk.keyword_id from project_keywords pk)
ORDER BY k.id ASC;

garethsime

Hi! I'm not a Librivox dev, so feel free to ignore my comments, but I do contribute occasionally, so I thought I'd throw in my suggestions 🙂

I like the look of this auto-complete feature a lot, but I have a few small technical questions.

garethsime · 2024-06-14T08:54:30Z

application/libraries/Librivox_search.php

@@ -122,7 +122,7 @@ function advanced_title_search($params)
 		$keyword_clause = '';
 		if (!empty($params['keywords']))
 		{
-			$keywords = explode(' ', $params['keywords']); //maybe preg_match if extra spaces cause trouble - thinnk we're ok
+			$keywords = array($params['keywords']);


Previously, users could type in something like "duck sheep" and get items that match either "duck" or "sheep". I'm not sure why it was that way exactly, but I suspect it's because it was hard for users to know which tags were valid, so they probably went "duck ducky duckling ente pato ..." in the hopes of striking lucky.

Now, we'd only be matching items that had the literal tag "duck sheep", but I think this is a good change -- the auto-complete should make it way easier to discover tags. (Might need to label the field "Keyword" rather than "Keywords" though.)

It might be worth running past the admin team, if you haven't already :)

(Comma-separated tags or something would be awesome, but I don't think that the jQuery autocomplete supports it out of the box, so you'd have to do something like this if you wanted that.)

I haven't run this change past the admin team. There didn't seem any point until I could convince myself that I could get it working. What HAS been run past the admin team, though, is another proposal I've currently got under consideration (it's a formal pull request now) that would see the keywords associated with every project displayed on that project's Catalog page. The keywords would be hyperlinked so that clicking a keywords entry immediately displays all the other projects sharing the same keywords term. As part of that change (when it was still a paper proposal) I suggested that the "keywords" terminology was inappropriate, and should be changed to "keyterm". I argued that calling an entry like "keywords" was confusing, and that "keyterm" was a more sensible description. That would leave the word "keyterms" (plural) to signify, say, "political fiction" AND "French revolution". In my proposal, I was going to alter "keywords" on the New Project template page to "keyterms". The administrators argued, I believe, that (a) matters are fine as they stand and (b) the New Project template is built around multilanguage labels, and if we were to change one label like "keywords" on that page, it would be necessary to pay someone to translate this into all the other languages we use for that page.
As for comma-separated tags, MAYBE that could be got to work with enough effort — but I'd prefer to take this in smaller steps. If this dropdown is accepted AND my pull request to display hyperlinked keywords for each project becomes a reality, I think it will become very much more feasible than it has been so far to conduct meaningful searches using keywords. I'm not suggesting this proposal is perfect — only that it's a lot better than what we have at the moment.

If this dropdown is accepted AND my pull request to display hyperlinked keywords for each project becomes a reality, I think it will become very much more feasible than it has been so far to conduct meaningful searches using keywords.

Very much so, I look forward to seeing it happen!

I'm not suggesting this proposal is perfect — only that it's a lot better than what we have at the moment.

Noted! If we ever get around to an auto-completing, multi-keyword-entering box, we can use that in the Project Template Generator and here in Advanced Search. Meanwhile, the visibility of what keywords we might search, straight from within the search form, is a big step forward.

Now, we'd only be matching items that had the literal tag "duck sheep", but I think this is a good change -- the auto-complete should make it way easier to discover tags. (Might need to label the field "Keyword" rather than "Keywords" though.)

This part also seems like at least a slight improvement overall, but comes with a trade-off. If there's a way to alleviate that down-side, great! Otherwise, we may take two steps forward,one step back, and then be happy about it.

The problem that this particular change would solve:
Searching by keyword for "American Revolution" currently shows keywords like "American Humor" and "French Revolution".
Yeah, that's bad. 😅

But, on the other hand:
Searching by keyword for "Winnie the Pooh" currently shows results tagged with "winnie", "winnie-the-pooh", "winnie the pooh", and simply "pooh".
This is good. Dear Pooh may be a silly old bear, but his nose will always lead him somewhere.
(Oh, and this same search will also show the one project with keyword "Thé". Excuse me while I make that not a thing, on live...)
A few projects will certainly have their keywords updated, as those become both more visible and more useful, but overhauling the keywords on our thousands of projects is actually even more work than most power-users would imagine, and is not in the scope of this PR.

If quoting a multi-word "keyword" to keep it together is easier for either of you "hotshots" than it is for me (I've not written so much new code as either of you!), or if you can think of another way to keep the best of both worlds, then I'm all ears. Otherwise, I'll certainly be here for the "running by" when we're sure we know our limitations. 😉

garethsime · 2024-06-14T08:57:18Z

application/models/Keyword_model.php

+		$query = $this->db->query($sql);
+		return $query->result_array();
+
+	}


I think it would be a good idea to use the escaping methods provided by CodeIgniter for this: https://www.codeigniter.com/userguide3/database/queries.html#escaping-queries

It would look something like this:

public function autocomplete($term) { // Escaping -- https://www.codeigniter.com/userguide3/database/queries.html#escaping-queries $escaped_term = $this->db->escape_like_str($term); $sql = 'SELECT DISTINCT k.value FROM keywords k JOIN project_keywords pk ON k.id = pk.keyword_id WHERE k.value like "' . $escaped_term . '%" ESCAPE \'!\' AND pk.project_id IS NOT NULL ORDER BY k.value ASC'; $query = $this->db->query($sql); return $query->result_array(); }

This helps prevent nasty SQL injection attacks.

Keywords are currently being escaped already in Librivox_search.php, thus:

$keyword_clause = '';
if (!empty($params['keywords']))
{
$keywords = array($params['keywords']);
$keywords = array_map('trim', $keywords); // clean it up
$escaped_keywords = [];
foreach ($keywords as $keyword)
$escaped_keywords[] = $this->db->escape($keyword);
$in_keywords = implode(", ", $escaped_keywords);

Are you suggesting we need further protection beyond this?

Doh! I get it now. Did not include a fix for this in my latest update. Will take another look at this tomorrow.

Yup, once the keywords are put into the search, it's fine, but we're putting $term directly into the auto-complete SQL itself, which could do bad things too 🙂

Here's the query again:

$sql = 'SELECT DISTINCT k.value FROM keywords k JOIN project_keywords pk ON k.id = pk.keyword_id WHERE k.value like "' . $term . '%" ORDER BY k.value ASC'; $query = $this->db->query($sql); return $query->result_array();

Having the ORDER BY bit on a newline stumps my crappy hacking skills, but imagine someone unknowingly removes the newline, changing it to this:

$sql = 'SELECT DISTINCT k.value FROM keywords k JOIN project_keywords pk ON k.id = pk.keyword_id WHERE k.value like "' . $term . '%" ORDER BY k.value ASC'; $query = $this->db->query($sql); return $query->result_array();

Now, you could put this in the keyword field and have it dump out all the email address in the Librivox database:

" UNION SELECT email AS value FROM users --

The full query becomes something like this:

SELECT DISTINCT k.value FROM keywords k JOIN project_keywords pk ON k.id = pk.keyword_id WHERE k.value like "" UNION SELECT email AS value FROM users -- %" ORDER BY k.value ASC

So yeah, I don't know how to make a hack like that work with the newline there, but I bet there's someone who can!

Oh, I was too slow to comment haha, glad you got what I meant 🙂

Thanks for thinking of SQL injection - escaping is good, parameterized queries are even better. We already have a tracker for existing SQL injection bugs (#19), and I've managed to fix some of them, with a lot more to go. You can see an example of converting to parameterized queries in #145.

garethsime · 2024-06-14T09:26:10Z

public_html/js/common/autocomplete.js

I've never looked at how the auto-complete works before, but man am I glad you figured it out first hahaha

Seriously though, I think it would be good to write a little docstring to help other contributors in the future. Maybe something like this:

/** * Set up auto-complete boxes using jQuery Autocomplete. * * Input elements need to define the following attributes: * * - data-search_func - The backend ajax function to call * - data-search_field - The name of the field * - data-search_area - Which area on the page is being auto-completed * - data-array_index - Used when you've got multiple inputs in one area (e.g. multiple authors) * * In the pages where this script is included, you'll also need to define two * functions: * * - autocomplete_assign_vars - Maps backend results to values to show in the drop-down * - autocomplete_assign_elements - Handles updating inputs when a value is selected * * See also: https://api.jqueryui.com/autocomplete/ */ function set_autocomplete() { ... }

Which also leads me into the next question -- why not follow how the others are doing it, and define the two javascript functions in advanced_search.php rather than modify this file? I know it's ugly, but better to have consistent ugly things rather than special cases. For example, you could drop something like this in:

 <script type="text/javascript"> function autocomplete_assign_vars(item) { return item.value; } function autocomplete_assign_elements(search_area, ui, array_index) { switch (search_area) { case 'keywords': document.getElementById("keywords").value = ui.item.label; break; } } </script> <script type="text/javascript" src="https://librivox.org/js/libs/jquery-1.8.2.js?v=1710057521"></script> <script type="text/javascript" src="https://librivox.org/js/libs/jquery.validate.js?v=1710057521"></script> <script type="text/javascript" src="https://librivox.org/js/libs/jquery-ui-1.8.24.custom.min.js?v=1710057521"></script> <script type="text/javascript" src="https://librivox.org/js/common/autocomplete.js?v=1710057521"></script>   <script type="text/javascript" src="https://librivox.org/js/common/jquery.tagsinput.min.js?v=1710057521"></script>

Thank you so much. These all strike me as excellent suggestions. I've implemented them on my development system and they work fine.

garethsime · 2024-06-14T09:42:56Z

application/views/catalog/partials/advanced_search.php

@@ -47,7 +49,7 @@
 			<div class="control-group">
 			     <div class="controls center">
 			     	<label for="keywords" ><span class="span2">Keywords:</span>
-					<input type="text" class="span4" id="keywords" name="keywords" value="<?= htmlspecialchars($advanced_search_form['keywords']) ?>"/>
+					<input type="text" name="keywords" value="" id="keywords" class="autocomplete" data-search_func="autocomplete_keywords" data-search_field="keywords" data-search_area="keywords" data-array_index="0" style="float:none; vertical-align:middle;margin-top:0px; width: 356px; max-width:500px; margin-left:0; "  />


Q: What are the styles for? E.g. why have a specific width/max width?

Seemed to be an essential change when I first worked through this, as changing the class of the field (required to make autocomplete work) appeared to be changing the layout. However, as I've now implemented your approach on my local environment, it seems I thankfully don't need this local styling anymore.

Nice, awesome :) Unlike IDs, an HTML element can have more than one class, so if span4 is important for the layout, then the input can have both classes:

<input class="span4 autocomplete" ... />

class attribute docs on MDN

But if you're happy with how it is without the span4, then I wouldn't change it!

garethsime · 2024-06-14T09:55:39Z

application/views/catalog/partials/advanced_search.php

+	<script type="text/javascript" src="https://librivox.org/js/libs/jquery-ui-1.8.24.custom.min.js?v=1710057521"></script>
+	<script type="text/javascript" src="https://librivox.org/js/common/autocomplete.js?v=1710057521"></script>
+	<script type="text/javascript" src="https://librivox.org/js/public/project_launch/index.js?v=1710057521"></script>
+	<script type="text/javascript" src="https://librivox.org/js/common/jquery.tagsinput.min.js?v=1710057521"></script>


Q: Are all of these needed? E.g. https://librivox.org/js/public/project_launch/index.js?v=1710057521 seems suspicious?

I did check the effect of removing each one of these individually to see if they were all needed, and had come to the conclusion that they were. In some cases the only side-effect of not including one of these scripts was to cause an exception to be raised and noted in the Javascript console, while the new dropdown function continued to work fine. However, I decided it would be better not to be implementing code that was going to be causing these exceptions to be thrown.
That said, it turns out there was either something wrong with my checking process, or the new way of implementing this that you have suggested does now completely obviate the need for the script you have highlighted. Without it, the dropdown is working fine on my local machine, and not causing exceptions to be thrown.

Ah right, makes sense! And yeah, if the autocomplete_assign_vars or autocomplete_assign_elements functions weren't set, then you'd get errors as soon as the autocomplete code ran. Bringing in those scripts would set them, which would stop the errors 🙂

garethsime · 2024-06-14T10:15:11Z

application/models/Keyword_model.php

+		FROM keywords k 
+		JOIN project_keywords pk
+		ON k.id = pk.keyword_id
+		WHERE k.value like "' . $term . '%" AND pk.project_id IS NOT NULL


Q: Is the AND pk.project_id IS NOT NULL significant here? This isn't a LEFT JOIN, so unmatched rows should be discarded I think?

Agreed. As you can see, I'm no hotshot developer! (Much more a Librivox reader than a Librivox techie...)

It may take me some little time as I have a few social commitments, but I expect I'll be updating my pull request to incorporate most of the changes you've suggested within the next 24 hours.

I'm no hotshot developer!

Nah, you've done well! The auto-complete stuff is tricky to put together and you got there. The main thing is that you've made it happen, and I'm looking forward to using it.

Much more a Librivox reader

Well, you're a long way ahead of my score of 2 sections that's for sure 😉

peterjdann · 2024-06-15T07:58:26Z

Have made a number of changes now in the light of garethsime's very helpful suggestions.
Thank you very much for your care and attention to this, garethsime!

peterjdann · 2024-06-15T09:24:23Z

Have just added keywords escaping, as suggested by Gareth. Thank you again, Gareth. Bit of a dimwit you're dealing with here!

garethsime · 2024-06-15T09:50:15Z

Also, I know this is completely off-topic for the PR, but I'm curious to know how your local dev setup works?

There was a fair amount of discussion a while back on how best to support other devs wanting to make contributions, so I'm always interested in hearing how people went with the setup and what they ended up doing, if you can spare the time 🙂

notartom · 2024-06-15T14:01:32Z

Hi! I'm not a Librivox dev, so feel free to ignore my comments, but I do contribute occasionally, so I thought I'd throw in my suggestions 🙂

You're as much a Librivox dev as any of us - as I've mentioned before in various places, the original author has long since moved on, so we're all trying our best to figure out the code base and make incremental improvements, hopefully without breaking anything. Your suggestions and comments are definitely valid, and I welcome your reviews of any PR.

peterjdann · 2024-06-16T01:24:37Z

Does strike me as possibly overkill, but I've now parameterised the query that looks up keywords in use, following a comment from notartom (which comment has now been deleted, I think?)

peterjdann · 2024-06-16T08:59:51Z

Also, I know this is completely off-topic for the PR, but I'm curious to know how your local dev setup works?

There was a fair amount of discussion a while back on how best to support other devs wanting to make contributions, so I'm always interested in hearing how people went with the setup and what they ended up doing, if you can spare the time 🙂

Have looked at my old notes on how I did this the first time on a Mac running Ubuntu under Parallels Desktop. Was able to pare down the necessary instructions to a fairly simple procedure, which I was able to verify worked correctly on a second Mac of mine running a similar setup this afternoon. I have posted those instructions in the place where the original discussion took place.

peterjdann · 2024-06-18T22:53:30Z

Thé IS a thing in French (tea). to a quick check before you delete that.

…

On Wed, 19 Jun 2024 at 8:48 AM, redrun45 ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In application/libraries/Librivox_search.php <#231 (comment)> : > @@ -122,7 +122,7 @@ function advanced_title_search($params) $keyword_clause = ''; if (!empty($params['keywords'])) { - $keywords = explode(' ', $params['keywords']); //maybe preg_match if extra spaces cause trouble - thinnk we're ok + $keywords = array($params['keywords']); I'm not suggesting this proposal is perfect — only that it's a lot better than what we have at the moment. Noted! If we ever get around to an auto-completing, multi-keyword-entering box, we can use that in the Project Template Generator *and* here in Advanced Search. Meanwhile, the visibility of what keywords we might search, straight from within the search form, is a big step forward. Now, we'd only be matching items that had the literal tag "duck sheep", but I think this is a good change -- the auto-complete should make it way easier to discover tags. (Might need to label the field "Keyword" rather than "Keywords" though.) *This part* also seems like at least a slight improvement overall, but comes with a trade-off. If there's a way to alleviate that down-side, great! Otherwise, we may take two steps forward,one step back, and then be happy about it. *The problem that this particular change would solve:* Searching by keyword for "American Revolution" currently shows keywords like "American *Humor*" and "*French* Revolution". Yeah, that's bad. 😅 *But, on the other hand:* Searching by keyword for "Winnie the Pooh" currently shows results tagged with "winnie", "winnie-the-pooh", "winnie the pooh", and simply "pooh". This is *good*. Dear Pooh may be a silly old bear, but his nose will always lead him somewhere. (Oh, and this same search will also show the one project with keyword "Thé". Excuse me while I make that *not a thing*, on live...) A few projects will certainly have their keywords updated, as those become both more visible and more useful, but overhauling the keywords on our thousands of projects is actually even *more* work than most power-users would imagine, and is not in the scope of this PR. If quoting a multi-word "keyword" to keep it together is easier for either of you "hotshots" than it is for me (*I've* not written so much new code as either of you!), or if you can think of another way to keep the best of both worlds, then I'm all ears. Otherwise, I'll certainly be here for the "running by" when we're sure we know our limitations. 😉 — Reply to this email directly, view it on GitHub <#231 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APUJ53HOQIHSBGGCGUXSUW3ZIC2NHAVCNFSM6AAAAABJJXQLB6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCMRWGY4TCNJVG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

redrun45 · 2024-06-18T23:44:20Z

Thé IS a thing in French (tea). to a quick check before you delete that.

Thanks, that's good to know! But here, it was short for 'Thénardier', having been truncated by accident.

peterjdann · 2024-06-18T23:50:37Z

Very good! Yes, sadly, exposing keywords is also going to expose keyblunders — there’s no denying that. I personally believe, though, the upside outweighs the downside.

…

On 19 Jun 2024, at 9:44 AM, redrun45 ***@***.***> wrote: Thé IS a thing in French (tea). to a quick check before you delete that. Thanks, that's good to know! But here, it was short for 'Thénardier', having been truncated by accident. — Reply to this email directly, view it on GitHub <#231 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APUJ53B4INR2TI2CAWUUCYTZIDA6VAVCNFSM6AAAAABJJXQLB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZXGI3DANJYGM>. You are receiving this because you authored the thread.

redrun45 · 2024-06-20T18:24:19Z

Very good! Yes, sadly, exposing keywords is also going to expose keyblunders — there’s no denying that. I personally believe, though, the upside outweighs the downside.

Agreed there! When keywords are more useful (as they will be, thanks to your changes here and elsewhere!), the obviously incorrect ones will be corrected over time. But since we can't search the catalog by description, and since everything else has a very finite set of "correct" values, the keywords will remain the only "free-form" discoverability for a project. Since #225 will cover the "exact match" use-case, I'd prefer we still at least have the option of a fuzzy keyword search.

I suppose the direction I'm going is: could we do something like the "Exact match" check-box for keywords in Advanced Search? Would that be easier than a multi-keyword-autocomplete box?

peterjdann · 2024-06-22T08:24:20Z

But since we can't search the catalog by description, and since everything else has a very finite set of "correct" values, the keywords will remain the only "free-form" discoverability for a project. Since #225 will cover the "exact match" use-case, I'd prefer we still at least have the option of a fuzzy keyword search.

I suppose the direction I'm going is: could we do something like the "Exact match" check-box for keywords in Advanced Search? Would that be easier than a multi-keyword-autocomplete box?

I think an "Exact match" checkbox would be a backward step. At the moment, my code is matching all terms that begin with what the user has entered. In practice, this means that if the user enters, say, "french revolution", that term is going to show up as the first hit, but the user will see other hits as well. In effect, my current code is delivering an exact match (if there is one) but also partially fuzzy matches.

Let me run past you an approach that you might find an improvement on how I've done this at the moment.

Imagine a scenario where the user has entered the search term "roman" and this causes the following SQL to be executed:

select value as text, 'A' as priority
from keywords
where value like 'roman%' and not INSTR(value, ';') and not INSTR(value, ',')
UNION
select concat ('=== ', 'other terms containing ', '"roman"', ' ===') as text, 'B' as priority
UNION
select value as text, 'C' as priority
from keywords
where value like '%roman%' and value not like 'roman%' and not INSTR(value, ';') and not INSTR(value, ',')
order by priority ASC, text ASC
limit 200

I know the "limit" value here is not what we actually use, but just for the purposes of illustration right now we need a high value for "limit" ito see the effect I'm after. (Of course we would also need to strip out the "priority" values from what the SQL is returning — I'm not suggesting we're going to display those to our users).

The point is, we can separately display what are, in effect, two lists in one return set, one of which lists is "fuzzier" than the other, while in effect documenting what we're showing the user within the result set itself.

Why do this, and not just present the user with a "fully fuzzy" result set? Well, try it using the same example:

select value as text
from keywords
where value like '%roman%' and not INSTR(value, ';') and not INSTR(value, ',')
order by text ASC
limit 200

It seems to me the result set is much less helpful.

As for the

not INSTR(value, ';') and not INSTR(value, ',')
why am I now proposing that?

Well, try running the same SQL but with that conditions stripped out. Turns out there really are some pretty messy entries in our keywords table that need to be rationalised. Down the track, it might be possible to write a utility page to help with that task — but for now, it might be best to hide some easily excludable bad examples from our users.

redrun45 · 2024-06-22T15:45:47Z

Just to be sure we're on the same page:
I'd suggest a check-box for the actual search, not necessarily for making the keyword autocomplete suggestions more fuzzy. You make good points on autocomplete. The "prioritized" version looks excellent from where I sit, too. 😉

The part I suggest making optional is this change in Librivox_search.php. The opening comment for this PR describes that change as:

Fixing a bug in the catalog's Advanced Search function that currently results in the generation of an incorrect list of results if a user enters a multiterm keywords entry such as "political fiction" or "french revolution"

When the user intends to search for projects tagged with a particular multi-word keyword, which they've selected from the autocomplete suggestion list... then doing a fuzzy search for projects with other keywords would indeed be a bug!

But if a user intends to do a broader search across multiple relevant keywords, that fuzzy matching is the best we have at the moment.
Let's say I want to study the history of France in a particular period. I select the Genre as "*Non-Fiction -> History -> Early Modern". Then instead of selecting just one keyword from a list of suggestions, I check or un-check a box, and then enter a collection of words that might appear in any number of relevant keywords. Say, "french", "france", "paris".
Would it be nice if we could pick several multi-word terms like "french revolution" and "white hoods of paris"? Absolutely! But if we don't (yet) have that, we might want to keep this (otherwise buggy) code as an alternate mode.

Ok! Suggestion made. It is definitely more work, even if a check-box is the simplest thing I know to ask for. Let me know what you think.

peterjdann · 2024-06-23T00:21:51Z

Thanks for this clarification. Can I ask if you’ve actually had a chance to play around in a system that has implemented this PR to see how quickly and easily it helps you find stuff you might be interested in? While obviously I don’t know this for sure, I can’t help wondering if your request here is based on an old way of doing things that there may not be a lot of point in clinging to if this PR and my other PR about exposing keywords on project catalog pages do make it into production. Would it REALLY be that hard to find what you’re looking for using this new approach? (My bet is that you would actually find this new way ever so much faster and easier — but then, perhaps you have played around with this for some time, as I’ve implemented it, and you really don’t find this new approach all that great?) To be clear, you appear to be suggesting an approach that would see the autocomplete behaviour I’ve enabled “turned off” at the selection of a checkbox. You’ve not suggested what succinct form of words you’d propose to use as a label for this checkbox that is going to make clear to users what to enter in order to get what results. Personally, I see very little value in going to all the trouble to implement the following, given how easy keywords searching is going to be without it anyway, but I can picture a “solution” that might satisfying what I personally think is an over-the-top requirement where instead of a checkbox we had a dropdown that included the following two options for keywords searching: “Click a term matching your entry” “Type comma-separated list” The first of these would be the default, and would result in the autocomplete dropdown behaviour I have implemented. If someone chose from the dropdown “political fiction”, they would see in their search results only projects that had “political fiction” as a keywords entry. They would not see all projects which have “fiction” as a keywords entry. The second would allow a user to type, say, ‘french, france, paris, french revolution, white hoods of Paris, winnie-the-pooh, pooh bear’ in the same (now non-autocompleting) field and see a search result set that included all projects that included any one of those literal comma-separated terms. It would NOT, however, include a project with the keywords ‘revolution’ or ‘bear’. I can see that this could be technically possible — but really, is this necessary now, with all the required JavaScript and so on? Have you really found the approach I’ve implemented so frustrating and difficult to use in practice? Out of interest, I did a check and found that of the 29,162 entries in the scrubbed database keywords table, nearly half are terms which contain a space not in the first position. It is currently impossible to conduct a straightforward keywords search using any of these terms without getting a host of what I regard as spurious results and unexpected results. It does seem to me that if the person who designed this system in the first place had intended the keyword search to operate as it does, I would have expected them to have prevented the use of space characters in keywords at the time they are first entered in the system. There’s certainly nothing in the current user interface that gives a user any clue that searching for “political fiction” is going to return every project that has the keywords “fiction”, and treating a space character as a list item separator is pretty unusual, in my experience. Given all the above, I don’t think it’s unreasonable to describe the current implementation as suffering from a bug.

…

On 23 Jun 2024, at 1:46 AM, redrun45 ***@***.***> wrote: Just to be sure we're on the same page: I'd suggest a check-box for the actual search, not necessarily for making the keyword autocomplete suggestions more fuzzy. You make good points on autocomplete. The "prioritized" version looks excellent from where I sit, too. 😉 The part I suggest making optional is this change in Librivox_search.php. The opening comment for this PR describes that change as: Fixing a bug in the catalog's Advanced Search function that currently results in the generation of an incorrect list of results if a user enters a multiterm keywords entry such as "political fiction" or "french revolution" When the user intends to search for projects tagged with a particular multi-word keyword, which they've selected from the autocomplete suggestion list... then doing a fuzzy search for projects with other keywords would indeed be a bug! But if a user intends to do a broader search across multiple relevant keywords, that fuzzy matching is the best we have at the moment. Let's say I want to study the history of France in a particular period. I select the Genre as "*Non-Fiction -> History -> Early Modern". Then instead of selecting just one keyword from a list of suggestions, I check or un-check a box, and then enter a collection of words that might appear in any number of relevant keywords. Say, "french", "france", "paris". Would it be nice if we could pick several multi-word terms like "french revolution" and "white hoods of paris"? Absolutely! But if we don't (yet) have that, we might want to keep this (otherwise buggy) code as an alternate mode. Ok! Suggestion made. It is definitely more work, even if a check-box is the simplest thing I know to ask for. Let me know what you think. — Reply to this email directly, view it on GitHub <#231 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APUJ53AJDA7VP3UYCRWJ5MTZIWL4DAVCNFSM6AAAAABJJXQLB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBUGA3TQMRVGM>. You are receiving this because you authored the thread.

redrun45 · 2024-06-23T00:39:40Z

Given all the above, I don’t think it’s unreasonable to describe the current implementation as suffering from a bug.

Fair enough! I'm not going to argue. I only intended to let you know of an objection I thought would likely be raised. I'll make sure folks know this PR is ready for review as-is.

peterjdann · 2024-06-23T01:57:55Z

Thank you! Can I suggest you pause any further review until I have implemented the kind of prioritised list I foreshadowed with that revised SQL? So far, that’s not yet part of this PR. I’d like to add it. Should be done within 48 hours, I expect.

…

On 23 Jun 2024, at 10:40 AM, redrun45 ***@***.***> wrote: Given all the above, I don’t think it’s unreasonable to describe the current implementation as suffering from a bug. Fair enough! I'm not going to argue. I only intended to let you know of an objection I thought would likely be raised. I'll make sure folks know this PR is ready for review as-is. — Reply to this email directly, view it on GitHub <#231 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APUJ53EORSDABAMF6RRRDLTZIYKOFAVCNFSM6AAAAABJJXQLB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBUGI3TANZRGU>. You are receiving this because you authored the thread.

peterjdann · 2024-06-24T08:10:16Z

Have now changed SQL so the list of keywords presented in the dropdown shows, as first priority, all keywords that START WITH the letters the user has typed, then a text separator indicating that the entries below the separator are some terms that INCLUDE the letters the user has typed, then as last priority terms not already presented that INCLUDE the letter sequence the user has typed. The maximum number of entries shown in the dropdown is 200.

I did NOT implement a change I foreshadowed above which would have excluded entries including ";" or "," characters.

The query term is now both "like-suitable" escaped and parameterised.

To get a glimpse of the kind of tidying up work that may lie ahead (if anyone is brave enough to take it on) trying typing "libriv" in the keywords search field on a system where this change has been implemented.

redrun45 · 2024-06-26T00:54:27Z

I did NOT implement a change I foreshadowed above which would have excluded entries including ";" or "," characters.
...
To get a glimpse of the kind of tidying up work that may lie ahead (if anyone is brave enough to take it on) trying typing "libriv" in the keywords search field on a system where this change has been implemented.

I didn't look to close until now, in case you intended to make that foreshadowed change. 😉
Actually, it's surprising how correlated these two things are. It would seem that any time either a ";" or "," shows up inside a keyword, it's a typo. In fact, most of those have already been fixed, and the remaining few can be done pretty easily! 🥳

I was going to suggest adding AND instances > 1 to the SELECT statements for autocomplete, now that I know the old, typo'd keywords are still in there. But of course, learning this, I realize that we'll need an extra line in update_keyword_stats, so that unused keywords get set to instances = 0!

I had assumed the unused keywords were deleted, so didn't even think to check that, for all my "hammering". I'll add another suggestion to #225. 😅

notartom · 2025-02-02T21:36:36Z

application/controllers/public/Public_ajax.php

@@ -15,6 +15,15 @@ function autocomplete_author()
 		$this->load->model('author_model');
 		echo json_encode($this->author_model->autocomplete($term, $search_field));
 	}
+
+	function autocomplete_keywords()


Question: I can see that is is referenced as the data-search_func in advanced_search.php below, so even though this is PHP and below is HTML, I'm assuming this somehow gets called when a user types in the keywords <input>. Does this mean we're running an SQL query every time a user adds or removes a character from the keywords search box? I'm worried about load on the server, and I'm wondering if it wouldn't be possible to cache the list of keywords client-side (or even server-side, but at least have them cached without looking them up each time?)

It's been a while since I first worked on this, but (from memory) yes, there is a DB lookup at each key press. Your suggestion of creating a cache (client or server side) and querying that makes very good sense. I'll see what I can do.

Does this mean we're running an SQL query every time a user adds or removes a character from the keywords search box?

For users who are "hunt-and-peck" typists, we probably are. 😅
By default, our JS library has a delay of 300ms after the last event, before it sends any queries to the server. That ought to mean no more than ~3 queries per second per user (and usually less), rather than a big rush of queries when a touch-typist goes at it.

This also applies to author autocomplete. If it's a problem here, we might be better off increasing the delay for both. Looks like we'd add the option in /public_html/js/common/autocomplete.js, with some higher value.

Ah, well that's a new angle I wasn't aware of. I'll wait to see what Artom thinks of redrun45's suggestion here before I embark on figuring out how to set up and query a cache.

I would like to point out that the call to the DB that runs in the current implementation to generate a dropdown list has a LIMIT statement in it, as follows:
$sql = 'SELECT DISTINCT k.value, "A" AS priority
FROM keywords k
JOIN project_keywords pk
ON k.id = pk.keyword_id
WHERE k.value LIKE ?
UNION
SELECT "=== some other keywords containing "?" ===" AS value, "B" AS priority
UNION
SELECT DISTINCT k.value, "C" AS priority
FROM keywords k
JOIN project_keywords pk
ON k.id = pk.keyword_id
WHERE k.value LIKE ?
AND k.value NOT LIKE ?
ORDER BY priority ASC, value ASC
LIMIT 200';

To populate a cache, it seems to me I'm going to need to something like:

Run a one time per usage SELECT from table keywords with a join on project_keywords with NO limit statement, and then, additionally:

At each subsequent keypress, loop through my cache (most likely a PHP array), potentially twice, using strpos() to look for matches where:
a) haystack contains needle*, and
b) haystack contants needle but not needle*

Now, the question in my mind is, is this latter approach likely to be more, or less, resource-intensive than the approach we've got at the moment, where every 300 ms (max) a user hits the database with a query that returns at most 200 items that we can use immediately as is, without the need for any further processing?

Yes, potentially much higher. Author autocomplete is currently "hooked up" to pages that are only useful for volunteers, rather than these search pages that the general public might use. It might be nice if we could hook it up to Advanced Search too... but that's if we can satisfy any performance impacts here, of course. 😉

If it helps any, this Keywords-autocomplete query would only be used when the user opts for "Advanced" search, and then starts entering keywords, which seems like a relatively small proportion of searches to me. Folks using the regular search, or advanced search without keywords, won't cause additional load.

In the cases where this method is used, the search itself might be a tad lighter because of 1 - fewer "keywords" entries matched, means fewer projects matched, means fewer (and hopefully more relevant) results. For example, the not-terribly-nice case where the user enters "american civil war", and receives projects whose keywords match any of "%american%" OR "%civil%" OR "%war%", which is quite a bit broader than they probably intended.

I've been trying to get my head around how either a server-side or client-side cache might be implemented in practice, and I must admit I'm scratching my head.

Re a server-side cache, unless I've got this all wrong, it's not like PHP objects continue to exist between https calls. Therefore (am I wrong about this??), it's not like there's any object in the CodeIgniter setup for which I can do a one-time setup of a keywords cache (for example, at the moment the keywords search field gets the focus) and then expect this cache to persist between user key-clicks in the keywords field, each (or many) of which are triggering AJAX calls on the server.

This leaves the theoretical possibility of setting up a client side cache — but give a thought, for a moment, to how complex this might be, and ask yourself why none of the canonical examples you'll find on the web about how to implement autocomplete with an input field use any such case (even though the requirement must be a very common one indeed).

Right now, the implementation is really very simple and elegant: we simply assign to the input field the server-side function that is to populate the list, and then implement that server-side function in PHP. (It's a little more complicated than that — but the amount of Javascript involved is very small). Against that, to implement a client-side cache I imagine I would need to make an AJAX call that would populate a Javascript array (say, on an onFocus event for the keywords input field), and then, on each keystroke in that field, loop through that array (potentially, twice) looking for matches.

Is this what you are asking me to do?

Other points I'm wondering about:
(a) If I have to have the client read a full list of all the keywords in use via AJAX, am I in danger of hitting some limit in the possible size of the response from the server?
(b) What would you think of writing a server-side keywords cache to an operating system file, and reading from that? If so: What should be the location of the file? Would you be concerned about the resource-use implications of that approach? (The file might be considered 'valid' by the cache-reading code if it bore a relatively recent timestamp, otherwise be rewritten following a fresh database call). There's a suggestion here (https://stackoverflow.com/questions/45740757/php-persistent-cache-data) that achieving 'persistence' by writing to, and reading from, a file might not be as expensive as it sounds if the operating system concerned in practice performed much of the work in memory.

I have had a go today at implementing an approach that caches a list of all keywords currently used in projects on the server side, as a file.

I haven't formally pushed this to this PR, as I think you need to have a look at it and think about whether you think it's a good idea in principle. You may well have some improvements to suggest right off.

In this implementation, I'm suggesting a maximum cache life of five minutes, on the assumption that if an administrator has added or deleted some keywords, they might well expect them to show up in the dropdown list relatively soon.

I've done only scanty testing. At face value, the results look identical to the current, wholly DB-centric, approach. I have verified that I can type Greek keyboard letters into the Keywords search field and get some Greek results. On my local machine the results seem to be available speedily — but that hardly amounts to a meaningful load test, of course.

In the process of getting this working, I did, at one point, change the permissions on my local cache directory (the one that is specified in the code). I have not double-checked whether this is, in fact, necessay.

At the code level, the change is confined to a single function within the Keywords model, and looks like this:

public function autocomplete($term) { // Paths for file operations with CodeIgniter helper are relative // to location of index.php $CACHE_DIR = '../application/cache/'; $KEYWORDS_CACHE_FILENAME = 'keywords_cache.txt'; $MAX_RETURN_VALUES = 200; $keywords_cache_file_exists = false; $keywords_cache_is_fresh = false; $keywords_cache_refresh_interval_seconds = 300; $json_encoded_keywords_cache_as_read_from_file = ''; $this->load->helper('file'); // Does keywords cache file yet exist? $files_in_cache_dir = get_dir_file_info($CACHE_DIR); if (array_key_exists($KEYWORDS_CACHE_FILENAME, $files_in_cache_dir)) { $keywords_cache_file_exists = true; } // Is the existing cache fresh? if ($keywords_cache_file_exists) { $cache_date = get_file_info($CACHE_DIR . $KEYWORDS_CACHE_FILENAME, 'date'); $timestamp_of_current_cache_file = $cache_date["date"]; $cache_age = time() - $timestamp_of_current_cache_file; if ($cache_age < $keywords_cache_refresh_interval_seconds) { $keywords_cache_is_fresh = true; } } // Write a new keywords cache if none currently exists, // or if the existing cache needs to be refreshed if (!$keywords_cache_file_exists or !$keywords_cache_is_fresh) { $sql = 'SELECT DISTINCT k.value FROM keywords k JOIN project_keywords pk ON k.id = pk.keyword_id ORDER BY value ASC'; $query = $this->db->query($sql); $data_for_keywords_cache = $query->result_array(); write_file($CACHE_DIR . $KEYWORDS_CACHE_FILENAME, json_encode($data_for_keywords_cache)); } $cached_result_array = array(); // Read the existing cache file if it is fresh if ($keywords_cache_file_exists and $keywords_cache_is_fresh) { $json_encoded_keywords_cache = read_file($CACHE_DIR . $KEYWORDS_CACHE_FILENAME); $associative = true; $keywords_cache_array = json_decode($json_encoded_keywords_cache, $associative); foreach($keywords_cache_array as $row => $inner_array) { foreach($inner_array as $inner_row => $value) { if ( strpos($value, $term) === 0 ) { $new_element = ["value" => $value]; array_push($cached_result_array, $new_element); } } $array_size = count ($cached_result_array); if ( count ($cached_result_array) >= $MAX_RETURN_VALUES) { break; } } // If we have at least one good match above, and if we have not yet exceeded // the number of items we can show in our dropdown list, show some additional partial matches if (( count ($cached_result_array) <= $MAX_RETURN_VALUES) and (count ($cached_result_array) > 0)) { $divider = "=== some other keywords containing '" . $term . "' ==="; $new_element = ["value" => $divider]; array_push($cached_result_array, $new_element); foreach($keywords_cache_array as $row => $inner_array) { foreach($inner_array as $inner_row => $value) { if ( (strpos($value, $term) !== 0 ) and ((strpos($value, $term))) ) { $new_element = ["value" => $value]; array_push($cached_result_array, $new_element); } } $array_size = count ($cached_result_array); if ( count ($cached_result_array) >= $MAX_RETURN_VALUES) { break; } } } } return $cached_result_array; } ```

I have now pushed to here what is basically the change shown above (which uses a file-based cache that is refreshed not more often than every five minutes), with a couple of minor adjustments. The first is that I have now made the matching process between what the user has entered and our list of keywords in use case-insensitive, so that, for example, if the user types "french", they will see results matching both "french" and "French". The second is that I have restored to the Keyword_model autocomplete() function the original DB-centric code, but commented out. This is to allow anyone wanting to compare the performance of the two approaches to switch between them just by commenting out whichever approach they wish not to use.
I have established that, at least on my development system, no changes of permissions for the 'cache' directory were required for this change to take affect.

garethsime reviewed Jun 14, 2024

View reviewed changes

peterjdann force-pushed the keywords_search_dropdown branch from 91a3a41 to 4f0a33e Compare June 15, 2024 07:52

peterjdann force-pushed the keywords_search_dropdown branch from 4f0a33e to 373a18f Compare June 15, 2024 09:20

peterjdann force-pushed the keywords_search_dropdown branch from 373a18f to a759c3b Compare June 16, 2024 01:15

garethsime approved these changes Jun 18, 2024

View reviewed changes

redrun45 mentioned this pull request Jun 23, 2024

Reveal keywords on project's catalogue page #225

Merged

peterjdann force-pushed the keywords_search_dropdown branch from 2e314e1 to 0f63a86 Compare June 24, 2024 08:00

notartom reviewed Feb 2, 2025

View reviewed changes

peterjdann added 2 commits February 3, 2025 15:50

Fix multi-keywords advanced search bug

bbb6e12

Dropdown selection list for advanced search keywords

7638324

peterjdann force-pushed the keywords_search_dropdown branch from 0f63a86 to 7638324 Compare February 5, 2025 05:13

Keywords search dropdown for Advanced Search #231

Are you sure you want to change the base?

Keywords search dropdown for Advanced Search #231

Conversation

peterjdann commented Jun 14, 2024

garethsime left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterjdann commented Jun 15, 2024

peterjdann commented Jun 15, 2024

garethsime commented Jun 15, 2024

notartom commented Jun 15, 2024 • edited Loading

peterjdann commented Jun 16, 2024

peterjdann commented Jun 16, 2024 • edited Loading

peterjdann commented Jun 18, 2024 via email

redrun45 commented Jun 18, 2024 • edited Loading

peterjdann commented Jun 18, 2024 via email

redrun45 commented Jun 20, 2024

peterjdann commented Jun 22, 2024 • edited Loading

redrun45 commented Jun 22, 2024

peterjdann commented Jun 23, 2024 via email

redrun45 commented Jun 23, 2024

peterjdann commented Jun 23, 2024 via email

peterjdann commented Jun 24, 2024 • edited Loading

redrun45 commented Jun 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterjdann Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterjdann Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

peterjdann Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

peterjdann Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

notartom commented Jun 15, 2024 •

edited

Loading

peterjdann commented Jun 16, 2024 •

edited

Loading

redrun45 commented Jun 18, 2024 •

edited

Loading

peterjdann commented Jun 22, 2024 •

edited

Loading

peterjdann commented Jun 24, 2024 •

edited

Loading

peterjdann Feb 3, 2025 •

edited

Loading

peterjdann Feb 3, 2025 •

edited

Loading

peterjdann Feb 4, 2025 •

edited

Loading

peterjdann Feb 5, 2025 •

edited

Loading