Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keywords search dropdown for Advanced Search #231

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

peterjdann
Copy link
Contributor

This pull request proposes:
(a) Fixing a bug in the catalog's Advanced Search function that currently results in the generation of an incorrect list of results if a user enters a multiterm keywords entry such as "political fiction" or "french revolution"
(b) Adding an autocomplete style dropdown list to the keywords field of the catalog Advanced Search

I believe implementing this dropdown will make it much easier than it is presently for users prepared to try our Advanced Search function to find an audiobook that may match a specific subject or genre interest.

Anyone reviewing this code proposal will notice that the SQL used to populate the keywords dropdown list does a look up of the project_keywords table. The purpose of this lookup is to avoid listing in the keywords dropdown keywords which are, in effect, orphans — that is to say, terms that appear in the keywords table, but are never referenced in the project_keywords table. In the scrubbed database, there are currently 2928 such terms. They can be viewed by running the following SQL against the scrubbed database:

SELECT DISTINCT k.value, k.id
FROM keywords k
where k.id NOT IN (SELECT DISTINCT pk.keyword_id from project_keywords pk)
ORDER BY k.id ASC;

Copy link
Contributor

@garethsime garethsime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I'm not a Librivox dev, so feel free to ignore my comments, but I do contribute occasionally, so I thought I'd throw in my suggestions 🙂

I like the look of this auto-complete feature a lot, but I have a few small technical questions.

@@ -122,7 +122,7 @@ function advanced_title_search($params)
$keyword_clause = '';
if (!empty($params['keywords']))
{
$keywords = explode(' ', $params['keywords']); //maybe preg_match if extra spaces cause trouble - thinnk we're ok
$keywords = array($params['keywords']);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, users could type in something like "duck sheep" and get items that match either "duck" or "sheep". I'm not sure why it was that way exactly, but I suspect it's because it was hard for users to know which tags were valid, so they probably went "duck ducky duckling ente pato ..." in the hopes of striking lucky.

Now, we'd only be matching items that had the literal tag "duck sheep", but I think this is a good change -- the auto-complete should make it way easier to discover tags. (Might need to label the field "Keyword" rather than "Keywords" though.)

It might be worth running past the admin team, if you haven't already :)

(Comma-separated tags or something would be awesome, but I don't think that the jQuery autocomplete supports it out of the box, so you'd have to do something like this if you wanted that.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't run this change past the admin team. There didn't seem any point until I could convince myself that I could get it working. What HAS been run past the admin team, though, is another proposal I've currently got under consideration (it's a formal pull request now) that would see the keywords associated with every project displayed on that project's Catalog page. The keywords would be hyperlinked so that clicking a keywords entry immediately displays all the other projects sharing the same keywords term. As part of that change (when it was still a paper proposal) I suggested that the "keywords" terminology was inappropriate, and should be changed to "keyterm". I argued that calling an entry like "keywords" was confusing, and that "keyterm" was a more sensible description. That would leave the word "keyterms" (plural) to signify, say, "political fiction" AND "French revolution". In my proposal, I was going to alter "keywords" on the New Project template page to "keyterms". The administrators argued, I believe, that (a) matters are fine as they stand and (b) the New Project template is built around multilanguage labels, and if we were to change one label like "keywords" on that page, it would be necessary to pay someone to translate this into all the other languages we use for that page.
As for comma-separated tags, MAYBE that could be got to work with enough effort — but I'd prefer to take this in smaller steps. If this dropdown is accepted AND my pull request to display hyperlinked keywords for each project becomes a reality, I think it will become very much more feasible than it has been so far to conduct meaningful searches using keywords. I'm not suggesting this proposal is perfect — only that it's a lot better than what we have at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this dropdown is accepted AND my pull request to display hyperlinked keywords for each project becomes a reality, I think it will become very much more feasible than it has been so far to conduct meaningful searches using keywords.

Very much so, I look forward to seeing it happen!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not suggesting this proposal is perfect — only that it's a lot better than what we have at the moment.

Noted! If we ever get around to an auto-completing, multi-keyword-entering box, we can use that in the Project Template Generator and here in Advanced Search. Meanwhile, the visibility of what keywords we might search, straight from within the search form, is a big step forward.

Now, we'd only be matching items that had the literal tag "duck sheep", but I think this is a good change -- the auto-complete should make it way easier to discover tags. (Might need to label the field "Keyword" rather than "Keywords" though.)

This part also seems like at least a slight improvement overall, but comes with a trade-off. If there's a way to alleviate that down-side, great! Otherwise, we may take two steps forward,one step back, and then be happy about it.

The problem that this particular change would solve:
Searching by keyword for "American Revolution" currently shows keywords like "American Humor" and "French Revolution".
Yeah, that's bad. 😅

But, on the other hand:
Searching by keyword for "Winnie the Pooh" currently shows results tagged with "winnie", "winnie-the-pooh", "winnie the pooh", and simply "pooh".
This is good. Dear Pooh may be a silly old bear, but his nose will always lead him somewhere.
(Oh, and this same search will also show the one project with keyword "Thé". Excuse me while I make that not a thing, on live...)
A few projects will certainly have their keywords updated, as those become both more visible and more useful, but overhauling the keywords on our thousands of projects is actually even more work than most power-users would imagine, and is not in the scope of this PR.

If quoting a multi-word "keyword" to keep it together is easier for either of you "hotshots" than it is for me (I've not written so much new code as either of you!), or if you can think of another way to keep the best of both worlds, then I'm all ears. Otherwise, I'll certainly be here for the "running by" when we're sure we know our limitations. 😉

$query = $this->db->query($sql);
return $query->result_array();

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to use the escaping methods provided by CodeIgniter for this: https://www.codeigniter.com/userguide3/database/queries.html#escaping-queries

It would look something like this:

public function autocomplete($term)
{
	// Escaping -- https://www.codeigniter.com/userguide3/database/queries.html#escaping-queries
	$escaped_term = $this->db->escape_like_str($term);

	$sql = 'SELECT DISTINCT k.value
	FROM keywords k
	JOIN project_keywords pk ON k.id = pk.keyword_id
	WHERE k.value like "' . $escaped_term . '%" ESCAPE \'!\'
	  AND pk.project_id IS NOT NULL
	ORDER BY k.value ASC';

	$query = $this->db->query($sql);
	return $query->result_array();
}

This helps prevent nasty SQL injection attacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keywords are currently being escaped already in Librivox_search.php, thus:

$keyword_clause = '';
if (!empty($params['keywords']))
{
$keywords = array($params['keywords']);
$keywords = array_map('trim', $keywords); // clean it up
$escaped_keywords = [];
foreach ($keywords as $keyword)
$escaped_keywords[] = $this->db->escape($keyword);
$in_keywords = implode(", ", $escaped_keywords);

Are you suggesting we need further protection beyond this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! I get it now. Did not include a fix for this in my latest update. Will take another look at this tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, once the keywords are put into the search, it's fine, but we're putting $term directly into the auto-complete SQL itself, which could do bad things too 🙂

Here's the query again:

$sql = 'SELECT DISTINCT k.value
FROM keywords k
JOIN project_keywords pk
ON k.id = pk.keyword_id
WHERE k.value like "' . $term . '%"
ORDER BY k.value ASC';
$query = $this->db->query($sql);
return $query->result_array();

Having the ORDER BY bit on a newline stumps my crappy hacking skills, but imagine someone unknowingly removes the newline, changing it to this:

$sql = 'SELECT DISTINCT k.value
FROM keywords k
JOIN project_keywords pk
ON k.id = pk.keyword_id
WHERE k.value like "' . $term . '%" ORDER BY k.value ASC';
$query = $this->db->query($sql);
return $query->result_array();

Now, you could put this in the keyword field and have it dump out all the email address in the Librivox database:

" UNION SELECT email AS value FROM users --

The full query becomes something like this:

SELECT DISTINCT k.value
FROM keywords k
JOIN project_keywords pk
ON k.id = pk.keyword_id
WHERE k.value like "" UNION SELECT email AS value FROM users --    %" ORDER BY k.value ASC

So yeah, I don't know how to make a hack like that work with the newline there, but I bet there's someone who can!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was too slow to comment haha, glad you got what I meant 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking of SQL injection - escaping is good, parameterized queries are even better. We already have a tracker for existing SQL injection bugs (#19), and I've managed to fix some of them, with a lot more to go. You can see an example of converting to parameterized queries in #145.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never looked at how the auto-complete works before, but man am I glad you figured it out first hahaha

Seriously though, I think it would be good to write a little docstring to help other contributors in the future. Maybe something like this:

/**
 * Set up auto-complete boxes using jQuery Autocomplete.
 *
 * Input elements need to define the following attributes:
 *
 *   - data-search_func - The backend ajax function to call
 *   - data-search_field - The name of the field
 *   - data-search_area - Which area on the page is being auto-completed
 *   - data-array_index - Used when you've got multiple inputs in one area (e.g. multiple authors)
 *
 * In the pages where this script is included, you'll also need to define two
 * functions:
 *
 *   - autocomplete_assign_vars - Maps backend results to values to show in the drop-down
 *   - autocomplete_assign_elements - Handles updating inputs when a value is selected
 *
 * See also: https://api.jqueryui.com/autocomplete/
 */
function set_autocomplete() {
  ...
}

Which also leads me into the next question -- why not follow how the others are doing it, and define the two javascript functions in advanced_search.php rather than modify this file? I know it's ugly, but better to have consistent ugly things rather than special cases. For example, you could drop something like this in:

<!-- I've mostly stolen this from section_compiler/index.js -->
<script type="text/javascript">
	function autocomplete_assign_vars(item) {
		return item.value;
	}

	function autocomplete_assign_elements(search_area, ui, array_index) {
		switch (search_area) {
			case 'keywords':
				document.getElementById("keywords").value = ui.item.label;
				break;
		}
	}
</script>

<script type="text/javascript" src="https://librivox.org/js/libs/jquery-1.8.2.js?v=1710057521"></script>
<script type="text/javascript" src="https://librivox.org/js/libs/jquery.validate.js?v=1710057521"></script>
<script type="text/javascript" src="https://librivox.org/js/libs/jquery-ui-1.8.24.custom.min.js?v=1710057521"></script>
<script type="text/javascript" src="https://librivox.org/js/common/autocomplete.js?v=1710057521"></script>
<!-- Remove the `/project_launch/index.js` import, it defines it's own autocomplete methods that clobber ours -->
<!-- <script type="text/javascript" src="https://librivox.org/js/public/project_launch/index.js?v=1710057521"></script> -->
<script type="text/javascript" src="https://librivox.org/js/common/jquery.tagsinput.min.js?v=1710057521"></script>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much. These all strike me as excellent suggestions. I've implemented them on my development system and they work fine.

@@ -47,7 +49,7 @@
<div class="control-group">
<div class="controls center">
<label for="keywords" ><span class="span2">Keywords:</span>
<input type="text" class="span4" id="keywords" name="keywords" value="<?= htmlspecialchars($advanced_search_form['keywords']) ?>"/>
<input type="text" name="keywords" value="" id="keywords" class="autocomplete" data-search_func="autocomplete_keywords" data-search_field="keywords" data-search_area="keywords" data-array_index="0" style="float:none; vertical-align:middle;margin-top:0px; width: 356px; max-width:500px; margin-left:0; " />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What are the styles for? E.g. why have a specific width/max width?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed to be an essential change when I first worked through this, as changing the class of the field (required to make autocomplete work) appeared to be changing the layout. However, as I've now implemented your approach on my local environment, it seems I thankfully don't need this local styling anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, awesome :) Unlike IDs, an HTML element can have more than one class, so if span4 is important for the layout, then the input can have both classes:

<input class="span4 autocomplete" ... />

class attribute docs on MDN

But if you're happy with how it is without the span4, then I wouldn't change it!

<script type="text/javascript" src="https://librivox.org/js/libs/jquery-ui-1.8.24.custom.min.js?v=1710057521"></script>
<script type="text/javascript" src="https://librivox.org/js/common/autocomplete.js?v=1710057521"></script>
<script type="text/javascript" src="https://librivox.org/js/public/project_launch/index.js?v=1710057521"></script>
<script type="text/javascript" src="https://librivox.org/js/common/jquery.tagsinput.min.js?v=1710057521"></script>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Are all of these needed? E.g. https://librivox.org/js/public/project_launch/index.js?v=1710057521 seems suspicious?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did check the effect of removing each one of these individually to see if they were all needed, and had come to the conclusion that they were. In some cases the only side-effect of not including one of these scripts was to cause an exception to be raised and noted in the Javascript console, while the new dropdown function continued to work fine. However, I decided it would be better not to be implementing code that was going to be causing these exceptions to be thrown.
That said, it turns out there was either something wrong with my checking process, or the new way of implementing this that you have suggested does now completely obviate the need for the script you have highlighted. Without it, the dropdown is working fine on my local machine, and not causing exceptions to be thrown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, makes sense! And yeah, if the autocomplete_assign_vars or autocomplete_assign_elements functions weren't set, then you'd get errors as soon as the autocomplete code ran. Bringing in those scripts would set them, which would stop the errors 🙂

FROM keywords k
JOIN project_keywords pk
ON k.id = pk.keyword_id
WHERE k.value like "' . $term . '%" AND pk.project_id IS NOT NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Is the AND pk.project_id IS NOT NULL significant here? This isn't a LEFT JOIN, so unmatched rows should be discarded I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. As you can see, I'm no hotshot developer! (Much more a Librivox reader than a Librivox techie...)

It may take me some little time as I have a few social commitments, but I expect I'll be updating my pull request to incorporate most of the changes you've suggested within the next 24 hours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no hotshot developer!

Nah, you've done well! The auto-complete stuff is tricky to put together and you got there. The main thing is that you've made it happen, and I'm looking forward to using it.

Much more a Librivox reader

Well, you're a long way ahead of my score of 2 sections that's for sure 😉

@peterjdann peterjdann force-pushed the keywords_search_dropdown branch from 91a3a41 to 4f0a33e Compare June 15, 2024 07:52
@peterjdann
Copy link
Contributor Author

Have made a number of changes now in the light of garethsime's very helpful suggestions.
Thank you very much for your care and attention to this, garethsime!

@peterjdann peterjdann force-pushed the keywords_search_dropdown branch from 4f0a33e to 373a18f Compare June 15, 2024 09:20
@peterjdann
Copy link
Contributor Author

Have just added keywords escaping, as suggested by Gareth. Thank you again, Gareth. Bit of a dimwit you're dealing with here!

@garethsime
Copy link
Contributor

Also, I know this is completely off-topic for the PR, but I'm curious to know how your local dev setup works?

There was a fair amount of discussion a while back on how best to support other devs wanting to make contributions, so I'm always interested in hearing how people went with the setup and what they ended up doing, if you can spare the time 🙂

@notartom
Copy link
Member

notartom commented Jun 15, 2024

Hi! I'm not a Librivox dev, so feel free to ignore my comments, but I do contribute occasionally, so I thought I'd throw in my suggestions 🙂

You're as much a Librivox dev as any of us - as I've mentioned before in various places, the original author has long since moved on, so we're all trying our best to figure out the code base and make incremental improvements, hopefully without breaking anything. Your suggestions and comments are definitely valid, and I welcome your reviews of any PR.

@peterjdann peterjdann force-pushed the keywords_search_dropdown branch from 373a18f to a759c3b Compare June 16, 2024 01:15
@peterjdann
Copy link
Contributor Author

Does strike me as possibly overkill, but I've now parameterised the query that looks up keywords in use, following a comment from notartom (which comment has now been deleted, I think?)

@peterjdann
Copy link
Contributor Author

peterjdann commented Jun 16, 2024

Also, I know this is completely off-topic for the PR, but I'm curious to know how your local dev setup works?

There was a fair amount of discussion a while back on how best to support other devs wanting to make contributions, so I'm always interested in hearing how people went with the setup and what they ended up doing, if you can spare the time 🙂

Have looked at my old notes on how I did this the first time on a Mac running Ubuntu under Parallels Desktop. Was able to pare down the necessary instructions to a fairly simple procedure, which I was able to verify worked correctly on a second Mac of mine running a similar setup this afternoon. I have posted those instructions in the place where the original discussion took place.

@peterjdann
Copy link
Contributor Author

peterjdann commented Jun 18, 2024 via email

@redrun45
Copy link
Collaborator

redrun45 commented Jun 18, 2024

Thé IS a thing in French (tea). to a quick check before you delete that.

Thanks, that's good to know! But here, it was short for 'Thénardier', having been truncated by accident.

@peterjdann
Copy link
Contributor Author

peterjdann commented Jun 18, 2024 via email

@redrun45
Copy link
Collaborator

Very good! Yes, sadly, exposing keywords is also going to expose keyblunders — there’s no denying that. I personally believe, though, the upside outweighs the downside.

Agreed there! When keywords are more useful (as they will be, thanks to your changes here and elsewhere!), the obviously incorrect ones will be corrected over time. But since we can't search the catalog by description, and since everything else has a very finite set of "correct" values, the keywords will remain the only "free-form" discoverability for a project. Since #225 will cover the "exact match" use-case, I'd prefer we still at least have the option of a fuzzy keyword search.

I suppose the direction I'm going is: could we do something like the "Exact match" check-box for keywords in Advanced Search? Would that be easier than a multi-keyword-autocomplete box?

@peterjdann
Copy link
Contributor Author

peterjdann commented Jun 22, 2024

But since we can't search the catalog by description, and since everything else has a very finite set of "correct" values, the keywords will remain the only "free-form" discoverability for a project. Since #225 will cover the "exact match" use-case, I'd prefer we still at least have the option of a fuzzy keyword search.

I suppose the direction I'm going is: could we do something like the "Exact match" check-box for keywords in Advanced Search? Would that be easier than a multi-keyword-autocomplete box?

I think an "Exact match" checkbox would be a backward step. At the moment, my code is matching all terms that begin with what the user has entered. In practice, this means that if the user enters, say, "french revolution", that term is going to show up as the first hit, but the user will see other hits as well. In effect, my current code is delivering an exact match (if there is one) but also partially fuzzy matches.

Let me run past you an approach that you might find an improvement on how I've done this at the moment.

Imagine a scenario where the user has entered the search term "roman" and this causes the following SQL to be executed:

select value as text, 'A' as priority
from keywords
where value like 'roman%' and not INSTR(value, ';') and not INSTR(value, ',')
UNION
select concat ('=== ', 'other terms containing ', '"roman"', ' ===') as text, 'B' as priority
UNION
select value as text, 'C' as priority
from keywords
where value like '%roman%' and value not like 'roman%' and not INSTR(value, ';') and not INSTR(value, ',')
order by priority ASC, text ASC
limit 200

I know the "limit" value here is not what we actually use, but just for the purposes of illustration right now we need a high value for "limit" ito see the effect I'm after. (Of course we would also need to strip out the "priority" values from what the SQL is returning — I'm not suggesting we're going to display those to our users).

The point is, we can separately display what are, in effect, two lists in one return set, one of which lists is "fuzzier" than the other, while in effect documenting what we're showing the user within the result set itself.

Why do this, and not just present the user with a "fully fuzzy" result set? Well, try it using the same example:

select value as text
from keywords
where value like '%roman%' and not INSTR(value, ';') and not INSTR(value, ',')
order by text ASC
limit 200

It seems to me the result set is much less helpful.

As for the

not INSTR(value, ';') and not INSTR(value, ',')
why am I now proposing that?

Well, try running the same SQL but with that conditions stripped out. Turns out there really are some pretty messy entries in our keywords table that need to be rationalised. Down the track, it might be possible to write a utility page to help with that task — but for now, it might be best to hide some easily excludable bad examples from our users.

@redrun45
Copy link
Collaborator

Just to be sure we're on the same page:
I'd suggest a check-box for the actual search, not necessarily for making the keyword autocomplete suggestions more fuzzy. You make good points on autocomplete. The "prioritized" version looks excellent from where I sit, too. 😉

The part I suggest making optional is this change in Librivox_search.php. The opening comment for this PR describes that change as:

Fixing a bug in the catalog's Advanced Search function that currently results in the generation of an incorrect list of results if a user enters a multiterm keywords entry such as "political fiction" or "french revolution"

When the user intends to search for projects tagged with a particular multi-word keyword, which they've selected from the autocomplete suggestion list... then doing a fuzzy search for projects with other keywords would indeed be a bug!

But if a user intends to do a broader search across multiple relevant keywords, that fuzzy matching is the best we have at the moment.
Let's say I want to study the history of France in a particular period. I select the Genre as "*Non-Fiction -> History -> Early Modern". Then instead of selecting just one keyword from a list of suggestions, I check or un-check a box, and then enter a collection of words that might appear in any number of relevant keywords. Say, "french", "france", "paris".
Would it be nice if we could pick several multi-word terms like "french revolution" and "white hoods of paris"? Absolutely! But if we don't (yet) have that, we might want to keep this (otherwise buggy) code as an alternate mode.


Ok! Suggestion made. It is definitely more work, even if a check-box is the simplest thing I know to ask for. Let me know what you think.

@peterjdann
Copy link
Contributor Author

peterjdann commented Jun 23, 2024 via email

@redrun45
Copy link
Collaborator

Given all the above, I don’t think it’s unreasonable to describe the current implementation as suffering from a bug.

Fair enough! I'm not going to argue. I only intended to let you know of an objection I thought would likely be raised. I'll make sure folks know this PR is ready for review as-is.

@peterjdann
Copy link
Contributor Author

peterjdann commented Jun 23, 2024 via email

@peterjdann
Copy link
Contributor Author

peterjdann commented Jun 24, 2024

Have now changed SQL so the list of keywords presented in the dropdown shows, as first priority, all keywords that START WITH the letters the user has typed, then a text separator indicating that the entries below the separator are some terms that INCLUDE the letters the user has typed, then as last priority terms not already presented that INCLUDE the letter sequence the user has typed. The maximum number of entries shown in the dropdown is 200.

I did NOT implement a change I foreshadowed above which would have excluded entries including ";" or "," characters.

The query term is now both "like-suitable" escaped and parameterised.

To get a glimpse of the kind of tidying up work that may lie ahead (if anyone is brave enough to take it on) trying typing "libriv" in the keywords search field on a system where this change has been implemented.

@redrun45
Copy link
Collaborator

I did NOT implement a change I foreshadowed above which would have excluded entries including ";" or "," characters.
...
To get a glimpse of the kind of tidying up work that may lie ahead (if anyone is brave enough to take it on) trying typing "libriv" in the keywords search field on a system where this change has been implemented.

I didn't look to close until now, in case you intended to make that foreshadowed change. 😉
Actually, it's surprising how correlated these two things are. It would seem that any time either a ";" or "," shows up inside a keyword, it's a typo. In fact, most of those have already been fixed, and the remaining few can be done pretty easily! 🥳

I was going to suggest adding AND instances > 1 to the SELECT statements for autocomplete, now that I know the old, typo'd keywords are still in there. But of course, learning this, I realize that we'll need an extra line in update_keyword_stats, so that unused keywords get set to instances = 0!

I had assumed the unused keywords were deleted, so didn't even think to check that, for all my "hammering". I'll add another suggestion to #225. 😅

@@ -15,6 +15,15 @@ function autocomplete_author()
$this->load->model('author_model');
echo json_encode($this->author_model->autocomplete($term, $search_field));
}

function autocomplete_keywords()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I can see that is is referenced as the data-search_func in advanced_search.php below, so even though this is PHP and below is HTML, I'm assuming this somehow gets called when a user types in the keywords <input>. Does this mean we're running an SQL query every time a user adds or removes a character from the keywords search box? I'm worried about load on the server, and I'm wondering if it wouldn't be possible to cache the list of keywords client-side (or even server-side, but at least have them cached without looking them up each time?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while since I first worked on this, but (from memory) yes, there is a DB lookup at each key press. Your suggestion of creating a cache (client or server side) and querying that makes very good sense. I'll see what I can do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we're running an SQL query every time a user adds or removes a character from the keywords search box?

For users who are "hunt-and-peck" typists, we probably are. 😅
By default, our JS library has a delay of 300ms after the last event, before it sends any queries to the server. That ought to mean no more than ~3 queries per second per user (and usually less), rather than a big rush of queries when a touch-typist goes at it.

This also applies to author autocomplete. If it's a problem here, we might be better off increasing the delay for both. Looks like we'd add the option in /public_html/js/common/autocomplete.js, with some higher value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, well that's a new angle I wasn't aware of. I'll wait to see what Artom thinks of redrun45's suggestion here before I embark on figuring out how to set up and query a cache.

Copy link
Contributor Author

@peterjdann peterjdann Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to point out that the call to the DB that runs in the current implementation to generate a dropdown list has a LIMIT statement in it, as follows:
$sql = 'SELECT DISTINCT k.value, "A" AS priority
FROM keywords k
JOIN project_keywords pk
ON k.id = pk.keyword_id
WHERE k.value LIKE ?
UNION
SELECT "=== some other keywords containing "?" ===" AS value, "B" AS priority
UNION
SELECT DISTINCT k.value, "C" AS priority
FROM keywords k
JOIN project_keywords pk
ON k.id = pk.keyword_id
WHERE k.value LIKE ?
AND k.value NOT LIKE ?
ORDER BY priority ASC, value ASC
LIMIT 200';

To populate a cache, it seems to me I'm going to need to something like:

  1. Run a one time per usage SELECT from table keywords with a join on project_keywords with NO limit statement, and then, additionally:
  2. At each subsequent keypress, loop through my cache (most likely a PHP array), potentially twice, using strpos() to look for matches where:
    a) haystack contains needle*, and
    b) haystack contants needle but not needle*

Now, the question in my mind is, is this latter approach likely to be more, or less, resource-intensive than the approach we've got at the moment, where every 300 ms (max) a user hits the database with a query that returns at most 200 items that we can use immediately as is, without the need for any further processing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, potentially much higher. Author autocomplete is currently "hooked up" to pages that are only useful for volunteers, rather than these search pages that the general public might use. It might be nice if we could hook it up to Advanced Search too... but that's if we can satisfy any performance impacts here, of course. 😉

If it helps any, this Keywords-autocomplete query would only be used when the user opts for "Advanced" search, and then starts entering keywords, which seems like a relatively small proportion of searches to me. Folks using the regular search, or advanced search without keywords, won't cause additional load.

In the cases where this method is used, the search itself might be a tad lighter because of 1 - fewer "keywords" entries matched, means fewer projects matched, means fewer (and hopefully more relevant) results. For example, the not-terribly-nice case where the user enters "american civil war", and receives projects whose keywords match any of "%american%" OR "%civil%" OR "%war%", which is quite a bit broader than they probably intended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to get my head around how either a server-side or client-side cache might be implemented in practice, and I must admit I'm scratching my head.

Re a server-side cache, unless I've got this all wrong, it's not like PHP objects continue to exist between https calls. Therefore (am I wrong about this??), it's not like there's any object in the CodeIgniter setup for which I can do a one-time setup of a keywords cache (for example, at the moment the keywords search field gets the focus) and then expect this cache to persist between user key-clicks in the keywords field, each (or many) of which are triggering AJAX calls on the server.

This leaves the theoretical possibility of setting up a client side cache — but give a thought, for a moment, to how complex this might be, and ask yourself why none of the canonical examples you'll find on the web about how to implement autocomplete with an input field use any such case (even though the requirement must be a very common one indeed).

Right now, the implementation is really very simple and elegant: we simply assign to the input field the server-side function that is to populate the list, and then implement that server-side function in PHP. (It's a little more complicated than that — but the amount of Javascript involved is very small). Against that, to implement a client-side cache I imagine I would need to make an AJAX call that would populate a Javascript array (say, on an onFocus event for the keywords input field), and then, on each keystroke in that field, loop through that array (potentially, twice) looking for matches.

Is this what you are asking me to do?

Copy link
Contributor Author

@peterjdann peterjdann Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other points I'm wondering about:
(a) If I have to have the client read a full list of all the keywords in use via AJAX, am I in danger of hitting some limit in the possible size of the response from the server?
(b) What would you think of writing a server-side keywords cache to an operating system file, and reading from that? If so: What should be the location of the file? Would you be concerned about the resource-use implications of that approach? (The file might be considered 'valid' by the cache-reading code if it bore a relatively recent timestamp, otherwise be rewritten following a fresh database call). There's a suggestion here (https://stackoverflow.com/questions/45740757/php-persistent-cache-data) that achieving 'persistence' by writing to, and reading from, a file might not be as expensive as it sounds if the operating system concerned in practice performed much of the work in memory.

Copy link
Contributor Author

@peterjdann peterjdann Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have had a go today at implementing an approach that caches a list of all keywords currently used in projects on the server side, as a file.

I haven't formally pushed this to this PR, as I think you need to have a look at it and think about whether you think it's a good idea in principle. You may well have some improvements to suggest right off.

In this implementation, I'm suggesting a maximum cache life of five minutes, on the assumption that if an administrator has added or deleted some keywords, they might well expect them to show up in the dropdown list relatively soon.

I've done only scanty testing. At face value, the results look identical to the current, wholly DB-centric, approach. I have verified that I can type Greek keyboard letters into the Keywords search field and get some Greek results. On my local machine the results seem to be available speedily — but that hardly amounts to a meaningful load test, of course.

In the process of getting this working, I did, at one point, change the permissions on my local cache directory (the one that is specified in the code). I have not double-checked whether this is, in fact, necessay.

At the code level, the change is confined to a single function within the Keywords model, and looks like this:

    	public function autocomplete($term)
	{
		// Paths for file operations with CodeIgniter helper are relative
		// to location of index.php
		$CACHE_DIR = '../application/cache/';
		$KEYWORDS_CACHE_FILENAME = 'keywords_cache.txt';
		$MAX_RETURN_VALUES = 200;
		$keywords_cache_file_exists = false;
		$keywords_cache_is_fresh = false;
		$keywords_cache_refresh_interval_seconds = 300;	
		$json_encoded_keywords_cache_as_read_from_file = '';
		
		$this->load->helper('file');
		
		// Does keywords cache file yet exist?
		$files_in_cache_dir = get_dir_file_info($CACHE_DIR);		
		if (array_key_exists($KEYWORDS_CACHE_FILENAME, $files_in_cache_dir)) 		
		{
			$keywords_cache_file_exists = true;
		}
		
		// Is the existing cache fresh?
		if ($keywords_cache_file_exists) 
		{
			$cache_date = get_file_info($CACHE_DIR . $KEYWORDS_CACHE_FILENAME, 'date');
			$timestamp_of_current_cache_file = $cache_date["date"];
			$cache_age = time() - $timestamp_of_current_cache_file;
			if ($cache_age < $keywords_cache_refresh_interval_seconds) 
			{
				$keywords_cache_is_fresh = true;
			}				
		}

		// Write a new keywords cache if none currently exists, 
		// or if the existing cache needs to be refreshed 
		if (!$keywords_cache_file_exists or !$keywords_cache_is_fresh)
		{
			$sql = 'SELECT DISTINCT k.value
			FROM keywords k
			JOIN project_keywords pk
			ON k.id = pk.keyword_id
			ORDER BY value ASC';
			$query = $this->db->query($sql);
			$data_for_keywords_cache = $query->result_array();
			write_file($CACHE_DIR . $KEYWORDS_CACHE_FILENAME, json_encode($data_for_keywords_cache));			
		}
		
		$cached_result_array = array();
		
		// Read the existing cache file if it is fresh
		
		if ($keywords_cache_file_exists and $keywords_cache_is_fresh) 
		{
			$json_encoded_keywords_cache = read_file($CACHE_DIR . $KEYWORDS_CACHE_FILENAME);
			$associative = true;
			$keywords_cache_array = json_decode($json_encoded_keywords_cache, $associative);
			
			foreach($keywords_cache_array as $row => $inner_array)
			{
  				foreach($inner_array as $inner_row => $value)
  				{
    					if ( strpos($value, $term)  === 0 )
					{						
						$new_element = ["value" => $value];
						array_push($cached_result_array, $new_element);

					}
  				}
  				$array_size = count ($cached_result_array);
				if ( count ($cached_result_array) >= $MAX_RETURN_VALUES) 
				{
					break;
				}
			}
			
			// If we have at least one good match above, and if we have not yet exceeded 
			// the number of items we can show in our dropdown list, show some additional partial matches
			
			if (( count ($cached_result_array) <= $MAX_RETURN_VALUES) and (count ($cached_result_array) > 0))
			{
				$divider = "=== some other keywords containing '" . $term . "' ===";
				$new_element = ["value" => $divider];
				array_push($cached_result_array, $new_element);
			
				foreach($keywords_cache_array as $row => $inner_array)
				{
  					foreach($inner_array as $inner_row => $value)
  					{
    						if ( (strpos($value, $term)  !== 0 ) and ((strpos($value, $term))) )
						{						
							$new_element = ["value" => $value];
							array_push($cached_result_array, $new_element);
						}
  					}
  					$array_size = count ($cached_result_array);
					if ( count ($cached_result_array) >= $MAX_RETURN_VALUES) 
					{
						break;
					}
				}
			}			
		}
		
		return $cached_result_array;
	}
```

Copy link
Contributor Author

@peterjdann peterjdann Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now pushed to here what is basically the change shown above (which uses a file-based cache that is refreshed not more often than every five minutes), with a couple of minor adjustments. The first is that I have now made the matching process between what the user has entered and our list of keywords in use case-insensitive, so that, for example, if the user types "french", they will see results matching both "french" and "French". The second is that I have restored to the Keyword_model autocomplete() function the original DB-centric code, but commented out. This is to allow anyone wanting to compare the performance of the two approaches to switch between them just by commenting out whichever approach they wish not to use.
I have established that, at least on my development system, no changes of permissions for the 'cache' directory were required for this change to take affect.

@peterjdann peterjdann force-pushed the keywords_search_dropdown branch from 0f63a86 to 7638324 Compare February 5, 2025 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants