Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feedback on clusters view #263

Open
ymgan opened this issue Sep 12, 2023 · 11 comments
Open

feedback on clusters view #263

ymgan opened this issue Sep 12, 2023 · 11 comments

Comments

@ymgan
Copy link

ymgan commented Sep 12, 2023

Hi @MortenHofft

I remember you asked for feedback on the clusters view in the hosted portal community call. @Antonarctica and I finally sat together and had a look at it. Here's our feedback:

cool Cool COOL!!!!!

Anton exclaimed "COOL!!" a million times while clicking through the nodes today!! We think that it's really well done and we appreciate the effort very much!! The feedback below is just for icing on the cake!

Include entity selected in the clusters tab

Screenshot 2023-09-12 at 15 01 01

https://www.biodiversity.aq/occurrence/search?entity=1434623051&from=30&view=CLUSTERS

We think that including the entity selected (pointed by the red arrow) in the clusters tab (the overlay in screenshot) and have it highlighted in a different colour will be helpful for user to compare it with the other nodes from the same cluster. To us, it is a cluster of 4, but we saw 3 entities in the cluster tab so we were a little confused in the beginning.

Meaning of node with dark outline

Screenshot 2023-09-12 at 15 08 46

We aren't sure what it means when a node has darker outlined (pointed by the red arrow) in the screenshot above. We guessed that it is the occurrence record scoped in our hosted portal, but it wasn't clear to us.

How are clusters sorted?

We are curious how clusters are sorted because that would help users to decide whether they want to browse through the pages because sometimes there can be hundreds or thousands of pages.

Pagination for the clusters

Depending on how clusters are sorted, user may be interested to see the last page or enter a page number. This would be more helpful than clicking the next button.

What does the puzzle icon mean?

Screenshot 2023-09-12 at 15 15 17

We are not sure what the puzzle icon means (pointed by red arrow). We believe that it means the data from all of the extensions associated with the record but we aren't sure.

Not sure where to start when first exploring

Our initial responses when we saw the view was we weren't sure where to start. Not sure how constructive this comment is because I don't know how to present this information better. But as we drill down to the nodes and edges, we understand what is happening and we appreciate the work and thought went into this.

Thank you so much for working on this!!

@ymgan ymgan changed the title feedback for clusters view feedback on clusters view Sep 12, 2023
@ymgan
Copy link
Author

ymgan commented Sep 13, 2023

also, spotted a small typo in the legend

contains differemnt identifications

@ManonGros
Copy link
Collaborator

ManonGros commented Nov 3, 2023

Thanks @ymgan !

To add to your feedback (so @MortenHofft has it all in one place).

Yes the feature is super cool!
I agree that it would be good to add a legend for the dark circle (it isn't obvious that it is the occurrence in the selection).

Highlighting the occurrence (and cluster) that was clicked on would also be helpful.

In addition to that:

  • In the About section, it would be good to put a reference to the documentation on how those clusters are generated. Right now we have this blogpost, when we get the new technical documentation website, there will be a page for it.
  • I really like to be able to see the assertions supporting the relationship between records when hovering the edges of the graph. However, the text is difficult to read (it is a narrow window with no separations between the assertions).
    • Could the window be a bit wider?
    • Could there be a visual separation between the assertions? At the very least, commas.

I think that if those points were addressed, the feature could be in production.

I have some ideas of things I would like to see if they were possible. I don't know if those could be implemented at all. Feel free to discard.

  1. I would love to be able to select a cluster and have all the occurrences in my selection (so I could generate a download). The current pages with related records only show the related pair. With the visual feature, we can see all the vertices. It would be great to get them all in one table instead of having to click on all the occurrences.
  2. That one would be more of a "nice to have": I would like to see the clusters based on assertions (not just the final linking). For example, show only the occurrences that have the same date and species names. I can imagine this would be difficult to do and may not be worth it. It would be nice to be able to explore the data that way.

@MortenHofft
Copy link
Member

I've looked at above and implemented what I could at this point. Some things I cannot do at this point.

  • How are clusters sorted?
    • They are not. You do not search for clusters, but for records that are clustered. Without any sorting.
    • So pagination doesn't make much sense as they are random.
  • Download clusters
    • I cannot do that at this point, because we do not have search by GBIF IDs - also @timrobertson100 suggests that if we go this way we should probably have a proper download format for it
  • Search by cluster features
    • Would be great. But clusters do not exists and as such do not have properties we can search. But makes sense. But would also require work in all layers.

@sformel-usgs
Copy link

I revisited this functionality today and it has not lost its coolness :-)

I checked back in on it because I was interested in demonstrating the data journey of a data collected on a NOAA ship (essentially the clustering of samples at the Smithsonian with the ROV photos in the NOAA Deep Sea Coral database).

One question, one note:

  1. @MortenHofft I understand that creating a download isn't trivial, but if I could get at the cluster data in any format, it would be appreciated. Currently, I don't see how I can go from the isInCluster term returned by the API, to investigating those linkages, without rerunning the code to construct the clustered occurrences. Any suggestions?

  2. @ManonGros It looks like the links to the source code are broken on the blog that is referenced in the 'About' section: https://data-blog.gbif.org/post/clustering-occurrences/

@timrobertson100
Copy link
Member

timrobertson100 commented Jan 1, 2025

Hi Steve

We don’t currently have a cluster download function, but will add it. If I made a custom export for you, would it help? If so, what filter are you interested in please - clusters that include a Smithsonian to NOAA connection?

I've fixed the link in the blog post. The code moved to here

@sformel-usgs
Copy link

Tim, thanks for the quick answer. I hope you had a happy new year!

Yes, a Smithsonian to NOAA connection is exactly what I'm after. In short, what occurrences from the DSCRTP dataset ( https://www.gbif.org/dataset/df8e3fb8-3da7-4104-a866-748f6da20a3c) are clustered with occurrences from Smithsonian datasets (especially https://www.gbif.org/dataset/821cc27a-e3bb-4bc5-ac34-89ada245069d or https://www.gbif.org/dataset/26098c25-8f7f-4c71-97ac-1d3db181c65e).

It doesn't need to be fullproof. I'm just looking to demonstrate that we can QAQC the data across these institutions and improve collection metadata and publishing workflows. If I'm able to connect the dots, then I'll pass it back to y'all as a use case for developing a download of clusters.

@timrobertson100
Copy link
Member

timrobertson100 commented Jan 3, 2025

As a first pass please see this TSV created using this SQL. Please let me know if you'd like any changes. The format is:

  • id1 - the NOAA record ID (source record)
  • id2 - the record ID it links to (remember it's just a hint)
  • dataset1 - the dataset key for the source record (always the NOAA dataset)
  • dataset2- the dataset key for the linked record (may or may not be a Smithosian record, but all clusters contain at least one Smithsonian record)
  • o1 - a preview of the source record
  • o2 - a preview of the linked record

@ManonGros
Copy link
Collaborator

Maybe the link to the blogpost should be replaced by the link to the documentation? https://techdocs.gbif.org/en/data-processing/clustering-occurrences (apparently, the source code links should be updated there as well)

ManonGros added a commit to gbif/tech-docs that referenced this issue Jan 7, 2025
After noticing that the links didn't work any more, see gbif/hosted-portals#263
@sformel-usgs
Copy link

sformel-usgs commented Jan 9, 2025

Thanks for the help; it's tough to identify a strategy for efficiently investigating these in a pairwise fashion (and I'm guessing you already know this) . I suspect it might take me a while to make progress on this.

@timrobertson100
Copy link
Member

Thanks @sformel-usgs - that is pretty accurate. Generally I load into a DB (e.g. clickhouse), start with those I am interested in and then JOIN twice out to find things of interest (a museums which might have imaged or sequenced it or so). Really though, it calls for a graph DB as this is just a table of edges. If you would like we could collaborate in a GH repo together and see what we find - if you let me know some of the questions you're interested in I'll be happy to try to help.

@sformel-usgs
Copy link

I appreciate the offer, but unfortunately, I'll have to put this to the side for the next couple of weeks. One realization I had while I was combing through the data is that the DSCRTP data might be discarding useful (e.g. catalogNumber) info during their ingestion process. I'll talk to them first before I spend too much more time on these particular datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants