Skip to content
This repository has been archived by the owner on Jul 25, 2024. It is now read-only.

Reevaluate the option to use Mariona's solution to place name disambiguation #26

Open
alexhebing opened this issue May 27, 2019 · 3 comments
Labels
question Further information is requested

Comments

@alexhebing
Copy link

alexhebing commented May 27, 2019

There appears to be some room in the current project, in terms of time, to add a component that does something interesting when collecting GIS coordinates for NER locations (i.e. more interesting than consuming GeoNames or OpenStreetMap). I estimate this is about 50 / 60 hours.

Assess whether it is feasible to create a script as part of the pipeline that implements (parts of) the solution proposed by Mariona Coll Ardanuy in this article in the available time.

More broadly, assess the available methods in the field and pick one that is doable in the available time.

@alexhebing alexhebing added the question Further information is requested label May 27, 2019
@jgonggrijp
Copy link
Member

"Feasible"? Do you perhaps mean to assess how much time it would take? Is there an a priori time limit for a go/no go decision?

@alexhebing
Copy link
Author

Thank you @jgonggrijp for your ever critical eye. I updated the description with more context to explain better what I have in mind. A quick first look at the article (that I have seen before obviously, but I have a much better understanding of the broader context now) already shows that a complete implementation is out of the question, but perhaps parts of the solution are usable.

@alexhebing
Copy link
Author

Mariona's solution has two main components: 1) a knowlegde base extracted from Wikipedia (and a small part from GeoNames); 2) A series of (two) scripts to suggest geocoordinates on the basis of a given location with 100 context words (50 left, 50 right). The scripts that we have received from Jaap/Mariona only cover step 2, whereas the knowledge base is included as sql dumps. After discussion with Berit, I see two options that should be implementable in the time estimated above:

  1. Use the knowledge base as is (4 dbs with wiki data from 2014), do a minimal clean up / refactoring of the scripts in step 2 (e.g. leave the db queries and the scoring algorithms as they are). Instead of having to run them one by one manually, make them do their work from one call. This might require not storing the candidates selected by the first script to files, but store them in memory. Create a webservice to wrap the scripts, so that the placenamedisambiguation pipeline can call it.

  2. Create a new (possibly dynamic) knowledge base, on the basis of how Mariona did it. Perhaps change the data structure to make it more convenient for an application trying to consume this data. Base it either on a new dump of Wikipedia, or call the Wikipedia API to retrieve relevant pages. Do 'something interesting' in terms of semantic comparison between the query term (with context) and the Wikipedia content. This 'something interesting' should be minimal, I think there won't be enough time to implement anew the algorithms from 1).

Closing this, continued in #30 .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants