Description
Last tuesday, during the december call, it was mentioned that support for a list of strings might be useful. This is a straightforward proposal for such a feature in the standard library.
Fortran has supported variable-length strings since the 2003 standard, but it does not have a native type to handle collections of strings of different lengths. Such collections are quite useful though and the language allows us to define a derived type that can handle such collections.
This proposal considers the features that would be useful for this derived type, a list of strings. It does not prescribe the methods for implementing the type. Given the ease by which arrays are handled in Fortran, it is possible to use allocatable arrays, but perhaps a linked list may prove to be more efficient, at least in certain scenarios of use. Therefore the proposal concentrates on the usage.
A further limitation of the proposal: there is no provision for nested strings or for storing data with the string.
Methods for the list of strings:
-
Method
insert
- insert a new string after the given index.
Special index values:head
andend
, wherehead
means insert the new string before the first one andend
means insert it after the last one. Arithmetic withend
is possible:end-1
means insert before the last element and so on.Note: besides a string you can also insert another list or an ordinary array of strings (in the latter case all inserted strings will be of the same length as the array)
Convenience method:
append
- same as insert with index = endIndices beyond the bounds of the list:
Should they cause an error? Or should they be interpreted as insert at the start or at the end as ifhead
orend
were given? Or should the list simply grow to that length? This potentially causes holes. -
Method
delete
- delete a single string or a range of strings from the list.
head
andend
have similar meanings as withinsert
. -
Method
replace
- replace a string with a new value at the indicated location.Note: should we support replacement with multiple strings? My initial take at this is: no, if you want to do that, it can be done with a combination of
insert
anddelete
. Thereplace
method is a convenient way to deal with individual strings. -
Method:
get
- get one element from the list and return it as a string -
Method:
range
- get all strings in a given range and return them as a new list. -
Method:
index
- find the index of the first occurrence of a string in the list. Returns 0 (zero) if there is no such string. May also look for the last. -
Method:
index_sub
- similar toindex
but rather than the entire string, a matching substring is sought for. -
Method:
destroy
- destroy the list -
Assignment: you can use lists in the context:
list1 = list2
where list1 gets a copy of the entire list of strings.
Some methods will be subroutines, others will be functions. Special attention should be paid to error conditions. A guiding principle: no surprises. The philosophy might be: the list is - conceptually - infinitely long and if an element has not been set, then it is an empty string (a string of length 0).
Activity
Romendakil commentedon Dec 17, 2020
This looks very similar to the implementation of iso_varying_string by Rich Townsend.
arjenmarkus commentedon Dec 17, 2020
Romendakil commentedon Dec 17, 2020
No, sorry, I was too hasty. You are right. I supports strings of variable length as a container of allocatable character arrays. But it doesn't implement arrays of strings of varying length.
arjenmarkus commentedon Dec 17, 2020
epagone commentedon Dec 17, 2020
I think that this addition would really be amazing.
Comments
Can you bit more specific? What do you mean by that?
IMHO if
append
is provided, out of bounds indices should cause an error.I don't understand why not. I think that it would be quite convenient and I do not see the potential pitfall lurking.
Additional methods
I would find very useful to run on lists of strings the following methods.
to_number
: convert numerical values and return arrays of reals and/or (when applicable) integers and (optionally) their location in the list. I understand it might have a complicated API but it would be a killer feature for me.unique
: return a list of unique strings (with optional case insensitivity, I would vote for the default to be case sensitive).upper
,lower
,capitalise
.camelcase
, ... (see StringiFor for more): those can be implemented in the first instance to individual strings and then applied to the entire list. Hopefully their names are self-explanatory.Additional operators
//
would concatenate lists of strings==
Input/output
I admit that I have no exact idea here, but I would like to make the access to the lists as simple as possible for input and output. So I'll throw in the topic for potential discussion.
I hope that I am not repeating something that has been already proposed or discussed during the monthly call (apologies in that case): unfortunately, I didn't have still the chance to catch-up with the recording.
Am I putting too much stuff in the plate? 😄
arjenmarkus commentedon Dec 18, 2020
What I meant by "not storing data with the string" is: the module I propose only stores strings, it does not function as a dictionary to associate arbitrary data with a string. Such functionality would be welcome too, but it would complicate matters a bit. I first want this one sorted out.
I will have a look at your suggestions. A first comment: I do not know if Fortran allows overloading of // - I have never seen it myself. The specification as I posted it is far from complete, I definitely acknowledge that :).
I did sit down yesterday to get started with a proof-of-concept implementation - currently two methods: insert and get. They work nicely.
ivan-pi commentedon Dec 18, 2020
It does in fact! You can run this small example to check it:
I can't find it right now, but there is a Fortran library for dictionaries somewhere on GitHub, that uses overloading of
//
for buildingkey//value
for pairs. In the issue #69 (string handling routines), I believe there were even some suggestions to use this to directly concatenate strings and numerical values (e.g.'2 + 3 ='//5)
.In general I like the proposed the API, the idea of an infinitely long lists also sounds like it might fit to Fortran well. One question concerning this is would the list have a
size
method?I also see some overlap with the
split
discussion in #241. Particularly the comment from @esterjo:Romendakil commentedon Dec 18, 2020
Yes, also the iso_varying_string module by Rich Townsend uses overloading of the (//) operator:
for the three different cases of varying string with itself, and attaching a character to varying string from left and right.
arjenmarkus commentedon Dec 18, 2020
epagone commentedon Dec 18, 2020
It can be overloaded like any other operator (e.g.
==
).Brilliant! Why don't you share this with a (draft?) pull request so it can be discussed and reviewed by our knowledgeable community?
arjenmarkus commentedon Dec 18, 2020
arjenmarkus commentedon Dec 18, 2020
esterjo commentedon Dec 18, 2020
Referring to my comment that @ivan-pi mentioned from #241
I should have also added that sorting may mean sorting the index array, and setting a "sorted" flag to true. No need to reallocate and reorder the data elements. However, it's probably best to just actually reallocate and sort the data, so that scanning through it in sorted order will not cause you do jump around memory randomly.
But yes @arjenmarkus, insertion and modification of elements would be inefficient and would require reallocating at least part of the data for each insertion. Anecdotally, in my day to day programming I typically use arrays of strings in R (usually for data science and statistical modeling). I rarely modify the elements of a character vector or insert elements after I have generated the array. Instead, querying, appending, sorting, getting unique values, and splitting are by far the most common operations. Given that, I will say that I'm a big fan of having two or more implementations optimized for different uses.
esterjo commentedon Dec 18, 2020
So there may be a "sort_view" method subroutine to only sort the index array, and a "sort_inplace" subroutine to actually sort the character data
3 remaining items
jvdp1 commentedon Dec 31, 2020
Thank you @arjenmarkus for the implementation. I will try to have a look at it next week.
@epagone I think that sorting algoritghms are definetely in the scope of
stdlib
, and are needed at several places (e.g., for sparse matrices, computation of median,...). Therefore sorting procedures should have their own module. See some discussions around sorting algorithms and implementations in #98.ivan-pi commentedon Jan 8, 2021
As @jvdp1 has mentioned already, I think we all agree sorting routines are in the scope of stdlib. But I also think we should not shy away from providing specific sort methods which are private to the module in development (in this case list of strings). This could be either due to using a specialized algorithm more suitable than a generic one, or simply as a temporary building block.
epagone commentedon Jan 8, 2021
@ivan-pi I understand and agree with the rationale and the general principle. However, is this the case? I am far from an expert of sorting algorithms, but to my eyes the one in the relevant PR does not seem to be tailored to list of strings (except for the data type).
ivan-pi commentedon Jan 8, 2021
I am not an expert for sorting algorithms either. I've tried to compile a list of available routines in #98. I've had a look at the implementation from @arjen. If we were to generalize the sorting routine and move it out of the string list module, we would need to:
string_type
as the suggested high-level (OO or functional) string type (similar tovarying_string
or the many string types defined by others; see String handling routines #69)argsort
routine which takes an array oftype(string_type)
and returns the indices that would sort an array.epagone commentedon Jan 8, 2021
I see this as an advantage to keep the ball rolling also on the front of
stdlib
string types. We can always roll back at a later time if, for some reason, this implementation is unsuitable: we are still in "experimental mode".I don't know if this would be a lot of work but to me it does not seem super-complicated.
arjenmarkus commentedon Jan 25, 2021
I am making progress with the implementation of a better way to specify the indices and en passant with the various versions to be supported. This made me think of some design questions for which I would like some advice:
call list%insert( 1, "A")
call list%insert( -10, 'B' )
write(*,*) list%get(-10)
produces "B".
Of course, negative indices may be not so useful in the first place and I should not put too much effort in that feature. If so, the alternative of flagging an error is probably best.
What do you think?
certik commentedon Jan 25, 2021
This is the behavior on Python:
It looks like it prepends it at the beginning? Notice that a negative index wraparounds from the right when you do
a[-1]
.arjenmarkus commentedon Jan 27, 2021
aman-godara commentedon Jul 26, 2021
This comment is aimed to discuss the design around which PR #470 is built upon:
Explanation of forward and backward indexes:
forward index:
fidx
backward index:
bidx
After inserting an element at
fidx( x )
, the added element will appear at indexfidx( x )
i.e. atx
th position when counted from left to right.After inserting an element at
bidx( y )
, the added element will appear at indexbidx( y )
i.e.y
th position when counted from right to left.inserting at
fidx( 1 )
is equivalent to inserting atbidx( len + 1 )
inserting at
bidx( 1 )
is equivalent to inserting atfidx( len + 1 )
Inserting an element at HEAD is like inserting at
fidx( 1 )
and inserting an element at TAIL is like inserting atbidx( 1 )
backward index is like the forward index of the reversed form of the list.
I am looking at insert API with a different perspective i.e. instead of talking about whether it will insert an element before i-th index or after i-th index I rather say that after insertion is completed the
newly_inserted_element
will be found at i-th index. What are your opinions on this way of looking atinsert( i )
?If you want to insert an element
new_element
before integer indexq
use the forward index of integer indexq
. If you want to insert an elementnew_element
after integer indexq
use the backward index of integer indexq
.Currently there is NO function in the PR which converts integer index to forward index or backward index [not to be confused with fidx and bidx of PR]