Skip to content

Commit db50d51

Browse files
committed
fixed get_index
1 parent 01ecdd8 commit db50d51

File tree

8 files changed

+260
-102
lines changed

8 files changed

+260
-102
lines changed

README.jmd

+94
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
## ShortStrings
2+
This is an efficient string format for storing strings using integer types. For example, `UInt32` can hold 3 bytes of string with 1 byte to record the size of the string and a `UInt128` can hold a byte string with 1 byte to record the size of the string.
3+
4+
Using BitIntegers.jl, integer of larger size than `UInt128` can be defined. This package support string with up to 126 bytes in size.
5+
6+
## Quick Start
7+
```julia
8+
using ShortStrings
9+
10+
using SortingAlgorithms
11+
using Random: randstring
12+
13+
N = Int(1e6)
14+
svec = [randstring(rand(1:15)) for i=1:N]
15+
# convert to ShortString
16+
ssvec = ShortString15.(svec)
17+
18+
# sort short vectors
19+
@time sort(svec);
20+
@time sort(ssvec, by = x->x.size_content, alg=RadixSort);
21+
22+
# conversion to shorter strings is also possible with
23+
ShortString7(randstring(7))
24+
ShortString3(randstring(3))
25+
26+
# convenience macros are provided for writing actual strings (e.g., for comparison)
27+
s15 = ss15"A short string" # ShortString15 === ShortString{Int128}
28+
s7 = ss7"shorter" # ShortString7 === ShortString{Int64}
29+
s3 = ss3"srt" # ShortString3 === ShortString{Int32}
30+
```
31+
32+
## Benchmarks
33+
34+
```julia
35+
using SortingLab, ShortStrings, SortingAlgorithms, BenchmarkTools;
36+
N = Int(1e6);
37+
svec = [randstring(rand(1:15)) for i=1:N];
38+
# convert to ShortString
39+
ssvec = ShortString15.(svec);
40+
basesort = @benchmark sort($svec)
41+
radixsort_timings = @benchmark SortingLab.radixsort($svec)
42+
short_radixsort = @benchmark ShortStrings.fsort($ssvec)
43+
# another way to do sorting
44+
sort(ssvec, by = x->x.size_content, alg=RadixSort)
45+
46+
using RCall
47+
@rput svec;
48+
r_timings = R"""
49+
memory.limit(2^31-1)
50+
replicate($(length(short_radixsort.times)), system.time(sort(svec, method="radix"))[3])
51+
""";
52+
53+
using Plots
54+
bar(["Base.sort","SortingLab.radixsort","ShortStrings radix sort", "R radix sort"],
55+
mean.([basesort.times./1e9, radixsort_timings.times./1e9, short_radixsort.times./1e9, r_timings]),
56+
title="String sort performance - len: 1m, variable size 15",
57+
label = "seconds")
58+
```
59+
60+
61+
```julia
62+
using SortingLab, ShortStrings, SortingAlgorithms, BenchmarkTools;
63+
N = Int(1e6);
64+
svec = rand([randstring(rand(1:15)) for i=1:N÷100],N)
65+
# convert to ShortString
66+
ssvec = ShortString15.(svec);
67+
basesort = @benchmark sort($svec) samples = 5 seconds = 120
68+
radixsort_timings = @benchmark SortingLab.radixsort($svec) samples = 5 seconds = 120
69+
short_radixsort = @benchmark ShortStrings.fsort($ssvec) samples = 5 seconds = 120
70+
71+
using RCall
72+
73+
@rput svec;
74+
r_timings = R"""
75+
replicate(max(5, $(length(short_radixsort.times))), system.time(sort(svec, method="radix"))[3])
76+
""";
77+
78+
using Plots
79+
bar(["Base.sort","SortingLab.radixsort","ShortStrings radix sort", "R radix sort"],
80+
mean.([basesort.times./1e9, radixsort_timings.times./1e9, short_radixsort.times./1e9, r_timings]),
81+
title="String sort performance - len: $(N÷1_000_000)m, fixed size: 15",
82+
label = "seconds")
83+
```
84+
85+
## Notes
86+
This is based on the discussion [here](https://discourse.julialang.org/t/progress-towards-faster-sortperm-for-strings/8505/4?u=xiaodai). If Julia.Base adopts the hybrid representation of strings then it makes this package redundant.
87+
88+
# Build Status
89+
90+
[![Build Status](https://travis-ci.org/xiaodaigh/ShortStrings.jl.svg?branch=master)](https://travis-ci.org/xiaodaigh/ShortStrings.jl)
91+
92+
[![Coverage Status](https://coveralls.io/repos/xiaodaigh/ShortStrings.jl/badge.svg?branch=master&service=github)](https://coveralls.io/github/xiaodaigh/ShortStrings.jl?branch=master)
93+
94+
[![codecov.io](http://codecov.io/github/xiaodaigh/ShortStrings.jl/coverage.svg?branch=master)](http://codecov.io/github/xiaodaigh/ShortStrings.jl?branch=master)

README.md

+124-94
Original file line numberDiff line numberDiff line change
@@ -1,94 +1,124 @@
1-
# ShortStrings
2-
This is an efficient string format for storing strings using integer types. For example, `UInt32` can hold 3 bytes of string with 1 byte to record the size of the string and a `UInt128` can hold a byte string with 1 byte to record the size of the string.
3-
4-
Using BitIntegers.jl, integer of larger size than `UInt128` can be defined. This package support string with up to 126 bytes in size.
5-
6-
# Quick Start
7-
```julia
8-
using ShortStrings, SortingAlgorithms
9-
N = Int(1e6)
10-
svec = [randstring(rand(1:15)) for i=1:N]
11-
# convert to ShortString
12-
ssvec = ShortString15.(svec)
13-
@time sort(svec);
14-
@time sort(ssvec, by = x->x.size_content, alg=RadixSort);
15-
16-
# conversion to shorter strings is also possible with
17-
ShortString7(randstring(7))
18-
ShortString3(randstring(3))
19-
20-
# convenience macros are provided for writing actual strings (e.g., for comparison)
21-
s15 = ss15"A short string" # ShortString15 === ShortString{Int128}
22-
s7 = ss7"shorter" # ShortString7 === ShortString{Int64}
23-
s3 = ss3"srt" # ShortString3 === ShortString{Int32}
24-
```
25-
26-
# Benchmark
27-
![String sorting performance](readme_string_sort.png)
28-
29-
30-
## Benchmarking code
31-
```julia
32-
using SortingLab, ShortStrings, SortingAlgorithms, BenchmarkTools;
33-
N = Int(1e6);
34-
svec = [randstring(rand(1:15)) for i=1:N];
35-
# convert to ShortString
36-
ssvec = ShortString15.(svec);
37-
basesort = @benchmark sort($svec)
38-
radixsort_timings = @benchmark SortingLab.radixsort($svec)
39-
short_radixsort = @benchmark ShortStrings.fsort($ssvec)
40-
# another way to do sorting
41-
sort(ssvec, by = x->x.size_content, alg=RadixSort)
42-
43-
using RCall
44-
R"""
45-
memory.limit(2^31-1)
46-
"""
47-
@rput svec;
48-
r_timings = R"""
49-
memory.limit(2^31-1)
50-
replicate($(length(short_radixsort.times)), system.time(sort(svec, method="radix"))[3])
51-
""";
52-
53-
using Plots
54-
bar(["Base.sort","SortingLab.radixsort","ShortStrings radix sort", "R radix sort"],
55-
mean.([basesort.times./1e9, radixsort_timings.times./1e9, short_radixsort.times./1e9, r_timings]),
56-
title="String sort performance - len: 1m, variable size 15",
57-
label = "seconds")
58-
savefig("readme_string_sort.png")
59-
60-
61-
using SortingLab, ShortStrings, SortingAlgorithms, BenchmarkTools;
62-
N = Int(2e7);
63-
svec = rand([randstring(rand(1:15)) for i=1:N÷100],N)
64-
# convert to ShortString
65-
ssvec = ShortString15.(svec);
66-
basesort = @benchmark sort($svec) samples = 5 seconds = 120
67-
radixsort_timings = @benchmark SortingLab.radixsort($svec) samples = 5 seconds = 120
68-
short_radixsort = @benchmark ShortStrings.fsort($ssvec) samples = 5 seconds = 120
69-
70-
using RCall
71-
72-
@rput svec;
73-
r_timings = R"""
74-
replicate(max(5, $(length(short_radixsort.times))), system.time(sort(svec, method="radix"))[3])
75-
""";
76-
77-
using Plots
78-
bar(["Base.sort","SortingLab.radixsort","ShortStrings radix sort", "R radix sort"],
79-
mean.([basesort.times./1e9, radixsort_timings.times./1e9, short_radixsort.times./1e9, r_timings]),
80-
title="String sort performance - len: $(N÷1_000_000)m, fixed size: 15",
81-
label = "seconds")
82-
savefig("readme_string_sort_fixed_len.png")
83-
```
84-
85-
# Notes
86-
This is based on the discussion [here](https://discourse.julialang.org/t/progress-towards-faster-sortperm-for-strings/8505/4?u=xiaodai). If Julia.Base adopts the hybrid representation of strings then it makes this package redundant.
87-
88-
# Build Status
89-
90-
[![Build Status](https://travis-ci.org/xiaodaigh/ShortStrings.jl.svg?branch=master)](https://travis-ci.org/xiaodaigh/ShortStrings.jl)
91-
92-
[![Coverage Status](https://coveralls.io/repos/xiaodaigh/ShortStrings.jl/badge.svg?branch=master&service=github)](https://coveralls.io/github/xiaodaigh/ShortStrings.jl?branch=master)
93-
94-
[![codecov.io](http://codecov.io/github/xiaodaigh/ShortStrings.jl/coverage.svg?branch=master)](http://codecov.io/github/xiaodaigh/ShortStrings.jl?branch=master)
1+
## ShortStrings
2+
This is an efficient string format for storing strings using integer types. For example, `UInt32` can hold 3 bytes of string with 1 byte to record the size of the string and a `UInt128` can hold a byte string with 1 byte to record the size of the string.
3+
4+
Using BitIntegers.jl, integer of larger size than `UInt128` can be defined. This package support string with up to 126 bytes in size.
5+
6+
## Quick Start
7+
````julia
8+
9+
using ShortStrings
10+
11+
using SortingAlgorithms
12+
using Random: randstring
13+
14+
N = Int(1e6)
15+
svec = [randstring(rand(1:15)) for i=1:N]
16+
# convert to ShortString
17+
ssvec = ShortString15.(svec)
18+
@time sort(svec);
19+
@time sort(ssvec, by = x->x.size_content, alg=RadixSort);
20+
21+
# conversion to shorter strings is also possible with
22+
ShortString7(randstring(7))
23+
ShortString3(randstring(3))
24+
25+
# convenience macros are provided for writing actual strings (e.g., for comparison)
26+
s15 = ss15"A short string" # ShortString15 === ShortString{Int128}
27+
s7 = ss7"shorter" # ShortString7 === ShortString{Int64}
28+
s3 = ss3"srt" # ShortString3 === ShortString{Int32}
29+
````
30+
31+
32+
````
33+
0.339147 seconds (131 allocations: 11.451 MiB)
34+
0.359553 seconds (784.87 k allocations: 71.526 MiB, 13.41% gc time)
35+
"srt"
36+
````
37+
38+
39+
40+
41+
42+
## Benchmarks
43+
44+
<details>
45+
<summary><b>Code</b></summary>
46+
````julia
47+
48+
using SortingLab, ShortStrings, SortingAlgorithms, BenchmarkTools;
49+
N = Int(1e6);
50+
svec = [randstring(rand(1:15)) for i=1:N];
51+
# convert to ShortString
52+
ssvec = ShortString15.(svec);
53+
basesort = @benchmark sort($svec)
54+
radixsort_timings = @benchmark SortingLab.radixsort($svec)
55+
short_radixsort = @benchmark ShortStrings.fsort($ssvec)
56+
# another way to do sorting
57+
sort(ssvec, by = x->x.size_content, alg=RadixSort)
58+
59+
using RCall
60+
R"""
61+
memory.limit(2^31-1)
62+
"""
63+
@rput svec;
64+
r_timings = R"""
65+
memory.limit(2^31-1)
66+
replicate($(length(short_radixsort.times)), system.time(sort(svec, method="radix"))[3])
67+
""";
68+
69+
using Plots
70+
bar(["Base.sort","SortingLab.radixsort","ShortStrings radix sort", "R radix sort"],
71+
mean.([basesort.times./1e9, radixsort_timings.times./1e9, short_radixsort.times./1e9, r_timings]),
72+
title="String sort performance - len: 1m, variable size 15",
73+
label = "seconds")
74+
````
75+
76+
77+
![](figures/README_2_1.png)
78+
79+
80+
</details>
81+
82+
<details>
83+
<summary><b>Code</b></summary>
84+
````julia
85+
86+
using SortingLab, ShortStrings, SortingAlgorithms, BenchmarkTools;
87+
N = Int(2e7);
88+
svec = rand([randstring(rand(1:15)) for i=1:N÷100],N)
89+
# convert to ShortString
90+
ssvec = ShortString15.(svec);
91+
basesort = @benchmark sort($svec) samples = 5 seconds = 120
92+
radixsort_timings = @benchmark SortingLab.radixsort($svec) samples = 5 seconds = 120
93+
short_radixsort = @benchmark ShortStrings.fsort($ssvec) samples = 5 seconds = 120
94+
95+
using RCall
96+
97+
@rput svec;
98+
r_timings = R"""
99+
replicate(max(5, $(length(short_radixsort.times))), system.time(sort(svec, method="radix"))[3])
100+
""";
101+
102+
using Plots
103+
bar(["Base.sort","SortingLab.radixsort","ShortStrings radix sort", "R radix sort"],
104+
mean.([basesort.times./1e9, radixsort_timings.times./1e9, short_radixsort.times./1e9, r_timings]),
105+
title="String sort performance - len: $(N÷1_000_000)m, fixed size: 15",
106+
label = "seconds")
107+
````
108+
109+
110+
![](figures/README_3_1.png)
111+
112+
113+
</details>
114+
115+
## Notes
116+
This is based on the discussion [here](https://discourse.julialang.org/t/progress-towards-faster-sortperm-for-strings/8505/4?u=xiaodai). If Julia.Base adopts the hybrid representation of strings then it makes this package redundant.
117+
118+
# Build Status
119+
120+
[![Build Status](https://travis-ci.org/xiaodaigh/ShortStrings.jl.svg?branch=master)](https://travis-ci.org/xiaodaigh/ShortStrings.jl)
121+
122+
[![Coverage Status](https://coveralls.io/repos/xiaodaigh/ShortStrings.jl/badge.svg?branch=master&service=github)](https://coveralls.io/github/xiaodaigh/ShortStrings.jl?branch=master)
123+
124+
[![codecov.io](http://codecov.io/github/xiaodaigh/ShortStrings.jl/coverage.svg?branch=master)](http://codecov.io/github/xiaodaigh/ShortStrings.jl?branch=master)

build-readme.jl

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Weave readme
2+
using Pkg
3+
cd("c:/git/ShortStrings/")
4+
Pkg.activate("c:/git/ShortStrings/readme-env")
5+
6+
using Weave
7+
8+
weave("README.jmd", out_path = :pwd, doctype = "github")
9+
10+
if false
11+
tangle("README.jmd")
12+
end

readme-env/Project.toml

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[deps]
2+
ShortStrings = "63221d1c-8677-4ff0-9126-0ff0817b4975"
3+
SortingAlgorithms = "a2af1166-a08f-5f64-846c-94a0d3cef48c"

src/base.jl

+15-7
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@ function ShortString{T}(s::Union{String, SubString{String}}) where T
1414
throw(ErrorException("sizeof(::ShortString) must be shorter than or equal to $(max_len) in length; you have supplied a string of size $sz"))
1515
end
1616
bits_to_wipe = 8(sizeof(T) - sz)
17+
# TODO some times this can throw errors for longish strings
18+
# Exception: EXCEPTION_ACCESS_VIOLATION at 0x1e0b7afd -- bswap at C:\Users\RTX2080\.julia\packages\BitIntegers\xU40U\src\BitIntegers.jl:332 [inlined]
19+
# ntoh at .\io.jl:541 [inlined]
1720
content = (T(s |> pointer |> Ptr{T} |> Base.unsafe_load |> ntoh) >> bits_to_wipe) << bits_to_wipe
1821
ShortString{T}(content | T(sz))
1922
end
@@ -25,7 +28,7 @@ Base.codeunit(s::ShortString, i) = codeunits(String(s), i)
2528
Base.codeunit(s::ShortString, i::Integer) = codeunit(String(s), i)
2629
Base.codeunits(s::ShortString) = codeunits(String(s))
2730
Base.convert(::ShortString{T}, s::String) where T = ShortString{T}(s)
28-
Base.convert(::String, ss::ShortString) = String(a) #reduce(*, ss)
31+
Base.convert(::String, ss::ShortString) = String(ss)
2932
Base.display(s::ShortString) = display(String(s))
3033
Base.firstindex(::ShortString) = 1
3134
Base.isvalid(s::ShortString, i::Integer) = isvalid(String(s), i)
@@ -46,13 +49,18 @@ size_nibbles(::Type{T}) where T = ceil(log2(sizeof(T))/4)
4649
size_mask(T) = UInt(exp2(4*size_nibbles(T)) - 1)
4750

4851

49-
Base.getindex(s::ShortString{T}, i::Integer) where T = begin
50-
Char((s.size_content << 8(i-1)) >> 8(sizeof(T)-1))
51-
end
52-
Base.collect(s::ShortString) = getindex.(s, 1:lastindex(s))
52+
# function Base.getindex(s::ShortString, i::Integer)
53+
# getindex(String(s), i)
54+
# end
55+
56+
# function Base.getindex(s::ShortString, args...; kwargs...)
57+
# getindex(String(s), args...; kwargs...)
58+
# end
59+
60+
Base.collect(s::ShortString) = collect(String(s))
5361

54-
==(s::ShortString, b::String) = begin
55-
String(s) == b
62+
==(s::ShortString, b::AbstractString) = begin
63+
String(s) == b
5664
end
5765

5866
promote_rule(::Type{String}, ::Type{ShortString{S}}) where S = String

test/getindex.jl

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
using Test
2+
using ShortStrings
3+
4+
s = "∫x ∂x"
5+
6+
ss = ShortString15(s)
7+
8+
@test s[1] == ss[1]

test/hash.jl

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@ using ShortStrings: ShortString, hash
22
using Test
33

44

5-
@test ShortString(10) == hash(UInt(10))
5+
@test hash(ShortString(10)) == hash(UInt(10))

0 commit comments

Comments
 (0)