Skip to content

Commit b289eea

Browse files
committed
Add documentation to README.md
1 parent 5ca9617 commit b289eea

File tree

1 file changed

+144
-3
lines changed

1 file changed

+144
-3
lines changed

Diff for: README.md

+144-3
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,150 @@
11
# StringEncodings
22

33
[![Travis CI Build Status](https://travis-ci.org/nalimilan/StringEncodings.jl.svg?branch=master)](https://travis-ci.org/nalimilan/StringEncodings.jl)
4-
54
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/3gslhfg91isldnvq?svg=true)](https://ci.appveyor.com/project/nalimilan/stringencodings-jl)
6-
75
[![Coveralls Coverage Status](https://coveralls.io/repos/nalimilan/StringEncodings.jl/badge.svg?branch=master&service=github)](https://coveralls.io/github/nalimilan/StringEncodings.jl?branch=master)
8-
96
[![Codecov Coverage Status](http://codecov.io/github/nalimilan/StringEncodings.jl/coverage.svg?branch=master)](http://codecov.io/github/nalimilan/StringEncodings.jl?branch=master)
7+
8+
This Julia package provides support for decoding and encoding texts between multiple character encodings. It it currently based on the iconv interface, and supports all major platforms (on Windows, it uses the native OS API via [win_iconv](https://github.com/win-iconv/win-iconv/). In the future, native Julia support for major encodings will be added.
9+
10+
## Encoding and Decoding Strings
11+
*Encoding* a refers to the process of converting a string (of any `AbstractString` type) to a sequence of bytes represented as a `Vector{UInt8}`. *Decoding* refers to the inverse process.
12+
13+
```julia
14+
julia> using StringEncodings
15+
16+
julia> encode("café", "UTF-16")
17+
10-element Array{UInt8,1}:
18+
0xff
19+
0xfe
20+
0x63
21+
0x00
22+
0x61
23+
0x00
24+
0x66
25+
0x00
26+
0xe9
27+
0x00
28+
29+
julia> decode(ans, "UTF-16")
30+
"café"
31+
```
32+
33+
Use the `encodings` function to get the list of all supported encodings on the current platform:
34+
```julia
35+
julia> encodings()
36+
1241-element Array{ASCIIString,1}:
37+
"1026"
38+
"1046"
39+
"1047"
40+
"10646-1:1993"
41+
"10646-1:1993/UCS4"
42+
"437"
43+
"500"
44+
"500V1"
45+
"850"
46+
"851"
47+
48+
"windows-1258"
49+
"WINDOWS-1258"
50+
"WINDOWS-31J"
51+
"windows-874"
52+
"WINDOWS-874"
53+
"WINDOWS-936"
54+
"WINSAMI2"
55+
"WS2"
56+
"YU"
57+
```
58+
59+
(Note that many of these are aliases for standard names.)
60+
61+
## The `Encoding` type
62+
In the examples above, the encoding was specified as a standard string. Though, in order to avoid ambiguities in multiple dispatch and to benefit from type specialization performance benefits, the package offers a special `Encoding` parametric type. Each parameterization of this type represents a character encoding. The non-standard string literal `enc` can be used to create an instance of this type, like so: `enc"UTF-16"`.
63+
64+
Since there is no ambiguity, the `encode` and `decode` functions accept either a string or an `Encoding` object. On the other hand, other functions presented below only support the latter to avoid creating conflicts with other packages extending Julia Base methods.
65+
66+
In future versions, the `Encoding` type will allow getting information about character encodings, and will be used to improve the performance of conversions.
67+
68+
## Reading from and Writing to Encoded Text Files
69+
The package also provides several simple methods to deal with files containing encoded text. They extend the equivalent functions from Julia Base, which only support text stored in the UTF-8 encoding.
70+
71+
A method for `open` is provided to write a string under an encoded form to a file:
72+
```julia
73+
julia> path = tempname();
74+
75+
julia> f = open(path, enc"UTF-16", "w");
76+
77+
julia> write(f, "café\nnoël")
78+
79+
julia> close(f); # Essential to complete encoding
80+
```
81+
82+
The contents of the file can then be read back using `readstring` (or `readall` under Julia 0.4):
83+
```julia
84+
julia> readstring(path) # Standard function expects UTF-8
85+
"\U3d83f7c0f\0\0n\0o\0\0"
86+
87+
julia> readstring(path, enc"UTF-16") # Works when passing the correct encoding
88+
"café\nnoël"
89+
```
90+
91+
Other variants of standard convenience functions are provided:
92+
```julia
93+
julia> readline(path, enc"UTF-16")
94+
"café\n"
95+
96+
julia> readlines(path, enc"UTF-16")
97+
2-element Array{ByteString,1}:
98+
"café\n"
99+
"noël"
100+
101+
julia> for l in eachline(path, enc"UTF-16")
102+
print(l)
103+
end
104+
café
105+
noël
106+
107+
julia> readuntil(path, enc"UTF-16", "o")
108+
"café\nno"
109+
```
110+
111+
When performing more complex operations on an encoded text file, it will often be easier to specify the encoding only once when opening it. The resulting I/O stream can then be passed to functions that are unaware of encodings (i.e. that assume UTF-8 text):
112+
```julia
113+
julia> open(path, enc"UTF-16")
114+
115+
julia> io = open(path, enc"UTF-16");
116+
117+
julia> readstring(io)
118+
"café\nnoël"
119+
```
120+
121+
## Advanced Usage: `StringEncoder` and `StringDecoder`
122+
The convenience functions presented above are based on the `StringEncoder` and `StringDecoder` types, which wrap I/O streams and offer on the fly character encoding conversion facilities. They can be used directly if you need to work with encoded text on an already existing I/O stream. This can be illustrated using an `IOBuffer`:
123+
```
124+
julia> b = IOBuffer();
125+
126+
julia> s = StringEncoder(b, "UTF-16");
127+
128+
julia> write(s, "café");
129+
130+
julia> close(s); # Essential to complete encoding
131+
132+
julia> decode(takebuf_array(b), enc"UTF-16")
133+
"café"
134+
```
135+
136+
And the reverse operation:
137+
```julia
138+
julia> b = IOBuffer();
139+
140+
julia> s = StringDecoder(b, "UTF-16");
141+
142+
julia> write(b, encode("café", enc"UTF-16"));
143+
144+
julia> decode(takebuf_array(b), enc"UTF-16")
145+
"café"
146+
```
147+
148+
Do not forget to call `close` on `StringEncoder` and `StringDecoder` objects to release iconv resources. In the case of `StringEncoder`, this function will also call `flush`, which will write any characters still in the buffer, and possibly some control sequences (for stateful encodings).
149+
150+
Conversion currently raises an error if an invalid byte sequence is encountered in the input, or if some characters cannot be represented in the target enconding. It is not yet possible to ignore such characters or to replace them with a placeholder.

0 commit comments

Comments
 (0)