#7.3 Regexp Regexp is a complicated but powerful tool for pattern match and text manipulation. Although its performance is lower than pure text match, it's more flexible. Base on its syntax, you can almost filter any kind of text from your source content. If you need to collect data in web development, it's not hard to use Regexp to have meaningful data.
Go has package regexp
as official support for regexp, if you've already used regexp in other programming languages, you should be familiar with it. Note that Go implemented RE2 standard except \C
, more details: http://code.google.com/p/re2/wiki/Syntax.
Actually, package strings
does many jobs like search(Contains, Index), replace(Replace), parse(Split, Join), etc. and it's faster than Regexp, but these are simple operations. If you want to search a string without case sensitive, Regexp should be your best choice. So if package strings
can achieve your goal, just use it, it's easy to use and read; if you need to more advanced operation, use Regexp obviously.
If you remember form verification we talked before, we used Regexp to verify if input information is valid there already. Be aware that all characters are UTF-8, and let's learn more about Go regexp
!
##Match
Package regexp
has 3 functions to match, if it matches returns true, returns false otherwise.
func Match(pattern string, b []byte) (matched bool, error error)
func MatchReader(pattern string, r io.RuneReader) (matched bool, error error)
func MatchString(pattern string, s string) (matched bool, error error)
All of 3 functions check if pattern
matches input source, returns true if it matches, but if your Regex has syntax error, it will return error. The 3 input sources of these functions are slice of byte
, RuneReader
and string
.
Here is an example to verify IP address:
func IsIP(ip string) (b bool) {
if m, _ := regexp.MatchString("^[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}$", ip); !m {
return false
}
return true
}
As you can see, using pattern in package regexp
is not that different. One more example, to verify if user input is valid:
func main() {
if len(os.Args) == 1 {
fmt.Println("Usage: regexp [string]")
os.Exit(1)
} else if m, _ := regexp.MatchString("^[0-9]+$", os.Args[1]); m {
fmt.Println("Number")
} else {
fmt.Println("Not number")
}
}
In above examples, we use Match(Reader|Sting)
to check if content is valid, they are all easy to use.
##Filter Match mode can verify content, but it cannot cut, filter or collect data from content. If you want to do that, you have to use complex mode of Regexp.
Sometimes we need to write a crawl, here is an example that shows you have to use Regexp to filter and cut data.
package main
import (
"fmt"
"io/ioutil"
"net/http"
"regexp"
"strings"
)
func main() {
resp, err := http.Get("http://www.baidu.com")
if err != nil {
fmt.Println("http get error.")
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
fmt.Println("http read error")
return
}
src := string(body)
// Convert HTML tags to lower case.
re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")
src = re.ReplaceAllStringFunc(src, strings.ToLower)
// Remove STYLE.
re, _ = regexp.Compile("\\<style[\\S\\s]+?\\</style\\>")
src = re.ReplaceAllString(src, "")
// Remove SCRIPT.
re, _ = regexp.Compile("\\<script[\\S\\s]+?\\</script\\>")
src = re.ReplaceAllString(src, "")
// Remove all HTML code in angle brackets, and replace with newline.
re, _ = regexp.Compile("\\<[\\S\\s]+?\\>")
src = re.ReplaceAllString(src, "\n")
// Remove continuous newline.
re, _ = regexp.Compile("\\s{2,}")
src = re.ReplaceAllString(src, "\n")
fmt.Println(strings.TrimSpace(src))
}
In this example, we use Compile as the first step for complex mode. It verifies if your Regex syntax is correct, then returns Regexp
for parsing content in other operations.
Here are some functions to parse your Regexp syntax:
func Compile(expr string) (*Regexp, error)
func CompilePOSIX(expr string) (*Regexp, error)
func MustCompile(str string) *Regexp
func MustCompilePOSIX(str string) *Regexp
The difference between ComplePOSIX
and Compile
is that the former has to use POSIX syntax which is leftmost longest search, and the latter is only leftmost search. For instance, for Regexp [a-z]{2,4}
and content "aa09aaa88aaaa"
, CompilePOSIX
returns aaaa
but Compile
returns aa
. Must
prefix means panic when the Regexp syntax is not correct, returns error only otherwise.
After you knew how to create a new Regexp, let's see this struct provides what methods that help us to operate content:
func (re *Regexp) Find(b []byte) []byte
func (re *Regexp) FindAll(b []byte, n int) [][]byte
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
func (re *Regexp) FindAllString(s string, n int) []string
func (re *Regexp) FindAllStringIndex(s string, n int) [][]int
func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
func (re *Regexp) FindIndex(b []byte) (loc []int)
func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)
func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int
func (re *Regexp) FindString(s string) string
func (re *Regexp) FindStringIndex(s string) (loc []int)
func (re *Regexp) FindStringSubmatch(s string) []string
func (re *Regexp) FindStringSubmatchIndex(s string) []int
func (re *Regexp) FindSubmatch(b []byte) [][]byte
func (re *Regexp) FindSubmatchIndex(b []byte) []int
These 18 methods including same function for different input sources(byte slice, string and io.RuneReader), we can simplify it by ignoring input sources as follows:
func (re *Regexp) Find(b []byte) []byte
func (re *Regexp) FindAll(b []byte, n int) [][]byte
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int
func (re *Regexp) FindIndex(b []byte) (loc []int)
func (re *Regexp) FindSubmatch(b []byte) [][]byte
func (re *Regexp) FindSubmatchIndex(b []byte) []int
Code sample:
package main
import (
"fmt"
"regexp"
)
func main() {
a := "I am learning Go language"
re, _ := regexp.Compile("[a-z]{2,4}")
// Find the first match.
one := re.Find([]byte(a))
fmt.Println("Find:", string(one))
// Find all matches and save to a slice, n less than 0 means return all matches, indicates length of slice if it's greater than 0.
all := re.FindAll([]byte(a), -1)
fmt.Println("FindAll", all)
// Find index of first match, start and end position.
index := re.FindIndex([]byte(a))
fmt.Println("FindIndex", index)
// Find index of all matches, the n does same job as above.
allindex := re.FindAllIndex([]byte(a), -1)
fmt.Println("FindAllIndex", allindex)
re2, _ := regexp.Compile("am(.*)lang(.*)")
// Find first submatch and return array, the first element contains all elements, the second element contains the result of first (), the third element contains the result of second ().
// Output:
// the first element: "am learning Go language"
// the second element: " learning Go ", notice spaces will be outputed as well.
// the third element: "uage"
submatch := re2.FindSubmatch([]byte(a))
fmt.Println("FindSubmatch", submatch)
for _, v := range submatch {
fmt.Println(string(v))
}
// Same thing like FindIndex().
submatchindex := re2.FindSubmatchIndex([]byte(a))
fmt.Println(submatchindex)
// FindAllSubmatch, find all submatches.
submatchall := re2.FindAllSubmatch([]byte(a), -1)
fmt.Println(submatchall)
// FindAllSubmatchIndex,find index of all submatches.
submatchallindex := re2.FindAllSubmatchIndex([]byte(a), -1)
fmt.Println(submatchallindex)
}
As we introduced before, Regexp also has 3 methods for matching, they do exactly same thing as exported functions, those exported functions call these methods underlying:
func (re *Regexp) Match(b []byte) bool
func (re *Regexp) MatchReader(r io.RuneReader) bool
func (re *Regexp) MatchString(s string) bool
Next, let's see how to do displacement through Regexp:
func (re *Regexp) ReplaceAll(src, repl []byte) []byte
func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
func (re *Regexp) ReplaceAllString(src, repl string) string
func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string
These are used in crawl example, so we don't explain more here.
Let's take a look at explanation of Expand
:
func (re *Regexp) Expand(dst []byte, template []byte, src []byte, match []int) []byte
func (re *Regexp) ExpandString(dst []byte, template string, src string, match []int) []byte
So how to use Expand
?
func main() {
src := []byte(`
call hello alice
hello bob
call hello eve
`)
pat := regexp.MustCompile(`(?m)(call)\s+(?P<cmd>\w+)\s+(?P<arg>.+)\s*$`)
res := []byte{}
for _, s := range pat.FindAllSubmatchIndex(src, -1) {
res = pat.Expand(res, []byte("$cmd('$arg')\n"), src, s)
}
fmt.Println(string(res))
}
At this point, you learned whole package regexp
in Go, I hope you can understand more by studying examples of key methods, and do something interesting by yourself.
##Links