3 String Match

7.1

3 String Match

Thanks to Daniel Prager for coming up with this problem on the Racket mailing list.

Sometimes, especially for scripting or solving coding puzzles, it can be useful to have tools for parsing strings in a particular format. Regular expressions can serve that purpose, but they can be difficult to read, especially when capture groups are involved.

It could be convenient to have a string-match form that’s able to parse simple, common cases without having the complexity (or power) of full regular expressions. For example, we might wish to parse lines with the following structure:

"favorite-numbers.txt"
10:Alice:a nice, round number
16:Bob:a round number in base 2
42:Douglas:the ultimate answer

Using a special string-match form, we could parse them like this:

(define (parse-line str)
  (string-match str
    [(n ":" name ":" description)
     (list (string->number n)
           name
           description)]))

> (map parse-line (file->lines "favorite-numbers.txt"))
'((10 "Alice" "a nice, round number")
  (16 "Bob" "a round number in base 2")
  (42 "Douglas" "the ultimate answer"))

A slightly more complex example might involve small sentences with a simple, predefined structure:

"statements.txt"
The sky is blue.
My car is red.
The heat death of the universe is inevitable.

We could use string-match to parse these sorts of structures, too:

(define (parse-sentence str)
  (string-match str
    [("The " noun " is " adjective ".")
     `(statement ,noun ,adjective)]
    [("My " noun " is " adjective ".")
     `(possessive-statement ,noun ,adjective)]))

> (map parse-sentence (file->lines "sentences.txt"))
'((statement "sky" "blue")
  (possessive-statement "car" "red")
  (statement "heat death of the universe" "inevitable"))

In this exercise, you will implement string-match.

3.1 Implementation strategy

How can we implement string-match? It may seem like an involved macro, but fortunately, we don’t have to implement it from scratch. Racket’s macro system is compositional, so we can assemble most of it out of existing parts already provided by the standard library.

With this in mind, consider what string-match actually does. From a high level, it performs two distinct tasks:

It matches strings against a particular pattern…
…then binds matched substrings to variables.

Each of these goals can be accomplished using existing Racket features. For the first part, we can use regular expressions by compiling our patterns to regular expressions at compile-time. For the second part, we can use match from racket/match, which conveniently already supports matching strings against regular expressions.

3.1.1 Regular expressions and syntax

How can we compile our string-match patterns to regular expressions? Well, Racket supports regular expression literals, written like #rx"" or #px"". For more information on the precise syntax, see Reading Regular Expressions.

For our purposes, however, the exact details of the syntax don’t matter. What’s important is that Racket supports regular expression literals, which means regular expressions are valid syntax objects! For proof, try syntax-quoting a regular expression literal:

> #'#rx"[a-z]+"
#<syntax:eval:1:0 #rx"[a-z]+">

We can exploit this property to embed regular expressions in the output of a macro. To create these regular expressions in the first place, we can produce them from strings using the regexp function:

> (regexp "[a-z]+")
#rx"[a-z]+"

We can also use regexp-quote to safely embed constant strings inside a regular expression without worrying about escaping special characters:

> (regexp (string-append (regexp-quote "+?[") "[^\\]]+" (regexp-quote "]?+")))
#rx"\\+\\?\\[[^\\]]+\\]\\?\\+"

Finally, we can use datum->syntax to convert a dynamically-constructed regular expression into a regular expression literal that can be embedded in syntax:

> (datum->syntax #'here (regexp (string-append "hello" "-" "world")))
#<syntax #rx"hello-world">

Using this technique, we can use regexp, regexp-quote, and datum->syntax to compile our string-match patterns into fast, efficient regexps.

3.1.2 Pattern-matching with racket/match

Racket’s match form is useful for all sorts of pattern-matching, including pattern-matching against regexps. For example, it’s possible to parse a simple URL using the following match expression:

> (match "http://example.com/foobar"
[(regexp #rx"^([^:]+)://([^/]+)/(.+)$" (list _ protocol hostname path))
(list protocol hostname path)])
'("http" "example.com" "foobar")

You can read the match documentation for the full syntax, but the most relevant piece for our string-match macro is the (regexp rx pat) pattern, where rx is a regular expression and pat is a pattern that will be matched against the results of the regexp match.

The syntax for match is a little different from our string-match, since it uses regular expressions instead of inline sequences of strings and bindings. Still, it’s pretty close, and combined with the previous section on regular expressions, you may be able to see how to combine the two to produce an efficient implementation of string-match.

3.2 Specification

With the explanations out of the way, it’s now time to implement string-match. Your task is to implement a simple, greedy string-matching macro that is efficiently compiled to regular expressions.

Here is an example use of string-match:

(define (sm str)
  (string-match str
    [(a "--" b " " c " end") (list (string->symbol a) (string->number b) c)]
    [("the " object " is " adjective) (~a adjective " " object)]
    [("whole-string") 'whole]))

You can see that string-match looks remarkably like match. The only difference is the pattern syntax, which is composed of a sequence of strings and identifiers. Each pattern should be compiled into a regular expression by embedding string literals directly into the result using regexp-quote, and identifiers should become capture groups of zero or more characters, (.*).

To illustrate the above rules, that example should roughly expand into the following code:

(define (sm str)
  (match str
    [(regexp #rx"^(.*)--(.*) (.*) end$" (list _ a b c))
     (list (string->symbol a) (string->number b) c)]
    [(regexp #rx"^the (.*) is (.*)$" (list _ object adjective))
     (~a adjective " " object)]
    [(regexp #rx"^whole-string$" (list _))
     'whole]))

The defined function, sm, should produce the following results when applied to input:

> (sm "abc--123 foo end")
'(abc 123 "foo")
> (sm "the fox is black")
"black fox"
> (sm "whole-string")
'whole
> (sm "abc--123--456 bar end")
'(abc--123 456 "bar")

3.3 Grammar

syntax
(string-match str-expr
[(str-pat ...+) body-expr ...+] ...)

str-pat = string-literal
| binding-id

str-expr : string?

← prev up next →

1	Threading I
2	Threading II
3	String Match
4	Exhaustive Enumerations