1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
grammar;
pub Invocation: Vec<&'input str> = {
<WORD+> NEWLINE,
};
// Several of the regexps below make use of Unicode character classes. [1] is
// the official reference to Unicode classes, and [2] is a site that is useful
// for browsing to get an intuitive idea of what the classes mean.
//
// In maintaining these regexps, it's important to understand the structure
// of Unicode character classes. There are seven top-level categories, each
// with a single-character name (ie. "Z" for separators). Each top-level
// category has several subcategories which form an exhaustive partition of it;
// the subcategories have two-character names (ie. "Zs" for space separators).
// Every allocated codepoint is in exactly one top-level category and exactly
// one subcategory.
//
// It is important that these regexps exhaustively cover the entirety of
// Unicode, without omission; otherwise lalrpop's lexer will give InvalidToken
// errors for unrecognized characters. Overlaps will be less catastrophic, as
// they'll be resoved by the precedence rules, but for clarity's sake they
// should be avoided.
//
// [1] http://www.unicode.org/reports/tr44/#General_Category_Values
// [2] https://www.compart.com/en/unicode/category
//
match {
// Zs is the Unicode class for space separators. This includes the ASCII
// space character.
//
r"\p{Zs}+" => { },
// Zl is the Unicode class for line separators. Zp is the Unicode class for
// paragraph separators. Newline and carriage return are included individually
// here, since Unicode classifies them with the control characters rather than
// with the space characters.
//
r"[\p{Zl}\p{Zp}\n\r]" => NEWLINE,
// This one recognizes exactly one character, the old-school double-quote. As
// tempting as it is to do something clever with character classes, shells have
// a long history of quoting syntaxes which are subtle and quick to anger, and
// for this project the decision is to be radically simple instead.
r#"["]"# => QUOTE,
// This one matches any control character other than line feed and carriage
// return. The grammar doesn't reference control characters, but having a
// token for them makes the error messages more informative.
r"[\p{C}&&[^\n\r]]" => CONTROL,
// Z is the unicode class for separators, including line, paragraph, and space
// separators. C is the class for control characters. This regexp tests for
// the intersection of the negation of these character classes, along with a
// negated class enumerating all the explicitly-recognized characters, which
// means it matches any character NOT in the regexps above.
//
// Note that, counterintuitively, line feed and carriage return are classified
// as control characters, not as line separators. Either way, this regexp would
// still exclude them, but the difference might be relevant when maintaining
// it.
//
r#"[\P{Z}&&\P{C}&&[^"]]+"# => WORD,
}
|