grammar;

pub Invocation: Vec<&'input str> = {
  <WORD*> NEWLINE,
};

// Several of the regexps below make use of Unicode character classes. [1] is
// the official reference to Unicode classes, and [2] is a site that is useful
// for browsing to get an intuitive idea of what the classes mean.
//
// In maintaining these regexps, it's important to understand the structure
// of Unicode character classes. There are seven top-level categories, each
// with a single-character name (ie. "Z" for separators). Each top-level
// category has several subcategories which form an exhaustive partition of it;
// the subcategories have two-character names (ie. "Zs" for space separators).
// Every allocated codepoint is in exactly one top-level category and exactly
// one subcategory.
//
// It is important that these regexps exhaustively cover the entirety of
// Unicode, without omission; otherwise lalrpop's lexer will give InvalidToken
// errors for unrecognized characters. Overlaps will be less catastrophic, as
// they'll be resoved by the precedence rules, but for clarity's sake they
// should be avoided.
//
// [1] http://www.unicode.org/reports/tr44/#General_Category_Values
// [2] https://www.compart.com/en/unicode/category
//
match {
  // Zs is the Unicode class for space separators. This includes the ASCII
  // space character.
  //
  r"\p{Zs}+" => { },

  // Zl is the Unicode class for line separators. Zp is the Unicode class for
  // paragraph separators. Newline and carriage return are included individually
  // here, since Unicode classifies them with the control characters rather than
  // with the space characters.
  //
  r"[\p{Zl}\p{Zp}\n\r]" => NEWLINE,

  // This one recognizes exactly one character, the old-school double-quote. As
  // tempting as it is to do something clever with character classes, shells have
  // a long history of quoting syntaxes which are subtle and quick to anger, and
  // for this project the decision is to be radically simple instead.
  r#"["]"# => QUOTE,

  // This one matches any control character other than line feed and carriage
  // return. The grammar doesn't reference control characters, but having a
  // token for them makes the error messages more informative.
  r"[\p{C}&&[^\n\r]]" => CONTROL,

  // Z is the unicode class for separators, which is exhaustively partitioned
  // into line, paragraph, and space separators. Each of those subclasses is
  // handled above. C is the class for control characters. This regexp tests
  // for the intersection of the negation of these character classes, along
  // with a negated class enumerating all the explicitly-recognized characters,
  // which means it matches any character NOT in the regexps above.
  //
  // Note that, counterintuitively, line feed and carriage return are classified
  // as control characters, not as line separators. Either way, this regexp would
  // still exclude them, but the difference might be relevant when maintaining
  // it.
  //
  r#"[\P{Z}&&\P{C}&&[^"]]+"# => WORD,
}