Glob matching

Table Of Contents ↓

Motivation

Understanding what is glob and how glob matching operates will help you adopt this useful mini-language in your domain whether you’re a user, programmer, or an architect.

What is glob?

Long ago, in UNIX V6, there was a program /etc/glob that would expand wildcard patterns. Soon afterward this became a shell built-in.

glob(7) — Linux manual page

Glob, short for global, is many things:

  1. a DSL to describe text patterns
  2. a library function
  3. a unix command

having multiple meaning and such a generic name, not surprisingly, leads to the confusion.

This post focuses on 1. - the DSL to describe text patterns. The pattern language migrated from the Unix shell into variety of platforms, which speaks for its usefulness.

glob matching has been spotted in:

Grammar

Being a language, it has a grammar, expressed in EBNF, looks like:

pattern:
	{ term }    
term:
	'*'         matches any sequence of characters
	'?'         matches any single character
	'[' [ '^' ] { character-range } ']'
	            character class (must be non-empty)
	c           matches character c (c != '*', '?', '\\', '[')
	'\\' c      matches character c

character-range:
	c           matches character c (c != '\\', '-', ']')
	'\\' c      matches character c
	lo '-' hi   matches character c for lo <= c <= hi
✋ note: original phrasing `non-Separator characters` is replaced with `characters` since we're not dealing with file paths.

There are many flavors of the grammar, but today’s focus is mainly on the wildcard *, since it’s the most useful and complex piece.

How it works

As mentioned earlier glob is a pattern matching language. The main focus is on * wildcard and how it matches text.

 '*'         matches any sequence of characters

Next, some examples:

globinput textoutcomecomments
✅ matchempty glob matches empty input ony: literal match
anything❌nomatch
hehe✅matchliteral match: the glob contains no wildcards
hehello❌nomatch
he*hello✅matchprefix match
he*hi hello❌nomatch
*lohello✅matchpostfix match
*lohello world❌nomatch
*he*the hello world✅matchinfix match
*he*world❌nomatch

👆 ️note: the empty table cells correspond to empty strings “”.

Since initial version, many glob implementations extended its functionality, and one of the most used ones is Bash’s extglob, with new syntax constructs.

Glob vs Regex

ℹ️ Don’t confuse glob with regular expressions(regex): they look alike but aren’t the same.

Unlike glob, regex matches a substring unless expressed to match start or end of the input.

Comparison of glob and the corresponding regular expression:

GlobRegexDescription
^$match empty string
*.*match everything
.\.the dot is treated literally by glob, but it needs to be escaped for regex interpreters
?.match any single symbol
he^he$exact match
he*^he.*prefix
*he.*he$postfix
*he*heinfix. Note regex is shorter this time

✋ Shells treat inputs with leading . specially: the * glob doesn’t match those, unless explicitly specified, ie:

  • ls * doesn’t list “hidden” files starting with ., but
  • ls .* lists

Implementation

Given that now we understand the DSL, how would we incorporate glob matching into software? We need to implement it.

Russ Cox posted about glob including the algorithm which is both elegant and performant (there are many with exponential runtime-s out there).

Below Go snippet is simplified to only handle * wildcard. I’ve commented it while studying the implementation, let me know if there’s a better way to think about it.


func match(glob, str string) bool {
	var gx, sx int
	var lastgx, nextsx int
	for gx < len(glob) || sx < len(str) {
		if gx < len(glob) {
			switch c := glob[gx]; c {
			default:
				// literal comparison
				// "consume" if both match
				if sx < len(str) && str[sx] == c {
					gx += 1
					sx += 1
					continue
				}
			case '*':
				// *lo ~ ablalo
				lastgx = gx     // remember the '*' position
				nextsx = sx + 1 // remember the next position in str
				gx += 1         // advance glob position
				// next iteration, will
				// - either match glob[gx] and str[sx], continuing
				// - or no-match
				//   in which case `gx` will be re-set to `lastgx` from above
				//   and sx will point to nextsx, 
        //   effectively "consuming" the no-match by '*'
				continue
			}
		}
		// glob is exhausted
		// try restarting from last '*' and next position in str
		// NOTE: `<=` here: signals "reached the end of the string"
		if 0 < nextsx && nextsx <= len(str) {
			gx = lastgx
			sx = nextsx
			continue
		}
		return false
	}

	return true
}

Algorithm Walkthrough

As an example, below is a matching sequence with glob: *lo* and input holo.

GlobInputDetails
*lo*holoinitial state
lo*holo* consumed; remembering its position and next position in string. Try match leftmost symbols.
*lo*olono-match in previous iteration: recover * position and consume non-matched symbol by moving to the next input symbol as set remembered in the previous step
lo*olo* consumed; remember again; attempt the literal match
*lo*lono-match again; recover last * and move to the next symbol
lo*lo* consumed; also literal match; consume both
o*oliteral match; continue
*consume *; remember its position again
it’s a match: both glob and the input string reached the end simultaneously

Summary

glob matching is a useful tool. Despite limited expressiveness, it’s a preferred solution for certain use-cases due to grammar simplicity and good UX.

I’ve picked it up intuitively, while using shell, without really thinking through how it operates, this post addressed the gap.

Next up: regular expressions?

Resources

  1. Man Page: glob
  2. Wildcards
  3. Bash globbing
  4. More shell globbing examples
Read More
Technical Writing One: Notes
Quines
Comments
read or add one↓