python - Regular expression pattern questions? -
i having hard time understanding regular expression pattern. me regular expression pattern match words ending in s. , start , end (like ana). how write ending?
word boundaries given \b
following regex matches words ending ing or s: "\b(\w+?(?:ing|s))\b"
\b
word boundary, \w+
1 or more "word character" , (?:ing|s)
uncaptured group of either ing
or s
.
as asked "how develop regex":
first: don't use regex complex tasks. hard read, write , maintain. example there a regex validates email addresses - computer generated , nothing should use in practice.
start simple , add edge cases. @ beginning plan characters need use: said need words ending s
or ing
. need represent word, endings of words , literal characters s
, ing
. word? might change case case, @ least every alphabetical character. looking in python documentation on regexes can find \w
[a-za-z0-9_]
, fits impression of word character. there can find \b
word boundary.
so "first pseudo code try" \b\w...\w\b
matches word. still need "formalize" ...
want have meaning of "one ore more characters", directly translates \b\w+\b
. can match word! still need s
or ing
. |
translates or, how following: \b\w+ing|s\b
? if test this, you'll see match confusing things ingest
should not match our regex. happening? saw |
can't know "which part should or", need introduce parenthesis: \b\w+(ing|s)\b
. congratulations, have arrived @ working regex!
why (and how) differ example gave first? first wrote \w+?
instead of \w+
, ?
turns +
non-greedy version. if know difference between greedy , non greedy is, skip paragraph. consider following: aaaaba
, want match things enclosed big letter a
. naive try: a\w+a
, 1 or more word characters enclosed a
. matches aaa
, aaaaba
, a
still can matched \w
. without further config *+?
quantifier try match as possible. sometimes, in example, don't want that, can use ?
after quantifier signal want non-greedy version, version matches as little possible.
but in our case isn't needed, words seperated whitespaces, not part of \w
. in fact can let +
greedy , alright. if use .
(any character) need careful not match much.
the other difference using (?:s|ing)
instead of (s|ing)
. ?:
here? changes capturing group non capturing group. don't want "everything" regex. consider following regex: i want go \w+
. not interested in whole sentence, in \w+
, can capture in group: i want go (\w+)
. means interested in specific piece of information , want retrieve later. (like when using |
) need group expressions together, not interested in content, can declare non capturing. otherwise group (s
or ing
) not actual word!
so summarize: * start small * add 1 case after * test examples
in fact tried re.findall(
\b\w+(?:ing|s)\b, "fishing words")
, didn't work. \w+(?:ing|s)
works. i've no idea why, maybe else can explain that. regex arcane thing, use them easy , easy test tasks.
Comments
Post a Comment