Let’s imagine that you have to check if a string is a valid email. You could come up with something like:
1
/[a-zA-Z0-9\.]+@[a-z]+\.[a-z]+/
It works, right? WRONG. Sure it’ll handle a couple of your test examples. But it’s not ready for real world usage. Here’s a standards compliant Perl regex.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/(?(DEFINE)
(?<address> (?&mailbox;) | (?&group;))
(?<mailbox> (?&name;_addr) | (?&addr;_spec))
(?<name_addr> (?&display;_name)? (?∠_addr))
(?<angle_addr> (?&CFWS;)? < (?&addr;_spec) > (?&CFWS;)?)
(?<group> (?&display;_name) : (?:(?&mailbox;_list) | (?&CFWS;))? ;
(?&CFWS;)?)
(?<display_name> (?&phrase;))
(?<mailbox_list> (?&mailbox;) (?: , (?&mailbox;))*)
(?<addr_spec> (?&local;_part) \@ (?&domain;))
(?<local_part> (?˙_atom) | (?"ed;_string))
(?<domain> (?˙_atom) | (?&domain;_literal))
(?<domain_literal> (?&CFWS;)? \[ (?: (?&FWS;)? (?&dcontent;))* (?&FWS;)?
\] (?&CFWS;)?)
(?<dcontent> (?&dtext;) | (?"ed;_pair))
(?<dtext> (?&NO;_WS_CTL) | [\x21-\x5a\x5e-\x7e])
(?<atext> (?&ALPHA;) | (?&DIGIT;) | [!#\$%&'*+-/ = ? ^ _ `{|}~])
(?<atom> (?&CFWS;)? (?&atext;)+ (?&CFWS;)?)
(?<dot_atom> (?&CFWS;)? (?˙_atom_text) (?&CFWS;)?)
(?<dot_atom_text> (?&atext;)+ (?: \. (?&atext;)+)*)
(?<text> [ \x01 - \x09\x0b\x0c\x0e - \x7f ])
(?<quoted_pair> \\ (?&text;))
(?<qtext> (?&NO;_WS_CTL) | [ \x21\x23 - \x5b\x5d - \x7e ])
(?<qcontent> (?&qtext;) | (?"ed;_pair))
(?<quoted_string> (?&CFWS;)? (?&DQUOTE;) (?:(?&FWS;)? (?&qcontent;))*
(?&FWS;)? (?&DQUOTE;) (?&CFWS;)?)
(?<word> (?&atom;) | (?"ed;_string))
(?<phrase> (?&word;)+)
# Folding white space
(?<fws> (?: (?&WSP;)* (?&CRLF;))? (?&WSP;)+)
(?<ctext> (?&NO;_WS_CTL) | [ \x21 - \x27\x2a - \x5b\x5d - \x7e ])
(?<ccontent> (?&ctext;) | (?"ed;_pair) | (?&comment;))
(?<comment> \( (?: (?&FWS;)? (?&ccontent;))* (?&FWS;)? \) )
(?<cfws> (?: (?&FWS;)? (?&comment;))*
(?: (?:(?&FWS;)? (?&comment;)) | (?&FWS;)))
# No whitespace control
(?<no_ws_ctl> [ \x01 - \x08\x0b\x0c\x0e - \x1f\x7f ])
(?<alpha> [A-Za-z])
(?<digit> [0-9])
(?<crlf> \x0d \x0a )
(?<dquote> ")
(?<wsp> [ \x20\x09 ])
)
(?&address;)/x
I couldn’t even imagine that the matter is this complex.