regexp

Regular expressions is a pattern to search and replace in text Syntax Via object const regexp = new RegExp("pattern", "flags") Via slashes const regexp = /pattern/gmi Slashes pattern does not allow for expressions to be inserted, they are fully static Slashes are used when we know the regular expression at the code writing time new RegExp is more often used when we need to create a regexp “on the fly” In both cases regexp becomes an instance of the built-in RegExp class let tag = prompt("What tag do you want to find?", "h2"); let regexp = new RegExp(`<${tag}>`); Flags /.../i case-insensitive /.../g all matches /.../m multiline mode /.../s match newline character \n /.../u full Unicode support (correct processing of surrogate pairs) /.../y searching at the exact position in the text "We will, we will rock you".match(/we/gi) // ["We", "we"] "A\nB".match(/A.B/) // null "A\nB".match(/A.B/s) // ['A\nB', index: 0, input: 'A\nB', groups: undefined] '😄 sfds'.match(/\p{Emoji}/gu) // ["😄"] Character classes /./ any character, except a newline /\d/ digit /\s/ space, including tabs \t , newlines \n , \v , \f , \r /\w/ word, either a letter of Latin alphabet or a digit or an underscore /\D/ non-digit, any character except \d, for ex a letter /\S/ non-space, any character except \s, for ex a letter /\W/ non-wordy character, anything but \w, e.g a non-latin letter or a space /\s\S/ anything, a space character OR not a space character /\d\D/ anything /[^]/ anything "Z".match(/./) // Z "Is there CSS4".match(/CSS\d/g) // ['CSS4'] // matches a string 'CSS' with a digit after it "+7(903)-123-45-67".match(/\d/g) // ['7'] "+7(903)-123-45-67".match(/\d/g) // ['7', '9', '0', '3', '1', '2', '3', '4', '5', '6', '7'] "+7(903)-123-45-67".match(/\D/g) // ['+', '(', ')', '-', '-', '-'] "+7(903)-123-45-67".replace(/\D/g, "") // 79031234567 "CSS4".match(/CS.4/) // 'CSS4' "hi 123 свет".match(/[^]/g) // ['h', 'i', ' ', '1', '2', '3', ' ', 'с', 'в', 'е', 'т'] Unicode properties There are 3 categories General category Letter (L), Uppercase_Letter (Lu), Lowercase_Letter (Ll), Titlecase_Letter (Lt), Modifier_Letter (Lm), Other_Letter (Lo) Mark (M), Non-Spacing_Mark (Mn), Spacing_Combining_Mark (Mc), Enclosing_Mark (Me) Number (N), Decimal_Digit_Number (Nd), Letter_Number (Nl), Other_Number (No) Symbol (S), Math_Symbol (Sm), Currency_Symbol (Sc), Modifier_Symbol (Sk), Other_Symbol (So) Punctuation (P), Connector_Punctuation (Pc), Dash_Punctuation (Pd), Open_Punctuation (Ps), Close_Punctuation (Pe), Initial_Punctuation (Pi), Final_Punctuation (Pf), Other_Punctuation (Po) Separator (Z), Space_Separator (Zs), Line_Separator (Zl), Paragraph_Separator (Zp) Other (C), Control (Cc), Format (Cf), Surrogate (Cs), Private_Use (Co), Unassigned (Cn) "10 >= 5".match(/\p{General_Category=Math_Symbol}/gu) // ['>', '='] "10 >= 5".match(/\p{Math_Symbol}/gu) // ['>', '='] "10 >= 5".match(/\p{Sm}/gu) // ['>', '='] Script "Привет man".match(/\p{Script=Cyrillic}/gu) // ['П', 'р', 'и', 'в', 'е', 'т'] Binary Unicode property ASCII, ASCII_Hex_Digit, Alphabetic, Any, Dash, Emoji, Hex_Digit, Lowercase, Math, Noncharacter_Code_Point, Pattern_Syntax, Pattern_White_Space, Quotation_Mark, Radical, Regional_Indicator, Sentence_Terminal, Soft_Dotted, Terminal_Punctuation, Unified_Ideograph, Uppercase, White_Space "1 plus 1 is 2".match(/\p{Alphabetic}/gu) // ['p', 'l', 'u', 's', 'i', 's'] Anchors + word boundary /^/ matches beginning of the text /$/ matches end of the text /\b/ matches for being a word boundary /^Mary/.test("Mary had a little lamb") // true /lamb$/.test("Mary had a little lamb") // true "Hello, Java!".match(/\bJava\b/g) // ['Java'] "Hello, JavaScript!".match(/\bJava\b/g) // null "1 23 456 78".match(/\b\d\d\b/g) // ["23", "78"] "12,34,56".match(/\b\d\d\b/g) // ["12", "34", "56"] // or "Mary had a little lamb".startsWith("Mary") // true "Mary had a little lamb".endsWith("lamb") // true Test if a time format let regexp = /^\d\d:\d\d$/ regexp.test("12:34") // true regexp.test("12:345") // false In multiline text with 'm' flag they match start/ end of a line not only string let str = `1st place: Winnie 2nd place: Piglet 3rd place: Eeyore ` str.match(/^\d/gm) // ["1", "2", "3"] Escaping, special characters [ \ ^ $ . | ? * + ( ) . to be escaped with backslash \ "Chapter 5.1".match(/\d\.\d/g) // ['5.1'] "function g()".match(/g/g) // ['g()'] // look for backslash [ab] OR(Sets) Search for any character among given Allow only characters or character classes [tm] "t" or "m" [\w-] wordly character or a hyphen [\s\d] a space or a digit "Mop top".match(/[tm]op/gi) // ["Mop", "top"] "Voila".match(/V[oi]la/) // null | OR Alternation allows any expressions A regexp A|B|C means one of expressions A, B or C I love HTML|CSS matches I love HTML or CSS I love (HTML|CSS) matches I love HTML or I love CSS "First HTML appeared, then CSS, then JavaScript".match(/html|php|css|java(script)?/gi) // ["HTML", "CSS", "JavaScript"] "Java, JavaScript, PHP, C, C++".match(/Java(Script)?|C(\+\+)?|PHP/g) // ["Java", "JavaScript", "PHP", "C", "C++"] "00:00 10:10 23:59 25:99 1:2".match(/([01]\d|2[0-3]):[0-5]\d/g) // ["00:00", "10:10", "23:59"] Ranges [a-z] range from a to z [0-5] digit from 0 to 5 [\s\d] a space or a digit // searching for "x" followed by two digits or letters from A to F "Exception 0xAF".match(/x[0-9A-F][0-9A-F]/g) // ["xAF"] Exclude [^…] [^aeyo] any character except 'a', 'e', 'y' or 'o' [^0-9] any character except a digit, the same as \D [^\s] any non-space character, same as \S In square brackets we can use the vast majority of special characters w/o escaping, until they mean something for brackets Quantifiers + , * , ? , {n} {3} 3 times {3,5} from 3 to 5 {1,} 1 or more ? optional, same as {0,1} * zero or more, same as {0,} + one or more \d+ looks for numbers "I'm 12345 years old".match(/\d{5}/) // "12345" // same as \d\d\d\d\d "I'm not 12, but 1234 years old".match(/\d{3,5}/) // "1234" "I'm not 12, but 345678 years old".match(/d{3,}/) // "345678" "+7(903)-123-45-67".match(/\d+/g) // ["7", "903", "123", "45", "67"] "color or colour?".match(/colou?r/g) // ["color", "colour"] "100 10 1".match(/\d0*/g) // ["100", "10", "1"] // looks for a digit followed by any number of zeroes (may be many or none) "100 10 1".match(/\d0+/g) // ["100", "10"] // 1 not matched, as 0+ requires at least one zero "0 1 12.345 7890".match(/\d+\.\d+/g) // 12.345 // Regexp for decimal fractions "<body> ... </body>".match(/<[a-z]+>/gi) // <body> // Regexp for an “opening HTML-tag w/o attributes”, such as <span> or <p> "Hello!... How goes?.....".match(/\.{3,}/g) // ["...", "....."] // find an ellipsis "..." Greedy and lazy quantifiers Greedy mode // let's find ["witch", "broom"] 'a "witch" and her "broom" is one'.match(/".+"/g) // ['"witch" and her "broom"'] // not what we want . means any character, + means one or more times .+ stops at a new line or end, then search for " further But there is no further, because we stopped at the end Regular expression engine understands that it took too many and starts to backtrack It iterates the string and shortens the match for the quantifier by one character every attempt from the end We got "witch" and her "broom" Due to flag 'g' it will continue from the prev match end, but no more quotes in the rest of the string In the default 'greedy' mode a quantified character is repeated as many times as possible The regexp adds to the match as many characters as it can for .+ and then shortens that one by one, if the rest of the pattern doesn’t match Greedy quantifier may lead to catastrophic backtracking and make regexp to execute very long Lazy mode 'a "witch" and her "broom" is one'.match(/".+?"/g) // ["witch", "broom"] Enable lazy mode by putting a question mark ? Repeats minimal number of times Usually ? is a quantifier (zero or one) But if added after another quantifier it gets another meaning It switches the matching mode from greedy to lazy Laziness is only enabled for the quantifier with ? Other quantifiers remain greedy Capturing groups (...) "Gogogo now!"".match(/(go)+/ig) // "Gogogo" (go)+ means 'go', 'gogo', 'gogogo' and so on Search engine memorizes the content matched by parentheses Parentheses are numbered from left to right Can be retrieved from the array The zero index of result always holds the full match. '<h1>Hello, world!</h1>'.match(/<.*?>/) // ['<h1>', index: 0, input: '<h1>Hello, world!</h1>', groups: undefined] // with () '<h1>Hello, world!</h1>'.match(/<(.*?)>/) // ['<h1>', 'h1', index: 0, input: '<h1>Hello, world!</h1>', groups: undefined] To include contents inside parentheses into the result wrap it into additional braces "1 turkey costs 30€".match(/\d+(?=(€|kr))/) // 30, € Capturing groups & str.replace() "John Bull".replace(/(\w+) (\w+)/, '$2, $1') // Bull, John Named groups Remembering groups by their numbers is hard we can give names to parentheses That’s done by putting ?<name> immediately after the opening brace let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/ let str = "2019-04-30" let groups = str.match(dateRegexp).groups groups.year // 2019 groups.month // 04 groups.day // 30 Exclude group with ?: let str = "Gogogo John!" let regexp = /(?:go)+ (\w+)/i // ?: excludes 'go' from capturing let result = str.match(regexp) result[0] // Gogogo John (full match) result[1] // John result.length // 2 (no more items in the array) Named groups & str.replace() let regexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/g let str = "2019-10-30, 2020-01-01" str.replace(regexp, '$<day>.$<month>.$<year>') // 30.10.2019, 01.01.2020 Backreference \1 We can refer to capturing group Engine finds the first quote () and memorizes its content Further we can “find the same as in the first group” by \1 \2 would mean the contents of the second group If we use ?: in the group, then we can’t reference it `He said: "She's the one!".`.match(/(['"])(.*?)\1/g) // ["She's the one!"]! Named groups can be used \k<name> `He said: "She's the one!".`.match(/(?<quote>['"])(.*?)\k<quote>/g) // ["She's the one!"] Lookahead (?= smth) X(?=Y) means look for "X", if "Y" is after it Contents of the parentheses (?=...) is not included in the result "1 turkey costs 30€".match(/\d+(?=€)/) // "30" // looks for a digit that is followed by a space and if there’s 30 somewhere after it "1 turkey costs 30€".match(/\d+(?=\s)(?=.*30)/) // '1' Negative lookahead (?!= smth) X(?!Y) means "search X, but only if not followed by Y" "2 turkeys cost 60€".match(/\d+\b(?!€)/g) // "2" Lookbehind (?<=Y)X (?<=Y)X matches X only if there’s Y before it "1 turkey costs $30".match(/(?<=\$)\d+/) // '30' Negative lookbehind (?<!Y)X (?<!Y)X matches X if there’s no Y before it "2 turkeys cost $60".match(/(?<!\$)\b\d+/g) // ["2"] // Find non-negative integers "0 12 -5 123 -18".match(/(?<!-)\d+/g) // ["0", "12", "123", "8"] Search at position with 'y' flag /.../y Flag 'y' allows to perform the search at the given position in the source string regexp.exec(str) works like str.match(regexp) With flag g it performs the search in str, starting from position stored in the regexp.lastIndex property If there is a match, then it sets regexp.lastIndex to the index Successive calls to regexp.exec(str) return matches one after another let str = 'let varName' // Let's find all words in this string let regexp = /\w+/g regexp.lastIndex // 0 (initially lastIndex=0) let word1 = regexp.exec(str) word1[0] // let (1st word) regexp.lastIndex // 3 (position after the match) let word2 = regexp.exec(str) word2[0] // varName (2nd word) regexp.lastIndex // 11 (position after the match) let word3 = regexp.exec(str) word3 // null (no more matches) regexp.lastIndex // 0 (resets at search end) // we can get all matches in the loop: str = 'let varName' regexp = /\w+/g let result while (result = regexp.exec(str)) { console.log( `Found ${result[0]} at position ${result.index}` ) // Found let at position 0 // Found varName at position 4 } 'y' flag makes regexp.exec() to search exactly at position lastIndex let str = 'let varName = "value"' let regexp = /\w+/y regexp.lastIndex = 3 regexp.exec(str) // null (there's a space at position 3, not a word) regexp.lastIndex = 4 regexp.exec(str) // varName (word at position 4) RegExp methods str.match(regexp) - finds all matches of 'regexp' in the string 'str', with 'g' flag returns an array str.matchAll(regexp) - returns not an array, but an iterable object str.split(regexp, limit) - Splits the string using the regexp (or a substring) as a delimiter str.search(regexp) - returns the position of the first match or -1 if none found: str.replace(regexp, replacement) - replaces matches found using regexp in string str with replacement str.replaceAll(regexp, replacement) - same as str.replace With 'g' flag regexp.exec(str) - works exactly like str.match(regexp) regexp.test(str) - looks for at least one match, if found, returns true, otherwise false str.match(regexp) Finds matches in a string If doesn’t have flag g array with capturing groups Wth g flag returns an array of all matches as strings If there are no matches null is returned To ensure a result to be an array put let result = str.match(regexp) || [] // without flag g let result = "I love JavaScript".match(/Java(Script)/) // (2) ['JavaScript', 'Script', index: 7, input: 'I love JavaScript', groups: undefined] result[0] // JavaScript (full match) result[1] // Script (first capturing group) result.length // 2 result.index // 7 (match position) result.input // I love JavaScript (source string) // with flag g result = "I love JavaScript".match(/Java(Script)/g) result[0] // JavaScript result.length // 1 // no matches result = "I love JavaScript".match(/HTML/) result // null // more examples let str = "We will, we will rock you" str.match(/we/gi) // ["We", "we"] str.match(/we/i) // ["We", index: 0, input: "We will, we will rock you", groups: undefined] str.match(/hello/i) // null let matches = "JavaScript".match(/HTML/) || [] if (!matches.length) alert("No matches") str.matchAll(regexp) Search for all matches with all groups Returns an iterable object with matches If there are no results, it returns an empty iterable object Every match is returned as an array with capturing groups (the same str.match w/o flag g) Can use for..of to loop over matchAll matches let matchAll = '<h1>Hello, world!</h1>'.matchAll(/<(.*?)>/g) matchAll // [object RegExp String Iterator], not array, but an iterable matchAll = Array.from(matchAll) let firstMatch = matchAll[0] firstMatch[0] // <h1> firstMatch[1] // h1 firstMatch.index // 0 firstMatch.input // <h1>Hello, world!</h1> str.split(regexp, limit) Splits the string using the regexp (or a substring) as a delimiter '12-34-56'.split('-') // ['12', '34', '56'] '12, 34, 56'.split(/,s*/) // ['12', '34', '56'] str.search(regexp) Returns the position of the first match Returns -1 if none are found Search until the first match If we need positions of all matches, use str.matchAll(regexp) "A drop of ink may make a million think".search(/ink/i) // 10 str.replace(str | regexp, str | func) Method for searching and replacing When the first argument is a string, it replaces the first match only '12-34-56'.replace("-", ":") // 12:34-56 '12-34-56'.replace( /-/g, ":" ) // 12:34:56 "We will, we will".replace(/we/i, "I") // I will, we will // no flag g "We will, we will".replace(/we/ig, "I") // I will, I will // with flag g Accepts special characters at the replacement string argument (2nd) $& inserts the whole match $` inserts a part of the string before the match $' inserts a part of the string after the match $n if n is a 1-2 digit number, inserts the contents of n-th capturing group $<name> inserts the contents of the parentheses with the given name $$ inserts character $ // swap first and last name "John Smith".replace(/(john) (smith)/i, '$2, $1') // Smith, John Second argument can be a function for smart replacement It will be called for each match The returned value will be inserted as a replacement replacementFunc(str, offset, input) Read more about function at the original source // let’s uppercase all matches let str = "html and css"; let result = str.replace(/html|css/gi, str => str.toUpperCase()) result // HTML and CSS // Replace each match by its position in the string "Ho-Ho-ho".replace(/ho/gi, (match, offset) => offset) // 0-3-6 str.replaceAll(str | regexp, str | func) Same as str.replace() , with two major differences If the first argument is a string, it replaces all occurrences If the first argument is a regular expression w/o the 'g' flag, there’ll be an error With 'g' flag, it works the same as str.replace() The main use case for is replacing all occurrences of a string '12-34-56'.replaceAll("-", ":") // 12:34:56 regexp.exec(str) Returns a match for regexp in the string It’s called on a regexp, not on a string Behaves differently depending on whether the regexp has flag 'g' If no 'g', then returns the first match, same as str.match(regexp) If there’s flag g, then returns the first match and saves its position in regexp.lastIndex property Next call starts the search from position regexp.lastIndex , returns the next match and saves the position after it in regexp.lastIndex And so on... If there are no matches, regexp.exec() returns null and resets regexp.lastIndex to 0 We can use regexp.exec to search from a given position by manually setting lastIndex If the regexp has flag 'y', the search will be performed exactly at the position regexp.lastIndex . That’s convenient when need to “read” from the string by a regexp at the exact position. let str = 'More about JavaScript at https://javascript.info' let regexp = /javascript/ig let result while (result = regexp.exec(str)) { alert( `Found ${result[0]} at position ${result.index}` ) // Found JavaScript at position 11, then // Found javascript at position 33 } Search from a given position let str = 'Hello, world!' let regexp = /\w+/g // w/o flag "g", lastIndex property is ignored regexp.lastIndex = 5 // search from 5th position (from the comma) regexp.exec(str) // world // or let str = 'Hello, world!' let regexp = /\w+/y regexp.lastIndex = 5 // search exactly at position 5 regexp.exec(str) // null regexp.test(str) Looks for a match and returns true/false whether it exists If the regexp has flag 'g', then looks from regexp.lastIndex property and updates this property, just like regexp.exec() If we apply the same global regexp to different inputs, it may lead to wrong result, recommended to set regexp.lastIndex = 0 before each search let str = "I love JavaScript" /love/i.test(str) // true // same as str.search(/love/i) != -1 // true Search from a given position let regexp = /love/gi let str = "I love JavaScript" // start the search from position 10 regexp.lastIndex = 10 regexp.test(str) // false (no match) Useful Html and text between tags const bodyPattern = /<body[^>]*>((.|[\n\r])*)<\/body>/im const textContentPattern = /<[^>]*(>|$)| |"|'|&zwnj;|»|«|>/g const html = '<html><head><title>NewTab</title></head><body><span>Hi</span><span>Bye</span></body></html>' const body = html.match(bodyPattern)[0] // <span>Hi</span><span>Bye</span> const text = body.replace(textContentPattern, '') // HiBye Text between 2 strings (?<=beginsWith)(.*)(?=endsWith) levis can be selected from Television by (?<=Te)(.*)(?=ion) Text between 2 strings including them (greedy mode) beginsWith(.*)endsWith Television can be selected from Television by Te(.*)on Text between 2 strings including them (lazy mode) beginsWith(.*?)endsWith includes substring and excludes another substring ^(?=.*(includeSubString1|includeSubString2))(?!.*excludeSubString1)(?!.*excludeSubString2).*