Data Cleaning [03]: String Manipulation

8 minute read

Published: February 06, 2022

Simple strings can be operated with string built-in methods. For more complex patterns, regular expression is a powerful tool.

String Built-In Methods

count()
endwith(), startwith()
index(), find(), rfind()
replace()
strip(), rstrip(), lstrip()
split()
lower()
upper()
casefold(): stronger version of lower()
rjust(), ljust()

Regular Expressions

Special Characters

.: In the default mode, this matches any character except a newline.
^: Matches the start of the string.
$: Matches the end of the string or just before the newline at the end of the string.
*: Causes the resulting RE to match 0 or more repetitions of the preceding RE
+: Causes the resulting RE to match 1 or more repetitions of the preceding RE.
?: Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
*?,+?, ??: greedy -> non-greedy
- re.search('<.*>', '<a> b <c>') -> <re.Match object; span=(0, 9), match='<a> b <c>'>
- re.search('<.*?>', '<a> b <c>') -> <re.Match object; span=(0, 3), match='<a>'>
{m}: Specifies that exactly m copies of the previous RE should be matched
{m,n}: Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound.
{m,n}?: Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible.
[]: Used to indicate a set of characters.
- Characters can be listed individually, e.g. [abc] will match a, b or c.
- Ranges of characters can be indicated by giving two characters and separating them by a -. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character (e.g. [-a] or [a-]), it will match a literal -.
- Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters (, +, *, or ).
- If the first character of the set is ^, all the characters that are not in the set will be matched.
- To match a literal ] inside a set, precede it with a backslash \, or place it at the beginning of the set.
|: A|B creates a regular expression that will match either A or B.
(...): Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group.
(?...): The first character after the ? determines what the meaning and further syntax of the construct is.
- (?aiLmsux): One or more letters from the set a, i, L, m, s, u, x.
  - re.A (ASCII-only matching)
  - re.I (ignore case)
  - re.L (locale dependent)
  - re.M (multi-line)
  - re.S (dot matches all)
  - re.U (Unicode matching)
  - re.X (verbose)
- (?:...): A non-capturing version of regular parentheses.
- (?aiLmsux-imsx:...): Zero or more letters from the set a, i, L, m, s, u, x, optionally followed by - followed by one or more letters from the i, m, s, x.
- (?P<name>...): Add label that can be referenced to the group.
- (?P=name): A backreference to a named group.
- (?#...): A comment; the contents of the parentheses are simply ignored.
- (?=...): Matches if ... matches next.
- (?!...): Matches if ... doesn’t match next.
- (?<=...): Matches if the current position in the string is preceded by a match for ... that ends at the current position.
- (?<!...): Matches if the current position in the string is not preceded by a match for ....
- (?(id/name)yes-pattern|no-pattern): Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t.
  - (<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$) will match <user@host.com> or user@host.com.
\: Either escapes special characters, or signals a special sequence.
- \number: Matches the contents of the group of the same number. Groups are numbered starting from 1.
- \A: Matches only at the start of the string.
- \b: Matches the empty string, but only at the beginning or end of a word.
- \B: Matches the empty string, but only when it is not at the beginning or end of a word.
- \d: Matches any Unicode decimal digit, equivalent to [0-9].
- \D: Matches any character which is not a decimal digit, equivalent to [^0-9].
- \s: Matches Unicode whitespace characters, equivalent to [ \t\n\r\f\v].
- \S: Matches any character which is not a whitespace character, equivalent to [^ \t\n\r\f\v].
- \w: Matches Unicode word characters, equivalent to [a-zA-Z0-9_].
- \W: Matches any character which is not a word character, equivalent to [^a-zA-Z0-9_].
- \Z: Matches only at the end of the string.

Pattern Matching

match()

re.match(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.

re.fullmatch(pattern, string, flags=0)

If the whole string matches the regular expression pattern, return a corresponding match object.

re.match(r'dog', 'doggie')
Out[245]: <re.Match object; span=(0, 3), match='dog'>

re.fullmatch(r'dog', 'doggie')
# Return None

search()

re.search(pattern, string, flags=0)

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object.

find()

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings or tuples.

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups.

re.findall(r'\bf[a-z]*', 'which foot or hand fell fastest')
Out[240]: ['foot', 'fell', 'fastest']

re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
Out[241]: [('width', '20'), ('height', '10')]

re.finditer(pattern, string, flags=0)

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string.

Substitution

re.sub(pattern, repl, string, count=0, flags=0)¶

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.

re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
Out[250]: 'Baked Beans & Spam'

re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):', 
       r'static PyObject*\npy_\1(void)\n{', # \1 would be 'myfunc'
       'def myfunc():')
Out[247]: 'static PyObject*\npy_myfunc(void)\n{'

If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string. For example:

def dashrepl(matchobj):
  if matchobj.group(0) == '-': return ' '
  else: return '-'
   
re.sub('-{1,2}', dashrepl, 'pro----gram-files')
Out[249]: 'pro--gram files'

re.subn(pattern, repl, string, count=0, flags=0)

Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made).

Splitting

re.split(pattern, string, maxsplit=0, flags=0)

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

re.split(r'\W+', 'Words, words, words.')
Out[236]: ['Words', 'words', 'words', '']

re.split(r'(\W+)', 'Words, words, words.')
Out[237]: ['Words', ', ', 'words', ', ', 'words', '.', '']

re.split('\s+', "foo bar\t baz \tqux")
Out[239]: ['foo', 'bar', 'baz', 'qux']

Vectorized String Functions

Using Series.map(), we can apply string and regular expression methods to each values in the series, but it will fail on the NA values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These can be accessed via Series.str, a complete list of the methods can be referenced to this page. Here we look at some examples.

slicing

data = pd.Series(['Lion', 'Monkey', 'Rabbit', NA])

data.str[:5]
Out[325]: 
0    Lion 
1    Monke
2    Rabbi
3      NaN
dtype: object

findall()

data.str.findall(r'on')
Out[314]: 
0    [on]
1    [on]
2      []
3     NaN
dtype: object

data.str.findall(r'on$')
Out[315]: 
0    [on]
1      []
2      []
3     NaN
dtype: object

replace()

data.str.replace(r'on','no')
Out[316]: 
0      Lino
1    Mnokey
2    Rabbit
3       NaN
dtype: object

split()

data = pd.Series(['Lion is carnivore', 'Monkey is omnivore', 'Rabbit is herbivore', NA])

data.str.split()
Out[319]: 
0      [Lion, is, carnivore]
1     [Monkey, is, omnivore]
2    [Rabbit, is, herbivore]
3                        NaN
dtype: object

match()

data.str.match(r'[H-N]')
Out[324]: 
0     True
1     True
2    False
3      NaN
dtype: object

Twitter Facebook LinkedIn

Chao Huang