Skip to content

regex or does not work – I do not know what is wrong in my pattern

I have the following strings:

2020-10-2125Chavez and Sons
2020-05-02Bean Inc
NaNRobinson, Mcmahon and Atkins
2020-04-25Hill-Fisher
2020-04-02Nothing and Sons
52457Carpenter and Sons
0Carpenter and Sons
Carpenter and Sons
NoneEconomy and Sons
2020-04-02

I want to have it separated:

myRegex = '^([-d]{0,}|[NnaAOoEe]{0,})(.*)' or '^([0-9]{4}-[0-9]{2}-[0-9]{2,}|[d]{0,}|[NnaAOoEe]{0,})([D]{0,})$'

I want all numbers, exact matches for (na, nan, none)-upper and lower cases and “” in first group like:

[2020-10-2125][Chavez and Sons]
[2020-05-02][Bean Inc]
[NaN][Robinson, Mcmahon and Atkins]
[2020-04-25][Hill-Fisher]
[2020-04-02][Nothing and Sons]
[52457][Carpenter and Sons]
[0][Carpenter and Sons]
[][Carpenter and Sons]
[None][Economy and Sons]
[2020-04-02][]

This would be wrong:

[2020-04-02No][thing and Sons]

I want

[2020-04-02][Nothing and Sons]

How do I write a regex which checks exact matches like ‘none’ – not case sensitive (should recognize also ‘None’,’nOne’ etc.)?

https://regex101.com/r/HvnZ47/3

Answer

What about the following with re.I:

(None|NaN?|[-d]+)?(.*)

https://regex101.com/r/d4XPPb/3

Explanation:

  • (None|NaN?|[-d]+)?
    • Either None
    • Or NaN from which the last N is optional (due to ?) so it also matches NA
    • Or digits and dashes one or more times
    • The whole group () is optional due to ? which means it might not be there
  • (.*) Any character to the end

However, there can still be edge cases. Consider the following:

National Geographic
---Test

would be parsed as

[Na][tional Geographic]
[---][Test]

An alternative:

From here we can keep on making the regex more complex, however, I think that it would be a lot simpler for you to implement custom parsing without regex. Loop characters in each line and:

  • if it starts with digit, parse all digits and dashes into group 1, the rest in group 2 (ie when you hit a character, change group)
  • Take the first 4 chars of the string and if they are “none”, split them out. At the same time ensure that the 5th character is Upper case (case insensitive line[:4].lower() == "none" and line[4].isupper())
  • Similar to the above step but for NA and NaN:
    • line[:3].lower() == "nan" and line[3].isupper()
    • line[:2].lower() == "na" and line[2].isupper()

The above should produce more accurate result and should be a lot easier to read.

Example code:

with open("/tmp/data") as f:
    lines = f.readlines()
results = []
for line in lines:
    # Remove spaces and n
    line = line.strip()
    if line[0].isdigit() or line[0] == "-":
        i = 0
        while line[i].isdigit() or line[i] == "-":
            i += 1
            if i == len(line) - 1:
                i = len(line)
                break
        results.append((line[:i], line[i:]))
    elif line[:4].lower() == "none" and line[4].isupper():
        results.append((line[:4], line[4:]))
    elif line[:3].lower() == "nan" and line[3].isupper():
        results.append((line[:3], line[3:]))
    elif line[:2].lower() == "na" and line[2].isupper():
        results.append((line[:2], line[2:]))
    else:
         # Assume group1 is missing! Everything is group2
         results.append((None, line))
for g1, g2 in results:
    print(f"[{g1 or ''}][{g2}]")

Data:

$ cat /tmp/data
2020-10-2125Chavez and Sons
2020-05-02Bean Inc
NaNRobinson, Mcmahon and Atkins
2020-04-25Hill-Fisher
2020-04-02Nothing and Sons
52457Carpenter and Sons
0Carpenter and Sons
Carpenter and Sons
NoneEconomy and Sons
NoNeEconomy and Sons
2020-04-02
NAEconomy and Sons
---Test
National Geographic

Output:

$ python ~/tmp/so.py
[2020-10-2125][Chavez and Sons]
[2020-05-02][Bean Inc]
[NaN][Robinson, Mcmahon and Atkins]
[2020-04-25][Hill-Fisher]
[2020-04-02][Nothing and Sons]
[52457][Carpenter and Sons]
[0][Carpenter and Sons]
[][Carpenter and Sons]
[None][Economy and Sons]
[NoNe][Economy and Sons]
[2020-04-02][]
[NA][Economy and Sons]
[---][Test]
[][National Geographic]