As the strings may have to be concatenated, I have firstly joining all the elements to single string using ##
to separate the text which have been joined.
I am going for all regex because there would be lot of conditions to check otherwise.
re_req = re.compile(r'##(?P<Short_ref>BAT.A.d{3})'
r'##(?P<Full_Reg_ref>Regulationsd{4}-d{4})'
r'##(?P<Reg_text>.*?MOC to 1|.*?(?=##BAT.A.d{3})(?!1))'
r'(?:##)?(?:(?P<Moc_text>.*?MOC2 to 1)(?P<MOC2>(?:##)?.*?(?=##BAT.A.d{3})(?!1)|.+)'
r'|(?P<Moc_text_temp>.*?(?=##BAT.A.d{3})(?!1)))')
final_list = []
for match in re_req.finditer("##" + "##".join(data)):
inner_list = [match.group('Short_ref').replace("##", " "),
match.group('Full_Reg_ref').replace("##", " "),
match.group('Reg_text').replace("##", " ")]
if match.group('Moc_text_temp'): # just Moc_text is present
inner_list += [match.group('Moc_text_temp').replace("##", " "), ""]
elif match.group('Moc_text') and match.group('MOC2'): # both Mock_text and MOC2 is present
inner_list += [match.group('Moc_text').replace("##", " "), match.group('MOC2').replace("##", " ")]
else: # neither Moc_text nor MOC2 is present
inner_list += ["", ""]
final_list.append(inner_list)
final_df = pd.DataFrame(final_list, columns=['Short_ref', 'Full_Reg_ref', 'Reg_text', 'Moc_text', 'MOC2'])
First and second line of regex is same as which you posted earlier and identifies the first two columns.
In third line of regex, r'##(?P<Reg_text>.*?MOC to 1|.*?(?=##BAT.A.d{3})(?!1))'
- matches all text till MOC to Short_ref or matches all the text before the next Reg_text. (?=##BAT.A.d{3})(?!1)
part is to taking the text upto Short_ref
pattern and if the Short_ref
is not the current Reg_text.
Fourth line is for when Moc_text
and MOC2
both is present and it is or
with fifth line for the case when just Moc_text
is present. This part of the regex is similar to the third line.
Last looping over all the matches using finditer
and constructing the rows of the dataframe
final_df
:
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…