开发者

notepad++ - trying to reformat some stuff

开发者 https://www.devze.com 2023-03-14 04:45 出处:网络
I have a CSV that basically has rows that look like: 06444|WidgetAdapter 6444|Description: Here is a description.

I have a CSV that basically has rows that look like:

06444|WidgetAdapter 6444|Description:

Here is a description.
Maybe some more.
|0

The text in the third field is always different and varying, and I'm trying to replace all newlines within it only with <br>, so it ends up as

06444|WidgetAdapter 6444|Description: <br>Here is a description.<br>Maybe some more.<br>|0

edit:

I basically need to get rid of all linebreaks so each line is a pro开发者_开发知识库per VALUE|VALUE|VALUE|VALUE. Normalize/beautify/clean it.

None of my tools can import this properly, phpMyAdmin chokes, etc. There are linebreaks within the field, there are doublequotes that are not escaped, etc.

Example other field:

08681|Book 08681|"Testimonial" - Person

You should buy this.|

Example of another field:

39338|Itemizer||


If you know you have 4 columns, you can easily parse your data. For example, here's a PHP line that results in an array with all data. Each line in the array is another array with all capturing groups: [0] has the whole match, and [1]-[4] with each column:

$pattern = '/^([^|]*)\|([^|]*)\|([^|]*)\|([^|]*)$/m';
preg_match_all($pattern, $data, $matches, PREG_SET_ORDER);

The pattern is extremely simple: it takes 4 values (not pipe signs), separated by 3 pipes. Once you have the data, you can easily rebuild it the way you want, for example by using nl2br.
Note that you cannot reliably parse the data if the first and last columns can also containg new lines.

Working example: http://ideone.com/gG0K3


If needed, it is possible to target these newlines using a regular expression. The idea is to find only newlines that are followed by one extra value, and then only whole lines. We can check the number of values after the current newline is 1 modulo 4, so we know we're at the 3rd column:

(?:\r\n?|\n)(?=[^|]*\|[^\n\r|]*\s*(?:^(?:[^|]*\|){3}[^\n\r|]*$\s*)*\Z)

Or, with (some) explanations:

(?:\r\n?|\n)   # Match a newline
(?=            # that is before...
    [^|]*\|[^\n\r|]*\s*               # one more separator and value
    (?:^(?:[^|]*\|){3}[^\n\r|]*$\s*)* # and some lines with 4 values.
    \Z                                # until the end of the string.
)

I couldn't get it to work on Notepad++ (it didn't even match [\r\n]), but it seems to work well on other engines:

  • Rubular (Ruby): http://rubular.com/r/NsbTNg9vCT
  • RegExr (Action Script): http://regexr.com?2u1iu
  • Regex Hero (.Net): http://regexhero.net/tester/?id=215ac2bb-811b-48dd-8c00-6dcfadfae2f2
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号