Need help to extract Title and Text from Reddit Post or Google News using regex (text manipulation)

AashirShaikh · May 21, 2022

Need help to extract Title and Text from Reddit Post or Google News using regex (text manipulation)

Here are the links:
Reddit: https://www.reddit.com/r/kpics.rss
Google News: https://news.google.com/news/feeds?um=1&ned=us&hl=en&q=albert+einstein&output=rss

RSF · May 22, 2022

Do you need to know how to parse through the feeds to get individual news entries one at a time, and/or how to extract title and text from an individual entry? If the latter:
Reddit

Title: <title>(.*?)</title>
Text: <content.*?>(.*?)</content>

Google News

Title: (Same as reddit)
Text: <description>(.*?)</description>

All regex's need the "Group 1" option checked.

AashirShaikh · May 28, 2022

RSF said:
Do you need to know how to parse through the feeds to get individual news entries one at a time, and/or how to extract title and text from an individual entry? If the latter:
Reddit

Title: <title>(.*?)</title>

Text: <content.*?>(.*?)</content>

Google News

Title: (Same as reddit)

Text: <description>(.*?)</description>

All regex's need the "Group 1" option checked.

Yes this is what I have been looking for

But I would like to get the individual entries.
For example there can be an index number for entries?
Index 1 = 1st post
Index 2 = 2nd post
Index 3 = 3rd post.

Like that?

RSF · May 28, 2022

Set up a local variable called entries_to_skip (type=integer). Then you can use a regular expression like so, to get the desired entry from the Google feed:
(?:<item>[[:ascii:]]+?<\/item>\n?){[lv=entries_to_skip]}(<item>[[:ascii:]]+?<\/item>)
(set entries_to_skip to 0, run the extract action, and process the extracted article; then set to 1, run the action, process the result, then 2, 3, 4, etc.) The extracted entry will be empty when you've run out of entries.

Same scheme for the Reddit feed, but use <entry> and <\/entry> instead of <item> and <\/item>)

Note that both Google and (especially) Reddit's feeds include a lot of HTML, including escaped HTML (e.g. "<", <font...>, etc.).

AashirShaikh · May 29, 2022

RSF said:
Set up a local variable called entries_to_skip (type=integer). Then you can use a regular expression like so, to get the desired entry from the Google feed:
(?:<item>[[:ascii:]]+?<\/item>\n?){[lv=entries_to_skip]}(<item>[[:ascii:]]+?<\/item>)
(set entries_to_skip to 0, run the extract action, and process the extracted article; then set to 1, run the action, process the result, then 2, 3, 4, etc.) The extracted entry will be empty when you've run out of entries.

Same scheme for the Reddit feed, but use <entry> and <\/entry> instead of <item> and <\/item>)

Note that both Google and (especially) Reddit's feeds include a lot of HTML, including escaped HTML (e.g. "<", <font...>, etc.).

Yoooo!!! Superb! It works.
Thank you thousands of time brother!

Need help to extract Title and Text from Reddit Post or Google News using regex (text manipulation)

AashirShaikh

New member

RSF

Well-known member

AashirShaikh

New member

RSF

Well-known member

AashirShaikh

New member