Need help to extract Title and Text from Reddit Post or Google News using regex (text manipulation)

RSF

Well-known member
Do you need to know how to parse through the feeds to get individual news entries one at a time, and/or how to extract title and text from an individual entry? If the latter:
Reddit
  • Title: <title>(.*?)</title>
  • Text: <content.*?>(.*?)</content>
Google News
  • Title: (Same as reddit)
  • Text: <description>(.*?)</description>
All regex's need the "Group 1" option checked.
 

AashirShaikh

New member
Do you need to know how to parse through the feeds to get individual news entries one at a time, and/or how to extract title and text from an individual entry? If the latter:
Reddit
  • Title: <title>(.*?)</title>
  • Text: <content.*?>(.*?)</content>
Google News
  • Title: (Same as reddit)
  • Text: <description>(.*?)</description>
All regex's need the "Group 1" option checked.
Yes this is what I have been looking for 😍
But I would like to get the individual entries.
For example there can be an index number for entries?
Index 1 = 1st post
Index 2 = 2nd post
Index 3 = 3rd post.

Like that?
 

RSF

Well-known member
Set up a local variable called entries_to_skip (type=integer). Then you can use a regular expression like so, to get the desired entry from the Google feed:
(?:<item>[[:ascii:]]+?<\/item>\n?){[lv=entries_to_skip]}(<item>[[:ascii:]]+?<\/item>)
(set entries_to_skip to 0, run the extract action, and process the extracted article; then set to 1, run the action, process the result, then 2, 3, 4, etc.) The extracted entry will be empty when you've run out of entries.

Same scheme for the Reddit feed, but use <entry> and <\/entry> instead of <item> and <\/item>)

Note that both Google and (especially) Reddit's feeds include a lot of HTML, including escaped HTML (e.g. "&lt;", <font...>, etc.).
 

AashirShaikh

New member
Set up a local variable called entries_to_skip (type=integer). Then you can use a regular expression like so, to get the desired entry from the Google feed:
(?:<item>[[:ascii:]]+?<\/item>\n?){[lv=entries_to_skip]}(<item>[[:ascii:]]+?<\/item>)
(set entries_to_skip to 0, run the extract action, and process the extracted article; then set to 1, run the action, process the result, then 2, 3, 4, etc.) The extracted entry will be empty when you've run out of entries.

Same scheme for the Reddit feed, but use <entry> and <\/entry> instead of <item> and <\/item>)

Note that both Google and (especially) Reddit's feeds include a lot of HTML, including escaped HTML (e.g. "&lt;", <font...>, etc.).
Yoooo!!! Superb! It works.
Thank you thousands of time brother!😍
 
Top