Regex and Extract text

3am · Feb 19, 2023

Hello, I am solving the problem that I cannot find one value on the page. It's about this page https://cs.hory.app/mountain/137690-dvojce-polomu and I'm looking for this data <div class="col-6 col-md-3" title="Celkový počet výstupů"><img src ="/icons/map/mountain-visited-80.png">9</div> (specifically the number 9 that changes).
I used HTTP GET and Text Manipulation (Exctract text)

<div class="col-6 col-md-3" title="Celkový počet výstupů"><img src="/icons/map/mountain-visited-80.png">(\d+)</div>
... but the macro does not work, there is nothing in the text extraction test.

Thank you very much for your help

FrameXX · Feb 19, 2023

Zdravím. Tady česká podpora. Za prvé to vypadá že používate zastaralou HTTP GET akci místo novější akce pod názvem HTTP požadavek. Ne že by to hodně vadilo, ale doporučuji raději používat tuto. Zadruhé jsem zkoušel HTTP požadavek z MacroDroidu poslat ale nazpátek se mi místo zdrojového kódu požadované stránky vrátilo toto:

HTML:

<h1>Redirect</h1>

<p><a href="https://cs.hory.app/mountain/137690-dvojce-polomu">Please click here to continue</a>.</p>

Tak tady se asi zasekneme. Z tohohle počet výstupů jen tak nedostanete. Ten odkaz na který to přesměrovává je mimochodem úplně totožný s tím, který jsem při požadavku použil. Nevím jestli u vás je to to samé a akorát jste si nevšiml, nebo je to jen můj případ. Akce HTTP požadavku má možnost "Pokračovat přes přesměrování" ale její zapnutí mi taky nepomohlo. Vypadá to že stránka Hrobraní má nějakou ochranu proti čtení dat boty, nebo se jen jedná o nějakou nešikovnost? Aktuálně netuším jak se přes tohle dostat pokud to vůbec jde.

EN:
Hello. This is Czech support. First of all, it looks like you are using the deprecated HTTP GET action instead of the newer action called HTTP Request. Not that it matters much, but I recommend using this one instead. Second, I tried to send an HTTP request from MacroDroid but this is what came back instead of the source code of the requested page:

HTML:

<h1>Redirect</h1>

<p><a href="https://cs.hory.app/mountain/137690-dvojce-polomu">Please click here to continue</a>.</p>

I guess this is where we get stuck. You're not gonna get what you wanted from this. The link it redirects to, by the way, is exactly the same as the one I used to make the request. I don't know if it's the same for you and you just didn't notice, or if it's just me. The HTTP request action has a "Continue via redirect" option but turning it on didn't help me either. It looks like the Horobraní page has some protection against bots reading the data, or is it just some kind of clumsiness? I currently have no idea how to get past this if at all possible.

3am · Feb 20, 2023

Zdravím, ano vím, že je to zastaralé, ale zkoušel jsem i http požadavek se stejným výsledkem. Netuším co má web Horobrani za ochranu, každopádně je to škoda. Ale i tak děkuji za odpověď a informaci.

Endercraft · Feb 20, 2023

Maybe it does this if it does not recognize the requester app as a browser ?
I got this header from developer tools, if it can help :


Origin:
https://cs.hory.app
Referer:
https://cs.hory.app/
sec-ch-ua:
"Not_A Brand";v="99", "Google Chrome";v="109", "Chromium";v="109"
sec-ch-ua-mobile:
?0
sec-ch-ua-platform:
"Windows"
[*]User-Agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36

Dimlos · Feb 20, 2023

I managed to do some scraping using Python with Termux.

<img src="/icons/map/mountain-visited-80.png">9</div>

I need a regular expression that matches only 9.

Dimlos · Feb 20, 2023

The python script looks like this, although you have to install a lot of stuff to get it to work.

Dimlos · Feb 20, 2023

Specifically, in addition to MacroDroid, I need to install Termux, Termux:Tasker, Python on Termux, Requests and BeautifulSoup.
Can anyone come up with a regular expression? Please help.

Dimlos · Feb 20, 2023

I've done my best to narrow it down to this point, now I just need to extract 9...

Endercraft · Feb 20, 2023

Dimlos said:
I've done my best to narrow it down to this point, now I just need to extract 9...

All that work just to extract a number ! You're really dedicated to giving a solution

Dimlos · Feb 21, 2023

I managed to complete the macro.
I will upload the macro and script when I have time.

@Endercraft
I am being provided with material that I would not have thought of on my own.
I enjoy doing this and it helps me improve my skills.

AkiraMenai · Feb 21, 2023

Maybe it's just me, but why not split to an array using ">" as a delimiter, take the last entry, and split again using "<" as the delimiter, or just replace </div> with nothing?

Dimlos · Feb 21, 2023

I see, but there are many unnecessary control characters in the text before extraction, so I cleaned it up once and extracted it with regular expressions.

Dimlos · Feb 21, 2023

I don't think it is easy to build a Python runtime environment, so I have disabled the part that GETs.
You can try the part that extracts 9 from text partially extracted from the site.

scraping.py

scraping

MediaFire is a simple to use free service that lets you put all your photos, documents, music, and video in a single place so you can access them anywhere and share them everywhere.

www.mediafire.com

Dimlos · Feb 21, 2023

I didn't think anyone would download it, but a few people have downloaded the python script.
I can't explain every step of the setup, but if you have any questions if you just can't figure it out, please write me.

Regex and Extract text

3am

New member

FrameXX

Well-known member

3am

New member

Endercraft

Moderator (& bug finder :D)

Dimlos

Well-known member

Attachments

Dimlos

Well-known member

Attachments

Dimlos

Well-known member

Dimlos

Well-known member

Attachments

Endercraft

Moderator (& bug finder :D)

Dimlos

Well-known member

AkiraMenai

Member

Dimlos

Well-known member

Dimlos

Well-known member

scraping

Attachments

Dimlos

Well-known member

Attachments