Regex and Extract text

3am

New member
Hello, I am solving the problem that I cannot find one value on the page. It's about this page https://cs.hory.app/mountain/137690-dvojce-polomu and I'm looking for this data <div class="col-6 col-md-3" title="Celkový počet výstupů"><img src ="/icons/map/mountain-visited-80.png">9</div> (specifically the number 9 that changes).
I used HTTP GET and Text Manipulation (Exctract text)

Screenshot_20230219_164611_MacroDroid.jpg Screenshot_20230219_164901_MacroDroid.jpgScreenshot_20230219_164554_MacroDroid.jpg
<div class="col-6 col-md-3" title="Celkový počet výstupů"><img src="/icons/map/mountain-visited-80.png">(\d+)</div>
... but the macro does not work, there is nothing in the text extraction test.

Thank you very much for your help
 

FrameXX

Well-known member
Zdravím. Tady česká podpora. Za prvé to vypadá že používate zastaralou HTTP GET akci místo novější akce pod názvem HTTP požadavek. Ne že by to hodně vadilo, ale doporučuji raději používat tuto. Zadruhé jsem zkoušel HTTP požadavek z MacroDroidu poslat ale nazpátek se mi místo zdrojového kódu požadované stránky vrátilo toto:

HTML:
<h1>Redirect</h1>

<p><a href="https://cs.hory.app/mountain/137690-dvojce-polomu">Please click here to continue</a>.</p>

Tak tady se asi zasekneme. Z tohohle počet výstupů jen tak nedostanete. Ten odkaz na který to přesměrovává je mimochodem úplně totožný s tím, který jsem při požadavku použil. Nevím jestli u vás je to to samé a akorát jste si nevšiml, nebo je to jen můj případ. Akce HTTP požadavku má možnost "Pokračovat přes přesměrování" ale její zapnutí mi taky nepomohlo. Vypadá to že stránka Hrobraní má nějakou ochranu proti čtení dat boty, nebo se jen jedná o nějakou nešikovnost? Aktuálně netuším jak se přes tohle dostat pokud to vůbec jde.

EN:
Hello. This is Czech support. First of all, it looks like you are using the deprecated HTTP GET action instead of the newer action called HTTP Request. Not that it matters much, but I recommend using this one instead. Second, I tried to send an HTTP request from MacroDroid but this is what came back instead of the source code of the requested page:

HTML:
<h1>Redirect</h1>

<p><a href="https://cs.hory.app/mountain/137690-dvojce-polomu">Please click here to continue</a>.</p>

I guess this is where we get stuck. You're not gonna get what you wanted from this. The link it redirects to, by the way, is exactly the same as the one I used to make the request. I don't know if it's the same for you and you just didn't notice, or if it's just me. The HTTP request action has a "Continue via redirect" option but turning it on didn't help me either. It looks like the Horobraní page has some protection against bots reading the data, or is it just some kind of clumsiness? I currently have no idea how to get past this if at all possible.
 
Last edited:

3am

New member
Zdravím, ano vím, že je to zastaralé, ale zkoušel jsem i http požadavek se stejným výsledkem. Netuším co má web Horobrani za ochranu, každopádně je to škoda. Ale i tak děkuji za odpověď a informaci.
 

Endercraft

Moderator (& bug finder :D)
Maybe it does this if it does not recognize the requester app as a browser ?
I got this header from developer tools, if it can help :
Origin: https://cs.hory.app Referer: https://cs.hory.app/ sec-ch-ua: "Not_A Brand";v="99", "Google Chrome";v="109", "Chromium";v="109" sec-ch-ua-mobile: ?0 sec-ch-ua-platform: "Windows" [*]User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36
 

Dimlos

Well-known member
I managed to do some scraping using Python with Termux.

<img src="/icons/map/mountain-visited-80.png">9</div>

I need a regular expression that matches only 9.
 

Attachments

  • output.jpg
    output.jpg
    372.2 KB · Views: 6

Dimlos

Well-known member
The python script looks like this, although you have to install a lot of stuff to get it to work.
 

Attachments

  • scraping.jpg
    scraping.jpg
    341.4 KB · Views: 6

Dimlos

Well-known member
Specifically, in addition to MacroDroid, I need to install Termux, Termux:Tasker, Python on Termux, Requests and BeautifulSoup.
Can anyone come up with a regular expression? Please help.
 

Dimlos

Well-known member
I've done my best to narrow it down to this point, now I just need to extract 9...
 

Attachments

  • scraping.jpg
    scraping.jpg
    376.2 KB · Views: 5
  • output.jpg
    output.jpg
    492.4 KB · Views: 5

Dimlos

Well-known member
I managed to complete the macro.
I will upload the macro and script when I have time.

@Endercraft
I am being provided with material that I would not have thought of on my own.
I enjoy doing this and it helps me improve my skills.
 
Maybe it's just me, but why not split to an array using ">" as a delimiter, take the last entry, and split again using "<" as the delimiter, or just replace </div> with nothing?
 
Last edited:

Dimlos

Well-known member
I see, but there are many unnecessary control characters in the text before extraction, so I cleaned it up once and extracted it with regular expressions.
 

Dimlos

Well-known member
I don't think it is easy to build a Python runtime environment, so I have disabled the part that GETs.
You can try the part that extracts 9 from text partially extracted from the site.

scraping.py
 

Attachments

  • scraping.macro
    7.1 KB · Views: 4

Dimlos

Well-known member
I didn't think anyone would download it, but a few people have downloaded the python script.
I can't explain every step of the setup, but if you have any questions if you just can't figure it out, please write me.
 

Attachments

  • Termux.jpg
    Termux.jpg
    369.6 KB · Views: 4
Top