There are a lot of use-cases in which you might want to automate a web-browser. For example to automate tedious repetitive tasks or to perform automated tests of front-end applications. There are also several tools available to do this such as Selenium, Cypress and Puppeteer. Several blog posts and presentations by Lucas Jellema picked my interest in Playwright so I decided to give it a try. I'm not a great fan of JavaScript so I decided to go with Python for this one. I also did some tests with wrk, a simple yet powerful HTTP bench-marking tool, to get an indication about how Playwright would handle concurrency and was not disappointed.
Introduction
Playwright
The following here gives a nice comparison of Selenium, Cypress, Puppeteer and Playwright. Microsoft Playwright has been created by the same people who created Google Puppeteer and is relatively new. Playwright communicates bidirectionally with the browser. This allows events in the browser to trigger events in your scripts (see here). This is something which can be done but is more difficult using something like Selenium (see here, I suspect polling is usually involved). With Selenium you also often need to build delays inside your scripts. Playwright provides extensive options for waiting for things to happen and since the interaction is bidirectional, I suspect polling will not be used which increases performance and makes scripts more robust and fast.
I used PlayWright on Python. The Node.js implementation is more mature. On Python, before the first official non-alpha release, there might still be some breaking changes in the API so the sample provided here might not work in the future due to those API changes. Since the JavaScript and Python API are quite similar, I do not expect major changes though.
Python
Python 3.4 introduced the asyncio module and since Python 3.5 you can use keywords like async and await. Because I was using async libraries and I wanted to wrap my script in a webservice, I decided to take a look at what webservice frameworks are available for Python. I stumbled on this comparison. Based on the async capabilities, simple syntax, good performance and (claimed) popularity, I decided to go with Sanic.
Google Translate API
Of course Google provides a commercial translate API. If you are thinking about seriously implementing a translate API, definitely go with that one since is is made to do translations and provide SLAs. I decided to create a little Python REST service to use the Google Translate website (a website scraper). For me this was a tryout of Playwright so in this example, I did not care about support, performance or SLAs. If you overuse the site, you will get Capcha's and I did not automate those away.
Getting things ready
I started out with a clean Ubuntu 20.04 environment. I tried JupyterLabs first but Playwright and JupyterLabs did not seem to play well together since probably JupyterLabs also extensively uses a browser itself. I decided to go with PyCharm. PyCharm has some code completion features which I like among other things and of course the interface is similar to IntelliJ and DataGrip which I also use for other things.
#Python 3 was already installed. pip wasn't yet
sudo apt-get install python3-pip
sudo pip3 install playwright
sudo pip3 install lxml
#Webservice framework
sudo pip3 install sanic
sudo apt-get install libenchant1c2a
#This installs the browsers which are used by Playwright
python3 -m playwright install
#PyCharm
sudo snap install pycharm-community --classic
from playwright import async_playwright
from sanic import Sanic
from sanic import response
app = Sanic(name='Translate application')
@app.route("/translate")
async def doTranslate(request):
async with async_playwright() as p:
sl = request.args.get('sl')
tl = request.args.get('tl')
translate = request.args.get('translate')
browser = await p.chromium.launch() # headless=False
context = await browser.newContext()
page = await context.newPage()
await page.goto('https://translate.google.com/?sl='+sl+'&tl='+tl+'&op=translate')
textarea = await page.waitForSelector('//textarea')
await textarea.fill(translate)
waitforthis = await page.waitForSelector('div.Dwvecf',state='attached')
result = await page.querySelector('span.VIiyi >> ../span/span/span')
textresult = await result.textContent()
await browser.close()
return response.json({'translation':textresult})
if __name__ == '__main__':
app.run(host="0.0.0.0", port=5000)
[2021-01-24 09:16:38 +0100] [3278] [INFO] Goin' Fast @ http://0.0.0.0:5000
[2021-01-24 09:16:38 +0100] [3278] [INFO] Starting worker [3278]
curl 'http://localhost:5000/translate?sl=nl&tl=en&translate=auto'
{"translation":"car"}
wrk -c2 -t2 -d30s --timeout=30s 'http://localhost:5000/translate?sl=nl&tl=en&translate=auto'
No comments:
Post a Comment