Sunday, January 24, 2021

Python: A Google Translate service using Playwright

There are a lot of use-cases in which you might want to automate a web-browser. For example to automate tedious repetitive tasks or to perform automated tests of front-end applications. There are also several tools available to do this such as Selenium, Cypress and Puppeteer. Several blog posts and presentations by Lucas Jellema picked my interest in Playwright so I decided to give it a try. I'm not a great fan of JavaScript so I decided to go with Python for this one. I also did some tests with wrk, a simple yet powerful HTTP bench-marking tool, to get an indication about how Playwright would handle concurrency and was not disappointed.


Introduction

Playwright

The following here gives a nice comparison of Selenium, Cypress, Puppeteer and Playwright. Microsoft Playwright has been created by the same people who created Google Puppeteer and is relatively new. Playwright communicates bidirectionally with the browser. This allows events in the browser to trigger events in your scripts (see here). This is something which can be done but is more difficult using something like Selenium (see here, I suspect polling is usually involved). With Selenium you also often need to build delays inside your scripts. Playwright provides extensive options for waiting for things to happen and since the interaction is bidirectional, I suspect polling will not be used which increases performance and makes scripts more robust and fast.

I used PlayWright on Python. The Node.js implementation is more mature. On Python, before the first official non-alpha release, there might still be some breaking changes in the API so the sample provided here might not work in the future due to those API changes. Since the JavaScript and Python API are quite similar, I do not expect major changes though.

Python

Python 3.4 introduced the asyncio module and since Python 3.5 you can use keywords like async and await. Because I was using async libraries and I wanted to wrap my script in a webservice, I decided to take a look at what webservice frameworks are available for Python. I stumbled on this comparison. Based on the async capabilities, simple syntax, good performance and (claimed) popularity, I decided to go with Sanic.


Another reason for me to go with Python is that it is very popular in the AI/ML area. If in the future I want to scrape a lot of data from websites and do smart things with that data, I can stay within Python and do not have to mix several languages.

Google Translate API

Of course Google provides a commercial translate API. If you are thinking about seriously implementing a translate API, definitely go with that one since is is made to do translations and provide SLAs. I decided to create a little Python REST service to use the Google Translate website (a website scraper). For me this was a tryout of Playwright so in this example, I did not care about support, performance or SLAs. If you overuse the site, you will get Capcha's and I did not automate those away.

Getting things ready

I started out with a clean Ubuntu 20.04 environment. I tried JupyterLabs first but Playwright and JupyterLabs did not seem to play well together since probably JupyterLabs also extensively uses a browser itself. I decided to go with PyCharm. PyCharm has some code completion features which I like among other things and of course the interface is similar to IntelliJ and DataGrip which I also use for other things.

 #Python 3 was already installed. pip wasn't yet  
 sudo apt-get install python3-pip  
 sudo pip3 install playwright  
 sudo pip3 install lxml  
 #Webservice framework  
 sudo pip3 install sanic  
 sudo apt-get install libenchant1c2a  
 #This installs the browsers which are used by Playwright  
 python3 -m playwright install  
 #PyCharm  
 sudo snap install pycharm-community --classic  

The Google Translate API scraper


from playwright import async_playwright from sanic import Sanic from sanic import response app = Sanic(name='Translate application') @app.route("/translate") async def doTranslate(request): async with async_playwright() as p: sl = request.args.get('sl') tl = request.args.get('tl') translate = request.args.get('translate') browser = await p.chromium.launch() # headless=False context = await browser.newContext() page = await context.newPage() await page.goto('https://translate.google.com/?sl='+sl+'&tl='+tl+'&op=translate') textarea = await page.waitForSelector('//textarea') await textarea.fill(translate) waitforthis = await page.waitForSelector('div.Dwvecf',state='attached') result = await page.querySelector('span.VIiyi >> ../span/span/span') textresult = await result.textContent() await browser.close() return response.json({'translation':textresult}) if __name__ == '__main__': app.run(host="0.0.0.0", port=5000)
You can also find the code here.

How does it work?

The app.run command at the end starts an HTTP server on port 5000. @app.route indicates the function doTranslate will be available at /translate. The async function receives a request. This request is used to obtain the GET arguments which are used to indicate 'target language' (tl), 'source language' (sl) and the text to translate (translate). Next a Chrome browser is started in headless mode. Headless is the default for Playwright but during development, it helps to disable this headless mode so you can see what happens in the browser. The Google translate site is opened with the source and target language as GET parameters. Next Playwright waits until a textarea appears (the p[age is fully loaded). It fills the text area with the text to be translated. Next Playwright waits for the translation to appear (the box 'Translations of  auto' in the screenshot below). The result is selected and saved. First I select span.VIiyi (a CSS selector) and within that span I select ../span/span/span (an XPATH selector). After the result is obtained, I close the browser and return it


When you start the service, you will get something like
 [2021-01-24 09:16:38 +0100] [3278] [INFO] Goin' Fast @ http://0.0.0.0:5000  
 [2021-01-24 09:16:38 +0100] [3278] [INFO] Starting worker [3278]  
Next you can test the service. You can do this with 
 curl 'http://localhost:5000/translate?sl=nl&tl=en&translate=auto'  
It will translate the Dutch (sl=nl) 'auto' (translate=auto) to English (tl=en). The result will be car.
 {"translation":"car"}  

Performance

Of course performance tests come with a big disclaimer. I tried this on specific hardware, on a specific OS, etc. You will not be able to exactly reproduce these results. Also the test I did was relatively simple just to get an impression. I additionally logged the responses since the script I used did not include proper exception handling.

I used wrk, an HTTP benchmark tool on the service.
 wrk -c2 -t2 -d30s --timeout=30s 'http://localhost:5000/translate?sl=nl&tl=en&translate=auto'  
WRK used 2 threads and kept 2 connections open at the same time (2 concurrent requests). Per request I set the timeout to 30s. This timeout was never reached.

This gave me average results of 2.5s per request. At 4 concurrent requests, this became 3s on average. At 8 concurrent requests this became 4.5s on average and started to give some errors. A 'normal' API can of course do much better. I was a bit surprised though Playwright could handle 8 headless browsers simultaneously on an 8Gb Ubuntu VM which also had PyCharm open at the same time (knowing what a memory-hog Chrome can be). I was also surprised Google didn't start to bother me with Capcha's yet.

Tips for development

Automatically generate scripts

You can use Playwright to automatically generate scripts for you from manual browser interactions. This can be a good start of a script. See for example here. The generated scripts however do not contain smart logic such as waiting for certain elements in the page to appear before selecting text from other elements.

Use browser developer tools

In order to automate a browser using Playwright, you need to select elements in the web-page. An easy way to do this is by looking at the developer tools which are present in most web-browsers nowadays. Here you can easily browse the DOM (document object model, the model on which the browser bases what it shows you). This allows you to find specific elements of interest to manipulate using Playwright.


Approximate human behavior

One of the dangers of creating screen scrapers is that if the site changes, your code might not work anymore. In my sample script I used specific identifiers like Dwvecf and VIiyi and queried for elements based on the sites DOM. When you look at a website yourself, you select elements in a more visual way. The better you can approximate the way a human would interact with a website, the more stable your script will be. For example, selecting the first textarea on the site is more stable then expecting the result to be in span.VIiyi and in that element under span/span/span. 

The right tool for the job

If an API is available and you can use it directly, that is of course preferable to using a website scraper since an API is made for automated interaction and a web-site is made for human interaction. You usually get much better performance and stability when using an official API instead. Playwright includes an API to monitor and modify HTTP and HTTPS requests done by the browser. This might help you in determining back-end APIs so you can try if you can use them directly.

When using a tool like Playwright to automate browser interaction in for example tests for custom developed applications, you can get better stability since you know what will change when in the site and can make sure the automation scripts keep working. When you're using Playwright against an external website, you will have less control. Google will not inform me when they change https://translate.google.com and this script will most likely break because of it.

No comments:

Post a Comment