Automating Web Scraping using Python, Selenium and Web Drivers

We hope you find this tutorial useful, please share it if you do!

Featured Image Photo by Josh Nezon on Unsplash

1. Introduction

This tutorial is a practical example of how you might go about automating interaction with a web browser, in situations where you really don’t want to be doing this manually on a regular basis; for example, where you have lot of website options to check, or you need to carry out these checks repeatedly over time.

The inspiration for this tutorial is as the lure of amazing train travel – the Dogu Express, a 26-35 hour train journey running between Ankara and Kars in Turkey. There is a tourist focussed version of this train journey called the “Touristic Dogu Express” with fewer stops and sleeping cars; this can get booked up quickly, and only releases its tickets up to 30 days in advance of travel.

This tutorial covers automating the checking of availability for a particular journey of interest to me (a single journey on the Touristic Dogu Express from Kars to Ankara) in the next 30 days. It is intended that, until I am ready to book my train tickets, this system checks more tickets as they are released and I receive notifications as to what berths are available, on the website, so I don’t miss out on a chance to book.

The method required for this is to automate web clicking functions in a web browser; in this tutorial this is going to be carried out using Python and Selenium on a Windows 11 computer.

2. Choosing a Web Driver to use with Selenium

In order to use Selenium, we need a driver to control the web browser we select, and the main options available are shown in Table 1.

Web browserDriver
Chromehttps://sites.google.com/chromium.org/driver/
Edgehttps://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Firefoxhttps://github.com/mozilla/geckodriver/releases
Safarihttps://webkit.org/blog/6900/webdriver-support-in-safari-10/
Table 1: Common Web Browsers and their respective Driver downloads for use with Selenium

Of these, this tutorial will use the Firefox browser for automation, and therefore install its driver ‘Geckodriver’.

3. Installing Geckodriver to use with Firefox

The Mozilla Firefox geckodriver can be found here (shown in Figure 1).

List of geckodriver options for download to use with Selenium and Firefox browser
Figure 1: List of geckodriver options – choose ‘Show all 13 assets’ to get the full list

If you click ‘Show all 13 assets’ (Figure 1) to look at more Geckodriver downloads, you will see geckodriver-v0.34.0-win64.zip as well, which is what I required for my computer setup. When you download that geckodriver file, it will be zipped. Extract the .zip file to anywhere you wish on your computer.  When unzipped, note the full path to your geckodriver.exe file, so you can point to this later in your Python Selenium code e.g. in a Windows environment:

# full local path to your geckodriver.exe
C:\some-folder\geckodriver-v0.34.0-win64\geckodriver.exe

4. Installing Python bindings for Selenium

Assuming that you have Python installed on your computer, and also (recommended) that you have set up and activated a Python virtual environment, then you are ready to install Python binding for Selenium:

# if you are installing via the command line:
pip install selenium

# if you are installing within a Python juypter notebook, the '!' is required:
!pip install selenium 

Note that the version of Python Selenium which has been installed and used in this tutorial is version 4.17.2. You can check this by running a command in the terminal or within a Python module or jupyter notebook. This is important as there was a rewrite of many of the Selenium methods brought in from version 4.10 onwards, such that some of the tutorial examples available on the internet are now out of date.

# checking Python selenium version from the terminal (e.g. within Pycharm)
python -c "import selenium; print(selenium.__version__)"

Within a Python module:

# within a Python module
import selenium
print(selenium.__version__)

5. Web Site and its Automation

The website we will be checking for train tickets is the TCDD website; this is described further at section 5.1 and following.

5.1 Overview of the Website to be Automated

main page for website to be automated with selenium
Figure 2: Main Page for the Turkish Railways website TCDD (English version) to be automated with Selenium

The main page of the Turkish Railways website TCDD shown in Figure 2 can be found here. The page defaults to Turkish, so you have to press the link labelled ‘English’ on the top right of the main page to change language. On the front page of this website, the following defaults are relevant for the subsequent automation steps:

  • Departure Date is prefilled with today’s date
  • Number of Passengers is prefilled with the default number of ‘1
  • The default journey search (radio button) is ‘One Way

5.2 Overview of desired website automation actions

The actions I would like to automate are as follows:

  • Searching from today’s date, run a search 31 times, up to 30 days from today
  • Choose a single journey from Kars to Ankara Gar
  • For two passengers
  • Pressing the search button, which should (if train listings are available) take you to the next page, page 2 (search results). If no train results are available, you will stay on the main page.
  • From page 2 (search results), retrieve availability for the Tourist Dogu Express Train (TURİSTİK DOĞU EKS) only, not the regular Dogu Express Train (DOĞU EKSPRESİ)
  • This search on a given date may give rise to a) a listing for the Tourist Dogu Express Train , but not the regular Dogu Express Train b) conversely, a listing for the regular train, but not the Tourist train, c) a listing for both types of train, and d) No train listing at all.

5.3 Set Selenium Imports and Driver settings

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

Below are the various settings required to set up the Firefox webdriver, and point it towards the train website:

# set the driver to find the correct geckodriver on your computer
geckodriverpath = C:\some-folder\geckodriver-v0.34.0-win64\geckodriver.exe

# set the Turkish train website for the driver
train_website = 'https://ebilet.tcddtasimacilik.gov.tr/view/eybis/tnmGenel/tcddWebContent.jsf'

Using the geckodriverpath and train_website, the driver and wait objects are created:

def set_driver_and_set_wait():
    """
    function to declare the driver (based on the website you are automating)
    and wait object
    :return: driver, wait
    """
    # https://stackoverflow.com/questions/76802588/python-selenium-unexpected-keyword-argument-executable-path
    driver_service = Service(executable_path=geckodriverpath)

    # Set up the Firefox WebDriver for Python Selenium in headless mode
    options = Options()
    options.headless = True
    driver = webdriver.Firefox(options=options, service=driver_service)
    # set the website for the driver
    driver.get(train_website)

    # set a wait time for the driver (this will be used in multiple places)
    wait = WebDriverWait(driver, 10)

    return driver, wait

5.4 Set the desired journey start and endpoint

It was set out in section 5.2, we are looking for a single journey from Kars to Ankara Gar. Therefore we need to find the website elements on the main search page which are the “From” (“Nereden”) and “To” (“Nereye”) text boxes, and insert “Kars” and “Ankara Gar” to these boxes respectively.

5.4.1 Locating the origin and destination boxes by element name

First, locate the name of the origin and destination boxes in the code, by right clicking on the web page and selecting Inspect(Q) as shown in Figure 3:

showing webpage source code using page inspect with firefox browser
Figure 3: Right click on webpage and choose Inspect (Q) in Firefox browser to show webpage source code

Clicking inspect and moving around the webpage will highlight the code which corresponds to page elements. In this case, the element with the name “nereden” (“From”) is located in the webpage source code shown in Figure 4:

locating element names on webpage
Figure 4: Looking at webpage source code elements using inspect to look at element ids, names

The code from the elements relating to “nereden” can be copied by right clicking on the bottom pane where the code is, and there are various copying options: Inner HTML, Outer HTML, CSS Selector, CSS Path, XPath. These options (particularly the Outer HTML and the XPath), when copied can be used to locate elements via name, ID, xpath, and then called via Selenium.

Copying of elements from source code (HTML, CSS, XPath) - to locate id, name, xpath
Figure 5: Copying of elements from source code (HTML, CSS, XPath) – to locate element id, name or xpath

The use of ‘wait’ ensures that the elements we’re locating has loaded on the web page. If the elements which are waiting to load do not exist, Selenium will produce a TimeoutException.

# our desired journey starting point: in this case, Kars, in the East of Turkey
FROM_input_box = wait.until(EC.visibility_of_element_located((By.NAME, "nereden")))

# our desired destination, Ankara Gar station
TO_input_box = wait.until(EC.visibility_of_element_located((By.NAME, "nereye")))

5.4.2 Typing the origin and destination stations to their boxes on the web page

We have previously defined the FROM_input_box and TO_input_box elements on the webpage; using the send_keys method sends whatever text you need to send to those boxes. Here we send the origin and destination railway stations:

# Type 'Kars' into the "From" box
FROM_input_box.send_keys("Kars")

# Type 'Ankara Gar' into the "To" box
TO_input_box.send_keys("Ankara Gar")

5.5 Changing website frontend language

To assist with visualizing the automation of your website, as this is primarily a Turkish language website, you can change the website language to English (note that this does not change the name of the elements in the web page source code)

def change_language(wait):
    """
    function to change the language of the frontend from Turkish to English
    :return:
    """
    english_button = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, "English")))

    # Click the English button (this language change is to assist non Turkish speaker
    # with viewing the automated use of the website)
    english_button.click()

5.6 Overriding the Default Passenger Number

The number of passengers on the main search page, is prefilled with the default number of ‘1‘. This default number must first be removed, before replacing it with the desired number of passengers in the passenger number box. First, the clickable passenger number box is identified and named as a Python object. Then an ‘action chain’ is defined to operate on that ‘passenger_number’ object, which carries out a series of actions.

First, a double click is carried out on the object, as this selects ‘1‘ on the first click, and on the second click it highlights it in full before sending a delete command. Following the delete command, the send_keys command is sent with the updated number of passengers.

# number of passengers found by ID as a clickable element
passenger_number = wait.until(EC.element_to_be_clickable((By.ID, "syolcuSayisi")))

# performs these actions in a chain
actions = ActionChains(driver)
actions.move_to_element(passenger_number)

# clicking spinner button twice selects and then highlights the '1' first
actions.click(passenger_number)
actions.click(passenger_number)

# delete selected default passenger number of '1'
actions.send_keys(Keys.DELETE)

# send "2" to indicate the number of passengers is now "2"
actions.send_keys("2")
actions.perform()
# This is another, alternative way in which to run the action chains above

ActionChains(driver).move_to_element(passenger_number).click(passenger_number).click(passenger_number).send_keys(Keys.DELETE).send_keys("2").perform()

5.7 Setting and Incrementing the Search Date

From Figure 2, it can be seen that the main website page (English version) has a default outbound travel date of whatever today’s date is. The intention is to carry out a search for specific trains for all dates from today’s date, up to 30 days from now. Therefore, we will be incrementing the date from today for each day up to today+30 days.

5.7.1 Incrementing the date

The search for a one-way train ticket from Kars to Ankara is to be iterated from today, every day until 30 days from today. Therefore Python datetime can be used [1] https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior to iterate over those dates. We can generate a string for any day with reference to today’s date as follows:

from datetime import datetime, timedelta

# increment the number of days 
number_of_days_from_now = timedelta(days=0)
todays_date = datetime.today()
future_date = number_of_days_from_now + todays_date
# convert to string after addition
future_date_string = future_date.strftime('%d.%m.%Y')

5.7.2 Finding and Setting the Outward Travel Date

As with the number of passengers, the default travel outbound date is set to today’s date. Therefore , the date widget must be found, its value fully selected before being deleted and overridden with the desirned outbound date [2] https://stackoverflow.com/questions/69690674/how-to-override-the-default-input-field-value-using-selenium-and-python . This is shown by the code below:

# trCalGid is the ID of the outwards date, in the format: 12.02.24
# CHANGE THE SEARCH DATE for the outward leg (from) 
date_widget = wait.until(EC.element_to_be_clickable((By.ID, "trCalGid")))

date_actions = ActionChains(driver)
date_actions.move_to_element(date_widget)

# three clicks moves over the entire date of format 12.02.24
date_actions.click(date_widget)
date_actions.click(date_widget)
date_actions.click(date_widget)

# fourth click selects whole date
date_actions.click(date_widget)

# clears default date
date_actions.send_keys(Keys.DELETE)

# sets new date - new_date_string is iterated over your range of dates
date_actions.send_keys(new_date_string)
date_actions.perform()

# click away from the date picker to close it 
# by clicking on an arbitrary section elsewhere 
# section to click to is the main 'intro' section on the site 
outside_element = driver.find_element(By.ID, "intro")
outside_element.click()

5.8 Finding and Pressing ‘Search’

Once the number of passengers, origin and destination station and desired travel date are set, then the search button must be located and pressed. This can be done via the following Selenium code:

# find the search button 'btnSeferSorgula' and click it

search_button = wait.until(EC.element_to_be_clickable((By.ID, "btnSeferSorgula")))
search_actions = ActionChains(driver)
search_actions.move_to_element(search_button)
search_actions.click(search_button)
search_actions.perform()

6. Processing Train Search Results

As stated above, searching for a Kars to Ankara train on a given date may give rise to any of the following combinations:

6.1 Search results scenarios

Scenario 1: search results yield a listing for the Tourist Dogu Express Train , but not the regular Dogu Express Train

In this scenario 1, there is a single row in the table, which is the Tourist Dogu Express Train only :

: Search results - listing for the Tourist Dogu Express train but not the regular Dogu Express train. Python selenium will look for the element of this row
Figure 6: Search results – listing for the Tourist Dogu Express train but not the regular Dogu Express train

Scenario 2: search results yield a listing for the regular Dogu Express train, but not the Tourist Dogu Express train.

In this scenario 2, there is a single row in the table (the Dogu Express only):

search results showing Dogu Express train only
Figure 7: Search provides listing for the regular Dogu Express train, but not the Tourist Dogu Express train

Scenario 3: search results yield a listing for both types of train:

In this scenario 3, there are two rows in the table:

Search results - listing for both the regular Dogu Express train and the Tourist Dogu Express train. Python Selenium will retrieve the xpaths for these table rows
Figure 8: Search results – listing for both the regular Dogu Express train and the Tourist Dogu Express train

Scenario 4: search results yield No train listing at all (in which case you do not leave the main search page, but get an information bubble instead):

Search results - no train listings found at all on search date
Figure 9: Search results – no train listings found at all on search date

Based on scenarios 1, 2, and 3 (all of which yield search results), it will be noted that the results are a table with either one or two rows. Based on scenario 4, there will be no table available with search results at all.

6.2 Checking that Search Results Page has loaded

In order to check that the search results page has loaded, I am choosing an element which should always be present on a loaded page – the column header “Tren Adi“. Therefore the code includes a wait until this element has loaded, using Expected Condition, Wait and until within Selenium – shown below.

Search results page showing Tren Adı column name showing that search results have been loaded.  Selenium will search for the id for Tren Adı
Figure 10: Search results page showing Tren Adı column name showing that search results have been loaded

Looking at the search page and copying the outer HTML code, Tren Adı column header has an id referred to below:

<th id="mainTabView:gidisSeferTablosu:j_idt78" class="ui-state-default" role="columnheader" style="text-align:center;"><span>Tren Adı</span></th>

The website code in the console shows the Tren Adı column header in more detail:

Search results page showing Tren Adı column name in web page source code via console
Figure 11: Search results page showing Tren Adı column name in web page source code via console

If the search results page has not loaded (because no search results have been returned at all, i.e. scenario 4, or for some other reason), waiting for the expected condition of finding Tren Adı will give rise to a timeout exception. This exception must be caught:

from selenium.common.exceptions import TimeoutException

def check_search_results_loaded(wait):
    """
    run a check to see if the results page has loaded (will load if trains found)
    need to do a wait until the column heading called Tren Adı has loaded
    :return:
    """
    # id for the Tren Adı column name
    tren_adi_id = "mainTabView:gidisSeferTablosu:j_idt78"
    wait.until(EC.visibility_of_element_located((By.ID, tren_adi_id)))

### calling the check_search_results_loaded function
try:
    # check whether any results loaded
    check_search_results_loaded(wait)
    print('search results page loaded')
       
except TimeoutException:
    driver.quit()

6.3 Locating Elements from Search Results

On the search result page, each result occupies a row in the table. The ids and element names to be used by Python Selenium can be cross checked against the browser view.

6.3.1 Elements in the table first row results

Figure 12 shows the page elements for the results for the first row, to be located by Python Selenium

Search results table first row (index 0) to be located by Python Selenium
Figure 12: Search results table first row (index 0) to be located by Python Selenium

Inner HTML view for the webpage source code for the top row of the table of the search results is shown below:

<div class="ui-button ui-widget ui-state-default ui-button-text-only ui-corner-left ui-state-active"><input id="mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:soBiletTipi:0" name="mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:soBiletTipi" type="radio" value="1" class="ui-helper-hidden" checked="checked"><span class="ui-button-text ui-c">Standart</span></div><div class="ui-button ui-widget ui-state-default ui-button-text-only ui-corner-right ui-state-disabled"><input id="mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:soBiletTipi:1" name="mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:soBiletTipi" type="radio" value="2" class="ui-helper-hidden" disabled="disabled"><span class="ui-button-text ui-c">Esnek</span></div>

X-path for the top row:

//*[@id="mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:soBiletTipi"]

Note that both the X-path, and the id for the first row as a mainTabView:gidisSeferTablosu:0: in it.

6.3.2 Elements in the table second row results

Looking at webpage sourcecode and ID for second row to locate with Selenium Python
Figure 13: Looking at webpage sourcecode and ID for second row to locate with Selenium Python

Inner HTML view for the second row of the search results table:

<div class="ui-button ui-widget ui-state-default ui-button-text-only ui-corner-left ui-state-active"><input id="mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:soBiletTipi:0" name="mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:soBiletTipi" type="radio" value="1" class="ui-helper-hidden" checked="checked"><span class="ui-button-text ui-c">Standart</span></div><div class="ui-button ui-widget ui-state-default ui-button-text-only ui-corner-right ui-state-disabled"><input id="mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:soBiletTipi:1" name="mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:soBiletTipi" type="radio" value="2" class="ui-helper-hidden" disabled="disabled"><span class="ui-button-text ui-c">Esnek</span></div>

X_path for the bottom row:

//*[@id="mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:soBiletTipi"]

Note that both the X-path, and the id for the first row as a :0: in it. Comparing the X_path and id of the first table row with that of the second table row, the two rows are identical except for the first number (which must relate to the table index ) – the first row as a :0: in it, and the second row contains :1:

Therefore, for multiple rows in the table, we iterate through the results row, and also handle any rows which are not in existence.

6.4 Iterating through Table row and handling non-existent page elements

From the search result scenarios 1-4 (and also variability in how the elements in a page are expressed), it can be seen that a search result page element may in fact be missing, when we search for the element Xpaths or id in question. Therefore this will cause a Python error (an exception) which must be handled, should it arise.

def check_results_table_row(driver, row_no):
    """
   row_no will be iterated through at least 0, 1
    For each row (iterate row_no) and locate the label element by its xpath

    top row xpath
    //*[@id="mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:j_idt81"]
    second row xpath
    //*[@id="mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:j_idt81"]
    :return:
    """
    # ensure row_no is a string
    row_no = str(row_no)
    # substitute row number into the table_row at '{}'
    table_row = '//*[@id="mainTabView:gidisSeferTablosu:{}:seferBilgileriDataList:0:cbGidisSeferInfo"]'.format(row_no)
    table_xpath_row = None
    try:
        table_xpath_row = driver.find_element(By.XPATH, table_row)
    except NoSuchElementException:
        pass
    if table_xpath_row:
        try:
            table_xpath_row.text
        except NoSuchElementException:
            pass

6.5 Locate Train Type Text Elements

The id of the train type

<label id="mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:j_idt81" class="ui-outputlabel" style="font-weight:bold;font-size:12px;"> : DOĞU EKSPRESİ </label>
<label id="mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:j_idt81" class="ui-outputlabel" style="font-weight:bold;font-size:12px;"> : TURİSTİK DOĞU EKS.</label>

Using Python Selenium to locate the page element by id:

def get_train_type(driver, row_no):
    """
    :param driver: Selenium driver object set up
    :param row_no: row number in the table (should be a string)
    :return: 
    """
    row_no = str(row_no)
    train_type_id ="mainTabView:gidisSeferTablosu:{}:seferBilgileriDataList:0:j_idt81".format(row_no)
    mainline_id = "mainTabView:gidisSeferTablosu:{}:seferBilgileriDataList:0:j_idt80".format(row_no)
    try:
        mainline_id_row = driver.find_element(By.ID, mainline_id)
        print('mainline_id_row label text {}'.format(mainline_id_row.text))
        train_type_id_row = driver.find_element(By.ID, train_type_id)
        print('train_type_row label text {}'.format(train_type_id_row.text))
    except NoSuchElementException:
        pass

6.6 Locate Berth Availability Text Elements

def check_berth_availability(driver, row_no):
    """
    function to check the berth availability which 
    :return:
    """
    row_no = str(row_no)
    # row_no string is inserted into berth_availability_id via {}
    berth_availability_id = "mainTabView:gidisSeferTablosu:{}:j_idt109:0:somVagonTipiGidis1_label".format(row_no)
    try:
        berth_avail_id_row = driver.find_element(By.ID, berth_availability_id)
        berth_availability_text = berth_avail_id_row.text
        print('berth availability {}'.format(berth_availability_text))
    except NoSuchElementException:
        pass

7. Conclusion

This tutorial has covered the automation of user interaction with web pages via Python Selenium. Further steps following this tutorial could be as follows:

  • Scheduling the running of the Python checking script on a regular basis, via Windows (using a Windows Batch file);
  • Sending an email with the summary information which has been gathered by the script

References

References
1 https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior
2 https://stackoverflow.com/questions/69690674/how-to-override-the-default-input-field-value-using-selenium-and-python
We hope you find this tutorial useful, please share it if you do!