A Beginner’s Guide to Web Scraping with Selenium in Python: Extracting eBay Products

Muhammad Naufal Hanif
10 min readJan 12, 2024

--

Web scraping is a powerful technique that allows us to extract data from websites. In this article, we will walk you through the process of web scraping using Selenium in Python. Selenium is a popular tool for web automation and testing, but it can also be used effectively for web scraping. By the end of this guide, you will have a good understanding of how to use Selenium for web scraping and be able to extract data from websites on your own.

Requirements:
Before we get started, ensure you have the following:

  • Latest PyCharm Community / Professional IDEA
  • Latest Chrome Browser installed

Why would we use Chrome instead of another?

  • Dominating market share: Chrome dominates the web browser market, which translates to greater compatibility with websites and less frequent issues with dynamic content or rendering.
  • Extensive extension library: Chrome offers a vast selection of extensions, including dedicated scraping tools, that can significantly enhance your scraping functionalities.
  • Mature and active community: Chrome’s large user base and developer community lead to abundant documentation, tutorials, and troubleshooting resources.
  • Has headless mode: Chrome’s headless mode runs efficiently without a GUI, ideal for server-side automation or scraping large datasets.

Step 1: Setting Up Project & Installing Selenium:

To begin, open your PyCharm IDEA, create a new project, choose Pure Python, and then create.

Next, create a new Python file called app

let’s install the necessary packages. Open your terminal (bottom left side) in PyCharm and enter the following command:

pip install selenium

Step 2: Setting Up the Selenium Web Driver:
Lucky for us, starting from Selenium version 4.14, we no longer have to install web driver things manually, because all of that stuff had been handled by Selenium Manager.

The web driver itself, it’s just like a remote control for your web browser. It allows you to control and automate every aspect of your browser interaction with a website, just like a human user would. This empowers you to perform various tasks, including:

  • Clicking on buttons and links
  • Filling out forms
  • Scrolling through pages
  • Extracting data from web elements
  • Running tests on web applications

Now, Before we going to set up the web driver, Let’s acquainted with Chrome options first.

Chrome Options is used to customize and configure the behaviour of the Chrome browser when launching it through ChromeDriver. Think of it as fine-tuning your Chrome experience for automation.

Here are some things Chrome Options can do:

  • Run Chrome in headless mode: This runs Chrome without the graphical user interface (GUI), making it useful for server-side automation or situations where you don’t need to see the browser window.
  • Block image loading: We can temporarily disable images that show up from websites, which will help increase our Chrome speed.
  • Setting up disk cache: We can also set up disk cache in Selenium to enhance performance and reduce network usage during our web scraping tasks.

There are many configurations you can do with Chrome Options besides that, sadly, selenium documentation itself regarding this Chrome option is really poor, but don’t worry, because there are many communities there that will help you, such as stackoverflow.com, selenium-python.readthedocs.io , and many more.

Okay, let’s just start our journey by setting up our web driver.

from selenium import webdriver 

if __name__ == '__main__':
chrome_options = webdriver.ChromeOptions() # Add options to chrome
chrome_options.add_experimental_option('detach', True) # Keep the browser open
chrome_options.add_argument('--disk-cache-size=0') # Disable cache
chrome_options.add_argument('--blink-settings=imagesEnabled=false') # Disable image
chrome_options.add_argument('--headless') # Enable headless mode
driver = webdriver.Chrome(options=chrome_options)

Next, set our target URL, we will target an e-commerce website which is eBay. We will gather information about new laptop products with 16 GB RAM that are available i.e.:

  • Product Link
  • Product Title
  • Product Price
  • Product Seller
  • Product Image URL

URL: https://www.ebay.com/sch/i.html?_from=R40&_nkw=laptop&_sacat=0&LH_ItemCondition=1000&RAM%2520Size=16%2520GB&_dcat=177&_ipg=240&rt=nc&LH_BIN=1

Inspect the page element by right-clicking the mouse on the page and choosing the inspect menu.

Find the container of the product list, by hovering on a certain element, it will colour the page automatically, representing what element we are pointing to. Here we have ‘ul srp-results srp-list clearfix’ which is the element that holds all these products.

Why would we have to find the container first? it’s important to make sure that we will just focus on the product lists, so whenever we find a specific element like this products selector ‘div.s-item__wrapper clearfix’ it will only search within the products container scope

After we‘ve done with the products selector (‘div.s-item__wrapper clearfix’) , we have to find the information details to be extracted (title, price, etc). The easiest and fastest approach for retrieving our desired element is by directly inspecting the data.

These are all the remaining:

SELECTORS:

  • Container Selector: ul.srp-results.srp-list.clearfix
  • Product Selector: div.s-item__wrapper.clearfix
  • Link Selector: a.s-item__link
  • Title Selector: div.s-item__title
  • Image Selector: div.s-item__image-wrapper.image-treatment
  • Price Selector: div.s-item__details.s-item__detail — primary
  • Seller Selector: span.s-item__seller-info-text

Okay, now let's execute our scenario:

  • Import the necessary modules.
...
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
  • define all the elements:
URL = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=laptop&_sacat=0&rt=nc&LH_ItemCondition=1000&RAM%2520Size=16%2520GB&_dcat=177&_ipg=240'

container_selector = 'ul.srp-results.srp-list.clearfix'
product_selector ='div.s-item__wrapper.clearfix'
link_selector = 'a.s-item__link'
title_selector = 'div.s-item__title'
price_selector = 'div.s-item__detail.s-item__detail--primary'
seller_selector = 'span.s-item__seller-info-text'
image_selector = 'div.s-item__image-wrapper.image-treatment'


if __name__ == '__main__':
...
  • Navigate to the URL and wait for the container
if __name__ == '__main__':
...
# Open the browser with Selenium and navigate to the URL
driver.get(URL)
# Wait for container present on the DOM of a page and visible with 15 second timeout
WebDriverWait(driver, 15).until(EC.visibility_of_element_located((By.CSS_SELECTOR, container_selector)))

# Quit the browser
driver.quit()

WebDriverWait: Used to instruct Selenium to wait for certain conditions to be met before proceeding with the script.

15: Timeout value in seconds, used to wait for the condition to become true before throwing a TimeoutException.

Until: It waits until the condition is met or the timeout expires.

EC: This is an alias for the expected_conditions module in Selenium, which provides various expected conditions that you can use with WebDriverWait.

Presence Of Element Located: This is a specific expected condition that checks for the visibility of an element on the page.

By: A class that represents a mechanism for locating elements on a web page. example: By.ID, By.CSS_SELECTOR, By.XPATH, By.CLASS_NAME, etc.

Quit: Used to completely terminate the WebDriver session and close all associated browser windows, tabs, and processes.

  • Extract product title.
if __name__ == '__main__':
...
# find 1 element of container
container = driver.find_element(By.CSS_SELECTOR, container_selector)
# find multiple element of product
products = container.find_elements(By.CSS_SELECTOR, product_selector)
# extract title with python list comprehension
titles = [item.find_element(By.CSS_SELECTOR, title_selector).text for item in products]
# example output: ['title1','title2']
print(title)
...

Find Element: used to find a single element.

By.CSS_SELECTOR: Using a CSS selector to identify the element.

.text: Access the text content of each found title element.

products.find_elements: Used to locate multiple elements on the web page that match the specified locator strategy and value.

  • Extract product price

If we take a look at the price_selector, there’s an element inside, which is ‘span.s-item__price’.

Because there’s another element inside it, we have to change our price_selector to be like this.

price_selector = 'div.s-item__detail.s-item__detail--primary span.s-item__price'

Yes, in Selenium, we can nested the CSS Selectors like the example above.

It’s actually fine to keep the price_selector as before, but the problem is we have to make sure that the attribute value (in this case, the class attribute value of div) is unique, In other words, There is no other set of data with his name. If there were, then the price data would be inaccurate.

if __name__ == '__main__':
...
prices = [item.find_element(By.CSS_SELECTOR, price_selector).text for item in products]
print(titles)
print(prices)
...
  • Extract product seller.
if __name__ == '__main__':
...
sellers = [item.find_element(By.CSS_SELECTOR, seller_selector).text for item in products]
print(titles)
print(prices)
print(sellers)
...
  • Extract product link.
if __name__ == '__main__':
...
links = [item.find_element(By.CSS_SELECTOR, link_selector).get_attribute('href') for item in products]
print(titles)
print(prices)
print(sellers)
print(links)
...

get_attribute: used to retrieve the value of a specific attribute of a web element.

  • Extract product image URL.

Let’s take a look at the image element again,

Correct, there’s an image element inside the div, let’s change the CSS Selector.

image_selector = 'div.s-item__image-wrapper.image-treatment img'

There’s a little difference here, we’re not chaining the image_selector to its attribute class, because there is no attribute class within the img tag, so we can just go to the tag instead of chaining it to any attribute.

if __name__ == '__main__':
...
image_urls = [item.find_element(By.CSS_SELECTOR, image_selector).get_attribute('src') for item in products]
print(titles)
print(prices)
print(sellers)
print(links)
print(image_urls)
...

Congratulations! You have retrieved all the data through the console!

Level Up: Exporting Your Data to CSV with Pandas

You’ve scraped the web and extracted all the information, and now to level up your skill, we will export our scraped data into CSV with the Pandas library, let’s install and import the module.

pip install pandas
import pandas as pd

if __name__ == '__main__':
...
dataFrame = pd.DataFrame(
# Key: Value means Column with it's value
{
'Product': titles,
'Price': prices,
'Seller': sellers,
'Link': links,
'Image URL': image_urls
}
)
# Start from index 1
dataFrame.index += 1
# Export to csv with 'write' mode
dataFrame.to_csv('laptop_products.csv', mode='w')
...

pd.DataFrame({…}): Creating a pandas DataFrame, which is a powerful data structure for storing and manipulating tabular data.

df.index += 1: This line modifies the default indices (row labels) of the DataFrame. It adds 1 to each index, starting them from 1 instead of 0.

df.to_csv(‘laptop_products.csv’, mode=’w’): This line saves the DataFrame as a CSV (comma-separated values) file named “laptop_products.csv” with mode ‘w’ which will create a new file if it doesn’t exist or overwriting an existing one.

Excellent! You already know how to simply export the list of data to CSV File.

The Complete Code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
'''
Selenium Website Scraping
Target: eBay e-commerce website

INFORMATION TO EXTRACT:
- Product Title
- Product Price
- Product Seller
- Product Image URL
- Product Link
'''

URL = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=laptop&_sacat=0&_fsrp=1&RAM%2520Size=16%2520GB&_dcat=177&LH_ItemCondition=1000&_ipg=240&rt=nc&LH_BIN=1' #

container_selector = 'ul.srp-results.srp-list.clearfix'
product_selector ='div.s-item__wrapper.clearfix'
link_selector = 'a.s-item__link'
title_selector = 'div.s-item__title'
price_selector = 'div.s-item__detail.s-item__detail--primary span.s-item__price'
seller_selector = 'span.s-item__seller-info-text'
image_selector = 'div.s-item__image-wrapper.image-treatment img'

if __name__ == '__main__':
chrome_options = webdriver.ChromeOptions() # Add options to chrome
chrome_options.add_experimental_option('detach', True) # Keep the browser open
chrome_options.add_argument('--disk-cache-size=0') # Disable cache
chrome_options.add_argument('--blink-settings=imagesEnabled=false') # Disable images
chrome_options.add_argument('--headless') # Enable headless mode

driver = webdriver.Chrome(options=chrome_options)
driver.get(URL) # Open the browser and navigate to the URL
WebDriverWait(driver, 15).until(EC.visibility_of_element_located((By.CSS_SELECTOR, container_selector))) # Wait for container present on the DOM of a page and visible with 15 second timeout
container = driver.find_element(By.CSS_SELECTOR, container_selector)
products = container.find_elements(By.CSS_SELECTOR, product_selector)
titles = [item.find_element(By.CSS_SELECTOR, title_selector).text for item in products]
prices = [item.find_element(By.CSS_SELECTOR, price_selector).text for item in products]
sellers = [item.find_element(By.CSS_SELECTOR, seller_selector).text for item in products]
links = [item.find_element(By.CSS_SELECTOR, link_selector).get_attribute('href') for item in products]
image_urls = [item.find_element(By.CSS_SELECTOR, image_selector).get_attribute('src') for item in products]

print(titles)
print(prices)
print(sellers)
print(links)
print(image_urls)

dataFrame = pd.DataFrame(
{
'Product': titles,
'Price': prices,
'Seller': sellers,
'Link': links,
'Image URL': image_urls
}
)
dataFrame.index += 1
dataFrame.to_csv('laptop_products.csv', mode='w')
driver.quit()

Remember, This guide is just the first step on your web scraping journey. Keep learning, keep exploring, and keep challenging yourself. There’s a whole world of data out there waiting to be unlocked! You can Explore more about the By Class such as By.ID, By.XPATH, By.CLASS_NAME, etc, and you’ll be amazed at what you will achieve afterwards!

I hope this guide has been helpful and inspiring! If you have any questions or suggestions, feel free to leave a comment below. And remember, the web scraping community is always happy to help, so don’t hesitate to reach out, But stay tuned, because we’re just getting started! My upcoming tutorials will tackle advanced techniques like handling scrolling in page, navigating complex websites, and etc. Get ready to level up your data scraping skills!

Feel free to contact me!
Email: naufalmng@gmail.com
Github: github.com/naufalmng
Linkedin: linkedin.com/in/mnaufalhanif/
Instagram: instagram.com/naufalh.apk

--

--

Responses (1)