Advanced Web Scraping With Selenium: Store Deals Data Extraction
Continuing our Advanced Web Scraping With Selenium series, in this article, we will improve our scraping skills again!
Rakuten is a popular online marketplace, that offers a vast array of products and deals from various stores. Our goal is to use Selenium to scrape store deals data and handle infinite scrolling pages.
The data we’ll extract are:
- store names
- images
- descriptions
- timestamps (python side)
Step 1: As always, first of all, go ahead to the Rakuten Website, we will analyze the website.
There are several to note here, as we can see from the beginning of the page load, the items are not simultaneously loaded, we have to scroll down to load the other items. we can notify by seeing the height length of the scroller from the top to the bottom. So, later we will handle this thing.
- Container Xpath: //*[@id=”main-content”]/div[4]/div[9]/div/div/div[2]/div[1]
- Main Content Xpath: //*[@id=”main-content”]
- Store Name Selector: a div:nth-child(2) div:nth-child(1)
- Image URL Selector: a div:nth-child(1) img
- Description Selector: a div:nth-child(2) div:nth-child(2) div:nth-child(2) span:nth-child(1)
nth:child(1) is just a method to get an element on a certain index, nth:child uses 1 based index. For instance:
Selector Stack: a div:nth-child(2) div:nth-child(2) div:nth-child(2) span:nth-child(1)
Step 2: Import necessary libraries and define the chrome option configuration.
import datetime
import time
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.actions.wheel_input import ScrollOrigin
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def configure_chrome_options():
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option('detach', True)
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("--disable-extensions")
options.add_argument('--no-sandbox')
options.add_argument('--disable-infobars')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-browser-side-navigation')
options.add_argument('--disable-gpu')
options.add_argument('--blink-settings=imagesEnabled=false')
options.add_argument(
'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
)
return options
Step 3: Define driver initialization.
def initialize_driver(options):
driver = webdriver.Chrome(options=options)
driver.maximize_window()
return driver
Step 4: Define scrape data function.
def scrape_data(driver, container):
store_name_selector = 'a div:nth-child(2) div:nth-child(1)'
image_url_selector = 'a div:nth-child(1) img'
description_selector = 'a div:nth-child(2) div:nth-child(2) span:nth-child(1)'
data = []
last_element = None
while True:
try:
items = container.find_elements(By.CSS_SELECTOR, 'div.css-0')
if last_element:
items = items[items.index(last_element) + 1:]
for item in items:
store_name = item.find_element(By.CSS_SELECTOR, store_name_selector).text
image_url = item.find_element(By.CSS_SELECTOR, image_url_selector).get_attribute('src')
description = item.find_element(By.CSS_SELECTOR, description_selector).text
deals = {
'Store Name': store_name,
'Image URL': image_url,
'Description': description,
'Timestamp': datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}
print(deals)
data.append(deals)
current_last_element = items[-1]
print(f'Total Scraped: {len(data)}')
last_element = current_last_element
print('Scrolling to last element..')
driver.execute_script("arguments[0].scrollIntoView(true);", last_element)
time.sleep(1.5)
except (NoSuchElementException, IndexError, Exception) as e:
print(e)
break
return data
last_element
is a common technique for handling duplicated elements and ensuring that we don't scrape the same data multiple times during consecutive iterations. Let's break down how it works:
last_element = None
Initially, last_element
is set to None
, indicating that there is no previous element.
if last_element:
items = items[items.index(last_element) + 1:]
This condition checks if last_element
exists (i.e., it's not None
). If it exists, it means that there was a previous iteration, and we want to avoid scraping the same data again. So, it adjusts the items
list to start from the next element after last_element
.
current_last_element = items[-1]
last_element = current_last_element
After processing the current set of items, the last element of the current iteration (current_last_element
) is stored in last_element
. This becomes the reference point for the next iteration.
This approach ensures that you only scrape new data in each iteration, preventing redundancy and duplication. By keeping track of the last element processed, you can efficiently handle scenarios where elements might be repeated due to dynamic loading or paginated content.
Step 5: Define the main function.
def main():
url = 'https://www.rakuten.com/stores/all'
container_xpath = '//*[@id="main-content"]/div[4]/div[9]/div/div/div[2]/div[1]'
main_content_xpath = '//*[@id="main-content"]'
csv_path = 'rakuten.csv'
options = configure_chrome_options()
driver = initialize_driver(options)
try:
driver.get(url)
WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, main_content_xpath)))
scroll_element = driver.find_element(By.XPATH, main_content_xpath)
scroll_from_element = ScrollOrigin.from_element(scroll_element)
ActionChains(driver).scroll_from_origin(scroll_from_element, 0, 3000).perform()
WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, container_xpath)))
container = driver.find_element(By.XPATH, container_xpath)
driver.execute_script("arguments[0].scrollIntoView(true);", container)
data = scrape_data(driver, container)
df = pd.DataFrame(data)
df.to_csv(csv_path, index=False)
finally:
driver.quit()
if __name__ == '__main__':
main()
Let’s break down the scenario:
Navigating to the Rakuten Website and Setting Up Scrolling:
driver.get(url)
WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, main_content_xpath)))
scroll_element = driver.find_element(By.XPATH, main_content_xpath)
scroll_from_element = ScrollOrigin.from_element(scroll_element)
- Navigating to the provided URL using
driver.get(url)
. - Wait for the main content area to be visible using
WebDriverWait
. - The main content element is located and used for scrolling.
Performing Scrolling and Waiting for Container Visibility:
ActionChains(driver).scroll_from_origin(scroll_from_element, 0, 3000).perform()
WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, container_xpath)))
- Using
ActionChains
, it scrolls from the main content element to simulate loading more content. - It then waits for the container element (the holder of store information) to become visible.
Locating Container Element and Scrolling to It:
container = driver.find_element(By.XPATH, container_xpath)
driver.execute_script("arguments[0].scrollIntoView(true);", container)
- Locating the container.
- The
scroll_to_element
function is called to ensure the container is fully visible.
Scraping Data, Creating DataFrame, and Saving to CSV:
data = scrape_data(driver, container)
df = pd.DataFrame(data)
df.to_csv(csv_path, index=False)
- The
scrape_data
function is called with the WebDriver and container as arguments to extract store name information. - The extracted data is converted into a pandas DataFrame (
df
). - The DataFrame is saved to a CSV file (
csv_path -> rakuten.csv
).
Quit driver within finally
Block:
finally:
driver.quit()
The Full Code:
import datetime
import time
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.actions.wheel_input import ScrollOrigin
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def configure_chrome_options():
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option('detach', True)
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("--disable-extensions")
options.add_argument('--no-sandbox')
options.add_argument('--disable-infobars')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-browser-side-navigation')
options.add_argument('--disable-gpu')
options.add_argument('--blink-settings=imagesEnabled=false')
options.add_argument(
'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
)
return options
def initialize_driver(options):
driver = webdriver.Chrome(options=options)
driver.maximize_window()
return driver
def scrape_data(driver, container):
store_name_selector = 'a div:nth-child(2) div:nth-child(1)'
image_url_selector = 'a div:nth-child(1) img'
description_selector = 'a div:nth-child(2) div:nth-child(2) div:nth-child(2) span:nth-child(1)'
data = []
last_element = None
while True:
try:
items = container.find_elements(By.CSS_SELECTOR, 'div.css-0')
if last_element:
items = items[items.index(last_element) + 1:]
for item in items:
store_name = item.find_element(By.CSS_SELECTOR, store_name_selector).text
image_url = item.find_element(By.CSS_SELECTOR, image_url_selector).get_attribute('src')
description = item.find_element(By.CSS_SELECTOR, description_selector).text
deals = {
'Store Name': store_name,
'Image URL': image_url,
'Description': description,
'Timestamp': datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}
print(deals)
data.append(deals)
current_last_element = items[-1]
print(f'Total Scraped: {len(data)}')
last_element = current_last_element
print('Scrolling to last element..')
driver.execute_script("arguments[0].scrollIntoView(true);", last_element)
time.sleep(1.5)
except (NoSuchElementException, IndexError, Exception) as e:
print(e)
break
return data
def main():
url = 'https://www.rakuten.com/stores/all'
container_xpath = '//*[@id="main-content"]/div[4]/div[9]/div/div/div[2]/div[1]'
main_content_xpath = '//*[@id="main-content"]'
csv_path = 'rakuten.csv'
options = configure_chrome_options()
driver = initialize_driver(options)
try:
driver.get(url)
WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, main_content_xpath)))
scroll_element = driver.find_element(By.XPATH, main_content_xpath)
scroll_from_element = ScrollOrigin.from_element(scroll_element)
ActionChains(driver).scroll_from_origin(scroll_from_element, 0, 3000).perform()
WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, container_xpath)))
container = driver.find_element(By.XPATH, container_xpath)
driver.execute_script("arguments[0].scrollIntoView(true);", container)
data = scrape_data(driver, container)
df = pd.DataFrame(data)
df.to_csv(csv_path, index=False)
finally:
driver.quit()
if __name__ == '__main__':
main()
In this journey through advanced web scraping, we’ve explored the potent capabilities of Selenium for data extraction. The code presented provides a robust foundation, showcasing the strategic use of Chrome options, WebDriver initialization, and intelligent scrolling.
The incorporation of last_element
demonstrates a smart approach to handling duplicate elements, ensuring efficient and precise data extraction. Adapt and apply this knowledge to your projects!
Happy scraping!