Advanced Web Scraping with Selenium: Extracting and Organizing Upcoming Sales Data

Muhammad Naufal Hanif
8 min readJan 22, 2024

--

What’s up folks! In this article, we embark on a journey to elevate your web scraping prowess by extracting upcoming sales information using Selenium and Python, as well as organizing it with Pandas. Following our previous adventure in A Beginner’s Guide to Web Scraping with Selenium, where we delved into eBay product extraction, we now set our sights on the realm of upcoming sales.

Without further ado, let’s get started!

Modules Requirements:

  • Selenium
  • Pandas

Section 1: Analyze The Website

Target: The Judicial Sales Corporation

Scraping Challenges:

  • We will get only all ‘Cook’ County.
  • We will create 2 Spreadsheets: Sorted by auction date and Sorted by zip code.

Data to Extract:
- Sale Date
- Sale Time
- File Number
- Case Number
- Firm Name
- Address
- City
- Zip Code
- Opening Bid
- Required % Down
- Sale Amount
- Continuance
- Sold To

One of the keys to mastering handling pagination is to analyze the next and previous buttons of the page. A disabled ‘Previous’ button indicates you’re on the first page. Conversely, a disabled ‘Next’ button means you’ve reached the last page. Analyzing these states can help you easily optimize your navigation logic.

Locators:

  • Table Xpath: //*[@id=”basic-datatables”]
  • Next Button Xpath: //*[@id=”basic-datatables_paginate”]/ul/li[7]/a
  • Next Button Selector at the end of the page: li.next.disabled

If we take a look at this image, we can see that the class name of the next button at the end of the page has changed from ‘next’ to ‘next disabled’ which means, we can just use this approach as our end page checker later.

What is XPath?

XPath (XML Path Language) is a locator strategy used to navigate through elements in HTML or XML documents. On our XPath, the expression //*[@id="basic-datatables_paginate"]/ul/li[7]/a means it’s selecting any element (*) in the document with an attribute id equal to "basic-datatables_paginate," then navigate to its ul child, followed by the 7th li child, and finally select the a (anchor) element.

Section 2: Import necessary libraries and define configurations

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import logging

# Configuration
URL = 'https://tjsc.com/Sales/UpcomingSales'
TABLE_XPATH = '//*[@id="basic-datatables"]'
NEXT_PAGE_XPATH = '//*[@id="basic-datatables_paginate"]/ul/li[7]/a'
DISABLED_NEXT_BUTTON_SELECTOR = 'li.next.disabled'
# Configure logging to run on INFO Level
logging.basicConfig(level=logging.INFO)

We will use logging modules to log all our scraping activities.

Why would we use logging instead of regular print?

  • Control Levels: Logging has different levels (DEBUG, INFO, WARNING, ERROR, CRITICAL) for controlling output verbosity.
  • Flexibility: Logging allows redirecting logs to files, consoles, or external services.
  • Timestamps: Logs automatically include timestamps for tracking events
  • Severity Levels: Log levels categorize the severity of messages.
  • Structured Output: Logging supports structured log output for easier analysis.
  • Error Handling: Provides a standardized way to report errors with stack traces.
  • Consistency: Ensures consistent formatting of log messages.
  • Integration: Integrates well with libraries and frameworks using the Python logging module.

Section 3: Set up WebDriver and any requirements variables

def main():
# Set up Chrome WebDriver options
options = webdriver.ChromeOptions()
# Disables "Chrome is being controlled by automated test software" message
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')

options.add_argument("--disable-extensions") # Disables extensions
options.add_argument('--no-sandbox') # Bypass OS security model
options.add_argument('--disable-infobars') # Disables infobars
options.add_argument('--disable-dev-shm-usage') # overcome limited resource problems
options.add_argument('--disable-browser-side-navigation') # Disables browser side navigation
options.add_argument('--disable-gpu') # Disables GPU hardware acceleration
options.add_argument('--blink-settings=imagesEnabled=false') # Disables images
options.add_argument('--headless') # Runs in headless mode
options.add_argument(
'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
) # Sets user agent to mimic a real browser

# Initialize Chrome WebDriver
driver = webdriver.Chrome(options=options)
driver.maximize_window()
driver.get(URL)

# Initialize counter and data variables
counter = 1
data = []
  • Counter: used to count pages.
  • data = used to accommodate our upcoming sales data.

Section 4: Extract data and handle pagination.

Since we will handle a dynamic pagination, we will use a while loop.

main():

def main():
...
while True:
try:
# Log page scraping progress
logging.info(f'Scraping page {counter}..')

# Wait for the table to be visible
WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, TABLE_XPATH)))
table = driver.find_element(By.XPATH, TABLE_XPATH)

# Scrape data from the current page
scrape_data(table, data)

# Check for end of pagination
end_of_page = handle_pagination(driver, counter, data)
if end_of_page:
break

# Increment page counter
counter += 1

except (NoSuchElementException, TimeoutException) as e:
# Handle errors and exit the script
logging.error(f"An error occurred: {e}")
driver.quit()
break
...


if __name__ == '__main__':
main()
  • NoSuchElementException: used to catch an error if there’s an not found element.
  • TimeoutException: used to catch an error if there’s an EC (expected conditions) that’s not fulfilled.

scrape_data(table, data):

def scrape_data(table, data):
# Take only the 'Cook' county data
table_rows = [
row for row in table.find_elements(By.XPATH, f"{TABLE_XPATH}/tbody/tr")
if row.find_element(By.XPATH, './td[8]').text == 'Cook'
]
for table_data in table_rows:
td = [td.text for td in table_data.find_elements(By.XPATH, './td[position() <= 14]')]
data.append({
"Sale Date": td[0],
"Sale Time": td[1],
"File Number": td[2],
"Case Number": td[3],
"Firm Name": td[4],
"Address": td[5],
"City": td[6],
"County": td[7],
"Zip Code": td[8],
"Opening Bid": td[9],
"Required % Down": td[10],
"Sale Amount": td[11],
"Continuance": td[12],
"Sold To": td[13]
})

def main():
...

The scrape_data function is designed to extract information from the table, specifically focusing on rows where the value in the 8th column (td[8]) is equal to 'Cook'. This ensures that only data related to the 'Cook' county is considered.

Here’s a breakdown of the function:

  • It iterates through each row in the table using a list comprehension.
  • For each row, it checks if the text in the 8th column (./td[8]) is equal to 'Cook' (is the county column equal to ‘Cook’?).
  • If the condition is satisfied, it extracts text from the first 14 columns of that row (td.text for td in table_data.find_elements(By.XPATH, './td[position() <= 14]')) means extracting all the column's values.
  • The extracted data is then appended to the data list as a dictionary, with keys representing different columns such as 'Sale Date', 'Sale Time', etc.

handle_pagination(driver, counter, data):

def handle_pagination(driver, counter, data):
try:
# return true if disabled next button found
driver.find_element(By.CSS_SELECTOR, DISABLED_NEXT_BUTTON_SELECTOR)
return True
except Exception:
# if it's not, find the next page button,scroll to it and click it
next_page = driver.find_element(By.XPATH, NEXT_PAGE_XPATH)
driver.execute_script('arguments[0].scrollIntoView(true)', next_page)
next_page.click()
# log the current page and total scraped data
logging.info(f'Page {counter} scraped. Total: {len(data)}')
return False


def main():
...

The handle_pagination function manages the navigation through paginated pages on a website.

  • If the ‘Next’ button is disabled (indicating the last page), the function returns True.
  • If an exception occurs (indicating the ‘Next’ button is not disabled), it proceeds to locate and click the ‘Next’ button using an XPath (NEXT_PAGE_XPATH).
  • After clicking, it logs the progress and returns False to indicate there are more pages to scrape.

Section 5: Organize data and export to Excel.

def main():
...
# Create DataFrames and sort by relevant columns
df_sheet1 = pd.DataFrame(data).sort_values(by='Sale Date')
df_sheet2 = pd.DataFrame(data).sort_values(by='Zip Code')

# Export to Excel file with two sheets
with pd.ExcelWriter('upcoming_sales.xlsx', engine='xlsxwriter') as writer:
df_sheet1.to_excel(writer, sheet_name='By Auction Date', index=False)
df_sheet2.to_excel(writer, sheet_name='By Zip Code', index=False)

# Log successful completion
logging.info("Excel file with two sheets created: upcoming_sales.xlsx")
  • df_sheet1 and df_sheet2 are two Pandas DataFrames created from the collected data.
  • pd.DataFrame(data) converts the collected data (list of dictionary) into a DataFrame.
  • sort_values(by='') sorts DataFrame based on the inputted column (By Sale Date and By Zip Code).
  • pd.ExcelWriter is used to create an Excel file named 'upcoming_sales.xlsx'.
  • df_sheet1 is written to the 'By Auction Date' sheet, and df_sheet2 is written to the 'By Zip Code' sheet.

The Complete Code:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import logging


URL = 'https://tjsc.com/Sales/UpcomingSales'
TABLE_XPATH = '//*[@id="basic-datatables"]'
NEXT_PAGE_XPATH = '//*[@id="basic-datatables_paginate"]/ul/li[7]/a'
DISABLED_NEXT_BUTTON_SELECTOR = 'li.next.disabled'

logging.basicConfig(level=logging.INFO)



def scrape_data(table, data):
table_rows = [
row for row in table.find_elements(By.XPATH, f"{TABLE_XPATH}/tbody/tr")
if row.find_element(By.XPATH, './td[8]').text == 'Cook'
]
for table_data in table_rows:
td = [td.text for td in table_data.find_elements(By.XPATH, './td[position() <= 14]')]
data.append({
"Sale Date": td[0],
"Sale Time": td[1],
"File Number": td[2],
"Case Number": td[3],
"Firm Name": td[4],
"Address": td[5],
"City": td[6],
"County": td[7],
"Zip Code": td[8],
"Opening Bid": td[9],
"Required % Down": td[10],
"Sale Amount": td[11],
"Continuance": td[12],
"Sold To": td[13]
})


def handle_pagination(driver, counter, data):
try:
driver.find_element(By.CSS_SELECTOR, DISABLED_NEXT_BUTTON_SELECTOR)
return True
except Exception:
next_page = driver.find_element(By.XPATH, NEXT_PAGE_XPATH)
driver.execute_script('arguments[0].scrollIntoView(true)', next_page)
next_page.click()
logging.info(f'Page {counter} scraped. Total: {len(data)}')
return False


def main():
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')

options.add_argument("--disable-extensions")
options.add_argument('--no-sandbox')
options.add_argument('--disable-infobars')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-browser-side-navigation')
options.add_argument('--disable-gpu')
options.add_argument('--blink-settings=imagesEnabled=false')
options.add_argument('--headless')
options.add_argument(
'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
)


driver = webdriver.Chrome(options=options)
driver.maximize_window()
driver.get(URL)

counter = 1
data = []

while True:
try:
logging.info(f'Scraping page {counter}..')

WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, TABLE_XPATH)))
table = driver.find_element(By.XPATH, TABLE_XPATH)

scrape_data(table, data)

end_of_page = handle_pagination(driver, counter, data)
if end_of_page:
break

counter += 1

except (NoSuchElementException, TimeoutException) as e:
logging.error(f"An error occurred: {e}")
driver.quit()
break

df_sheet1 = pd.DataFrame(data).sort_values(by='Sale Date')
df_sheet2 = pd.DataFrame(data).sort_values(by='Zip Code')

with pd.ExcelWriter('upcoming_sales.xlsx', engine='xlsxwriter') as writer:
df_sheet1.to_excel(writer, sheet_name='By Auction Date', index=False)
df_sheet2.to_excel(writer, sheet_name='By Zip Code', index=False)

logging.info("Excel file with two sheets created: upcoming_sales.xlsx")


if __name__ == '__main__':
main()

This advanced web scraping showcased the power of Selenium and Python in efficiently extracting and organizing upcoming sales data. From data extraction and dynamic pagination handling to structured organization and export, each step was a testament to the capabilities of web scraping for data-driven insights.

Remember, web scraping is a powerful tool that requires responsible and ethical use. Always be mindful of the website’s terms of service and respect data privacy guidelines. Now armed with advanced web scraping skills, you’re ready to tackle diverse data extraction challenges in your scraping adventures. Happy scraping!

--

--

No responses yet