Adaptive Web Scraping: A Machine Learning Approach to Dynamic HTML Structure Detection

  by   Nicholas Micallef






Departments Computer Science, Zienkiewicz Institute for Modelling, Data and AI
DescriptionWeb scraping tools like Selenium rely on HTML tags to automate data extraction from websites. However, a major challenge arises from the frequent structural changes in HTML, whether due to UI redesigns or deliberate efforts by platforms to prevent scraping. As a result, web scraping has become increasingly difficult, with technology companies investing heavily in anti-scraping measures to protect their data. This project aims to develop a machine learning model that can dynamically identify key HTML tags required for web scraping. By leveraging machine learning, the proposed solution will enable automated adaptation to structural changes in HTML, reducing reliance on rigid, hard-coded tag identification. Project Objectives: 1. Analyzing Barriers and Challenges Conduct an in-depth analysis of the challenges developers face when scraping data from various platforms. The student will have the flexibility to select the platforms they wish to investigate. 2. Developing a Scraping Script Based on insights from the first phase, design and implement a script capable of overcoming identified barriers and successfully extracting substantial data from the selected platforms. 3. Dataset Creation Manually label a diverse set of HTML data to identify relevant tags for scraping. This step involves annotating key elements across various HTML structures to create a robust training dataset for the model. 4. Model Development and Training Build and train a machine learning model to detect and classify useful HTML tags by learning patterns in different website structures. 5. Proof of Concept Integrate the trained model with Selenium and the developed script to validate its effectiveness. The proof of concept should demonstrate the model’s ability to adapt to dynamic HTML structures and successfully guide Selenium in extracting relevant data. Project Scope and Requirements: This is a technically demanding project that requires strong programming and data science skills. Success will depend on the student's ability to implement machine learning techniques, handle complex HTML structures, and develop robust scraping methodologies.
Preparation
Project Categories Artificial Intelligence (AI)
Project Keywords Machine Learning, Programming Languages


Level of Studies

Level 6 (Undergraduate Year 3) yes
Level 7 (Masters) yes
Level 8 (PhD) yes