Hack Your Future

Scrape First, Science Later | Data Mining Project Presentations

December 30, 2020
, 3:00 pm
, Data Science, ITC Publications

Here are our student’s presentations of the data mining projects , the first project in the Data Science Program.

Before we decide what to do with data, we first need to scrape it. That’s why the first project our students in the Data Science Program work on, is the data mining project.

Our students from the October cohort just finished their first project in the program, we’ve gathered the project descriptions and demonstrations:

Coronavirus Cases Web Scraper

Eran Perelman, Nofar Herman

This project uses the Coronavirus website https://www.worldometers.info/coronavirus/ to fetch data about the virus. The data is both globally and individually for each country , and its history, of coronavirus cases. The code fetches the data when running it at first, and updates it on every interval.

For more information:
https://github.com/EranPer/Data-Mining-project

CoronaGram – scrap instagram posts from Instagram hashtag pages

Yoav Wigelman, Yair Stemmer

The program works in 3 steps:
1. Connection to a hashtag page, scrapping a collection post “shortcodes” and inserting them into an SQL DB.
2. Collecting shortcodes that were inserted into the DB is step 1, converting them into post urls and scrape of post data.
3. Enrichment database by analysing post texts with an API for detection of language, translation and sentiment analysis to detect if the post was either positive, negative or neutral.

Insta Scraper – Web Scraper for Instagram posts

Yaniv Goldfrid, Dana Belivekov

Insta Scraper enables the user to retrieve data from the Instagram by specifying username or hashtag. It will scrape all latest posts and its information (likes, comments, location, etc) along with the user who posted it and their information (followers, following, bio, personal website, etc). Insta Scraper is enriched with the Weather API and Google Geo Location API, so for every post with a location, it retrieves the temperature and weather conditions at the time and place of posting.

Job hunting in the United States

Royi Razi, Eyal Hashimshoni

This project optimizes job search in the DS field across the United States using various websites.

Company Reviews Scraper

Albert Tamman, Asher Holder, Ruben Valensi

For our data mining project, we decided to web-scrap Trustpilot, a website which allows consumers to rate companies. With the data from this website, we could have a more in-depth analysis of the companies’ reviews.

Stack Exchange Scraper

David Frankenberg, Maria Startseva

The Stack Exchange Scraper allows you to retrieve information about the most frequent questions in the the Stack Exchange ecosystem. The Stack Exchange family sites are a meeting point between people who have questions on a related topic and other users and experts who freely answer their questions. The voting system allows us to follow the most popular questions and answers and gives us an idea of the trending topics in a specialized communities.

Technology News Web Scraper

Daniella Grimberg, Eddie Mattout

The Technology News Web Scraper looks for the trendiest topics, articles and authors in TechCrunch and other sources. You can use it to search for headlines related to any specific tech-industry such as Cybersecurity, Gaming, Health-tech, and through that understand trends and hype in the industry of interest. You can also search for trending topics based on date and author and their respective twitter handles.

PyPi Scraper

Adam Rubins, Or Granot

PyPi scraper is a tool you can use to scrap python packages information found on PyPi website https://pypi.org/. Information include – package name, version, OS systems that support it, Python versions that support it, topic and many more. you can use it to locally save it in a DB or to print it on the screen. The scraper also gets additional information from Github.

Adoptable Animals Database

Ariana Gordon, Noa Ehrenhalt

Our database with command line interface showcases adoptable animals throughout the United States. Utilizing the Los Angeles County Animal Care and Control Center website as well as the Petfinder API, this program gathers details about the animals currently in those organizations’ shelters and rescues. Users can use the command line interface to search for a new furry friend by their shelter ID, breed, shelter location, available date, or any combination of the aforementioned attributes.

ATP Tennis Scrapper

Tal Baram, Gabriel Choukroun

In our project we built a scraper to get data from the ATP website (professional men’s tennis tour). Our system allows us to get data from each calendar year, starting from 1877. The data includes tournaments, champions, players profiles, scores, rankings and more. The system is built in a way that allows friendly filters in order to match the user’s needs. All of the data is being saved on MySQL database.

Github: https://github.com/talbaram3192/web_scraping_project

Web Scraping Real Estate Data

Ron Levy, Ohad Hayoun

The real estate industry is just starting to unlock the potential of big data and incorporate machine learning
Our project’s main goal is to gain the tools needed to utilize data, better inform investment decisions, and discover new opportunities in the real estate sector using insights taken from data analytics.
By scraping USA real estate key data, surrounding the factors that impact real estate value such as, property type, house images, land size, price trends, location, and amenities – it makes it possible for anyone interested in real estate to quickly obtain large amounts of relevant data.

Stock Mining

Barak Beitner, Omer Danziger

Stock mining is a python program for scrapping Yahoo finance top indexes like S&P 500, Dow Jones, NASDAQ etc. along the history and examining trends. Once all the desired data has been scraped the user can analyze the data by their own needs using sql DB interface.

Allmusic Web Scrapper

Clémentine Szalavecz, Alexandre de Pape

This project consisted in scrapping allmusic.com which is a website that identifies all new releases album per week and gives a lot of information about the artist. And then creating a database and deploying it to an AWS machine.

Aliexpress.com Product Scraper

Yaniv Weiss, Ruben Cohen

Our program scrapes data from the website Aliexpress.com. It scrapes data about products, suppliers and customers using Python and Selenium. Once the data is scraped, the program updates a SQL database. The data scraped is enriched with data retrieved from the New York Times using an API. The goal of this enrichment is to gain insight on the popularity of a kind of product.
Finally, we used an AWS server to run our scraper and made a summary of the insights obtained using redash.

Glassdoor Web Scraper

Amit Englstein, Charlotte Abitbol

Using the script we have developed one can have information regarding whatever job he desired to search for, while presenting some additional and relevant information about the company: its overall rating, benefits, location, revenue, size, exchange rate, etc.. Tired searching for jobs manually while considering so many factors? This automatic script will do that for you. Using the script you can both save time and have further analysis with your newly scraped data.
*NOTE: All scraping was done in accordance to glassdoor’s robots.txt limitations!

Whoscored Web Scraper

Daniel Siles, Alex Zabbal

We developed a project to scrape information regarding soccer statistics from all of the major leagues in the world. The software can get information about the standings of each league , match statistics, player statistics and even the winning odds of each team in match. This data is collected on a continuous basis to build a database of information that can be used for predictive analysis.

Newark Data Mining

Neta Geva, Loren Dery

Our project is a data mining project about Newark airport website. This program scrapes data about departure and arrival flights from the main website of Newark and also from each flight number detailed page. It gathers data about departure/arrival estimated hour, terminal, gate, status, destination/origin. In addition, we got from the API airport info, data about airports (address, website, IATA etc..). It builds a database on MySQL based on this data. The user is also able to get the data using easy command line.

This is only the first step! Right now the students from the October cohort are working on their final data science project until the end of the program.

Click here to learn more about the Fellows Data Science program at <itc>.

Please fill out in English

Hack Your Future

Scrape First, Science Later | Data Mining Project Presentations

Share this post