LibGuides: Data Science: Guide for Independent Projects

Introduction

Many data science students eventually want to undertake an independent or personal side project. This guide is intended to provide resources for these types of project. This is not necessarily intended to provide guidance for course projects, internship deliverables, or other formalized projects. Rather, this is to help you, as a data science student, get a little extra experience working with data.

The benefits of these types of projects are three-fold: (1) apply what you've learned in your coursework to a new topic, testing your knowledge (2) learn new skills, including new Python or R packages and other platforms/tools, and (3) produce an output you can put on your resume. If you get really into your project, you can also consider turning it into a guest blog post on a data science site, or otherwise sharing your work with a broader audience.

Getting started

Still not sure? Use the choice wheels below to help brainstorm a project topic.

Pick a topic area
For example, maybe you want to do a project related to sports, or social media, or biology. If you're not sure, or don't have a preference, use this spinning wheel to help you pick a topic area.
Pick a data science task
Data science projects often focus on a specific task, for instance classification, regression, clustering, or others.
Pick a data type
One way to pick a project is to think about what kind of data you want practice working with. For instance, do you want to practice working with numeric data, or text data? Or maybe you want to practice your data science skills with image data?
Pick an additional tool or approach
This wheel includes a selection of tools, methods, or approaches often used in data science project, such as: API queries, recommender systems, sentiment analysis, and more.
Pick a random keyword
Spin to get a random keyword to help further brainstorm your topic. For instance, if you want to practice classification (task) using tabular data (type) for transportation (area), and incorporate sentiment analysis (additional tool) could you tie this to your chosen keyword? Consider: is there a way to incorporate, say, political preferences based on vote data, with voter sentiment towards expanded highway infrastructure? Or maybe classifying preferences for electric vehicles by median income by Census block, rates of homeowners insurance, or other of these keywords?

Guided projects

Maybe you're not ready to start a project entirely from scratch. That's fine! These links have examples of more guided projects: they provide a dataset, a general question, and either tutorials, full source code, or hints about what packages and analyses you'll need to use. Think of these are "training wheels projects": they are a way to build your confidence and help you get comfortable with outside class projects.

16 Data Science Projects with Source Code
Each project includes sample code to complete the project from start to finish. There's not a lot of comments or explanation associated with the code, so this can be a chance to practice reading through existing code to test your comprehension as you go through the project, step by step.
24 data science projects to boost knowledge and skills
These projects are split into beginner, intermediate, and advanced levels, with links to tutorials and where to download the data in question.
12 Data Science Projects for Beginners and Experts
This site presents data science projects in R and Python with source code and data. Areas of project include text analysis, recommender systems, deep learning, supervised and unsupervised machine learning.
8 fun machine learning projects for beginners
Machine learning is a popular topic with data science students, and these projects provide a semi-guided way to practice your skills.
5 Advanced Projects for Data Science Portfolio
These methods used in these projects are more advanced, but each comes with example code and guided instructions from start to finish.
65+ Data Science Projects with Source Code
Projects are grouped by which method or approach the project emphasizes: web scraping, analysis & visualization, machine learning, time series forecasting, deep learning, computer vision, and natural language processing.
21 Data Science Projects for Beginners
Projects are divided into beginner, intermediate, and advanced. Each project includes a link to the dataset used as well as example solution code. Note: you may need to create an account to access; only some projects are free to access as of January 2025.

Starting projects from scratch

Make use of the other resources in this guide! Check out the "Working with Python" and "Working with R" tabs for information about data analysis and visualization packages. Read through the "Version Control & GitHub" tab for additional information about working with Git and how to properly structure a GitHub repository. The "Finding Data & Statistics" tab redirects to a full guide to help with finding data sources and the "Data Visualization" tab will send you to additional resources about data visualization, including best practices.

Getting started for beginners
Starting with visualization is great advice.
Options for projects
This guide to building a data science portfolio also offers a good overview of different kinds of projects possible: data cleaning, data storytelling, an "end to end" project, and an explanatory project. Picking what kind of project you'd like to undertake is a good start.
Guide to starting a data science project
You won't need to write a formal proposal (since this is your personal project, you can work on whatever you want), but the other steps in this guide are applicable.
Scoping a project
This guide to scoping a data science project is more detailed than necessary for a personal side project, but the takeaways are good (define the goal, determine data needs, determine analysis needed).
Project style guide
Remember: how you put together your project is as important as your project topic! This guide is definitely worth reading.
Data science project template
From Cookiecutter Data Science: "A logical, reasonably standardized, but flexible project structure for doing and sharing data science work."
Managing Code and Software for Applied Data Science Projects
This self-paced workshop site from UC Davis DataLab workshop discusses how and why we build code, possible development workflows, project management strategies, and tool selection. This workshop is designed for learners who are writing and applying code for their research projects, but have no or limited formal computer science training.

Project examples for beginners

Sometimes, you want to look at fully formed examples to get an idea of what you can do for your own project. Here are some examples of data science (or at least, data science-ish) projects suitable for lower division data science students: the projects use available data, (mostly) make the underlying code public, produce effective/interesting visuals, and are easy to read through. These examples also span a range of project options, such as making a tutorial for popular/frequently used datasets, learning new techniques, scraping your own data, or digging into a big dataset.

Kaggle Titanic tutorial
One way to approach a data science side project is to write up your workflow/results as a tutorial for other people to use. This has multiple benefits: it helps you organize your thoughts, forces you to be explicit about your data wrangling and modeling, and adds your own personal touch when working with popular, frequently used datasets.
Visualize Spotify
This project visualizes attributes of songs (beats per minute, loudness, length, etc.) from one of this person's Spotify playlists. There is a link in the post to a GitHub repository which includes the data, scripts, notebooks, and figures.
Text mining The Office
This project uses text mining techniques on a dataset of every line from the TV series The Office. Note the cleaning steps to get the data ready for analysis!
Tracking emerging slang
This project uses Google Trends data to track where new slang comes from (spatially and temporally). Could you recreate a similar analysis using Python? What other questions could you ask with Google Trends data?
Recipe recommendations API
This project consists of three parts: scraping recipe data, building recommender models and building an API to be hosted on a web server. How might results change with different recipe data?
Video game sales
This project using video game data relies heavily on data visualization. This example uses R, but consider: could you make similar plots in Python? What about PowerBI or Tableau?
Movie genre prediction
This project uses elements of movie posters to predict movie genres using convolutional neural networks (CNN). The code is already available, making this a good project to practice looking through and understanding code written by someone else. What parts of the code are understandable based on prior coursework? Are there Python libraries used that are new to you?
Football (soccer) match outcome prediction
Projects with this data predict the probability of match outcomes for each target class (home team wins, away (opponent) team wins and draw). This project includes dealing with missing and imbalanced data. A more detailed evaluation of various models can be found in this notebook. Try adapting this workflow to data from other sports of your choice!

Also consider reaching out to your fellow data science students about forming a group to work on an independent project. Group projects are a great way to develop important skills such as code collaboration (particularly using GitHub) and project workflow management. Working with a group also provides a built-in network for brainstorming ideas, troubleshooting code errors, and formalizing your project. Plus, it can be more motivating to work in a group, since you're relying on each other to make progress.

Alternatively, if you prefer to work on your own project, it would still be valuable to reach out to other people for code review. Reviewing someone else's code is a useful learning exercise, and having your own code reviewed by your peers is a good way to make sure you don't have any mistakes in your code.

More advanced projects

Using Common Crawl data
The Common Crawl corpus contains petabytes of data and is available on Amazon S3. It contains raw web page data, extracted metadata and text extractions collected since 2008. The Common Crawl site includes tutorials and example projects using this data. This is a good dataset to use for a project if you want experience working with truly big data, navigating the Amazon web ecosystem, and using data mining techniques at scale.
Wayback Machine (archived web pages)
Historical web page captures of sites are available via the Wayback Machine and can be extracted and analyzed with Python in a multistep process.
more...less...
The UC San Diego Library maintains a campus web archiving program to capture web sites relevant to the UCSD community. This presentation from UC Love Data Week 2023 demonstrates an example workflow for accessing and analyzing data in one of these web archive collections. What other types of analysis can you think of with this type of data?
Papers with Code
Papers with Code is free and open resource with Machine Learning papers, code, datasets, methods and evaluation tables. Browse by data type and task/method; most of the datasets and code examples are linked to associated peer reviewed publications. Publications here are beyond the scope of most personal projects, but this site is a good central hub for learning more about cutting edge (and classic) data science methods and models. A potential project would be to try implementing one of the methods/models, which requires learning new packages/functions (reading documentation), benchmarking and assessment, and interpreting technical results.
Previous DSC capstone projects
This page includes links to past DSC capstone (180AB) projects. These represent multi-quarter projects, not personal projects, but they provide a good overview of the types of topics and methods found in many advanced personal projects.
UCSD Computing Paths projects page
Check out the data mining subsection to see projects by other UCSD students

If you have an advanced project idea, look into whether it could be a good fit for the HDSI Undergraduate Scholarship Program. About the program:

"Unlike lab-directed projects, students will be able to choose their own research topics and lead the research process. Scholarships will provide opportunities for students to work closely with a mentor to develop analytical skills, develop data science portfolios, and foster novel data-driven approaches to problem solving.

Examples of data-driven projects include applications of methods, tools, and infrastructure for heterogeneous dataset integration, machine learning, geospatial analyses, scalable computing, data visualization, data ethics, and privacy. Priority will be given to applications that employ novel and creative data scientific approaches with specific potential impact to application areas."

Building a portfolio

When working on a personal project, you are building your data science portfolio, a public collection of your work you can share with future employers. A good data science portfolio will include a mix of code, data visualizations, and narrative.

Having a well-organized GitHub is a great start to building your data science portfolio. Remember: most repositories on GitHub are public, so you can look at other people's data science portfolios and projects to get a sense of style and format. You may also eventually decide to create your own website. The format of your portfolio may vary; the important thing to keep in mind is that this is a way to showcase your work for future employers.

How To Build an Awesome Data Science Portfolio
Includes 5 simple tips for building a compelling data science portfolio, with examples.
How to Build a Data Science Portfolio That Stands Out in 2024
Includes: How to create data science projects that help your portfolio stand out; The importance of creating a GitHub account to display your work; How you can start writing articles in the data science domain; How to showcase all your work with a beautiful portfolio website
Developing Your Data Science Portfolio
Step-by-step guide from UC David DataLab
The Promise of Portfolios: Training Modern Data Scientists
Check out the tables in this article to get a sense of the various ways to demonstrate your communication skills in your portfolio.