capstone project big data

21 Interesting Data Science Capstone Project Ideas [2024]

Data science, encompassing the analysis and interpretation of data, stands as a cornerstone of modern innovation.

Capstone projects in data science education play a pivotal role, offering students hands-on experience to apply theoretical concepts in practical settings.

These projects serve as a culmination of their learning journey, providing invaluable opportunities for skill development and problem-solving.

Our blog is dedicated to guiding prospective students through the selection process of data science capstone project ideas. It offers curated ideas and insights to help them embark on a fulfilling educational experience.

Join us as we navigate the dynamic world of data science, empowering students to thrive in this exciting field.

Data Science Capstone Project: A Comprehensive Overview

Table of Contents

Data science capstone projects are an essential component of data science education, providing students with the opportunity to apply their knowledge and skills to real-world problems.

Capstone projects challenge students to acquire and analyze data to solve real-world problems. These projects are designed to test students’ skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning.

In addition, capstone projects are conducted with industry, government, and academic partners, and most projects are sponsored by an organization.

The projects are drawn from real-world problems, and students work in teams consisting of two to four students and a faculty advisor.

However, the goal of the capstone project is to create a usable/public data product that can be used to show students’ skills to potential employers.

Best Data Science Capstone Project Ideas – According to Skill Level

Data science capstone projects are a great way to showcase your skills and apply what you’ve learned in a real-world context. Here are some project ideas categorized by skill level:

Beginner-Level Data Science Capstone Project Ideas

1. Exploratory Data Analysis (EDA) on a Dataset

Start by analyzing a dataset of your choice and exploring its characteristics, trends, and relationships. Practice using basic statistical techniques and visualization tools to gain insights and present your findings clearly and understandably.

2. Predictive Modeling with Linear Regression

Build a simple linear regression model to predict a target variable based on one or more input features. Learn about model evaluation techniques such as mean squared error and R-squared, and interpret the results to make meaningful predictions.

3. Classification with Decision Trees

Use decision tree algorithms to classify data into distinct categories. Learn how to preprocess data, train a decision tree model, and evaluate its performance using metrics like accuracy, precision, and recall. Apply your model to practical scenarios like predicting customer churn or classifying spam emails.

4. Clustering with K-Means

Explore unsupervised learning by applying the K-Means algorithm to group similar data points together. Practice feature scaling and model evaluation to identify meaningful clusters within your dataset. Apply your clustering model to segment customers or analyze patterns in market data.

5. Sentiment Analysis on Text Data

Dive into natural language processing (NLP) by analyzing text data to determine sentiment polarity (positive, negative, or neutral).

Learn about tokenization, text preprocessing, and sentiment analysis techniques using libraries like NLTK or spaCy. Apply your skills to analyze product reviews or social media comments.

6. Time Series Forecasting

Predict future trends or values based on historical time series data. Learn about time series decomposition, trend analysis, and seasonal patterns using methods like ARIMA or exponential smoothing. Apply your forecasting skills to predict stock prices, weather patterns, or sales trends.

7. Image Classification with Convolutional Neural Networks (CNNs)

Explore deep learning concepts by building a basic CNN model to classify images into different categories.

Learn about convolutional layers, pooling, and fully connected layers, and experiment with different architectures to improve model performance. Apply your CNN model to tasks like recognizing handwritten digits or classifying images of animals.

Intermediate-Level Data Science Capstone Project Ideas

8. Customer Segmentation and Market Basket Analysis

Utilize advanced clustering techniques to segment customers based on their purchasing behavior. Conduct market basket analysis to identify frequent item associations and recommend personalized product suggestions.

Implement techniques like the Apriori algorithm or association rules mining to uncover valuable insights for targeted marketing strategies.

9. Time Series Anomaly Detection

Apply anomaly detection algorithms to identify unusual patterns or outliers in time series data. Utilize techniques such as moving average, Z-score, or autoencoders to detect anomalies in various domains, including finance, IoT sensors, or network traffic.

Develop robust anomaly detection models to enhance data security and predictive maintenance.

10. Recommendation System Development

Build a recommendation engine to suggest personalized items or content to users based on their preferences and behavior. Implement collaborative filtering, content-based filtering, or hybrid recommendation approaches to improve user engagement and satisfaction.

Evaluate the performance of your recommendation system using metrics like precision, recall, and mean average precision.

11. Natural Language Processing for Topic Modeling

Dive deeper into NLP by exploring topic modeling techniques to extract meaningful topics from text data.

Implement algorithms like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) to identify hidden themes or subjects within large text corpora. Apply topic modeling to analyze customer feedback, news articles, or academic papers.

12. Fraud Detection in Financial Transactions

Develop a fraud detection system using machine learning algorithms to identify suspicious activities in financial transactions. Utilize supervised learning techniques such as logistic regression, random forests, or gradient boosting to classify transactions as fraudulent or legitimate.

Employ feature engineering and model evaluation to improve fraud detection accuracy and minimize false positives.

13. Predictive Maintenance for Industrial Equipment

Implement predictive maintenance techniques to anticipate equipment failures and prevent costly downtime.

Analyze sensor data from machinery using machine learning algorithms like support vector machines or recurrent neural networks to predict when maintenance is required. Optimize maintenance schedules to minimize downtime and maximize operational efficiency.

14. Healthcare Data Analysis and Disease Prediction

Utilize healthcare datasets to analyze patient demographics, medical history, and diagnostic tests to predict the likelihood of disease occurrence or progression.

Apply machine learning algorithms such as logistic regression, decision trees, or support vector machines to develop predictive models for diseases like diabetes, cancer, or heart disease. Evaluate model performance using metrics like sensitivity, specificity, and area under the ROC curve.

Advanced Level Data Science Capstone Project Ideas

15. Deep Learning for Image Generation

Explore generative adversarial networks (GANs) or variational autoencoders (VAEs) to generate realistic images from scratch. Experiment with architectures like DCGAN or StyleGAN to create high-resolution images of faces, landscapes, or artwork.

Evaluate image quality and diversity using perceptual metrics and human judgment.

16. Reinforcement Learning for Game Playing

Implement reinforcement learning algorithms like deep Q-learning or policy gradients to train agents to play complex games like Atari or board games.

Experiment with exploration-exploitation strategies and reward-shaping techniques to improve agent performance and achieve superhuman levels of gameplay.

17. Anomaly Detection in Streaming Data

Develop real-time anomaly detection systems to identify abnormal behavior in streaming data streams such as network traffic, sensor readings, or financial transactions.

Utilize online learning algorithms like streaming k-means or Isolation Forest to detect anomalies and trigger timely alerts for intervention.

18. Multi-Modal Sentiment Analysis

Extend sentiment analysis to incorporate multiple modalities such as text, images, and audio to capture rich emotional expressions.

However, utilize deep learning architectures like multimodal transformers or fusion models to analyze sentiment across different modalities and improve understanding of complex human emotions.

19. Graph Neural Networks for Social Network Analysis

Apply graph neural networks (GNNs) to model and analyze complex relational data in social networks. Use techniques like graph convolutional networks (GCNs) or graph attention networks (GATs) to learn node embeddings and predict node properties such as community detection or influential users.

20. Time Series Forecasting with Deep Learning

Explore advanced deep learning architectures like long short-term memory (LSTM) networks or transformer-based models for time series forecasting.

Utilize attention mechanisms and multi-horizon forecasting to capture long-term dependencies and improve prediction accuracy in dynamic and volatile environments.

21. Adversarial Robustness in Machine Learning

Investigate techniques to improve the robustness of machine learning models against adversarial attacks.

Explore methods like adversarial training, defensive distillation, or certified robustness to mitigate vulnerabilities and ensure model reliability in adversarial perturbations, particularly in critical applications like autonomous vehicles or healthcare.

These project ideas cater to various skill levels in data science, ranging from beginners to experts. Choose a project that aligns with your interests and skill level, and don’t hesitate to experiment and learn along the way!

Factors to Consider When Choosing a Data Science Capstone Project

Choosing the right data science capstone project is crucial for your learning experience and effectively showcasing your skills. Here are some factors to consider when selecting a data science capstone project:

Personal Interest

Select a project that aligns with your passions and career goals to stay motivated and engaged throughout the process.

Data Availability

Ensure access to relevant and sufficient data to complete the project and draw meaningful insights effectively.

Complexity Level

Consider your current skill level and choose a project that challenges you without overwhelming you, allowing for growth and learning.

Real-World Impact

Aim for projects with practical applications or societal relevance to showcase your ability to solve tangible problems.

Resource Requirements

Evaluate the availability of resources such as time, computing power, and software tools needed to execute the project successfully.

Mentorship and Support

Seek projects with opportunities for guidance and feedback from mentors or peers to enhance your learning experience.

Novelty and Innovation

Explore projects that push boundaries and explore new techniques or approaches to demonstrate creativity and originality in your work.

Tips for Successfully Completing a Data Science Capstone Project

Successfully completing a data science capstone project requires careful planning, effective execution, and strong communication skills. Here are some tips to help you navigate through the process:

Plan and Prioritize: Break down the project into manageable tasks and create a timeline to stay organized and focused.
Understand the Problem: Clearly define the project objectives, requirements, and expected outcomes before analyzing.
Explore and Experiment: Experiment with different methodologies, algorithms, and techniques to find the most suitable approach.
Document and Iterate: Document your process, results, and insights thoroughly, and iterate on your analyses based on feedback and new findings.
Collaborate and Seek Feedback: Collaborate with peers, mentors, and stakeholders, actively seeking feedback to improve your work and decision-making.
Practice Communication: Communicate your findings effectively through clear visualizations, reports, and presentations tailored to your audience’s understanding.
Reflect and Learn: Reflect on your challenges, successes, and lessons learned throughout the project to inform your future endeavors and continuous improvement.

By following these tips, you can successfully navigate the data science capstone project and demonstrate your skills and expertise in the field.

Wrapping Up

In wrapping up, data science capstone project ideas are invaluable in bridging the gap between theory and practice, offering students a chance to apply their knowledge in real-world scenarios.

They are a cornerstone of data science education, fostering critical thinking, problem-solving, and practical skills development.

As you embark on your journey, don’t hesitate to explore diverse and challenging project ideas. Embrace the opportunity to push boundaries, innovate, and make meaningful contributions to the field.

Share your insights, challenges, and successes with others, and invite fellow enthusiasts to exchange ideas and experiences.

1. What is the purpose of a data science capstone project?

A data science capstone project serves as a culmination of a student’s learning experience, allowing them to apply their knowledge and skills to solve real-world problems in the field of data science. It provides hands-on experience and showcases their ability to analyze data, derive insights, and communicate findings effectively.

2. What are some examples of data science capstone projects?

Data science capstone projects can cover a wide range of topics and domains, including predictive modeling, natural language processing, image classification, recommendation systems, and more. Examples may include analyzing customer behavior, predicting stock prices, sentiment analysis on social media data, or detecting anomalies in financial transactions.

3. How long does it typically take to complete a data science capstone project?

The duration of a data science capstone project can vary depending on factors such as project complexity, available resources, and individual pace. Generally, it may take several weeks to several months to complete a project, including tasks such as data collection, preprocessing, analysis, modeling, and presentation of findings.

Science Fair Project Ideas For 6th Graders

When it comes to Science Fair Project Ideas For 6th Graders, the possibilities are endless! These projects not only help students develop essential skills, such…

Java Project Ideas for Beginners

Java is one of the most popular programming languages. It is used for many applications, from laptops to data centers, gaming consoles, scientific supercomputers, and…

10 Unique Data Science Capstone Project Ideas

A capstone project is a culminating assignment that allows students to demonstrate the skills and knowledge they’ve acquired throughout their degree program. For data science students, it’s a chance to tackle a substantial real-world data problem.

If you’re short on time, here’s a quick answer to your question: Some great data science capstone ideas include analyzing health trends, building a predictive movie recommendation system, optimizing traffic patterns, forecasting cryptocurrency prices, and more .

In this comprehensive guide, we will explore 10 unique capstone project ideas for data science students. We’ll overview potential data sources, analysis methods, and practical applications for each idea.

Whether you want to work with social media datasets, geospatial data, or anything in between, you’re sure to find an interesting capstone topic.

Project Idea #1: Analyzing Health Trends

When it comes to data science capstone projects, analyzing health trends is an intriguing idea that can have a significant impact on public health. By leveraging data from various sources, data scientists can uncover valuable insights that can help improve healthcare outcomes and inform policy decisions.

Data Sources

There are several data sources that can be used to analyze health trends. One of the most common sources is electronic health records (EHRs), which contain a wealth of information about patient demographics, medical history, and treatment outcomes.

Other sources include health surveys, wearable devices, social media, and even environmental data.

Analysis Approaches

When analyzing health trends, data scientists can employ a variety of analysis approaches. Descriptive analysis can provide a snapshot of current health trends, such as the prevalence of certain diseases or the distribution of risk factors.

Predictive analysis can be used to forecast future health outcomes, such as predicting disease outbreaks or identifying individuals at high risk for certain conditions. Machine learning algorithms can be trained to identify patterns and make accurate predictions based on large datasets.

Applications

The applications of analyzing health trends are vast and far-reaching. By understanding patterns and trends in health data, policymakers can make informed decisions about resource allocation and public health initiatives.

Healthcare providers can use these insights to develop personalized treatment plans and interventions. Researchers can uncover new insights into disease progression and identify potential targets for intervention.

Ultimately, analyzing health trends has the potential to improve overall population health and reduce healthcare costs.

Project Idea #2: Movie Recommendation System

When developing a movie recommendation system, there are several data sources that can be used to gather information about movies and user preferences. One popular data source is the MovieLens dataset, which contains a large collection of movie ratings provided by users.

Another source is IMDb, a trusted website that provides comprehensive information about movies, including user ratings and reviews. Additionally, streaming platforms like Netflix and Amazon Prime also provide access to user ratings and viewing history, which can be valuable for building an accurate recommendation system.

There are several analysis approaches that can be employed to build a movie recommendation system. One common approach is collaborative filtering, which uses user ratings and preferences to identify patterns and make recommendations based on similar users’ preferences.

Another approach is content-based filtering, which analyzes the characteristics of movies (such as genre, director, and actors) to recommend similar movies to users. Hybrid approaches that combine both collaborative and content-based filtering techniques are also popular, as they can provide more accurate and diverse recommendations.

A movie recommendation system has numerous applications in the entertainment industry. One application is to enhance the user experience on streaming platforms by providing personalized movie recommendations based on individual preferences.

This can help users discover new movies they might enjoy and improve overall satisfaction with the platform. Additionally, movie recommendation systems can be used by movie production companies to analyze user preferences and trends, aiding in the decision-making process for creating new movies.

Finally, movie recommendation systems can also be utilized by movie critics and reviewers to identify movies that are likely to be well-received by audiences.

For more information on movie recommendation systems, you can visit https://www.kaggle.com/rounakbanik/movie-recommender-systems or https://www.researchgate.net/publication/221364567_A_new_movie_recommendation_system_for_large-scale_data .

Project Idea #3: Optimizing Traffic Patterns

When it comes to optimizing traffic patterns, there are several data sources that can be utilized. One of the most prominent sources is real-time traffic data collected from various sources such as GPS devices, traffic cameras, and mobile applications.

This data provides valuable insights into the current traffic conditions, including congestion, accidents, and road closures. Additionally, historical traffic data can also be used to identify recurring patterns and trends in traffic flow.

Other data sources that can be used include weather data, which can help in understanding how weather conditions impact traffic patterns, and social media data, which can provide information about events or incidents that may affect traffic.

Optimizing traffic patterns requires the use of advanced data analysis techniques. One approach is to use machine learning algorithms to predict traffic patterns based on historical and real-time data.

These algorithms can analyze various factors such as time of day, day of the week, weather conditions, and events to predict traffic congestion and suggest alternative routes.

Another approach is to use network analysis to identify bottlenecks and areas of congestion in the road network. By analyzing the flow of traffic and identifying areas where traffic slows down or comes to a halt, transportation authorities can make informed decisions on how to optimize traffic flow.

The optimization of traffic patterns has numerous applications and benefits. One of the main benefits is the reduction of traffic congestion, which can lead to significant time and fuel savings for commuters.

By optimizing traffic patterns, transportation authorities can also improve road safety by reducing the likelihood of accidents caused by congestion.

Additionally, optimizing traffic patterns can have positive environmental impacts by reducing greenhouse gas emissions. By minimizing the time spent idling in traffic, vehicles can operate more efficiently and emit fewer pollutants.

Furthermore, optimizing traffic patterns can have economic benefits by improving the flow of goods and services. Efficient traffic patterns can reduce delivery times and increase productivity for businesses.

Project Idea #4: Forecasting Cryptocurrency Prices

With the growing popularity of cryptocurrencies like Bitcoin and Ethereum, forecasting their prices has become an exciting and challenging task for data scientists. This project idea involves using historical data to predict future price movements and trends in the cryptocurrency market.

When working on this project, data scientists can gather cryptocurrency price data from various sources such as cryptocurrency exchanges, financial websites, or APIs. Websites like CoinMarketCap (https://coinmarketcap.com/) provide comprehensive data on various cryptocurrencies, including historical price data.

Additionally, platforms like CryptoCompare (https://www.cryptocompare.com/) offer real-time and historical data for different cryptocurrencies.

To forecast cryptocurrency prices, data scientists can employ various analysis approaches. Some common techniques include:

Time Series Analysis: This approach involves analyzing historical price data to identify patterns, trends, and seasonality in cryptocurrency prices. Techniques like moving averages, autoregressive integrated moving average (ARIMA), or exponential smoothing can be used to make predictions.
Machine Learning: Machine learning algorithms, such as random forests, support vector machines, or neural networks, can be trained on historical cryptocurrency data to predict future price movements. These algorithms can consider multiple variables, such as trading volume, market sentiment, or external factors, to make accurate predictions.
Sentiment Analysis: This approach involves analyzing social media sentiment and news articles related to cryptocurrencies to gauge market sentiment. By considering the collective sentiment, data scientists can predict how positive or negative sentiment can impact cryptocurrency prices.

Forecasting cryptocurrency prices can have several practical applications:

Investment Decision Making: Accurate price forecasts can help investors make informed decisions when buying or selling cryptocurrencies. By considering the predicted price movements, investors can optimize their investment strategies and potentially maximize their returns.
Trading Strategies: Traders can use price forecasts to develop trading strategies, such as trend following or mean reversion. By leveraging predicted price movements, traders can make profitable trades in the volatile cryptocurrency market.
Risk Management: Cryptocurrency price forecasts can help individuals and organizations manage their risk exposure. By understanding potential price fluctuations, risk management strategies can be implemented to mitigate losses.

Project Idea #5: Predicting Flight Delays

One interesting and practical data science capstone project idea is to create a model that can predict flight delays. Flight delays can cause a lot of inconvenience for passengers and can have a significant impact on travel plans.

By developing a predictive model, airlines and travelers can be better prepared for potential delays and take appropriate actions.

To create a flight delay prediction model, you would need to gather relevant data from various sources. Some potential data sources include:

Flight data from airlines or aviation organizations
Weather data from meteorological agencies
Historical flight delay data from airports

By combining these different data sources, you can build a comprehensive dataset that captures the factors contributing to flight delays.

Once you have collected the necessary data, you can employ different analysis approaches to predict flight delays. Some common approaches include:

Machine learning algorithms such as decision trees, random forests, or neural networks
Time series analysis to identify patterns and trends in flight delay data
Feature engineering to extract relevant features from the dataset

By applying these analysis techniques, you can develop a model that can accurately predict flight delays based on the available data.

The applications of a flight delay prediction model are numerous. Airlines can use the model to optimize their operations, improve scheduling, and minimize disruptions caused by delays. Travelers can benefit from the model by being alerted in advance about potential delays and making necessary adjustments to their travel plans.

Additionally, airports can use the model to improve resource allocation and manage passenger flow during periods of high delay probability. Overall, a flight delay prediction model can significantly enhance the efficiency and customer satisfaction in the aviation industry.

Project Idea #6: Fighting Fake News

With the rise of social media and the easy access to information, the spread of fake news has become a significant concern. Data science can play a crucial role in combating this issue by developing innovative solutions.

Here are some aspects to consider when working on a project that aims to fight fake news.

When it comes to fighting fake news, having reliable data sources is essential. There are several trustworthy platforms that provide access to credible news articles and fact-checking databases. Websites like Snopes and FactCheck.org are good starting points for obtaining accurate information.

Additionally, social media platforms such as Twitter and Facebook can be valuable sources for analyzing the spread of misinformation.

One approach to analyzing fake news is by utilizing natural language processing (NLP) techniques. NLP can help identify patterns and linguistic cues that indicate the presence of misleading information.

Sentiment analysis can also be employed to determine the emotional tone of news articles or social media posts, which can be an indicator of potential bias or misinformation.

Another approach is network analysis, which focuses on understanding how information spreads through social networks. By analyzing the connections between users and the content they share, it becomes possible to identify patterns of misinformation dissemination.

Network analysis can also help in identifying influential sources and detecting coordinated efforts to spread fake news.

The applications of a project aiming to fight fake news are numerous. One possible application is the development of a browser extension or a mobile application that provides users with real-time fact-checking information.

This tool could flag potentially misleading articles or social media posts and provide users with accurate information to help them make informed decisions.

Another application could be the creation of an algorithm that automatically identifies fake news articles and separates them from reliable sources. This algorithm could be integrated into news aggregation platforms to help users distinguish between credible and non-credible information.

Project Idea #7: Analyzing Social Media Sentiment

Social media platforms have become a treasure trove of valuable data for businesses and researchers alike. When analyzing social media sentiment, there are several data sources that can be tapped into. The most popular ones include:

Twitter: With its vast user base and real-time nature, Twitter is often the go-to platform for sentiment analysis. Researchers can gather tweets containing specific keywords or hashtags to analyze the sentiment of a particular topic.
Facebook: Facebook offers rich data for sentiment analysis, including posts, comments, and reactions. Analyzing the sentiment of Facebook posts can provide valuable insights into user opinions and preferences.
Instagram: Instagram’s visual nature makes it an interesting platform for sentiment analysis. By analyzing the comments and captions on Instagram posts, researchers can gain insights into the sentiment associated with different images or topics.
Reddit: Reddit is a popular platform for discussions on various topics. By analyzing the sentiment of comments and posts on specific subreddits, researchers can gain insights into the sentiment of different communities.

These are just a few examples of the data sources that can be used for analyzing social media sentiment. Depending on the research goals, other platforms such as LinkedIn, YouTube, and TikTok can also be explored.

When it comes to analyzing social media sentiment, there are various approaches that can be employed. Some commonly used analysis techniques include:

Lexicon-based analysis: This approach involves using predefined sentiment lexicons to assign sentiment scores to words or phrases in social media posts. By aggregating these scores, researchers can determine the overall sentiment of a post or a collection of posts.
Machine learning: Machine learning algorithms can be trained to classify social media posts into positive, negative, or neutral sentiment categories. These algorithms learn from labeled data and can make predictions on new, unlabeled data.
Deep learning: Deep learning techniques, such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs), can be used to capture the complex patterns and dependencies in social media data. These models can learn to extract sentiment information from textual or visual content.

It is important to note that the choice of analysis approach depends on the specific research objectives, available resources, and the nature of the social media data being analyzed.

Analyzing social media sentiment has a wide range of applications across different industries. Here are a few examples:

Brand reputation management: By analyzing social media sentiment, businesses can monitor and manage their brand reputation. They can identify potential issues, respond to customer feedback, and take proactive measures to maintain a positive image.
Market research: Social media sentiment analysis can provide valuable insights into consumer opinions and preferences. Businesses can use this information to understand market trends, identify customer needs, and develop targeted marketing strategies.
Customer feedback analysis: Social media sentiment analysis can help businesses understand customer satisfaction levels and identify areas for improvement. By analyzing sentiment in customer feedback, companies can make data-driven decisions to enhance their products or services.
Public opinion analysis: Researchers can analyze social media sentiment to study public opinion on various topics, such as political events, social issues, or product launches. This information can be used to understand public sentiment, predict trends, and inform decision-making.

These are just a few examples of how analyzing social media sentiment can be applied in real-world scenarios. The insights gained from sentiment analysis can help businesses and researchers make informed decisions, improve customer experience, and drive innovation.

Project Idea #8: Improving Online Ad Targeting

Improving online ad targeting involves analyzing various data sources to gain insights into users’ preferences and behaviors. These data sources may include:

Website analytics: Gathering data from websites to understand user engagement, page views, and click-through rates.
Demographic data: Utilizing information such as age, gender, location, and income to create targeted ad campaigns.
Social media data: Extracting data from platforms like Facebook, Twitter, and Instagram to understand users’ interests and online behavior.
Search engine data: Analyzing search queries and user behavior on search engines to identify intent and preferences.

By combining and analyzing these diverse data sources, data scientists can gain a comprehensive understanding of users and their ad preferences.

To improve online ad targeting, data scientists can employ various analysis approaches:

Segmentation analysis: Dividing users into distinct groups based on shared characteristics and preferences.
Collaborative filtering: Recommending ads based on users with similar preferences and behaviors.
Predictive modeling: Developing algorithms to predict users’ likelihood of engaging with specific ads.
Machine learning: Utilizing algorithms that can continuously learn from user interactions to optimize ad targeting.

These analysis approaches help data scientists uncover patterns and insights that can enhance the effectiveness of online ad campaigns.

Improved online ad targeting has numerous applications:

Increased ad revenue: By delivering more relevant ads to users, advertisers can expect higher click-through rates and conversions.
Better user experience: Users are more likely to engage with ads that align with their interests, leading to a more positive browsing experience.
Reduced ad fatigue: By targeting ads more effectively, users are less likely to feel overwhelmed by irrelevant or repetitive advertisements.
Maximized ad budget: Advertisers can optimize their budget by focusing on the most promising target audiences.

Project Idea #9: Enhancing Customer Segmentation

Enhancing customer segmentation involves gathering relevant data from various sources to gain insights into customer behavior, preferences, and demographics. Some common data sources include:

Customer transaction data
Customer surveys and feedback
Social media data
Website analytics
Customer support interactions

By combining data from these sources, businesses can create a comprehensive profile of their customers and identify patterns and trends that will help in improving their segmentation strategies.

There are several analysis approaches that can be used to enhance customer segmentation:

Clustering: Using clustering algorithms to group customers based on similar characteristics or behaviors.
Classification: Building predictive models to assign customers to different segments based on their attributes.
Association Rule Mining: Identifying relationships and patterns in customer data to uncover hidden insights.
Sentiment Analysis: Analyzing customer feedback and social media data to understand customer sentiment and preferences.

These analysis approaches can be used individually or in combination to enhance customer segmentation and create more targeted marketing strategies.

Enhancing customer segmentation can have numerous applications across industries:

Personalized marketing campaigns: By understanding customer preferences and behaviors, businesses can tailor their marketing messages to individual customers, increasing the likelihood of engagement and conversion.
Product recommendations: By segmenting customers based on their purchase history and preferences, businesses can provide personalized product recommendations, leading to higher customer satisfaction and sales.
Customer retention: By identifying at-risk customers and understanding their needs, businesses can implement targeted retention strategies to reduce churn and improve customer loyalty.
Market segmentation: By identifying distinct customer segments, businesses can develop tailored product offerings and marketing strategies for each segment, maximizing the effectiveness of their marketing efforts.

Project Idea #10: Building a Chatbot

A chatbot is a computer program that uses artificial intelligence to simulate human conversation. It can interact with users in a natural language through text or voice. Building a chatbot can be an exciting and challenging data science capstone project.

It requires a combination of natural language processing, machine learning, and programming skills.

When building a chatbot, data sources play a crucial role in training and improving its performance. There are various data sources that can be used:

Chat logs: Analyzing existing chat logs can help in understanding common user queries, responses, and patterns. This data can be used to train the chatbot on how to respond to different types of questions and scenarios.
Knowledge bases: Integrating a knowledge base can provide the chatbot with a wide range of information and facts. This can be useful in answering specific questions or providing detailed explanations on certain topics.
APIs: Utilizing APIs from different platforms can enhance the chatbot’s capabilities. For example, integrating a weather API can allow the chatbot to provide real-time weather information based on user queries.

There are several analysis approaches that can be used to build an efficient and effective chatbot:

Natural Language Processing (NLP): NLP techniques enable the chatbot to understand and interpret user queries. This involves tasks such as tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis.
Intent recognition: Identifying the intent behind user queries is crucial for providing accurate responses. Machine learning algorithms can be trained to classify user intents based on the input text.
Contextual understanding: Chatbots need to understand the context of the conversation to provide relevant and meaningful responses. Techniques such as sequence-to-sequence models or attention mechanisms can be used to capture contextual information.

Chatbots have a wide range of applications in various industries:

Customer support: Chatbots can be used to handle customer queries and provide instant support. They can assist with common troubleshooting issues, answer frequently asked questions, and escalate complex queries to human agents when necessary.
E-commerce: Chatbots can enhance the shopping experience by assisting users in finding products, providing recommendations, and answering product-related queries.
Healthcare: Chatbots can be deployed in healthcare settings to provide preliminary medical advice, answer general health-related questions, and assist with appointment scheduling.

Building a chatbot as a data science capstone project not only showcases your technical skills but also allows you to explore the exciting field of artificial intelligence and natural language processing.

It can be a great opportunity to create a practical and useful tool that can benefit users in various domains.

Completing an in-depth capstone project is the perfect way for data science students to demonstrate their technical skills and business acumen. This guide outlined 10 unique project ideas spanning industries like healthcare, transportation, finance, and more.

By identifying the ideal data sources, analysis techniques, and practical applications for their chosen project, students can produce an impressive capstone that solves real-world problems and showcases their abilities.

Is Health Science A Stem Major? An In-Depth Look

STEM majors lead to some of the most in-demand and highest paying careers. If you’re interested in a health science degree, you may be wondering – is health science considered a STEM major? While health science shares similarities with science and math-based fields, its classification as STEM is a complex issue. If you’re short on…

The Science And Art Of Reasoning: An Introduction To Logic

Every day we make arguments, draw conclusions, and evaluate the logic of the claims we encounter. But what formal principles govern effective reasoning and argumentation? If you want a quick answer, the science and art of analyzing arguments and making rational judgments is known as logic. In this comprehensive guide, we’ll explore the foundations of…

Rensselaer Polytechnic Institute Computer Science Rankings: An In-Depth Analysis

Rensselaer Polytechnic Institute (RPI) is renowned for its computing and IT programs, but how does it stack up in major computer science rankings? In this comprehensive guide, we’ll analyze RPI’s CS program rankings across various authoritative college lists and metrics. In short, RPI computer science is consistently ranked among the top computer science schools in…

The Serious Impacts Of Insufficient Sleep For Computer Science Majors

In the high-pressure world of computer science programs, sleep deprivation and all-nighters often become a norm. Memes joking about the struggles of exhausted, caffeine-fueled computer science majors have become wildly popular on college campuses and across the internet. But behind the humor lies a serious issue. Insufficient and poor quality sleep can severely impact the…

Phd In Data Science Salaries: What You Can Expect To Earn

A PhD in data science qualifies you for some of the most prestigious and high-paying jobs in tech. But exact salaries can vary significantly based on your specialization, industry, location and experience. Understanding the earning potential can help you evaluate if a PhD is worthwhile. If you’re short on time, here’s the key takeaway: Data…

What Science Courses Are Taught In 9Th Grade? A Comprehensive Overview

Starting high school is an exciting new chapter! As a 9th grader, you’ll begin building the science knowledge needed for future STEM courses and college majors. If you’re short on time, here’s a quick answer: Most 9th graders take a basic physical science course covering physics, chemistry, and space science concepts. Biology is sometimes offered…

25+ Solved End-to-End Big Data Projects with Source Code

Solved End-to-End Real World Mini Big Data Projects Ideas with Source Code For Beginners and Students to master big data tools like Hadoop and Spark.

Ace your big data analytics interview by adding some unique and exciting Big Data projects to your portfolio. This blog lists over 20 big data analytics projects you can work on to showcase your big data skills and gain hands-on experience in big data tools and technologies. You will find several big data projects depending on your level of expertise- big data projects for students, big data projects for beginners, etc.

Build a big data pipeline with AWS Quicksight, Druid, and Hive

Downloadable solution code | Explanatory videos | Tech Support

Have you ever looked for sneakers on Amazon and seen advertisements for similar sneakers while searching the internet for the perfect cake recipe? Maybe you started using Instagram to search for some fitness videos, and now, Instagram keeps recommending videos from fitness influencers to you. And even if you’re not very active on social media, I’m sure you now and then check your phone before leaving the house to see what the traffic is like on your route to know how long it could take you to reach your destination. None of this would have been possible without the application of big data analysis process on by the modern data driven companies. We bring the top big data projects for 2023 that are specially curated for students, beginners, and anybody looking to get started with mastering data skills.

What is a big data project, how do you create a good big data project, 25+ big data project ideas to help boost your resume , big data project ideas for beginners, intermediate projects on data analytics, advanced level examples of big data projects, real-time big data projects with source code, sample big data project ideas for final year students, big data project ideas using hadoop , big data projects using spark, gcp and aws big data projects, best big data project ideas for masters students, fun big data project ideas, top 5 apache big data projects, top big data projects on github with source code, level-up your big data expertise with projectpro's big data projects, faqs on big data projects.

A big data project is a data analysis project that uses machine learning algorithms and different data analytics techniques on structured and unstructured data for several purposes, including predictive modeling and other advanced analytics applications. Before actually working on any big data projects, data engineers must acquire proficient knowledge in the relevant areas, such as deep learning, machine learning, data visualization , data analytics, data science, etc.

Many platforms, like GitHub and ProjectPro, offer various big data projects for professionals at all skill levels- beginner, intermediate, and advanced. However, before moving on to a list of big data project ideas worth exploring and adding to your portfolio, let us first get a clear picture of what big data is and why everyone is interested in it.

ProjectPro Free Projects on Big Data and Data Science

Kicking off a big data analytics project is always the most challenging part. You always encounter questions like what are the project goals, how can you become familiar with the dataset, what challenges are you trying to address, what are the necessary skills for this project, what metrics will you use to evaluate your model, etc.

Well! The first crucial step to launching your project initiative is to have a solid project plan. To build a big data project, you should always adhere to a clearly defined workflow. Before starting any big data project, it is essential to become familiar with the fundamental processes and steps involved, from gathering raw data to creating a machine learning model to its effective implementation.

Understand the Business Goals of the Big Data Project

The first step of any good big data analytics project is understanding the business or industry that you are working on. Go out and speak with the individuals whose processes you aim to transform with data before you even consider analyzing the data. Establish a timeline and specific key performance indicators afterward. Although planning and procedures can appear tedious, they are a crucial step to launching your data initiative! A definite purpose of what you want to do with data must be identified, such as a specific question to be answered, a data product to be built, etc., to provide motivation, direction, and purpose.

Here's what valued users are saying about ProjectPro

Director Data Analytics at EY / EY Tech

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

Not sure what you are looking for?

Collect Data for the Big Data Project

The next step in a big data project is looking for data once you've established your goal. To create a successful data project, collect and integrate data from as many different sources as possible.

Here are some options for collecting data that you can utilize:

Connect to an existing database that is already public or access your private database.

Consider the APIs for all the tools your organization has been utilizing and the data they have gathered. You must put in some effort to set up those APIs so that you can use the email open and click statistics, the support request someone sent, etc.

There are plenty of datasets on the Internet that can provide more information than what you already have. There are open data platforms in several regions (like data.gov in the U.S.). These open data sets are a fantastic resource if you're working on a personal project for fun.

Data Preparation and Cleaning

The data preparation step, which may consume up to 80% of the time allocated to any big data or data engineering project, comes next. Once you have the data, it's time to start using it. Start exploring what you have and how you can combine everything to meet the primary goal. To understand the relevance of all your data, start making notes on your initial analyses and ask significant questions to businesspeople, the IT team, or other groups. Data Cleaning is the next step. To ensure that data is consistent and accurate, you must review each column and check for errors, missing data values, etc.

Making sure that your project and your data are compatible with data privacy standards is a key aspect of data preparation that should not be overlooked. Personal data privacy and protection are becoming increasingly crucial, and you should prioritize them immediately as you embark on your big data journey. You must consolidate all your data initiatives, sources, and datasets into one location or platform to facilitate governance and carry out privacy-compliant projects.

New Projects

Data Transformation and Manipulation

Now that the data is clean, it's time to modify it so you can extract useful information. Starting with combining all of your various sources and group logs will help you focus your data on the most significant aspects. You can do this, for instance, by adding time-based attributes to your data, like:

Acquiring date-related elements (month, hour, day of the week, week of the year, etc.)

Calculating the variations between date-column values, etc.

Joining datasets is another way to improve data, which entails extracting columns from one dataset or tab and adding them to a reference dataset. This is a crucial component of any analysis, but it can become a challenge when you have many data sources.

Visualize Your Data

Now that you have a decent dataset (or perhaps several), it would be wise to begin analyzing it by creating beautiful dashboards, charts, or graphs. The next stage of any data analytics project should focus on visualization because it is the most excellent approach to analyzing and showcasing insights when working with massive amounts of data.

Another method for enhancing your dataset and creating more intriguing features is to use graphs. For instance, by plotting your data points on a map, you can discover that some geographic regions are more informative than some other nations or cities.

Build Predictive Models Using Machine Learning Algorithms

Machine learning algorithms can help you take your big data project to the next level by providing you with more details and making predictions about future trends. You can create models to find trends in the data that were not visible in graphs by working with clustering techniques (also known as unsupervised learning). These organize relevant outcomes into clusters and more or less explicitly state the characteristic that determines these outcomes.

Advanced data scientists can use supervised algorithms to predict future trends. They discover features that have influenced previous data patterns by reviewing historical data and can then generate predictions using these features.

Lastly, your predictive model needs to be operationalized for the project to be truly valuable. Deploying a machine learning model for adoption by all individuals within an organization is referred to as operationalization.

Repeat The Process

This is the last step in completing your big data project, and it's crucial to the whole data life cycle. One of the biggest mistakes individuals make when it comes to machine learning is assuming that once a model is created and implemented, it will always function normally. On the contrary, if models aren't updated with the latest data and regularly modified, their quality will deteriorate with time.

You need to accept that your model will never indeed be "complete" to accomplish your first data project effectively. You need to continually reevaluate, retrain it, and create new features for it to stay accurate and valuable.

If you are a newbie to Big Data, keep in mind that it is not an easy field, but at the same time, remember that nothing good in life comes easy; you have to work for it. The most helpful way of learning a skill is with some hands-on experience. Below is a list of Big Data analytics project ideas and an idea of the approach you could take to develop them; hoping that this could help you learn more about Big Data and even kick-start a career in Big Data.

Yelp Data Processing Using Spark And Hive Part 1

Yelp Data Processing using Spark and Hive Part 2

Hadoop Project for Beginners-SQL Analytics with Hive

Tough engineering choices with large datasets in Hive Part - 1

Finding Unique URL's using Hadoop Hive

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster

Orchestrate Redshift ETL using AWS Glue and Step Functions

Analyze Yelp Dataset with Spark & Parquet Format on Azure Databricks

Data Warehouse Design for E-commerce Environments

Analyzing Big Data with Twitter Sentiments using Spark Streaming

PySpark Tutorial - Learn to use Apache Spark with Python

Tough engineering choices with large datasets in Hive Part - 2

Event Data Analysis using AWS ELK Stack

Web Server Log Processing using Hadoop

Data processing with Spark SQL

Build a Time Series Analysis Dashboard with Spark and Grafana

GCP Data Ingestion with SQL using Google Cloud Dataflow

Deploying auto-reply Twitter handle with Kafka, Spark, and LSTM

Dealing with Slowly Changing Dimensions using Snowflake

Spark Project -Real-Time data collection and Spark Streaming Aggregation

Snowflake Real-Time Data Warehouse Project for Beginners-1

Real-Time Log Processing using Spark Streaming Architecture

Real-Time Auto Tracking with Spark-Redis

Building Real-Time AWS Log Analytics Solution

Explore real-world Apache Hadoop projects by ProjectPro and land your Big Data dream job today!

In this section, you will find a list of good big data project ideas for masters students.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive

Online Hadoop Projects -Solving small file problem in Hadoop

Airline Dataset Analysis using Hadoop, Hive, Pig, and Impala

AWS Project-Website Monitoring using AWS Lambda and Aurora

Explore features of Spark SQL in practice on Spark 2.0

MovieLens Dataset Exploratory Analysis

Bitcoin Data Mining on AWS

Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis

Spark Project-Analysis and Visualization on Yelp Dataset

Project Ideas on Big Data Analytics

Let us now begin with a more detailed list of good big data project ideas that you can easily implement.

This section will introduce you to a list of project ideas on big data that use Hadoop along with descriptions of how to implement them.

1. Visualizing Wikipedia Trends

Human brains tend to process visual data better than data in any other format. 90% of the information transmitted to the brain is visual, and the human brain can process an image in just 13 milliseconds. Wikipedia is a page that is accessed by people all around the world for research purposes, general information, and just to satisfy their occasional curiosity.

Raw page data counts from Wikipedia can be collected and processed via Hadoop. The processed data can then be visualized using Zeppelin notebooks to analyze trends that can be supported based on demographics or parameters. This is a good pick for someone looking to understand how big data analysis and visualization can be achieved through Big Data and also an excellent pick for an Apache Big Data project idea.

Visualizing Wikipedia Trends Big Data Project with Source Code .

2. Visualizing Website Clickstream Data

Clickstream data analysis refers to collecting, processing, and understanding all the web pages a particular user visits. This analysis benefits web page marketing, product management, and targeted advertisement. Since users tend to visit sites based on their requirements and interests, clickstream analysis can help to get an idea of what a user is looking for.

Visualization of the same helps in identifying these trends. In such a manner, advertisements can be generated specific to individuals. Ads on webpages provide a source of income for the webpage, and help the business publishing the ad reach the customer and at the same time, other internet users. This can be classified as a Big Data Apache project by using Hadoop to build it.

Big Data Analytics Projects Solution for Visualization of Clickstream Data on a Website

3. Web Server Log Processing

A web server log maintains a list of page requests and activities it has performed. Storing, processing, and mining the data on web servers can be done to analyze the data further. In this manner, webpage ads can be determined, and SEO (Search engine optimization) can also be done. A general overall user experience can be achieved through web-server log analysis. This kind of processing benefits any business that heavily relies on its website for revenue generation or to reach out to its customers. The Apache Hadoop open source big data project ecosystem with tools such as Pig, Impala, Hive, Spark, Kafka Oozie, and HDFS can be used for storage and processing.

Big Data Project using Hadoop with Source Code for Web Server Log Processing

This section will provide you with a list of projects that utilize Apache Spark for their implementation.

4. Analysis of Twitter Sentiments Using Spark Streaming

Sentimental analysis is another interesting big data project topic that deals with the process of determining whether a given opinion is positive, negative, or neutral. For a business, knowing the sentiments or the reaction of a group of people to a new product launch or a new event can help determine the profitability of the product and can help the business to have a more extensive reach by getting an idea of the feel of the customers. From a political standpoint, the sentiments of the crowd toward a candidate or some decision taken by a party can help determine what keeps a specific group of people happy and satisfied. You can use Twitter sentiments to predict election results as well.

Sentiment analysis has to be done for a large dataset since there are over 180 million monetizable daily active users ( https://www.businessofapps.com/data/twitter-statistics/) on Twitter. The analysis also has to be done in real-time. Spark Streaming can be used to gather data from Twitter in real time. NLP (Natural Language Processing) models will have to be used for sentimental analysis, and the models will have to be trained with some prior datasets. Sentiment analysis is one of the more advanced projects that showcase the use of Big Data due to its involvement in NLP.

Access Big Data Project Solution to Twitter Sentiment Analysis

5. Real-time Analysis of Log-entries from Applications Using Streaming Architectures

If you are looking to practice and get your hands dirty with a real-time big data project, then this big data project title must be on your list. Where web server log processing would require data to be processed in batches, applications that stream data will have log files that would have to be processed in real-time for better analysis. Real-time streaming behavior analysis gives more insight into customer behavior and can help find more content to keep the users engaged. Real-time analysis can also help to detect a security breach and take necessary action immediately. Many social media networks work using the concept of real-time analysis of the content streamed by users on their applications. Spark has a Streaming tool that can process real-time streaming data.

Access Big Data Spark Project Solution to Real-time Analysis of log-entries from applications using Streaming Architecture

6. Analysis of Crime Datasets

Analysis of crimes such as shootings, robberies, and murders can result in finding trends that can be used to keep the police alert for the likelihood of crimes that can happen in a given area. These trends can help to come up with a more strategized and optimal planning approach to selecting police stations and stationing personnel.

With access to CCTV surveillance in real-time, behavior detection can help identify suspicious activities. Similarly, facial recognition software can play a bigger role in identifying criminals. A basic analysis of a crime dataset is one of the ideal Big Data projects for students. However, it can be made more complex by adding in the prediction of crime and facial recognition in places where it is required.

Big Data Analytics Projects for Students on Chicago Crime Data Analysis with Source Code

Explore Categories

In this section, you will find big data projects that rely on cloud service providers such as AWS and GCP.

7. Build a Scalable Event-Based GCP Data Pipeline using DataFlow

Suppose you are running an eCommerce website, and a customer places an order. In that case, you must inform the warehouse team to check the stock availability and commit to fulfilling the order. After that, the parcel has to be assigned to a delivery firm so it can be shipped to the customer. For such scenarios, data-driven integration becomes less comfortable, so you must prefer event-based data integration.

This project will teach you how to design and implement an event-based data integration pipeline on the Google Cloud Platform by processing data using DataFlow .

Scalable Event-Based GCP Data Pipeline using DataFlow

Data Description: You will use the Covid-19 dataset(COVID-19 Cases.csv) from data.world , for this project, which contains a few of the following attributes:

people_positive_cases_count

county_name

data_source

Language Used: Python 3.7

Services: Cloud Composer , Google Cloud Storage (GCS), Pub-Sub , Cloud Functions, BigQuery, BigTable

Big Data Project with Source Code: Build a Scalable Event-Based GCP Data Pipeline using DataFlow

8. Topic Modeling

The future is AI! You must have come across similar quotes about artificial intelligence (AI). Initially, most people found it difficult to believe that could be true. Still, we are witnessing top multinational companies drift towards automating tasks using machine learning tools.

Understand the reason behind this drift by working on one of our repository's most practical data engineering project examples .

Project Objective: Understand the end-to-end implementation of Machine learning operations (MLOps) by using cloud computing .

Learnings from the Project: This project will introduce you to various applications of AWS services . You will learn how to convert an ML application to a Flask Application and its deployment using Gunicord webserver. You will be implementing this project solution in Code Build. This project will help you understand ECS Cluster Task Definition.

Tech Stack:

Language: Python

Libraries: Flask, gunicorn, scipy , nltk , tqdm, numpy, joblib, pandas, scikit_learn, boto3

Services: Flask, Docker, AWS, Gunicorn

Source Code: MLOps AWS Project on Topic Modeling using Gunicorn Flask

9. MLOps on GCP Project for Autoregression using uWSGI Flask

Here is a project that combines Machine Learning Operations (MLOps) and Google Cloud Platform (GCP). As companies are switching to automation using machine learning algorithms, they have realized hardware plays a crucial role. Thus, many cloud service providers have come up to help such companies overcome their hardware limitations. Therefore, we have added this project to our repository to assist you with the end-to-end deployment of a machine learning project .

Project Objective: Deploying the moving average time-series machine-learning model on the cloud using GCP and Flask.

Learnings from the Project: You will work with Flask and uWSGI model files in this project. You will learn about creating Docker Images and Kubernetes architecture. You will also get to explore different components of GCP and their significance. You will understand how to clone the git repository with the source repository. Flask and Kubernetes deployment will also be discussed in this project.

Tech Stack: Language - Python

Services - GCP, uWSGI, Flask, Kubernetes, Docker

Build Professional SQL Projects for Data Analysis with ProjectPro

Unlock the ProjectPro Learning Experience for FREE

This section has good big data project ideas for graduate students who have enrolled in a master course.

10. Real-time Traffic Analysis

Traffic is an issue in many major cities, especially during some busier hours of the day. If traffic is monitored in real-time over popular and alternate routes, steps could be taken to reduce congestion on some roads. Real-time traffic analysis can also program traffic lights at junctions – stay green for a longer time on higher movement roads and less time for roads showing less vehicular movement at a given time. Real-time traffic analysis can help businesses manage their logistics and plan their commute accordingly for working-class individuals. Concepts of deep learning can be used to analyze this dataset properly.

11. Health Status Prediction

“Health is wealth” is a prevalent saying. And rightly so, there cannot be wealth unless one is healthy enough to enjoy worldly pleasures. Many diseases have risk factors that can be genetic, environmental, dietary, and more common for a specific age group or sex and more commonly seen in some races or areas. By gathering datasets of this information relevant for particular diseases, e.g., breast cancer, Parkinson’s disease, and diabetes, the presence of more risk factors can be used to measure the probability of the onset of one of these issues.

Health Status Prediction Big Data Project

In cases where the risk factors are not already known, analysis of the datasets can be used to identify patterns of risk factors and hence predict the likelihood of onset accordingly. The level of complexity could vary depending on the type of analysis that has to be done for different diseases. Nevertheless, since prediction tools have to be applied, this is not a beginner-level big data project idea.

12. Analysis of Tourist Behavior

Tourism is a large sector that provides a livelihood for several people and can adversely impact a country's economy.. Not all tourists behave similarly simply because individuals have different preferences. Analyzing this behavior based on decision-making, perception, choice of destination, and level of satisfaction can be used to help travelers and locals have a more wholesome experience. Behavior analysis, like sentiment analysis, is one of the more advanced project ideas in the Big Data field.

13. Detection of Fake News on Social Media

With the popularity of social media, a major concern is the spread of fake news on various sites. Even worse, this misinformation tends to spread even faster than factual information. According to Wikipedia, fake news can be visual-based, which refers to images, videos, and even graphical representations of data, or linguistics-based, which refers to fake news in the form of text or a string of characters. Different cues are used based on the type of news to differentiate fake news from real. A site like Twitter has 330 million users , while Facebook has 2.8 billion users. A large amount of data will make rounds on these sites, which must be processed to determine the post's validity. Various data models based on machine learning techniques and computational methods based on NLP will have to be used to build an algorithm that can be used to detect fake news on social media.

Access Solution to Interesting Big Data Project on Detection of Fake News

14. Prediction of Calamities in a Given Area

Certain calamities, such as landslides and wildfires, occur more frequently during a particular season and in certain areas. Using certain geospatial technologies such as remote sensing and GIS (Geographic Information System) models makes it possible to monitor areas prone to these calamities and identify triggers that lead to such issues.

If calamities can be predicted more accurately, steps can be taken to protect the residents from them, contain the disasters, and maybe even prevent them in the first place. Past data of landslides has to be analyzed, while at the same time, in-site ground monitoring of data has to be done using remote sensing. The sooner the calamity can be identified, the easier it is to contain the harm. The need for knowledge and application of GIS adds to the complexity of this Big Data project.

15. Generating Image Captions

With the emergence of social media and the importance of digital marketing, it has become essential for businesses to upload engaging content. Catchy images are a requirement, but captions for images have to be added to describe them. The additional use of hashtags and attention-drawing captions can help a little more to reach the correct target audience. Large datasets have to be handled which correlate images and captions.

Image Caption Generating Big Data Project

This involves image processing and deep learning to understand the image and artificial intelligence to generate relevant but appealing captions. Python can be used as the Big Data source code. Image caption generation cannot exactly be considered a beginner-level Big Data project idea. It is probably better to get some exposure to one of the projects before proceeding with this.

Get confident to build end-to-end projects

Access to a curated library of 250+ end-to-end industry projects with solution code, videos and tech support.

16. Credit Card Fraud Detection

The goal is to identify fraudulent credit card transactions, so a customer is not billed for an item that the customer did not purchase. This can tend to be challenging since there are huge datasets, and detection has to be done as soon as possible so that the fraudsters do not continue to purchase more items. Another challenge here is the data availability since the data is supposed to be primarily private. Since this project involves machine learning, the results will be more accurate with a larger dataset. Data availability can pose a challenge in this manner. Credit card fraud detection is helpful for a business since customers are likely to trust companies with better fraud detection applications, as they will not be billed for purchases made by someone else. Fraud detection can be considered one of the most common Big Data project ideas for beginners and students.

If you are looking for big data project examples that are fun to implement then do not miss out on this section.

17. GIS Analytics for Better Waste Management

Due to urbanization and population growth, large amounts of waste are being generated globally. Improper waste management is a hazard not only to the environment but also to us. Waste management involves the process of handling, transporting, storing, collecting, recycling, and disposing of the waste generated. Optimal routing of solid waste collection trucks can be done using GIS modeling to ensure that waste is picked up, transferred to a transfer site, and reaches the landfills or recycling plants most efficiently. GIS modeling can also be used to select the best sites for landfills. The location and placement of garbage bins within city localities must also be analyzed.

18. Customized Programs for Students

We all tend to have different strengths and paces of learning. There are different kinds of intelligence, and the curriculum only focuses on a few things. Data analytics can help modify academic programs to nurture students better. Programs can be designed based on a student’s attention span and can be modified according to an individual’s pace, which can be different for different subjects. E.g., one student may find it easier to grasp language subjects but struggle with mathematical concepts.

In contrast, another might find it easier to work with math but not be able to breeze through language subjects. Customized programs can boost students’ morale, which could also reduce the number of dropouts. Analysis of a student’s strong subjects, monitoring their attention span, and their responses to specific topics in a subject can help build the dataset to create these customized programs.

19. Real-time Tracking of Vehicles

Transportation plays a significant role in many activities. Every day, goods have to be shipped across cities and countries; kids commute to school, and employees have to get to work. Some of these modes might have to be closely monitored for safety and tracking purposes. I’m sure parents would love to know if their children’s school buses were delayed while coming back from school for some reason.

Taxi applications have to keep track of their users to ensure the safety of the drivers and the users. Tracking has to be done in real-time, as the vehicles will be continuously on the move. Hence, there will be a continuous stream of data flowing in. This data has to be processed, so there is data available on how the vehicles move so that improvements in routes can be made if required but also just for information on the general whereabouts of the vehicle movement.

20. Analysis of Network Traffic and Call Data Records

There are large chunks of data-making rounds in the telecommunications industry. However, very little of this data is currently being used to improve the business. According to a MindCommerce study: “An average telecom operator generates billions of records per day, and data should be analyzed in real or near real-time to gain maximum benefit.”

The main challenge here is that these large amounts of data must be processed in real-time. With big data analysis, telecom industries can make decisions that can improve the customer experience by monitoring the network traffic. Issues such as call drops and network interruptions must be closely monitored to be addressed accordingly. By evaluating the usage patterns of customers, better service plans can be designed to meet these required usage needs. The complexity and tools used could vary based on the usage requirements of this project.

This section contains project ideas in big data that are primarily open-source and have been developed by Apache.

Apache Hadoop is an open-source big data processing framework that allows distributed storage and processing of large datasets across clusters of commodity hardware. It provides a scalable, reliable, and cost-effective solution for processing and analyzing big data.

22. Apache Spark

Apache Spark is an open-source big data processing engine that provides high-speed data processing capabilities for large-scale data processing tasks. It offers a unified analytics platform for batch processing, real-time processing, machine learning, and graph processing.

23. Apache Nifi

Apache NiFi is an open-source data integration tool that enables users to easily and securely transfer data between systems, databases, and applications. It provides a web-based user interface for creating, scheduling, and monitoring data flows, making it easy to manage and automate data integration tasks.

24. Apache Flink

Apache Flink is an open-source big data processing framework that provides scalable, high-throughput, and fault-tolerant data stream processing capabilities. It offers low-latency data processing and provides APIs for batch processing, stream processing, and graph processing.

25. Apache Storm

Apache Storm is an open-source distributed real-time processing system that provides scalable and fault-tolerant stream processing capabilities. It allows users to process large amounts of data in real-time and provides APIs for creating data pipelines and processing data streams.

Does Big Data sound difficult to work with? Work on end-to-end solved Big Data Projects using Spark , and you will know how easy it is!

This section has projects on big data along with links of their source code on GitHub.

26. Fruit Image Classification

This project aims to make a mobile application to enable users to take pictures of fruits and get details about them for fruit harvesting. The project develops a data processing chain in a big data environment using Amazon Web Services (AWS) cloud tools, including steps like dimensionality reduction and data preprocessing and implements a fruit image classification engine.

Fruit Image Classification Big Data Project

The project involves generating PySpark scripts and utilizing the AWS cloud to benefit from a Big Data architecture (EC2, S3, IAM) built on an EC2 Linux server. This project also uses DataBricks since it is compatible with AWS.

Source Code: Fruit Image Classification

27. Airline Customer Service App

In this project, you will build a web application that uses machine learning and Azure data bricks to forecast travel delays using weather data and airline delay statistics. Planning a bulk data import operation is the first step in the project. Next comes preparation, which includes cleaning and preparing the data for testing and building your machine learning model.

Airline Customer Service App Big Data Project

This project will teach you how to deploy the trained model to Docker containers for on-demand predictions after storing it in Azure Machine Learning Model Management. It transfers data using Azure Data Factory (ADF) and summarises data using Azure Databricks and Spark SQL . The project uses Power BI to visualize batch forecasts.

Source Code: Airline Customer Service App

28. Criminal Network Analysis

This fascinating big data project seeks to find patterns to predict and detect links in a dynamic criminal network. This project uses a stream processing technique to extract relevant information as soon as data is generated since the criminal network is a dynamic social graph. It also suggests three brand-new social network similarity metrics for criminal link discovery and prediction. The next step is to develop a flexible data stream processing application using the Apache Flink framework, which enables the deployment and evaluation of the newly proposed and existing metrics.

Source Code- Criminal Network Analysis

Trying out these big data project ideas mentioned above in this blog will help you get used to the popular tools in the industry. But these projects are not enough if you are planning to land a job in the big data industry. And if you are curious about what else will get you closer to landing your dream job, then we highly recommend you check out ProjectPro . ProjectPro hosts a repository of solved projects in Data Science and Big Data prepared by experts in the industry. It offers a subscription to that repository that contains solutions in the form of guided videos along with supporting documentation to help you understand the projects end-to-end. So, don’t wait more to get your hands dirty with ProjectPro projects and subscribe to the repository today!

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

1. Why are big data projects important?

Big data projects are important as they will help you to master the necessary big data skills for any job role in the relevant field. These days, most businesses use big data to understand what their customers want, their best customers, and why individuals select specific items. This indicates a huge demand for big data experts in every industry, and you must add some good big data projects to your portfolio to stay ahead of your competitors.

2. What are some good big data projects?

Design a Network Crawler by Mining Github Social Profiles. In this big data project, you'll work on a Spark GraphX Algorithm and a Network Crawler to mine the people relationships around various Github projects.

Visualize Daily Wikipedia Trends using Hadoop - You'll build a Spark GraphX Algorithm and a Network Crawler to mine the people relationships around various Github projects.

Modeling & Thinking in Graphs(Neo4J) using Movielens Dataset - You will reconstruct the movielens dataset in a graph structure and use that structure to answer queries in various ways in this Neo4j big data project.

3. How long does it take to complete a big data project?

A big data project might take a few hours to hundreds of days to complete. It depends on various factors such as the type of data you are using, its size, where it's stored, whether it is easily accessible, whether you need to perform any considerable amount of ETL processing on the data, etc.

Access Solved Big Data and Data Science Projects

About the Author

ProjectPro is the only online platform designed to help professionals gain practical, hands-on experience in big data, data engineering, data science, and machine learning related technologies. Having over 270+ reusable project templates in data science and big data with step-by-step walkthroughs,

User policy

Write for ProjectPro

Big Data Capstone Project

Further develop your knowledge of big data by applying the skills you have learned to a real-world data science project.

The Big Data Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this Big Data MicroMasters program to a medium-scale data science project.

Working with organisations and stakeholders of your choice on a real-world dataset, you will further develop your data science skills and knowledge.

This project will give you the opportunity to deepen your learning by giving you valuable experience in evaluating, selecting and applying relevant data science techniques, principles and theory to a data science problem.

This project will see you plan and execute a reasonably substantial project and demonstrate autonomy, initiative and accountability.

You’ll deepen your learning of social and ethical concerns in relation to data science, including an analysis of ethical concerns and ethical frameworks in relation to data selection and data management.

By communicating the knowledge, skills and ideas you have gained to other learners through online collaborative technologies, you will learn valuable communication skills, important for any career. You’ll also deliver a written presentation of your project design, plan, methodologies, and outcomes.

What you'll learn

The Big Data Capstone project will give you the chance to demonstrate practically what you have learned in the Big Data MicroMasters program including:

How to evaluate, select and apply data science techniques, principles and theory;
How to plan and execute a project;
Work autonomously using your own initiative;
Identify social and ethical concerns around your project;
Develop communication skills using online collaborative technologies.

University pathways

Related degrees from the University of Adelaide

Master of Data Science

This program provides the necessary skills for entering the world of big data and data science.

Find out more

Bachelor of Computer Science

Want to study at the cutting edge of the growing technological age and build a career that can change the world?

Skip to main content
Skip to secondary menu
Skip to primary sidebar
Skip to footer

Data and Technology Insights

Big Data – Capstone Project

Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game “Catch the Pink Flamingo”. During the five week Capstone Project, you will walk through the typical big data science steps for acquiring, exploring, preparing, analyzing, and reporting.

In the first two weeks, we will introduce you to the data set and guide you through some exploratory analysis using tools such as Splunk and Open Office. Then we will move into more challenging big data problems requiring the more advanced tools you have learned including KNIME, Spark's MLLib and Gephi. Finally, during the fifth and final week, we will show you how to bring it all together to create engaging and compelling reports and slide presentations.

Interested in what the future will bring? Download our 2024 Technology Trends eBook for free.

As a result of our collaboration with Splunk, a software company focus on analyzing machine-generated big data, learners with the top projects will be eligible to present to Splunk and meet Splunk recruiters and engineering leadership.

About Coursera

5 Reasons Why Modern Data Integration Gives You a Competitive Advantage
5 Most Common Database Structures for Small Businesses
6 Ways to Reduce IT Costs Through Observability
How is Big Data Analytics Used in Business? These 5 Use Cases Share Valuable Insights
How Realistic Are Self-Driving Cars?

Dear visitor, Thank you for visiting Datafloq. If you find our content interesting, please subscribe to our weekly newsletter:

Did you know that you can publish job posts for free on Datafloq? You can start immediately and find the best candidates for free! Click here to get started.

Thanks for visiting Datafloq If you enjoyed our content on emerging technologies, why not subscribe to our weekly newsletter to receive the latest news straight into your mailbox?

Privacy Overview
Necessary Cookies
Marketing cookies

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

Please enable Strictly Necessary Cookies first so that we can save your preferences!

NYU Center for Data Science

Harnessing Data’s Potential for the World

Master’s in Data Science

Industry Concentration
Admission Requirements
Capstone Project
Summer Research Initiative
Financial Aid
MS Admissions Ambassadors
Summer Initiative

CDS master’s students have a unique opportunity to solve real-world problems through the capstone course in the final year of their program. The capstone course is designed to apply knowledge into practice and to develop and improve critical skills such as problem-solving and collaboration skills.

Students are matched with research labs within the NYU community and with industry partners to investigate pressing issues, applying data science to the following areas:

Probability and statistical analyses
Natural language processing
Big Data analysis and modeling
Machine learning and computational statistics
Coding and software engineering
Visualization modeling
Neural networks
Signal processing
High dimensional statistics

Capstone projects present students with the opportunity to work in their field of interest and gain exposure to applicable solutions. Project sponsors, NYU labs, and external partners, in turn receive the benefit of having a new perspective applied to their projects.

“Capstone is a unique opportunity for students to solve real world problems through projects carried out in collaboration with industry partners or research labs within the NYU community,” says capstone advisor and CDS Research Fellow Anastasios Noulas. “It is a vital experience for students ahead of their graduation and prior to entering the market, as it helps them improve their skills, especially in problem solving contexts that are atypical compared to standard courses offered in the curriculum. Cooperation within teams is another crucial skill built through the Capstone experience as projects are typically run across groups of 2 to 4 people.”

The Capstone Project offers the opportunity for organizations to propose a project that our graduate students will work on as part of their curriculum for one semester. Information on the course along with a questionnaire to propose a project, can be found on the Capstone Fall 2024 Project Submission Form . If you have any questions, please reach out to [email protected] .

Best Fall 2023 Capstone Posters

Multimodal NLP for M&A Agreements

Student Authors: Harsh Asrani, Chaitali Joshi, Tayyibah Khanam, Ansh Riyal | Project Mentors: Vlad Kobzar, Kyunghyun Cho

Partisan Bias and the US Federal Court System

Student Authors: Annabelle Huether, Mary Nwangwu, Allison Redfern | Project Mentors: Aaron Kaufman, Jon Rogowski

Best Fall 2023 Student Voted Posters

User-Centric AI Models for Assisting the Blind

Student Authors: Gail Batutis, Aradhita Bhandari, Aryan Jain, Mallory Sico | Project Mentors: Giles Hamilton-Fletcher, Chen Feng, Kevin C. Chan

Multi-Modal Foundation Models for Medicine

Student Authors: Yunming Chen, Harry Huang, Jordan Tian, Ning Yang | Project Mentors: Narges Razavian

Best Fall 2023 Student Voted Runner-Up Posters

Representational geometry of learning rules in neural networks

Student Authors: Ghana Bandi, Shiyu Ling, Shreemayi Sonti, Zoe Xiao | Project Mentors: SueYeon Chung, Chi-Ning Chou

Medical Data Leakage with Multi-site Collaborative Training

Student Authors: Christine Gao, Ciel Wang, Yuqi Zhang | Project Mentors: Qi Lei

Fall 2023 Capstone Project List

Segmentation of Metastatic Brain Tumors Using Deep Learning
Discovering misinformation narratives from suspended tweets using embedding-based clustering algorithms
Network Intrusion Detection Systems using Machine Learning
Knowledge Extraction from Pathology Reports Using LLMs
Building an Interactive Browser for Epigenomic & Functional Maps from the Viewpoint of Disease Association
Prediction of Acute Pancreatitis Severity Using CT Imaging and Deep Learning
User-centric AI models for assisting the blind
A machine learning model to predict future kidney function in patients undergoing treatment for kidney masses
Fine-Tuning of MedSAM for the Automated Segmentation of Musculoskeletal MRI for Bone Topology Evaluation and Radiomic Analysis
Online News Content Neural Network Recommendation Engine
Explanatory Modeling for Website Traffic Movements
Egocentric video zero-shot object detection
Leverage OncoKB’s Curated Literature Database to Build an NLP Biomarker Identifier
Improving Out-of-Distribution Generalization in Neural Models for Astrophics and Cosmology?
Preparing a Flood Risk Index for the State of Assam, India
Causal GANs
Bringing Structure to Emergent Taxonomies from Open-Ended CMS Tags
Social Network Analysis of Hospital Communication Networks
Multimodal Question Answering
Does resolution matter for transfer learning with satelitte imagery?
Measuring Optimizer-Agnostic Hyperparameter Tuning Difficulty
Extracting causal political narratives from text.
Designing Principled Training Methods for Deep Neural Networks
Multimodal NLP for M&A Agreements
Using Deep Learning to Solve Forward-Backward Stochastic Differential Equations
OptiComm: Maximizing Medical Communication Success with Advanced Analytics
Automated assessment of epilepsy subtypes using patient-generated language data
Predicting cancer drug response of patients from their alteration and clinical data
Identify & Summarize top key events for a given company from News Data using ML and NLP Models
Developing predictive shooting accuracy metric(s) for First-Person-Shooter esports
Supporting Student Success through Pipeline Curricular Analysis
Transformers for Electronic Health Records
Build Models for Multilingual Medical Coding
Metadata Extraction from Spoken Interactions Between Mothers and Young Children
Uncertainty Radius Selection in Distributionally Robust Portfolio Optimization
Unveiling Insights into Employee Benefit Plans and Insurance Dynamics
Advanced Name Screening and Entity Linking Using large language models
What Keeps the Public Safe While Avoiding Excessive Use of Incarceration? Supporting Data-Centered Decisionmaking in a DA’s Office
Foundation Models for Brain Imaging
Housing Price Forecasting – Alternative Approaches
Evaluating the Capability of Large Language Models to Measure Psychiatric Functioning
Predicting year-end success using deep neural network (DNN) architecture

Best Fall 2022 Capstone Posters

Leveraging Computer Vision to Map Cell Tower Locations to Enhance School Connectivity poster

Leveraging Computer Vision to Map Cell Tower Locations to Enhance School Connectivity

Student Authors: Lorena Piedras, Priya Dhond, and Alejandro Sáez | Mentors: Iyke Derek Maduako (UNICEF)

Neural Re-Ranking for Personalized Home Search poster

Neural Re-Ranking for Personalized Home Search

Student Authors: Giacomo Bugli, Luigi Noto, Guilherme Albertini | Mentors: Shourabh Rawat, Niranjan Krishna, and Andreas Rubin-Schwarz

Sequence Modeling for Query Understanding & Conversational Search

Student Authors: Lucas Tao, Evelyn Wang, Jun Wang, Cecilia Wu | Mentors: Amir Rahmani, Arun Balagopalan, Shourabh Rawat, and Najoung Kim

Solving challenging video games in human-like ways poster

Solving challenging video games in human-like ways

Student Authors: Brian Pennisi, Jiawen Wu, Adeet Patel, and Sarvesh Patki | Mentors: Todd Gureckis (NYU)

Best Fall 2022 Student Voted Posters

Deep Learning Framework for Segmentation of Medical Images poster

Deep Learning Framework for Segmentation of Medical Images

Student Authors: Luoyao Chen, Mei Chen, Jinqian Pan | Mentors: Jacopo Cirrone (NYU)

Galaxy Dataset Distillation

Student Authors: Xu Han, Jason Wang, Chloe Zheng | Mentors: Julia Kempe (NYU)

Best Fall 2022 Runner-Up Posters

Dementia Detection from FLAIR MRI via Deep Learning poster

Dementia Detection from FLAIR MRI via Deep Learning

Student Authors: Jiawen Fan, Aiqing Li | Mentors: Narges Razavian (NYU Langone)

Ego4d NLQ: Egocentric Visual Learning of Representations and Episodic Memory poster

Ego4d NLQ: Egocentric Visual Learning of Representations and Episodic Memory

Student Authors: Dongdong Sun; Rui Chen; Ying Wang | Mentors: Mengye Ren (NYU)

Learning User Representations from Zillow Search Sessions using Transformer Architectures poster

Learning User Representations from Zillow Search Sessions using Transformer Architectures

Student Authors: Xu Han, Jason Wang, Chloe Zheng | Mentors: Shourabh Rawat (Zillow Group)

Methane Emission Quantification through Satellite Images poster

Methane Emission Quantification through Satellite Images

Student Authors: Alex Herron, Dhruv Saxena, Xiangyue Wang | Mentors: Robert Huppertz (orbio.earth)

Fall 2022 Capstone Project List

Data Science for Clinical Decision-making Support in Radiation Therapy
Using Voter File Data to Study Electoral Reform
Creating an Epigenomic Map of the Heart
Career Recommendation
Calibrating for Class Weights
Assigning Locations to Detected Stops using LSTM
Impact of YMCA Facilities on the Local Neighborhoods of Bronx
Powering SMS Product Recommendations with Deep Learning
Evaluation and Performance Comparison of Two Models in Classifying Cosmological Simulation Parameters
Crypto Anomaly Detection
Sequence Modeling for Query Understanding & Conversational Search
Multi-Modal Graph Inductive Learning with CLIP Embeddings
Multimodal Contract Segmentation
Extraction of Causal Narratives from News Articles
Detecting Erroneous Geospatial Data
Improving Speech Recognition Performance using Synthetic Data
Multi-document Summarization for News Events
Multi-task learning in orthogonal low dimensional parameter manifolds
Let’s Go Shopping: An Investigation Into a New Bimodal E-Commerce Dataset
Training AI to recognize objects of interest to the blind community
Classify Classroom Activities using Ambient Sound
Database and Dashboard for RII
Bitcoin Price Prediction Using Machine Learning Models
Context Driven Approach to Detecting Cross-Platform Coordinated Influence Campaigns
Invalid Traffic Detection Model Deployment
Recalled Experiences of Death: Using Transformers to Understand Experiences and Themes
Context-Based Content Extraction & Summarization from News Articles
Neural Learning to Rank for Personalized Home Search
Improve Speech Recognition Performance Using Unpaired Audio and Text
Data Normalization & Generalization to Population Metrics
Automated Judicial Case Briefing
Cyber Threat Detection for News Articles
MLS Fan Segmentation
Near Real-Time Estimation of Beef and Dairy Feedlot Greenhouse Gas Emissions
Do Better Batters Face Higher or Lower Quality Pitches?

Previous Capstone Projects

Best fall 2021 capstone posters.

Question Answering on Long Context

Student Authors: Xinli Gu, Di He, Congyun Jin | Project Mentor: Jocelyn Beauchesne (Hyperscience)

Multimodal Self-Supervised Deep Learning with Chest X-Rays and EHR Data

Student Authors: Adhham Zaatri, Emily Mui, Yechan Lew | Project Mentor: Sumit Chopra (NYU Langone)

Head and Neck CT Segmentation Using Deep Learning

Student Authors: Pengyun Ding, Tianyu Zhang | Project Mentor: Ye Yuan (NYU Langone)

3D Astrophysical Simulation with Transformer

Student Authors: Elliot Dang, Tong Li, Zheyuan Hu | Project Mentor: Shirley Ho (Flatiron Institute)

Multimodal Representations for Document Understanding (Best Student Voted Poster)

Student Authors: Pavel Gladkevich, David Trakhtenberg, Ted Xie, Duey Xu | Project Mentor: Shourabh Rawat (Zillow Group)

2021 Capstone Project List

Accelerated Learning in the Context of Language Acquisition
Analysis of Cardiac Signals on Patients with Atrial Fibrillation
Applications of Neural Radiance Fields in Astronomy
Automatic Detection of Alzheimer’s Disease with Multi-Modal Fusion of Clinical MRI Scans
Automatic Transcription of Speech on SAYCam
Automatic Volumetric Segmentation of Brain Tumor Using Deep Learning for Radiation Oncology
Automatically Identify Applicants Who Require Physician’s Reports
Building a Question-Answer Generation Pipeline for The New York Times
Coupled Energy-Based Models and Normalizing Flows for Unsupervised Learning
Data Classification Processing for Clinical Decision-making Support in Radiation Therapy
Deep Active Learning for Protest Detection
Estimating Intracranial Pressure Using OCT Scans of the Eyeball
Graph Neural Networks for Electronic Health Record (EHR) Data
Head and Neck CT Image Segmentation
Head Movement Measurement During Structural MRI
Image Segmentation for Vestibular Schwannoma
Investigation into the Functionality of Key, Query, Value Sub-modules of a Transformer
Know Your Worth: An Analysis of Job Salaries
Machine learning-based computational phenotyping of electronic health records
Modeling the Speed Accuracy Tradeoff in Decision-Making
Multi-modal Breast Cancer Detection
Multi-Modal Deep Learning with Medical Images and EHR Data
Multimodal Representations for Document Understanding
Nematode Counting
News Clustering and Summarization
Post-surgical resection mapping in epilepsy using CNNs
Predicting Grandstanding in the Supreme Court through Speech
Predicting Probability of Post-Colectomy Hospital Readmission
Prediction of Total Knee Replacement Using Radiographs and Clinical Risk Factors
Reinforcement Learning for Option Hedging
Representation Learning Regarding RNA-RBP Binding
Self-Supervised Learning of Medical Image Representations Using Radiology Reports
The Study of American Public Policy with NLP
Topical Aggregation and Timeline Extraction on the NYT Corpus
Unsupervised Deep Denoiser for Electron-Microscope Data
Using Deep Learning and FBSDEs to Solve Option Pricing and Trading Problems
Vision Language Models for Real Estate Images and Descriptions

Featured 2020 Capstone Projects

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs

By Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang Kuo, Samuel Thomas, Edmilson MoraisJain

Accented Speech Recognition Inspired by Human Perception

By Xiangyun Chu, Elizabeth Combs, Amber Wang, Michael Picheny

Diarization of Legal Proceedings Paper Chart

Diarization of Legal Proceedings. Identifying and Transcribing Judicial Speech from Recorded Court Audio

By Jeffrey Tumminia, Amanda Kuznecov, Sophia Tsilerides, Ilana Weinstein, Brian McFee, Michael Picheny, Aaron R. Kaufman

2020 Capstone Project List

2D to 3D Video Generation for Surgery (Best Capstone Poster)
Action Primitive Recognition with Sequence to Sequence Models towards Stroke Rehabilitation
Applying Self-learning Methods on Histopathology Whole Slide Images
Applying Transformers Models to Scanned Documents: An Application in Industry
Beyond Bert-based Financial Sentimental Classification: Label Noise and Company Information
Bias and Stability in Hiring Algorithms (Best Capstone Poster)
Breast Cancer Detection using Self-supervised Learning Method
Catastrophic Forgetting: An Extension of Current Approaches (Best Capstone Poster)
ClinicalLongformer: Public Available Transformers Language Models for Long Clinical Sequences
Complication Prediction of Bariatric Surgery
Constraining Search Space for Hardware Configurations
D4J: Data for Justice to Advance Transparency and Fairness
Data-driven Diesel Insights
Deep Learning to Study Pathophysiology in Dermatomyositis
Detection Of Drug-Target Interactions Using BioNLP
Determining RNA Alternative Splicing Patterns
Developing a Data Ecosystem for Refugee Integration Insights
Diarizing Legal Proceedings
Estimating the Impact of the Home Health Value-Based Purchasing Model
Extracting economic sentiment from mainstream media articles
Food Trend Detection in Chinese Financial Market
Forecasting Biodiesel Auction Prices
Generative Adversarial Networks for Electron Microscope Image Denoising
Graph Embedding for Question Answering over Knowledge Graphs
Impact of NYU Wasserman Resources on Students’ Career Outcomes
Improving Accented Speech Recognition Through Multi-Accent Pre-Exposure
Improving Synthetic Image Generation for Better Object Detection
Learning-based Model for Super-resolution in Microscopy Imaging
Modeling Human Reading by a Grapheme-to-Phoneme Neural Network
Movement Classification of Macaque Neural Activity
New OXXO Store in Brazil and Revenue Prediction
Numerical Relativity Interpolations using Deep Learning
One Medical Passport: Predictive Obstructive Sleep Apnea Analysis
Online Student Pathways at New York University
Predicting YouTube Trending Video Project
Promotional Forecasting Model for Profit Optimization
Question Answering on Tabular Data with NLP
Raizen Fuel Demand Forecasting
Reach for the stars: detecting astronomical transients
Reverse Engineering the MOS 6502 Microprocessor
Selecting Optimal Training Sets
Synthesizing baseball data with event prediction pretraining
Train ETA Estimation for Rumo S.A.
Training a Generalizable End-to-End Speech-to-Intent Model
Utilizing Machine Learning for Career Advancement and Professional Growth

Best Fall 2019 Capstone Projects

Inferring the Topic(s) of Wikipedia Articles

By Marina Zavalina, Sarthak Agarwal, Chinmay Singhal, Peeyush Jain

Option Portfolio Replication and Hedging in Deep Reinforcement Learning

By Bofei Zhang, Jiayi Du, Yixuan Wang, Muyang Jin

Deep-Learning Regressions in Astronomy poster

Adversarial Attacks Against Linear and Deep-Learning Regressions in Astronomy

By Teresa Huang, Zacharie Martin, Greg Scanlon, Eva Wang Mentors: Soledad Villar, David W. Hogg

2019 Capstone Project List

Adversarial Attacks Against Linear and Deep-learning Regressions in Astronomy
Automated Breast Cancer Screening
Automatic Legal Case Summaries
Cross-task Transfer Between Language Understanding Tasks in NLP
Dark Matter and Stellar Stream Detection using Deep Learned Clustering
Exploiting Google Street View to Generate Global-scale Data Sets for Training Next Generation Cyber-Physical Systems
Federated Incremental Learning
Fraud Detection in Monetary Transactions Between Bank Accounts
Guided Image Upsampling
Improving State of the Art Cross-Lingual Word-Embeddings
Latent Semantic Topics Distribution Over Web Content Corpus
Lease Renewal Probability Prediction
Machine Learning for Adaptive Fuzzy String Matching
Market Segmentation from Retailer Behavior
Modeling the Experienced Dental Curriculum from Student Data
Modelling NBA Games
Movie Preference Prediction
MRI Image Reconstruction
NLP Metalearning
Predict next sales office location

Predicting Stock Market Movements using Public Sentiment Data & Sequential Deep Learning Models

Predictive Maintenance Techniques
Reinforcement Learning for Replication and Hedging of Option
Self-supervised Machine Listening

Sentence Classification of TripAdvisor ‘Points-of-Interest’ Reviews

Simulating the Dark Matter Distribution of the Universe with Deep Learning
SMaPP2: Joint Embedding of User-content and Network Structure to Enable a Common coordinate that captures ideology, geography and user topic spectrum.”
Sparse Deconvolution Methods for Microscopy Imaging Data Analysis
Stereotype and Unconscious Bias in Large Datasets
Structuring Exploring and Exploiting NIH’s Clinical Trials Database
The Analysis, Visualization, and Understanding of Big Urban Noise Data
Unsupervised and Self-supervised Learning for Medical Notes
Unsupervised Generative Video Dubbing
Using Deep Generative Models to de-noise Noisy Astronomical Data

Featured Academic Capstone Projects

Deep Learning for Breast Cancer Detection

By Jason Phang, Jungkyu (JP) Park, Thibault Fevry, Zhe Huang, The B-Team

Brain Segmentation Using Deep Learning

By Team 22/7 | Chaitra V. Hegde | Advisor: Narges Razavian

Predict Total Knee Replacement Using MRI With Supervised and Semi-Supervised Networks

By Team Glosy: Hong Gao, Mingsi Long, Yulin Shen, and Jie Yang

Featured Industry Capstone Projects

Determining where New York Life Insurance should open its next sales office

NBA Shot Prediction with Spatio-Temporal Analysis

Other past capstone projects.

Active Physical Inference via Reinforcement Learning
Deep Multi-Modal Content-User Embeddings for Music Recommendation
Fluorescent Microscopy Image Restoration
Learning Visual Embeddings for Reinforcement Learning
Offensive Speech Detection on Twitter
Predicting Movement Primitives in Stroke Patients using IMU Sensors
Recurrent Policy Gradients For Smooth Continuous Control
The Quality-Quantity Tradeoff in Deep Learning
Trend Modeling in Childhood Obesity Prediction
Twitter Food/Activity Monitor

Utilizing AutoPhrase on Computer Science papers over time

Group members: Jason Lin, Cameron Brody, James Yu

Abstract: Phrase mining is a useful tool to extract quality phrases from large text corpora. Previous work on this topic, such as AutoPhrase, demonstrates its effectiveness against baseline methods by using precision-recall as a metric. Our goal is to extend this work by analyzing how AutoPhrase phrases change over time, as well as how phrases are connected with each other by using network visualizations. This will be done through exploratory data analysis, along with a classification model utilizing individual phrases to predict a specific year range.

Codenames AI

Group members: Xuewei Yan, Cameron Shaw, Yongqing Li

Abstract: Codenames is a popular board game that relies on word association and its ultimate goal is to connect multiple words together with a single clue word. In this paper, we construct a system that incorporates artificial intelligence into the game to allow communication between humans and AI as well as providing the capability of replacing human effort in creating such a clue word. Our project utilized three types of word relationship measurements from Word2Vec, GloVe, and WordNet, to design and understand word relationships used in this game. An AI system is built on each measurement and tested on both AI-AI and AI-Human communication performance. We evaluate the performance with each system’s average speed in finishing a game as well as its ability to accurately identify their team words. The AI-AI team performance demonstrates outstanding efficiency for AI to manage this game, and the best performing measurement is able to achieve a 60% accuracy in its communication between AI and Human.

Spam Detection Using Natural Language Processing

Group members: Jonathan Tanoto

Abstract: Building a spam detection algorithm by utilizing Natural Language Processing to extract features associated with spam emails. Deep Learning methods as well as word-to-vector transformation are used to create a spam email classifier.

Blockchain / Smart-Contracts

An exploration on medical records using blockchain technology.

Group members: Ruiwei Wan, Yifei Wang

Abstract: In this project, we set out to explore the application of blockchain technology to Electronical Health Records systems. As we are prototyping the blockchain applications on the Electronic Medical Records System using our proposed Medcoin application, we encountered several challenges. After careful evaluations and discussions, we decide to turn our project into an exploration of the pros and cons of using blockchain applications in the Electronic Health Records system. We find that the proposed authorization contract could not meet the required authentication and testification functions of EHR, which are the two essential components for EHR, we, therefore, stop in our prototyping and in our report provide a discussion of advantages and disadvantages of using Blockchain for EHR systems. And due to the privacy issue of medical records, we also find the authorization smart contract proposal infeasible and exhibits lack of considerations. Our prototyping of smart contract failure could serve as a valuable lesson to why centralized application could be more proper to Medical Records related system design.

spatiotemporal machine learning

Uncertainty quantification and deep learning for scalable spatiotemporal analysis.

Group members: Kailing Ding, Judy Jin, Derek Leung, Miles Labrador

Abstract: In spatiotemporal forecasting, deep learning models need to not only make predictions but also quantify their predictions' certainty (uncertainty). For example, consider a stock automatic trading system where a machine learning model predicts the stock price. A point prediction from the model might be dramatically different from the real value because of the high stochasticity of the stock market. But, on the other hand, if the model could estimate the range which guarantees to cover the true value with high probability, the trading system could compute the best and worst rewards and make more sensible decisions. And this is where conformal prediction technique comes in, which is a technique for quantifying such uncertainties for models. In the paper, we seek to evaluate the performance and quality of conformal quantile regression that embeds uncertainty metrics into their output. Beyond this, we will also seek to contribute to the torchTS library by implementing a data loader class. This class will be designed to preprocess and split up data into training, calibration, and test sets in a more consistent format for our models to be more easily applied. Lastly, we aim to improve the torchTS library API documentation to present the library's functionality in an easily understood way as well as present users with examples of torchTS' spatiotemporal analysis methods being used.

High-dimensional Statistical Learning, Causal Inference, Robust ML, Fair ML

Post-prediction inference on political twitter.

Group members: Luis Ledezma-Ramos, Dylan Haar, Alicia Gunawan

Abstract: Having observed data seems to be a necessary requirement to conduct inference, but what happens when observed outcomes cannot easily be obtained? The simplest practice seems to proceed with using predicted outcomes, but without any corrections this can result in issues like bias and incorrect standard errors. Our project studies a correction method for inference conducted on predicted, not observed outcomes—called post-prediction inference—through the lens of political data. We are investigating the kinds of phrases or words in a tweet that will most strongly indicate a person’s political alignment to US politics. We have discovered that these correction techniques are promising in their ability to correct for post-prediction inference in the field of political science.

NFL-Analysis

Group members: Jonathan Langley, Sujeet Yeramareddy, Yong Liu

Abstract: After researching about a new inference correction approach called post-prediction inference, we chose to apply it to sports analysis based on NFL games. We designed a model that can predict the Spread of a football game, such as which team will win and what the margin of their victory will be. We then analyzed the most/least important features so that we can accurately correct inference for these variables in order to more accurately understand their impact on our response variable, Spread.

Machine Learning (TBA)

Investigation on latent dirichlet allocation.

Group members: Duha Aldebakel, Rui Zhang, Anthony Limon, Yu Cao

Abstract: We explore both Markov Chain Monte Carlo algorithms and variational inference methods for Latent Dirichlet Allocation (LDA), a generative probabilistic topic model for data such as text data. LDA is a generative probabilistic topic model, meaning we treat data as observations that arise from a generative probabilistic process including hidden variables, i.e. structure we want to find in the data. Topic modelling allows us to fulfill algorithmic needs to organize, understand, and annotate documents according to the discovered structure. For text data, hidden variables reflect the thematic structure of a corpus that we don't have access to, we only have access to our observations which are the documents of the collection themselves. Our aim is to infer this hidden structure through posterior inference, that is, we want to compute the conditional distribution of the hidden variables given our observations, and we use our knowledge from Q1 about inference methods to solve this problem.

Wildfire and Environmental Data Analysis

Machine learning for physical systems, locating sound with machine learning.

Group members: Raymond Zhao, Brady Zhou

Abstract: In this domain, we learned about the methods around localizing sound waves using special devices called microphone arrays. Broadly speaking, this device can figure what a sound is and where it came from. With the growing ubiquity of microphone devices, we find this to be a potentially useful use-case. The base case scenario method involves what is called "affine mapping" which is essentially another form of linear transformation. In this project, we decided to examine how machine learning techniques such as Neural Networks, Support Vector Machines, and Random Forest may benefit (or not benefit) in this field.

Environmental Monitoring, remote sensing, cyber-physical systems, Engineers for Exploration

E4e microfaune project.

Group members: Jinsong Yang, Qiaochen Sun

Abstract: Nowadays, human activities such as wildfires and hunting have become the largest factor that would have serious negative effects on biodiversity. In order to deeply understand how anthropogenic activities deeply affect wildlife populations, field biologists utilize automated image classification driven by neural networks to get relevant biodiversity information from the images. However, for some small animals such as insects or birds, the camera could not work very well because of the small size of these animals. It is extremely hard for cameras to capture the movement and activities of small animals. To effectively solve this problem, passive acoustic monitoring (PAM) has become one of the most popular methods. We could utilize sounds we collect from PAM to train certain machine learning models which could tell us the fluctuation of biodiversity of all these small animals. The goal of the whole program is to test the biodiversity of these small animals (most of them are birds). However, the whole program could be divided into plenty of small parts. I and Jinsong will pay attention to the intermediate step of the program. The goal of our project is to generate subsets of audio recordings that have higher probability of vocalization of interest, which could help our labeling volunteer to save time and energy. The solutions could help us reduce down the amount of time and resources required to achieve enough training data for species-level classifiers. We perform the same thing with AID_NeurIPS_2021. Only the data is different between these two github. For this github, we use the peru data instead of Coastal_Reserve data.

Group members: Harsha Jagarlamudi, Kelly Kong

Eco-Acoustic Event Detection: Classifying temporal presence of birds in recorded bird vocalization audio

Group members: Alan Arce, Edmundo Zamora

Abstract: Leveraging "Deep Learning" methods to classify temporal presence birds in recorded bird vocalization audio. Using a hybrid CNN-RNN model, trained on audio data, in the interest of benefitting wildlife monitoring and preservation.

Pyrenote - User Profile Design & Accessible Data

Group members: Dylan Nelson

Abstract: Pyrenote is a project in development by a growing group of student researchers here at UCSD. It's primary purpose is to allow anyone to contribute to research by labeling data in an intuitive and accessible way. Right now it is currently being used to develop a sort of voice recognition for birds. The goal is to make an algorithm that can strongly label data (say where in the clip a bird is calling and what bird is making the call). To do this, a very vast dataset is needed to be labeled. I worked mostly on the user experience side. Allowing them to interact with their labeling in new ways, such as keeping tabs on their progress and reaching goals. Developing a User Profile page was the primary source for receiving this data and was developed iteratively as a whole new page for the site

Pyrenote Webdeveloper

Group members: Wesley Zhen

Abstract: The website, Pyrenote, is helping scientists track bird populations by identifying them using machine learning classifiers on publicly annotated audio recordings. I have implemented three features over the course of two academic quarters aimed at streamlining user experience and improving scalability. The added scalability will be useful for future projects as we start becoming more ambitious with the number of users we bring to the site.

Spread of Misinformation Online

Who is spreading misinformation and worries in twitter.

Group members: Lehan Li, Ruojia Tao

Abstract: Spread of misinformation over social media posts challenges to daily information intake and exchange. Especially under current covid 19 pandemic, the disperse of misinformation regarding to covid 19 diseases and vaccination posts threats to individuals' wellbeing's and general publish health. The people's worries also increase with misinformation such as the shortage of food and water. This spread of misinformation also provide This project seeks to investigate the spread of misinformation over social media (Twitter) under covid 19 pandemic. wo main directions are investigated in the project. The first direction is the analysis of the effect of bot users on the spread of misinformation: We want to explore what is the role that robot user plays in spreading the misinformation. Where are the bot users located in the social network. The second direction is the sentiment analysis that examines users' attitudes towards misinformation: We want to see the spread of sentiment with different places in social networks. We also mixed the two directions: What is the relationship between bot-users with positive and negative emptions? Since online social medias users form social networks, the project also seeks to investigate the effect of social network on the above two topics. Moreover, the project is also interested in exploring the change in proportion of bot users and users' attitude towards misinformation as the social network becomes more concentrated and tightly connected.

Misinformation on Reddit

Group members: Samuel Huang, David Aminifard

Abstract: As social media has grown in popularity, namely Reddit, its use for rapidly sharing information based on categories or topics (subreddits) has had massive implications for how people are usually exposed to information and the quality of the information they interact with. While Reddit has its benefits, e.g. providing instant access to - nearly - real time, categorized information, it has possibly played a role in worsening divisions and the spread of misinformation. Our results showed that subreddits with the highest proportions of misinformation posts tend to lean more towards politics and news. In addition, we found that despite the frequency of misinformation per subreddit, the average upvote ratio per submission seemed consistently high, which indicated that subreddits tend to be ideologically homogeneous.

The Spread of YouTube Misinformation Through Twitter

Group members: Alisha Sehgal, Anamika Gupta

Abstract: In our Capstone Project, we explore the spread of misinformation online. More specifically, we look at the spread of misinformation across Twitter and YouTube because of the large role these two social media platforms play in the dissemination of news and information. Our main objectives are to understand how YouTube videos contribute to spreading misinformation on Twitter, evaluate how effectively YouTube is removing misinformation and if these policies also prevent users from engaging with misinformation. We take a novel approach of analyzing tweets, YouTube video captions, and other metadata using NLP to determine the presence of misinformation and investigate how individuals interact or spread misinformation. Our research focuses on the domain of public health as this is the subject of many conspiracies, varying opinions, and fake news.

Particle Physics

Understanding higgs boson particle jets with graph neural networks.

Group members: Charul Sharma, Rui Lu, Bryan Ambriz

Abstract: Extending the content of last quarter of deep sets neural network, fully connected neural network classifier, adversarial deep set model and designed decorrelated tagger (DDT), we went a little bit further this quarter about picking up different layers in neural network like GENConv and EdgeConv. GENConv and EdgeConv play incredibly important roles here for boosting the performances of our basic GNN model. We also evaluated the performance of our model using ROC (Receiver-Operating Curve) curves describing AUC (Area Under the Curve). Meanwhile, based on previous experiences of project one and past project of particle physics domain, we decided to add one more section, exploratory data analysis in our project for conducting some basic theory, bootstrapping or common sense of our dataset. But we have not produced all the optimal outcomes so far even though we finished the EdgeConv part and for the following weeks, we would like to finish the GENConv and may try some other layers to find out the potential to increase the performance of our model.

Predicting a Particle's True Mass

Group members: Jayden Lee, Dan Ngo, Isac Lee

Abstract: The Large Hadron Collider (LHC) collides protons traveling near light speed to generate high-energy collisions. These collisions produce new particles and have led to the discovery of new elementary particles (e.g., Higgs Boson). One key information to collect from this collision event is the structure of the particle jet, which refers to a group of collective spray of decaying particles that travel in the same direction, as accurately identifying the type of these jets - QCD or signal - play a crucial role in discovery of high-energy elementary particles like Higgs particle. There are several properties that determine jet type with jet mass being one of the strongest indicators in jet type classification. A previous study jet mass estimation, called “soft drop declustering,” has been one of the most effective methods in making rough estimations on the jet mass. With this in mind, we aim to implement machine learning in jet mass estimation through various neural network architectures. With data collected and processed by CERN, we implemented a model capable of improving jet mass prediction through jet features.

Mathematical Signal Processing (compression of deep nets, or optimization for data-science/ML)

Graph neural networks, graph neural network based recommender systems for spotify playlists.

Group members: Benjamin Becze, Jiayun Wang, Shone Patil

Abstract: With the rise of music streaming services on the internet in the 2010’s, many have moved away from radio stations to streaming services like Spotify and Apple Music. This shift offers more specificity and personalization to users’ listening experiences, especially with the ability to create playlists of whatever songs that they wish. Oftentimes user playlists have a similar genre or theme between each song, and some streaming services like Spotify offer recommendations to expand a user’s existing playlist based on the songs in it. Using Node2vec and GraphSAGE graph neural network methods, we set out to create a recommender system for songs to add to an existing playlist by drawing information from a vast graph of songs we built from playlist co-occurrences. The result is a personalized song recommender based not only on Spotify’s community of playlist creators, but also the specific features within a song.

Dynamic Stock Industry Classification

Group members: Sheng Yang

Abstract: Use Graph-based Analysis to Re-classify Stocks in China A-share and Improve Markowitz Portfolio Optimization

NLP, Misinformation

Hdsi faculty exploration tool.

Group members: Martha Yanez, Sijie Liu, Siddhi Patel, Brian Qian

Abstract: The Halıcıoğlu Data Science Institute (HDSI) at University of California, San Diego is dedicated to the discovery of new methods and training of students and faculty to use data science to solve problems in the current world. The HDSI has several industry partners that are often searching for assistance to tackle their daily activities and need experts in different domain areas. Currently, there are around 55 professors affiliated to HDSI. They all have diverse research interests and have written numerous papers in their own fields. Our goal was to create a tool that allows HDSI to select the best fit from their faculty, based on their published work, to aid their industry partners in their specific endeavors. We did this with Natural Language Processing (NLP) by managing all the abstracts from the faculty’s published work and organizing them by topics. We will then obtained the proportion of papers of each faculty associated with each of the topics and drew a relationship between researchers and their most published topics. This will allow HDSI to personalize recommendations of faculty candidates to their industry partner’s particular job.

Group members: Du Xiang

AI in Healthcare, Deep Reinforcement Learning, Trustworthy Machine Learning

Improving robustness in deep fusion modeling against adversarial attacks.

Group members: Ayush More, Amy Nguyen

Abstract: Autonomous vehicles rely heavily on deep fusion modeling, which utilize multiple inputs for its inferences and decision making. By using the data from these inputs, the deep fusion model benefits from shared information, which is primarily associated with robustness as these input sources can face different levels of corruption. Thus, it is highly important that the deep fusion models used in autonomous vehicles are robust to corruption, especially to input sources that are weighted more heavily in different conditions. We explore a different approach in training the robustness for a deep fusion model through adversarial training. We fine-tune the model on adversarial examples and evaluate its robustness against single source noise and other forms of corruption. Our experimental results show that adversarial training was effective in improving the robustness of a deep fusion model object detector against adversarial noise and Gaussian noise while maintaining performance on clean data. The results also highlighted the lack of robustness of models that are not trained to handle adversarial examples. We believe that this is relevant given the risks that autonomous vehicles pose to pedestrians - it is important that we ensure the inferences and decisions made by the model are robust against corruption, especially if it is intentional from outside threats.

Healthcare: Adversarial Defense In Medical Deep Learning Systems

Group members: Rakesh Senthilvelan, Madeline Tjoa

Abstract: In order to combat against such adversarial instances, there needs to be robust training done with these models in order to best protect against the methods that these attacks use on deep learning systems. In the scope of this paper, we will be looking into the methods of fast gradient signed method and projected gradient descent, two methods used in adversarial attacks to maximize loss functions and cause the affected system to make opposing predictions, in order to train our models against them and allow for stronger accuracy when faced with adversarial examples.

Satellite image analysis

Ml for finance, ml for healthcare, fair ml, ml for science, actionable recourse.

Group members: Shweta Kumar, Trevor Tuttle, Takashi Yabuta, Mizuki Kadowaki, Jeffrey Feng

Abstract: In American society today there is a constant encouraged reliance on credit, despite it not being available to everyone as a legal right. Currently, there are countless evaluation methods of an individual's creditworthiness in practice. In an effort to regulate the selection criteria of different financial institutions, the Equal Credit Opportunity Act (ECOA) requires that applicants denied a loan are entitled to an Adverse Action notice, a statement from the creditor explaining the reason for the denial. However, these adverse action notices are frequently unactionable and ineffective in providing feedback to give an individual recourse, which is the ability to act up on a reason for denial to raise one’s odds of getting accepted for a loan. In our project, we will be exploring whether it is possible to create an interactive interface to personalize adverse action notices in alignment with personal preferences for individuals to gain recourse.

Social media; online communities; text analysis; ethics

Finding commonalities in misinformative articles across topics.

Group members: Hwang Yu, Maximilian Halvax, Lucas Nguyen

Abstract: In order to combat the large scale distribution of misinformation online, We wanted to develop a way to flag news articles that are misinformative and could potentially mislead the general public. In addition to flagging news articles, we also wanted to find commonalities between the misinformation that we found. Were some topics in specific containing more misleading information than others? How much overlap do these articles have when we break their content down into TF IDF and see what words carry the most importance when put into various models detecting misinformation. We wanted to narrow down our models to be trained on four different topics: economics, politics, science, and general which is a dataset encompassing the three previous topics. We Found that general included the most overlap overall, while the topics themselves, while mostly different than the other specific topics, had certain models that still put emphasis on similar words, indicating a possible pattern of misinformative language in these articles. We believe, from these results, that we can find a pattern that could direct further investigation into how misinformation is written and distributed online.

The Effect of Twitter Cancel Culture on the Music Industry

Group members: Peter Wu, Nikitha Gopal, Abigail Velasquez

Abstract: Musicians often trend on social media for various reasons but in recent years, there has been a rise in musicians being “canceled” for committing offensive or socially unacceptable behavior. Due to the wide accessibility of social media, the masses are able to hold accountable musicians for their actions through “cancel culture”, a form of modern ostracism. Twitter has become a well-known platform for “cancel culture” as users can easily spread hashtags and see what’s trending, which also has the potential to facilitate the spread of toxicity. We analyze how public sentiment towards canceled musicians on Twitter changes in respect to the type of issue they were canceled for, their background, and the strength of their parasocial relationship with their fans. Through our research, we aim to determine whether “cancel culture” leads to an increase in toxicity and negative sentiment towards a canceled individual.

Analyzing single cell multimodality data via (coupled) autoencoder neural networks

Coupled autoencoders for single-cell data analysis.

Group members: Alex Nguyen, Brian Vi

Abstract: Historically, analysis on single-cell data has been difficult to perform, due to data collection methods often resulting in the destruction of the cell in the process of collecting information. However, an ongoing endeavor of biological data science has recently been to analyze different modalities, or forms, of the genetic information within a cell. Doing so will allow modern medicine a greater understanding of cellular functions and how cells work in the context of illnesses. The information collected on the three modalities of DNA, RNA, and protein can be done safely and because it is known that they are same information in different forms, analysis done on them can be extrapolated understand the cell as a whole. Previous research has been conducted by Gala, R., Budzillo, A., Baftizadeh, F. et al. to capture gene expression in neuron cells with a neural network called a coupled autoencoder. This autoencoder framework is able to reconstruct the inputs, allowing the prediction of one input to another, as well as align the multiple inputs in the same low dimensional representation. In our paper, we build upon this coupled autoencoder on a data set of cells taken from several sites of the human body, predicting from RNA information to protein. We find that the autoencoder is able to adequately cluster the cell types in its lower dimensional representation, as well as perform decently at the prediction task. We show that the autoencoder is a powerful tool for analyzing single-cell data analysis and may prove to be a valuable asset in single-cell data analysis.

Machine Learning, Natural Language Processing

On evaluating the robustness of language models with tuning.

Group members: Lechuan Wang, Colin Wang, Yutong Luo

Abstract: Prompt tuning and prefix tuning are two effective mechanisms to leverage frozen language models to perform downstream tasks. Robustness reflects models’ resilience of output under a change or noise in the input. In this project, we analyze the robustness of natural language models using various tuning methods with respect to a domain shift (i.e. training on a domain but evaluating on out-of-domain data). We apply both prompt tuning and prefix tuning on T5 models for reading comprehension (i.e. question-answering) and GPT-2 models for table-to-text generation.

Activity Based Travel Models and Feature Selection

A tree-based model for activity based travel models and feature selection.

Group members: Lisa Kuwahara, Ruiqin Li, Sophia Lau

Abstract: In a previous study, Deloitte Consulting LLP developed a method of creating city simulations through cellular location and geospatial data. Using these simulations of human activity and traffic patterns, better decisions can be made regarding modes of transportation or road construction. However, the current commonly used method of estimating transportation mode choice is a utility model that involves many features and coefficients that may not necessarily be important but still make the model more complex. Instead, we used a tree-based approach - in particular, XGBoost - to identify just the features that are important for determining mode choice so that we can create a model that is simpler, robust, and easily deployable, in addition to performing better than the original utility model on both the full dataset and population subsets.

Explainable AI, Causal Inference

Explainable ai.

Group members: Jerry Chan, Apoorv Pochiraju, Zhendong Wang, Yujie Zhang

Abstract: Nowadays, the algorithmic decision-making system has been very common in people’s daily lives. Gradually, some algorithms become too complex for humans to interpret, such as some black-box machine learning models and deep neural networks. In order to assess the fairness of the models and make them better tools for different parties, we need explainable AI (XAI) to uncover the reasoning behind the predictions made by those black-box models. In our project, we will be focusing on using different techniques from causal inferences and explainable AI to interpret various classification models across various domains. In particular, we are interested in three domains - healthcare, finance, and the housing market. Within each domain, we are going to train four binary classification models first, and we have four goals in general: 1) Explaining black-box models both globally and locally with various XAI methods. 2) Assessing the fairness of each learning algorithm with regard to different sensitive attributes; 3) Generating recourse for individuals - a set of minimal actions to change the prediction of those black-box models. 4) Evaluating the explanations from those XAI methods using domain knowledge.

AutoML Platforms

Deep learning transformer models for feature type inference.

Group members: Andrew Shen, Tanveer Mittal

Abstract: The first step AutoML software must take after loading in the data is to identify the feature types of individual columns in input data. This information then allows the software to understand the data and then preprocess it to allow machine learning algorithms to run on it. Project Sortinghat of the ADA lab at UCSD frames this task of Feature Type Inference as a machine learning multiclass classification problem. Machine learning models defined in the original SortingHat feature type inference paper use 3 sets of features as input. 1. The name of the given column 2. 5 not null sample values 3. Descriptive numeric features about the column The textual features are easy to access, however the descriptive statistics previous models rely on require a full pass through the data which make preprocessing less scalable. Our goal is to produce models that may rely less on these statistics by better leveraging the textual features. As an extension of Project SortingHat, we experimented with deep learning transformer models and varying the sample sizes used by random forest models. We found that our transformer models achieved state of the art results on this task which outperform all existing tools and ML models that have been benchmarked against SortingHat's ML Data Prep Zoo. Our best model used a pretrained Bidirectional Encoder Representations Transformer(BERT) language model to produce word embeddings which are then processed by a Convolutional Neural Network(CNN) model. As a result of this project, we have published 2 BERT CNN models using the PyTorch Hub api. This is to allow software engineers to easily integrate our models or train similar ones for use in AutoML platforms or other automated data preparation applications. Our best model uses all the features defined above, while the other only uses column names and sample values while offering comparable performance and much better scalability for all input data.

Exploring Noise in Data: Applications to ML Models

Group members: Cheolmin Hwang, Amelia Kawasaki, Robert Dunn

Abstract: In machine learning, models are commonly built in such a way to avoid what is known as overfitting. As it is generally understood, overfitting is when a model is fit exactly to the training data causing the model to have poor performance on new examples. This means that overfit models tend to have poor accuracy on unseen data because the model is fit exactly to the training data. Therefore, in order to generalize to all examples of data and not only the examples found in a given training set, models are built with certain techniques to avoid fitting the data exactly. However, it can be found that overfitting does not always work in this way that one might expect as will be shown by fitting models with a given level of noisiness. Specifically, it is seen that some models fit exactly to data with high levels of noise still produce results with high accuracy whereas others are more prone to overfitting.

Group Testing for Optimizing COVID-19 Testing

Covid-19 group testing optimization strategies.

Group members: Mengfan Chen, Jeffrey Chu, Vincent Lee, Ethan Dinh-Luong

Abstract: The COVID-19 pandemic that has persisted for more than two years has been combated by efficient testing strategies that reliably identifies positive individuals to slow the spread of the pandemic. Opposed to other pooling strategies within the domain, the methods described in this paper prioritize true negative samples over overall accuracy. In the Monte Carlo simulations, both nonadaptive and adaptive testing strategies with random pool sampling resulted in high accuracy approaching at least 95% with varying pooling sizes and population sizes to decrease the number of tests given. A split tensor rank 2 method attempts to identify all infected samples within 961 samples, converging the number of tests to 99 as the prevalence of infection converges to 1%.

Causal Discovery

Patterns of fairness in machine learning.

Group members: Daniel Tong, Anne Xu, Praveen Nair

Abstract: Machine learning tools are increasingly used for decision-making in contexts that have crucial ramifications. However, a growing body of research has established that machine learning models are not immune to bias, especially on protected characteristics. This had led to efforts to create mathematical definitions of fairness that could be used to estimate whether, given a prediction task and a certain protected attribute, an algorithm is being fair to members of all classes. But just like how philosophical definitions of fairness can vary widely, mathematical definitions of fairness vary as well, and fairness conditions can in fact be mutually exclusive. In addition, the choice of model to use to optimize fairness is also a difficult decision we have little intuition for. Consequently, our capstone project centers around an empirical analysis for studying the relationships between machine learning models, datasets, and various fairness metrics. We produce a 3-dimensional matrix of the performance of a certain machine learning model, for a certain definition of fairness, for a certain given dataset. Using this matrix on a sample of 8 datasets, 7 classification models, and 9 fairness metrics, we discover empirical relationships between model type and performance on specific metrics, in addition to correlations between metric values across different dataset-model pairs. We also offer a website and command-line interface for users to perform this experimentation on their own datasets.

Causal Effects of Socioeconomic and Political Factors on Life Expectancy in 166 Different Countries

Group members: Adam Kreitzman, Maxwell Levitt, Emily Ramond

Abstract: This project examines causal relationships between various socioeconomic variables and life expectancy outcomes in 166 different countries, with the ability to account for new, unseen data and variables with an intuitive data pipeline process with detailed instructions and the PC algorithm with updated code to account for missingness in data. With access to this model and pipeline, we hope that questions such as “do authoritarian countries have a direct relation to life expectancy?” or “how does women in government affect perceived notion of social support?” will now be able to be answered and understood. Through our own analysis, we were able to find intriguing results, such as a higher Perception of Corruption is distinctly related to a lower Life Ladder score. We also found that higher quality of life perceptions is related to lower economic inequality. These results aim to educate not only the general public, but government officials as well.

Time series analysis in health

Time series analysis on the effect of light exposure on sleep quality.

Group members: Shubham Kaushal, Yuxiang Hu, Alex Liu

Abstract: The increase of artificial light exposure through the increased prevalence of technology has an affect on the sleep cycle and circadian rhythm of humans. The goal of this project is to determine how different colors and intensities of light exposure prior to sleep affects the quality of sleep through the classification of time series data.

Sleep Stage Classification for Patients With Sleep Apnea

Group members: Kevin Chin, Yilan Guo, Shaheen Daneshvar

Abstract: Sleeping is not uniform and consists of four stages: N1, N2, N3, and REM sleep. The analysis of sleep stages is essential for understanding and diagnosing sleep-related diseases, such as insomnia, narcolepsy, and sleep apnea; however, sleep stage classification often does not generalize to patients with sleep apnea. The goal of our project is to build a sleep stage classifier specifically for people with sleep apnea and understand how it differs from the normal sleep stage. We will then explore whether or not the inclusion and featurization of ECG data will improve the performance of our model.

Environmental health exposures & pollution modeling & land-use change dynamics

Supervised classification approach to wildfire mapping in northern california.

Group members: Alice Lu, Oscar Jimenez, Anthony Chi, Jaskaranpal Singh

Abstract: Burn severity maps are an important tool for understanding fire damage and managing forest recovery. We have identified several issues with current mapping methods used by federal agencies that affect the completeness, consistency, and efficiency of their burn severity maps. In order to address these issues, we demonstrate the use of machine learning as an alternative to traditional methods of producing severity maps, which rely on in-situ data and spectral indices derived from image algebra. We have trained several supervised classifiers on sample data collected from 17 wildfires across Northern California and evaluate their performance at mapping fire severity.

Network Performance Classification

Network signal anomaly detection.

Group members: Laura Diao, Benjamin Sam, Jenna Yang

Abstract: Network degradation occurs in many forms, and our project will focus on two common factors: packet loss and latency. Packet loss occurs when one or more data packets transmitted across a computer network fail to reach their destination. Latency can be defined as a measure of delay for data to transmit across a network. For internet users, high rates of packet loss and significant latency can manifest in jitter or lag, which are indicators of overall poor network performance as perceived by the end user. Thus, when issues arise in these two factors, it would be beneficial for internet service providers to know exactly when the user is experiencing problems in real time. In real world scenarios, situations or environments such as poor port quality, overloaded ports, network congestion and more can impact overall network performance. In order to detect some of these issues in network transmission data, we built an anomaly detection system that predicts the estimated packet loss and latency of a connection and detects whether there is a significant degradation of network quality for the duration of the connection.

Real Time Anomaly Detection in Networks

Group members: Justin Harsono, Charlie Tran, Tatum Maston

Abstract: Internet companies are expected to deliver the speed their customer has paid for. However, for various reasons such as congestion or connectivity issues, it is inevitable for one to perceive degradations in network quality. To still ensure the customer is satisfied, certain monitoring systems must be built to inspect the quality of the connection. Our goal is to build a model that would be able to detect, in real time, these regions of networks degradations, so that an appropriate recovery can be enacted to offset these degradations. Our solution is a combination of two anomaly detection methods that successfully detects shifts in the data, based on a rolling window of data it has seen.

System Usage Reporting

Intel telemetry: data collection & time-series prediction of app usage.

Group members: Srikar Prayaga, Andrew Chin, Arjun Sawhney

Abstract: Despite advancements in hardware technology, PC users continue to face frustrating app launch times, especially on lower end Windows machines. The desktop experience differs vastly from the instantaneous app launches and optimized experience we have come to expect even from low end smartphones. We propose a solution to preemptively run Windows apps in the background based on the app usage patterns of the user. Our solution is two-step. First, we built telemetry collector modules in C/C++ to collect real-world app usage data from two of our personal Windows 10 devices. Next, we developed neural network models, trained on the collected data, to predict app usage times and corresponding launch sequences in python. We achieved impressive results on selected evaluation metrics across different user profiles.

Predicting Application Use to Reduce User Wait Time

Group members: Sasami Scott, Timothy Tran, Andy Do

Abstract: Our goal for this project was to lower the user wait time when loading programs by predicting the next used application. In order to obtain the needed data, we created data collection libraries. Using this data, we created a Hidden Markov Model (HMM) and a Long Short-Term Memory (LSTM) model, but the latter proved to be better. Using LSTM, we can predict the application use time and expand this concept to more applications. We created multiple LSTM models with varying results, but ultimately chose a model that we think had potential. We decided on using the model that reported a 90% accuracy.

INTELlinext: A Fully Integrated LSTM and HMM-Based Solution for Next-App Prediction With Intel SUR SDK Data Collection

Group members: Jared Thach, Hiroki Hoshida, Cyril Gorlla

Abstract: As the power of modern computing devices increases, so too do user expectations for them. Despite advancements in technology, computer users are often faced with the dreaded spinning icon waiting for an application to load. Building upon our previous work developing data collectors with the Intel System Usage Reporting (SUR) SDK, we introduce INTELlinext, a comprehensive solution for next-app prediction for application preload to improve perceived system fluidity. We develop a Hidden Markov Model (HMM) for prediction of the k most likely next apps, achieving an accuracy of 64% when k = 3. We then implement a long short-term memory (LSTM) model to predict the total duration that applications will be used. After hyperparameter optimization leading to an optimal lookback value of 5 previous applications, we are able to predict the usage time of a given application with a mean absolute error of ~45 seconds. Our work constitutes a promising comprehensive application preload solution with data collection based on the Intel SUR SDK and prediction with machine learning.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Notifications

BigData Engineering Capstone Project with Tech-stack : Linux, MySQL, sqoop, HDFS, Hive, Impala, SparkSQL, SparkML, git

Subham2S/BigData-Engineering-Capstone-Project-1

Folders and files, repository files navigation, bigdata engineering capstone project 1.

🤖 Tech-Stack

One of the big corporations needed data engineering services for a decade's worth of employee data. All employee datasets from that period were provided in six CSV files. The first step of this project was to create an Entity Relation Diagram and create a database in an RDBMS with all the tables for structuring and holding the data as per the relations between the tables. So, I imported the CSVs into a MySQL database, transferred to HDFS/Hive in an optimized format, and analyzed with Hive, Impala, Spark, and SparkML. Finally, I created a Bash Script to facilitate the end-to-end data pipeline and machine learning pipeline for automation purposes.

Importing data from MySQL RDBMS to HDFS using Sqoop, Creating HIVE Tables with compressed file format (avro), Explanatory Data Analysis with Impala & SparkSQL and Building Random Forest Classifer Model & Logistic Regression Model using SparkML.

Upload the Capstone_Inputs Folder in Client home dir which contains :

Capstone_P1.sh
CreateMySQLTables.sql
HiveTables.sql
capstone.py
departments.csv
dept_emp.csv
dept_manager.csv
employees.csv
salaries.csv

Run the Bash Script Capstone_P1.sh file in Terminal

Wait for a while and download the Capstone_Outputs Folder After approx. 10-15 mins Capstone_Ouputs Folder will be generated with all the output files : 1. Cap_MySQLTables.txt - To Check MySQL Tables. 2. Cap_HiveDB.txt - To Ensure that Hive Tables were created. 3. Cap_ImpalaAnalysis.txt - All EDA output tables from Impala. 4. Cap_HiveTables.txt - To Check records in Hive Tables and dept_emp1 is created additionally to fix some duplicate issues which were present in dept_emp. 5. Cap_SparkSQL_EDA_ML.txt - All EDA output tables from SparkSQL, pySpark and all the details of the Models (both Random Forest & Logistic Regression) 6. random_forest.model.zip 7. logistic_regression.model.zip

🔍 Details of Capstone_P1.sh

Linux commands.

Removes the metadata of the tables which are there in the Root dir (created by the sqoop command when the code was run last time)
Removes the Java MapReduce Codes which are there in the Root dir (created by the sqoop command when the code was run last time)
Removes the current Capstone_Outputs Folder and Creates a new dir Capstone_Outputs - Here all the Outputs will be stored.
Recursively Copies everything to root folder to avoid permission issues at later point of time.

MySQL ( .sql )

Creates MySQL tables & Inserts data. For more details, please check out CreateMySQLTables.sql
Removes & Creates the Warehouse/Capstone dir to avoid anomalies between same named files
Importing Data & Metadata of all the Tables from MySQL RDBMS system to Hadoop using SQOOP command

HDFS Commands

Transfering the Metadata to HDFS for Table creation in Hive

Hive ( .hql )

All the hive Tables are created as AVRO format. In the HiveDB.hql file Table location and its metadata (schema) locations are mentioned separately.

Impala ( .sql )

Explanatory Data Analysis is done with Impala. For more details, please check out EDA.sql
Checking all the records of the Hive Tables before moving to spark. For more details, please check out HiveTables.sql

Spark ( .py )

This capstone.py does everything. First it loads the tables and creates spark dataframes, then checks all the records again. After that Same EDA analysis is performed with the aid of sparkSQL & pySpark.
After EDA, it checks stats for Numerical & Categorical Variables. Then proceeds towards model building after creating final df with joining the tables and dropping irrelevant columns. As per the chosen target variable 'left', the independent variables were divided into continuous and categorical variables, and in the categorical variables, two columns were label encoded manually and the rest were processed for One-Hot Encoding.
Then, based on previous experience of EDA, both Random Forest Classification Model and Logistic Regression Model are chosen for this dataset. And as per the analysis the accuracies were 99% (RF) and 90% (LR). Model were fitted on test and train (0.3: 0.7) and gave same accuracy. Considering these as good fits, both the models were saved.
After that a Pipeline was created and same analysis were performed in a streamlined manner to build these models. The Accuracies between the built models and the Pipeline models are very close. The reason behind the slight change in the accuracies is that the earlier case, the train & test split was performed after fitting the assembler but in case of ML pipeline, the assembler is inside the stages, so assembler is fitting on split datasets separately as a part of the pipeline. This is also clearly visible in the features column as well. So, this was a good test of the pipeline models in terms of accuracy, and we can conclude that the ML Pipeline is working properly.

Collecting the Models

📚 reference files.

The following files are added for your reference.

Capstone.ipynb
Capstone Project1.pptx , Capstone Project1.pdf
Capstone.zip
ERD_Data Model.jpg , ERD_Data Model.svg
Python 66.6%
HiveQL 3.6%

Top 15 Big Data Projects (With Source Code)

Introduction, big data project ideas, projects for beginners, intermediate big data projects, advanced projects, big data projects: why are they so important, frequently asked questions, additional resources.

Almost 6,500 million linked gadgets communicate data via the Internet nowadays. This figure will climb to 20,000 million by 2025. This “sea of data” is analyzed by big data to translate it into the information that is reshaping our world. Big data refers to massive data volumes – both organized and unstructured – that bombard enterprises daily. But it’s not simply the type or quantity of data that matters; it’s also what businesses do with it. Big data may be evaluated for insights that help people make better decisions and feel more confident about making key business decisions. Big data refers to vast, diversified amounts of data that are growing at an exponential rate. The volume of data, the velocity or speed with which it is created and collected, and the variety or scope of the data points covered (known as the “three v’s” of big data) are all factors to consider. Big data is frequently derived by data mining and is available in a variety of formats.

Unstructured and structured big data are two types of big data. For large data, the term structured data refers to data that has a set length and format. Numbers, dates, and strings, which are collections of words and numbers, are examples of organized data. Unstructured data is unorganized data that does not fit into a predetermined model or format. It includes information gleaned from social media sources that aid organizations in gathering information on customer demands.

Key Takeaway

Confused about your next job?

Big data is a large amount of diversified information that is arriving in ever-increasing volumes and at ever-increasing speeds.
Big data can be structured (typically numerical, readily formatted, to and saved) or unstructured (often non-numerical, difficult to format and store) (more free-form, less quantifiable).
Big data analysis may benefit nearly every function in a company, but dealing with the clutter and noise can be difficult.
Big data can be gathered willingly through personal devices and applications, through questionnaires, product purchases, and electronic check-ins, as well as publicly published remarks on social networks and websites.
Big data is frequently kept in computer databases and examined with software intended to deal with huge, complicated data sets.

Just knowing the theory of big data isn’t going to get you very far. You’ll need to put what you’ve learned into practice. You may put your big data talents to the test by working on big data projects. Projects are an excellent opportunity to put your abilities to the test. They’re also great for your resume. In this article, we are going to discuss some great Big Data projects that you can work on to showcase your big data skills.

1. Traffic control using Big Data

Big Data initiatives that simulate and predict traffic in real-time have a wide range of applications and advantages. The field of real-time traffic simulation has been modeled successfully. However, anticipating route traffic has long been a challenge. This is because developing predictive models for real-time traffic prediction is a difficult endeavor that involves a lot of latency, large amounts of data, and ever-increasing expenses.

The following project is a Lambda Architecture application that monitors the traffic safety and congestion of each street in Chicago. It depicts current traffic collisions, red light, and speed camera infractions, as well as traffic patterns on 1,250 street segments within the city borders.

These datasets have been taken from the City of Chicago’s open data portal:

Traffic Crashes shows each crash that occurred within city streets as reported in the electronic crash reporting system (E-Crash) at CPD. Citywide data are available starting September 2017.
Red Light Camera Violations reflect the daily number of red light camera violations recorded by the City of Chicago Red Light Program for each camera since 2014.
Speed Camera Violations reflect the daily number of speed camera violations recorded by each camera in Children’s Safety Zones since 2014.
Historical Traffic Congestion Estimates estimates traffic congestion on Chicago’s arterial streets in real-time by monitoring and analyzing GPS traces received from Chicago Transit Authority (CTA) buses.
Current Traffic Congestion Estimate shows current estimated speed for street segments covering 300 miles of arterial roads. Congestion estimates are produced every ten minutes.

The project implements the three layers of the Lambda Architecture:

Batch layer – manages the master dataset (the source of truth), which is an immutable, append-only set of raw data. It pre-computes batch views from the master dataset.
Serving layer – responds to ad-hoc queries by returning pre-computed views (from the batch layer) or building views from the processed data.
Speed layer – deals with up-to-date data only to compensate for the high latency of the batch layer

Source Code – Traffic Control

2. Search Engine

To comprehend what people are looking for, search engines must deal with trillions of network objects and monitor the online behavior of billions of people. Website material is converted into quantifiable data by search engines. The given project is a full-featured search engine built on top of a 75-gigabyte In this project, we will use several datasets like stopwords.txt (A text file containing all the stop words in the current directory of the code) and wiki_dump.xml (The XML file containing the full data of Wikipedia). Wikipedia corpus with sub-second search latency. The results show wiki pages sorted by TF/IDF (stands for Term Frequency — Inverse Document Frequency) relevance based on the search term/s entered. This project addresses latency, indexing, and huge data concerns with an efficient code and the K-Way merge sort method.

Source Code – Search Engine

3. Medical Insurance Fraud Detection

A unique data science model that uses real-time analysis and classification algorithms to assist predict fraud in the medical insurance market. This instrument can be utilized by the government to benefit patients, pharmacies, and doctors, ultimately assisting in improving industry confidence, addressing rising healthcare expenses, and addressing the impact of fraud. Medical services deception is a major problem that costs Medicare/Medicaid and the insurance business a lot of money.

4 different Big Datasets have been joined in this project to get a single table for final data analysis. The datasets collected are:

Part D prescriber services- data such as name of doctor, addres of doctor, disease, symptoms etc.
List of Excluded Individuals and Entities (LEIE) database: This database contains a rundown of people and substances that are prohibited from taking an interest in governmentally financed social insurance programs (for example Medicare) because of past medicinal services extortion.
Payments Received by Physician from Pharmaceuticals
CMS part D dataset- data by Center of Medicare and Medicaid Services

It has been developed by taking consideration of different key features with applying different Machine Learning Algorithms to see which one performs better. The ML algorithms used have been trained to detect any irregularities in the dataset so that the authorities can be alerted.

Source Code – Medical Insurance Fraud

4. Data Warehouse Design for an E-Commerce Site

A data warehouse is essentially a vast collection of data for a company that assists the company in making educated decisions based on data analysis. The data warehouse designed in this project is a central repository for an e-commerce site, containing unified data ranging from searches to purchases made by site visitors. The site can manage supply based on demand (inventory management), logistics, the price for maximum profitability, and advertisements based on searches and things purchased by establishing such a data warehouse. Recommendations can also be made based on tendencies in a certain area, as well as age groups, sex, and other shared interests. This is a data warehouse implementation for an e-commerce website “Infibeam” which sells digital and consumer electronics.

Source Code – Data Warehouse Design

5. Text Mining Project

You will be required to perform text analysis and visualization of the delivered documents as part of this project. For beginners, this is one of the best deep learning project ideas. Text mining is in high demand, and it can help you demonstrate your abilities as a data scientist . You can deploy Natural Language Process Techniques to gain some useful information from the link provided below. The link contains a collection of NLP tools and resources for various languages.

Source Code – Text Mining

6. Big Data Cybersecurity

The major goal of this Big Data project is to use complex multivariate time series data to exploit vulnerability disclosure trends in real-world cybersecurity concerns. This project consists of outlier and anomaly detection technologies based on Hadoop, Spark, and Storm are interwoven with the system’s machine learning and automation engine for real-time fraud detection and intrusion detection to forensics.

For independent Big Data Multi-Inspection / Forensics of high-level risks or volume datasets exceeding local resources, it uses the Ophidia Analytics Framework. Ophidia Analytics Framework is an open-source big data analytics framework that contains cluster-aware parallel operators for data analysis and mining (subsetting, reduction, metadata processing, and so on). The framework is completely connected with Ophidia Server: it takes commands from the server and responds with alerts, allowing processes to run smoothly.

Lumify, an open-source big data analysis, and visualization platform are also included in the Cyber Security System to provide big data analysis and visualization of each instance of fraud or intrusion events into temporary, compartmentalized virtual machines, which creates a full snapshot of the network infrastructure and infected device, allowing for in-depth analytics, forensic review, and providing a transportable threat analysis for Executive level next-steps.

Lumify, a big data analysis and visualization tool developed by Cyberitis is launched using both local and cloud resources (customizable per environment and user). Only the backend servers (Hadoop, Accumulo, Elasticsearch, RabbitMQ, Zookeeper) are included in the Open Source Lumify Dev Virtual Machine. This VM allows developers to get up and running quickly without having to install the entire stack on their development workstations.

Source Code – Big Data Cybersecurity

7. Crime Detection

The following project is a Multi-class classification model for predicting the types of crimes in Toronto city. The developer of the project, using big data ( The dataset collected includes every major crime committed from 2014-2017* in the city of Toronto, with detailed information about the location and time of the offense), has constructed a multi-class classification model using a Random Forest classifier to predict the type of major crime committed based on time of day, neighborhood, division, year, month, etc. using data sourced from Toronto Police.

The use of big data analytics here is to discover crime tendencies automatically. If analysts are given automated, data-driven tools to discover crime patterns, these tools can help police better comprehend crime patterns, allowing for more precise estimates of past crimes and increasing suspicion of suspects.

Source Code – Crime Detection

8. Disease Prediction Based on Symptoms

With the rapid advancement of technology and data, the healthcare domain is one of the most significant study fields in the contemporary era. The enormous amount of patient data is tough to manage. Big Data Analytics makes it easier to manage this information (Electronic Health Records are one of the biggest examples of the application of big data in healthcare). Knowledge derived from big data analysis gives healthcare specialists insights that were not available before. In healthcare, big data is used at every stage of the process, from medical research to patient experience and outcomes. There are numerous ways of treating various ailments throughout the world. Machine Learning and Big Data are new approaches that aid in disease prediction and diagnosis. This research explored how machine learning algorithms can be used to forecast diseases based on symptoms. The following algorithms have been explored in code:

Naive Bayes
Decision Tree
Random Forest
Gradient Boosting

Source Code – Disease Prediction

9. Yelp Review Analysis

Yelp is a forum for users to submit reviews and rate businesses with a star rating. According to studies, an increase of one star resulted in a 59 percent rise in income for independent businesses. As a result, we believe the Yelp dataset has a lot of potential as a powerful insight source. Customer reviews of Yelp is a gold mine waiting to be discovered.

This project’s main goal is to conduct in-depth analyses of seven different cuisine types of restaurants: Korean, Japanese, Chinese, Vietnamese, Thai, French, and Italian, to determine what makes a good restaurant and what concerns customers, and then make recommendations for future improvement and profit growth. We will mostly evaluate customer evaluations to determine why customers like or dislike the business. We can turn the unstructured data (reviews) into actionable insights using big data, allowing businesses to better understand how and why customers prefer their products or services and make business improvements as rapidly as feasible.

Source Code – Review Analysis

10. Recommendation System

Thousands, millions, or even billions of objects, such as merchandise, video clips, movies, music, news, articles, blog entries, advertising, and so on, are typically available through online services. The Google Play Store, for example, has millions of apps and YouTube has billions of videos. Netflix Recommendation Engine, their most effective algorithm, is made up of algorithms that select material based on each user profile. Big data provides plenty of user data such as past purchases, browsing history, and comments for Recommendation systems to deliver relevant and effective recommendations. In a nutshell, without massive data, even the most advanced Recommenders will be ineffective. Big data is the driving force behind our mini-movie recommendation system. Over 3,000 titles are filtered at a time by the engine, which uses 1,300 suggestion clusters depending on user preferences. It’s so accurate that customized recommendations from the engine drive 80 percent of Netflix viewer activity. The goal of this project is to compare the performance of various recommendation models on the Hadoop Framework.

Source Code – Recommendation System

11. Anomaly Detection in Cloud Servers

Anomaly detection is a useful tool for cloud platform managers who want to keep track of and analyze cloud behavior in order to improve cloud reliability. It assists cloud platform managers in detecting unexpected system activity so that preventative actions can be taken before a system crash or service failure occurs.

This project provides a reference implementation of a Cloud Dataflow streaming pipeline that integrates with BigQuery ML, Cloud AI Platform to perform anomaly detection. A key component of the implementation leverages Dataflow for feature extraction & real-time outlier identification which has been tested to analyze over 20TB of data.

Source Code – Anomaly Detection

12. Smart Cities Using Big Data

A smart city is a technologically advanced metropolitan region that collects data using various electronic technologies, voice activation methods, and sensors. The information gleaned from the data is utilized to efficiently manage assets, resources, and services; in turn, the data is used to improve operations throughout the city. Data is collected from citizens, devices, buildings, and assets, which is then processed and analyzed to monitor and manage traffic and transportation systems, power plants, utilities, water supply networks, waste, crime detection, information systems, schools, libraries, hospitals, and other community services. Big data obtains this information and with the help of advanced algorithms, smart network infrastructures and various analytics platforms can implement the sophisticated features of a smart city. This smart city reference pipeline shows how to integrate various media building blocks, with analytics powered by the OpenVINO Toolkit, for traffic or stadium sensing, analytics, and management tasks.

Source Code – Smart Cities

13. Tourist Behavior Analysis

This is one of the most innovative big data project concepts. This Big Data project aims to study visitor behavior to discover travelers’ preferences and most frequented destinations, as well as forecast future tourism demand.

What is the role of big data in the project? Because visitors utilize the internet and other technologies while on vacation, they leave digital traces that Big Data can readily collect and distribute – the majority of the data comes from external sources such as social media sites. The sheer volume of data is simply too much for a standard database to handle, necessitating the use of big data analytics. All the information from these sources can be used to help firms in the aviation, hotel, and tourist industries find new customers and advertise their services. It can also assist tourism organizations in visualizing and forecasting current and future trends.

Source Code – Tourist Behavior Analysis

14. Web Server Log Analysis

A web server log keeps track of page requests as well as the actions it has taken. To further examine the data, web servers can be used to store, analyze, and mine the data. Page advertising can be determined and SEO (search engine optimization) can be performed in this manner. Web-server log analysis can be used to get a sense of the overall user experience. This type of processing is advantageous to any company that relies largely on its website for revenue generation or client communication. This interesting big data project demonstrates parsing (including incorrectly formatted strings) and analysis of web server log data.

Source Code – Web Server Log Analysis

15. Image Caption Generator

Because of the rise of social media and the importance of digital marketing, businesses must now upload engaging content. Visuals that are appealing to the eye are essential, but subtitles that describe the images are also required. The usage of hashtags and attention-getting subtitles might help you reach out to the right people even more. Large datasets with correlated photos and captions must be managed. Image processing and deep learning are used to comprehend the image, and artificial intelligence is used to provide captions that are both relevant and appealing. Big Data source code can be written in Python. The creation of image captions isn’t a beginner-level Big Data project proposal and is indeed challenging. The project given below uses a neural network to generate captions for an image using CNN (Convolution Neural Network) and RNN (Recurrent Neural Network) with BEAM Search (Beam search is a heuristic search algorithm that examines a graph by extending the most promising node in a small collection.

There are currently rich and colorful datasets in the image description generating work, such as MSCOCO, Flickr8k, Flickr30k, PASCAL 1K, AI Challenger Dataset, and STAIR Captions, which are progressively becoming a trend of discussion. The given project utilizes state-of-the-art ML and big data algorithms to build an effective image caption generator.

Source Code – Image Caption Generator

Big Data is a fascinating topic. It helps in the discovery of patterns and outcomes that might otherwise go unnoticed. Big Data is being used by businesses to learn what their customers want, who their best customers are, and why people choose different products. The more information a business has about its customers, the more competitive it is.

It can be combined with Machine Learning to create market strategies based on customer predictions. Companies that use big data become more customer-centric.

This expertise is in high demand and learning it will help you progress your career swiftly. As a result, if you’re new to big data, the greatest thing you can do is brainstorm some big data project ideas.

We’ve examined some of the best big data project ideas in this article. We began with some simple projects that you can complete quickly. After you’ve completed these beginner tasks, I recommend going back to understand a few additional principles before moving on to the intermediate projects. After you’ve gained confidence, you can go on to more advanced projects.

What are the 3 types of big data? Big data is classified into three main types:

Unstructured
Semi-structured

What can big data be used for? Some important use cases of big data are:

Improving Science and research
Improving governance
Smart cities
Understanding and targeting customers
Understanding and Optimizing Business Processes
Improving Healthcare and Public Health
Financial Trading
Optimizing Machine and Device Performance

What industries use big data? Big data finds its application in various domains. Some fields where big data can be used efficiently are:

Travel and tourism
Financial and banking sector
Telecommunication and media
Banking Sector
Government and Military
Social Media
Big Data Tools
Big Data Engineer
Applications of Big Data
Big Data Interview Questions
Big Data Projects

Top 10 power bi project ideas for practice, 14 data mining projects with source code.

Data Science Master’s Students Tackle Diverse, Real-World Challenges in Capstone Projects

The spring 2024 residential data science master's class.

Human trafficking, illegal fishing, small business contracts — these are all issues that, at first glance, seemingly have little in common.

Yet, for students in the University of Virginia’s data science residential master’s program , these disparate subjects have one critical common denominator: they could all be better understood through the application of data science methods.

Fifteen groups of graduate students at the School of Data Science recently presented the findings of their capstone projects before a sizable crowd of faculty, staff, and fellow students at the Graduate hotel in Charlottesville.

Capstone projects have long been a cornerstone of the data science master’s program. In them, students work in groups with a faculty mentor as well as an outside sponsor to tackle a real-world problem.

It’s an opportunity for students to collaborate with their classmates and apply what they are learning from faculty toward addressing a real issue that their sponsor is dealing with — experiences that often leave lasting impressions.

“When we talk to alumni years later, they all say that the two things that they remember most about the program are the capstone experience and also a cohort experience,” said Phil Bourne , founding dean of the School of Data Science, in opening remarks.

The projects offered students the chance to learn about new subjects and methods as well as develop skills that could prove vital as they begin their careers.

“While you’ve done many projects, this is the first time that it was a very large project that you had to break it up and project manage it,” Adam Tashman , an associate professor of data science and capstone program director, told the students.

He added that perhaps the most important lesson students will take from the experience is “how to deliver for a customer.”

For more than four hours, with a break for lunch, groups of students laid out their findings in brief presentations followed by questions from the audience.

In one, which focused on predicting which federal agencies small businesses could match with to secure contracts, David Diaz , whose father owns a small business in California, described the personal nature of the group’s work.

“I’ve seen this firsthand; I still see it with my dad now. There are a lot of long hours. It isn’t really a 9 to 5 job — it’s a 24/7 job,” Diaz said, describing the continual process that small business owners face in gathering information to try to secure bids while ensuring business operations run efficiently.

“So, the goal of this project is to hopefully reduce that research time in reaching the federal contracting market and, hopefully, allow businesses to have a finer scope into what they’re looking for,” he added.

David Diaz addresses the audience during his group's capstone project presentation. (Photo by Alyssa Brown)

Another group laid out their work on illegal fishing, discussing the complexity of this global issue and the vast problems it creates. They also highlighted a key point that any data scientist must confront: how to classify the data they had.

“Kind of like an indie album, our data set is unlabeled,” joked Samuel Brown, who explained how he and his group created labels to differentiate between illegal and illegal fishing.

The group discussed how their project demonstrated how machine learning could be used to help predict illegal fishing, information that could, potentially, reduce its prevalence.

Like any complex challenge, completing a capstone project can be stressful. And in those difficult times, sources of wisdom are sometimes found in unexpected places.

Sunidhi Goyal , who works as a tennis instructor for UVA Recreation, recounted how one of her six-year-old students asked her one day what was bothering her.

Goyal, part of a group that worked with LMI on empowering their enterprise architecture team, said she wasn’t sure how she could explain the complexities of their project to a young child.

But she tried, describing, in simple terms, how she and her collaborators needed to find a way to allow enterprise architects to sort through a large number of documents and keep just the relevant ones — the “good” documents, she called them.

“She was like, ‘Oh, what if the good could be a magnet, and you could keep it together and let go of the bad,” Goyal said the student responded.

“This is exactly what we ended up doing,” Goyal said, explaining how her team used a method called principal component analysis to retain only the most relevant documents, an approach that helped lead them to their solution.

As the day wound down, audience members voted on awards, and Tashman praised the students for the effort, passion, and purpose they exhibited in taking on the challenges presented by their sponsors.

“These are all important things that our sponsors need help with, and you all took ownership of that. I think you took it to heart and really put your heart and soul into it,” he said.

And while the completion of their capstone projects signaled an end to their time as master’s students at the School of Data Science, it also marked the beginning of a much longer journey to come.

“Let this be the first of many real-world problems that you face and that you tackle,” Tashman said.

Awards, as voted on by the audience

Most Innovative Analytical Solution : “Optimizing the ALMA Research Proposal Process with Machine Learning”

Group members: Brendan Puglisi, Arnav Boppudi, Kaleigh O’Hara, Noah McIntire, Ryan Lipps

Most Compelling Data Visualization : “Detecting Illegal Fishing with Automatic Identification Systems and Machine Learning”

Group members: Samuel Brown, Danielle Katz, Dana Korotovskikh, Stephen Kullman

Most Engaging Data Story : “Predicting Winter California Precipitation with Convolutional Neural Networks”

Group members: Anthony Chiado, Kristian Olsson, Luke Rohlwing, Michael Vaden

Most Impactful Ethical Engagement : “Detecting Human Trafficking

Group members: Jacqui Unciano, Grace Zhang, Tatev Gomtsyan, Serene Lu

A graphic showing UVA's Rotunda with the words M.S. in Data Science Online

‘A Big Moment’: Online Data Science Master’s Students Present Captone Projects

Professor Promotes Data Literacy With New Book

Speakers at the grand opening of the new data science building.

News Roundup: School of Data Science Celebrates Grand Opening of New Home

Get the latest news.

Subscribe to receive updates from the School of Data Science.

Prospective Student
School of Data Science Alumnus
UVA Affiliate
Industry Member

UC Berkeley
Sign Up to Volunteer
I School Slack
Alumni News
Alumni Events
Alumni Accounts
Career Support
Academic Mission
Diversity & Inclusion Resources
DEIBJ Leadership
Featured Faculty
Featured Alumni
Work at the I School
Subscribe to Email Announcements
Logos & Style Guide
Directions & Parking

The School of Information is UC Berkeley’s newest professional school. Located in the center of campus, the I School is a graduate research and education community committed to expanding access to information and to improving its usability, reliability, and credibility while preserving security and privacy.

Career Outcomes
Degree Requirements
Paths Through the MIMS Degree
Final Project
Funding Your Education
Admissions Events
Request Information
Capstone Project
Jack Larson Data for Good Fellowship
Tuition & Fees
Women in MIDS
MIDS Curriculum News
MICS Student News
Dissertations
Applied Data Science Certificate
ICTD Certificate
Citizen Clinic

The School of Information offers four degrees:

The Master of Information Management and Systems (MIMS) program educates information professionals to provide leadership for an information-driven world.

The Master of Information and Data Science (MIDS) is an online degree preparing data science professionals to solve real-world problems. The 5th Year MIDS program is a streamlined path to a MIDS degree for Cal undergraduates.

The Master of Information and Cybersecurity (MICS) is an online degree preparing cybersecurity leaders for complex cybersecurity challenges.

Our Ph.D. in Information Science is a research program for next-generation scholars of the information age.

Spring 2024 Course Schedule
Summer 2024 Course Schedule
Fall 2024 Course Schedule

The School of Information's courses bridge the disciplines of information and computer science, design, social sciences, management, law, and policy. We welcome interest in our graduate-level Information classes from current UC Berkeley graduate and undergraduate students and community members. More information about signing up for classes.

Ladder & Adjunct Faculty
MIMS Students
MIDS Students
5th Year MIDS Students
MICS Students
Ph.D. Students

Publications
Centers & Labs
Computer-mediated Communication
Data Science
Entrepreneurship
Human-computer Interaction (HCI)
Information Economics
Information Organization
Information Policy
Information Retrieval & Search
Information Visualization
Social & Cultural Studies
Technology for Developing Regions
User Experience Research

Research by faculty members and doctoral students keeps the I School on the vanguard of contemporary information needs and solutions.

The I School is also home to several active centers and labs, including the Center for Long-Term Cybersecurity (CLTC) , the Center for Technology, Society & Policy , and the BioSENSE Lab .

Why Hire I School?
Request a Resume Book
Leadership Development Program
Mailing List
For Nonprofit and Government Employers
Jobscan & Applicant Tracking Systems
Resume & LinkedIn Review
Resume Book

I School graduate students and alumni have expertise in data science, user experience design & research, product management, engineering, information policy, cybersecurity, and more — learn more about hiring I School students and alumni .

Press Coverage
I School Voices

Hany farid in blue shirt taking to someone on his side

On the March 27th episode of PBS’s documentary series Nova titled “A.I. Revolution,” correspondent Miles O’Brien...

photo of sand with construction machines

A group of scholars from the School of Information are tackling the issue of illegal sand mining with the help of a...

ai-generated image of person on computer surrounded by stacks of papers

When the Bancroft Library received over 100,000 Japanese-American internment “individual record” forms (WRA-26) from...

view of attendees and speakers at conference

The Goldman School of Public Policy, the CITRIS Policy Lab, and the School of Information hosted the inaugural UC...

Distinguished Lecture Series
I School Lectures
Information Access Seminars
CLTC Events
Women in MIDS Events
Data Science Summer 2024 Capstone Project Showcase

Capstone projects are the culmination of the MIDS students’ work in the School of Information’s Master of Information and Data Science program.

Over the course of their final semester, teams of students propose and select project ideas, conduct and communicate their work, receive and provide feedback, and deliver compelling presentations along with a web-based final deliverable.

Join us for an online presentation of these capstone projects. Six teams will present for twenty minutes each, including Q&A.

A panel of judges will select an outstanding project for the Hal R. Varian MIDS Capstone Award .

For the I School Community

Sidebar Text

More information.

Summer 2024 MIDS Project Descriptions

If you have questions about this event, please contact the Student Affairs team at [email protected] .

Accessibility

If you require an accommodation for effective communication (ASL interpreting, CART captioning, alternative media formats, etc.) or information about mobility access in order to fully participate in this event, please contact Megan O’Connor with as much advance notice as possible and at least 7–10 days in advance of the event.

Profile profile for moconnor

Last updated:.

Application

Why New College
Request Information
Apply Today

Individualized Curriculum

Explore more:

Chart Your Course (CYC)
SET SAIL First Year Seminar

Senior Capstone Project

An Education Unique As You

A New College education is all about you—your vision, your passions, your originality. As you work with your faculty adviser on your individual contract each semester, you’ll outline your academic and personal goals for the upcoming term. Along with traditional courses, you’ll explore new subjects through labs, tutorials and Independent Study Projects (ISPs). Every New College journey culminates in a senior capstone project or thesis, which serves as your grand finale — your project, your way — from a research paper or film to an art exhibit.

Areas of Concentration

Choose one of our 50 areas of concentration (majors) or design a multi-disciplinary or special area of concentration. Explore your interests and discover new ones as you choose from hundreds of courses, tutorials, labs, and seminars. With an average faculty-to-student ratio of 6:1, you’ll never take a class in a lecture hall with hundreds of students.

Academic Contracts

You’ll work with your faculty adviser to build an academic “contract” each semester instead of a one-size-fits-all traditional degree program, planning your courses and goals.

Narrative Evaluations

Your professors will give you in-depth evaluations on your coursework that tell your story, highlight your success, and show your promise.

Independent Study Period

In January, you’ll explore a topic you’re passionate about through an ISP (whether that is a lab experiment, internship, study abroad experience or other creative endeavor). Recent ISPs include:

Fighting red tide (and looking towards a fungal solution)
Developing a robotic prosthetic hand
Using big data to solve economic and social problems

We call it a senior capstone project or thesis, but it’s really the culmination of your journey here at New College—whether it takes the form of a scientific research paper or a theatrical performance. With coaching from your faculty adviser, you’ll be ready to defend your work during the baccalaureate exam before you graduate. Sound like grad school? That’s because it is so impressive.

Recent examples include:

Sarasota Bay: A Newly Defined Nursery Area for Blacktip Sharks (Carcharhinus limbatus) on the Gulf Coast of Florida
Footprints in the Atmosphere: A Quantitative Analysis of Community Carbon Emissions to Ignite Collective Climate Action
Cultural Gentrification: Hip-Hop & Racial Epistemologies in the United States

Explore Academics & Majors

Learn about our 50+ Areas of Concentration (what we call majors), the academic and career support services to help you thrive, and more about academics at New College.

TechRepublic

Male system administrator of big data center typing on laptop computer while working in server room. Programming digital operation. Man engineer working online in database center. Telecommunication.

8 Best Data Science Tools and Software

Apache Spark and Hadoop, Microsoft Power BI, Jupyter Notebook and Alteryx are among the top data science tools for finding business insights. Compare their features, pros and cons.

EU’s AI Act: Europe’s New Rules for Artificial Intelligence

Europe's AI legislation, adopted March 13, attempts to strike a tricky balance between promoting innovation and protecting citizens' rights.

Concept image of a woman analyzing data.

10 Best Predictive Analytics Tools and Software for 2024

Tableau, TIBCO Data Science, IBM and Sisense are among the best software for predictive analytics. Explore their features, pricing, pros and cons to find the best option for your organization.

Tableau Review: Features, Pricing, Pros and Cons

Tableau has three pricing tiers that cater to all kinds of data teams, with capabilities like accelerators and real-time analytics. And if Tableau doesn’t meet your needs, it has a few alternatives worth noting.

Futuristic concept art for big data solution for enterprises.

Latest Articles

The Top 5 Pipedrive Alternatives for 2024

Discover the top alternatives to Pipedrive. Explore a curated list of CRM platforms with similar features, pricing and pros and cons to find the best fit for your business.

Technology background with national flag of Australia.

The Australian Government’s Manufacturing Objectives Rely on IT Capabilities

The intent of the Future Made in Australia Act is to build manufacturing capabilities across all sectors, which will likely lead to more demand for IT skills and services.

Businessman add new skill or gear into human head to upgrade working skill.

Udemy Report: Which IT Skills Are Most in Demand in Q1 2024?

Informatica PowerCenter, Microsoft Playwright and Oracle Database SQL top Udemy’s list of most popular tech courses.

Gartner: 4 Bleeding-Edge Technologies in Australia

Gartner recently identified emerging tech that will impact enterprise leaders in APAC. Here’s what IT leaders in Australia need to know about these innovative technologies.

Llama 3 Cheat Sheet: A Complete Guide for 2024

Learn how to access Meta’s new AI model Llama 3, which sets itself apart by being open to use under a license agreement.

Zoho vs Salesforce (2024): Which CRM Is Better?

Look at Zoho CRM and Salesforce side-by-side to compare the cost per functionality and top pros and of each provider to determine which is better for your business needs.

Businessman hand holding glowing digital brain.

9 Innovative Use Cases of AI in Australian Businesses in 2024

Australian businesses are beginning to effectively grapple with AI and build solutions specific to their needs. Here are notable use cases of businesses using AI.

An illustration of a monthly salary of a happy employee on year 2024.

How Are APAC Tech Salaries Faring in 2024?

The year 2024 is bringing a return to stable tech salary growth in APAC, with AI and data jobs leading the way. This follows downward salary pressure in 2023, after steep increases in previous years.

Anthropic Releases Claude Team Enterprise AI Plan and iOS App

The enterprise plan seeks to fill a need for generative AI tools for small and medium businesses. Plus, a Claude app is now on iOS.

Top Tech Conferences & Events to Add to Your Calendar in 2024

A great way to stay current with the latest technology trends and innovations is by attending conferences. Read and bookmark our 2024 tech events guide.

TechRepublic Premium Editorial Calendar: Policies, Checklists, Hiring Kits and Glossaries for Download

TechRepublic Premium content helps you solve your toughest IT issues and jump-start your career or next project.

IBM Acquires HashiCorp for $6.4 Billion, Expanding Hybrid Cloud Offerings

The deal is intended to strengthen IBM’s hybrid and multicloud offerings and generative AI deployment.

Customer Relationship Management concept art on a laptop screen.

6 Best Enterprise CRM Software for 2024

Freshsales, Zoho CRM and Pipedrive are among the top enterprise CRM software that organize and automate data workflows to help achieve businesses’ client management goals in 2024.

8 Best Free Alternatives to Microsoft Excel

Discover the best free alternatives to Microsoft Excel: powerful, feature-packed solutions that help you work smarter and faster by allowing you to create comprehensive spreadsheets and analyze data.

Logos of the company salesforce which offers cloud based solutions for Customer Relationship Management on a heap on a table.

Salesforce Einstein Copilot AI Assistant Enters General Availability

Plus, Salesforce bundles its AI implementation and data governance services.

Create a TechRepublic Account

Get the web's best business technology news, tutorials, reviews, trends, and analysis—in your inbox. Let's start with the basics.

* - indicates required fields

Sign in to TechRepublic

Lost your password? Request a new password

Reset Password

Please enter your email adress. You will receive an email message with instructions on how to reset your password.

Check your email for a password reset link. If you didn't receive an email don't forgot to check your spam folder, otherwise contact support .

Welcome. Tell us a little bit about you.

This will help us provide you with customized content.

Want to receive more TechRepublic news?

You're all set.

Thanks for signing up! Keep an eye out for a confirmation email from our team. To ensure any newsletters you subscribed to hit your inbox, make sure to add [email protected] to your contacts list.

Make a Gift
Search Search

2024 Capstone Award Winners

Congratulations to the 2024 Dean’s Choice Award winners and Honorable Mentions!

The School of Information’s annual spring poster sessions showcase graduating student’s research and professional experience projects. This was the largest poster session in our school's history and the first time that undergraduates and graduates proudly presented their capstone projects together.

View spring 2024 capstone projects.

Undergraduate Dean’s Choice Award Winner & Honorable Mention

Dean Eric T. Meyer is pleased to present the Dean’s Choice Award to undergraduate student Michael Chen and Honorable Mention to Courtenay-Dee O'Brien!

“One of the things that really stood out about both undergraduate award winners was their ability to explain not only the details of their projects, which was evident in many of the students I spoke with, but also to help others understand the big picture questions of why the topics were important to them and to the world,” Dean Meyer said. “Whether it is a vision for saving significant amounts of energy using technology tools, or using data to protect elephant populations, the projects tackled big and important issues, and did so in creative ways using the knowledge and skills developed in the iSchool.”

Winner: “Design of Intuitive Visualizations for Residential Heating and Cooling Demands”

Michael Chen worked with the School of Information's Dr. James Howison to design new interpretable ways of visualizing the different temperature and comfort conditions a home faces throughout the year. He simulated hourly data for mock houses in Austin before generating a variety of visualizations, including calendar-shaped heatmaps and time series charts that highlighted the most extreme weather conditions. Used in tandem, this project's visualizations aim to help the viewer intuitively understand what their home would feel like throughout the year, potentially aiding homeowners in their HVAC sizing decisions.

“I really enjoyed combining the data skills and the focus on human-centered design that the iSchool teaches in this project,” Michael said. “It was an engaging challenge to think about how to visualize data that isn't normally communicated - like what a temperature difference actually feels like. I feel that utilizing interdisciplinary approaches, along with an emphasis on how data is conveyed, aligns with the spirit of the Informatics program.”

Michael’s project supervisor, James Howison said, “Michael jumped right in and drove this project from a vague idea to an innovative visualization. Capstones are great opportunities to throw a fun idea up in the air and see what happens; the idea for this one was conceived when we had our HVAC system swapped out. I didn’t even have the data in hand, but Michael found the simulation software and got it up and running. Our iSchool students do a great job with capstones, Michael certainly did!”

Congratulations on this recognition of your research, Michael!

Honorable Mention: “From Data to Defense: Mapping Elephant Poaching Trends for Targeted Conservation”

Courtenay-Dee O'Brien’s capstone project addressed the multifaceted challenges of illegal elephant poaching, driven by factors such as ivory demand, corruption, poverty, and criminal syndicates. By integrating data engineering, AI/ML, and geospatial analysis, the project analyzed historical and real-time data to uncover trends and risk factors in poaching incidents. Key findings indicated a strong inverse correlation between illegal carcass counts and both political stability and economic conditions, highlighting the importance of socio-economic interventions in regions with high poaching rates.

“This project has significantly enhanced my expertise in data engineering and data science, strategically equipping me for my forthcoming role as a Data Engineering and Applied AI Analyst at Deloitte & Touche,” Courtenay-Dee said. “Engaging in every facet of the data pipeline—from collection through to feature engineering—has not only honed my technical acumen but also enriched my perspective with interdisciplinary insights across technology, governance, African Elephant conservation, and business strategy. This robust experience is crucial as I step into the evolving landscape of data analytics, ready to tackle complex challenges with innovative solutions.”

Congratulations, Courtenay-Dee!

Graduate Dean’s Choice Award Winner & Honorable Mention

Dean Eric T. Meyer is also pleased to present the Dean’s Choice Award to master’s student Utkarsh Mujumdar and Honorable Mention to Madhav Varma!

Dean Meyer recounted, “Both of the graduate student winners had an ability in their posters and their conversation to help others immediately grasp the practical implications of their projects. Utkarsh built an impressive prototype for helping users ask questions about topics beyond simple fact retrieval, and Madhav contributed to designing a system to provide personalized dietary information in a clinical setting. Both projects have the potential to help people in a very direct way.”

Winner: “Designing a Multi-Perspective Search System Using Large Language Models and Retrieval Augmented Generation”

Utkarsh Mujumdar’s thesis project involved designing a multi-perspective search system that employs Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). Multi-perspective search is an information retrieval scenario when the search query focuses on contentious topics that might not have clear factual grounds for an answer, and so the information presented to the user should accommodate the different perspectives on any given topic. The system developed as part of his thesis does so by blending the conversational flow of LLMs with the context-aware retrieval capabilities of RAG.

Utkarsh said, “An important learning for me during my capstone project was the ability to think about problems from the perspective of a user, while developing and designing a system. It will help me with my future working on applied AI products that are geared towards non-technical users.”

Congratulations, Utkarsh, for this recognition of your work!

Honorable Mention: “Developing a User Interface for a Clinical Decision Support System (CDSS) Tailored to Individualized Dietary Interventions, Informed by Behavior Change Theory”

Madhav Varma’s project integrates user experience (UX) design principles with empirical insights from clinical research, specifically in personalized nutrition education. Through comprehensive interviews with clinical researchers and dietitians, alongside mixed methods UX research, it aims to elucidate decision-making processes and educational strategies within nutrition interventions. Anticipated outcomes include user flows and prototypes grounded in behavior change theory, facilitating the delivery of tailored nutrition education to patients by providing Registered Dietitian Nutritionists (RDNs) with quick access to personalized educational materials. Iterative prototyping, driven by feedback and testing, will refine the interface to align with evolving user needs. Collaborative engagement with field supervisors and exploration of design systems within digital frameworks will ensure cohesive, industry-standard interfaces. This initiative enhances the professional toolkit of UX researchers and designers, fostering an understanding of UX design's role in healthcare.

“The culmination of my academic journey in the form of a capstone project provided invaluable insights into the critical role of User Experience (UX) within the healthcare domain. Delving into the intricacies of designing software solutions for this industry underscored the importance of understanding and navigating the unique challenges inherent in healthcare workflows,” Madhav said. “I became acutely aware of how the fusion of clinical and UX research methodologies can yield innovative technological solutions, enhancing the efficiency of established healthcare practices and workflows.”

Congratulations, Madhav!

1616 Guadalupe St, Suite #5.202 Austin, Texas 78701-1213

News & Events
Email Lists
Indigenous Land Acknowledgment
UT Austin Home
Emergency Information
Site Policies
Web Accessibility Policy
Web Privacy Policy
Adobe Reader

Apple's AI ambitions could see boost from potential team up with Rivian

I nvesting.com -- Apple Inc (NASDAQ:AAPL)'s decision to scrap its car project cost billions of dollars, but a potential partnership with Rivian could support Apple's artificial intelligence ambitions as data from modern cars may have big role to play in the AI revolution.

A potential partnership would help Rivian (NASDAQ:RIVN) improve its upcoming R2 and R3 EVs and allow Apple to demonstrate its technology it was developing for Project Titan, its electric vehicle, DigiTimes Asia reported Monday.

"The data collection from the Rivian fleet (especially the high definition optical/vision data) may prove vital in the company’s ability to compete in the next phase of the industry," Morgan Stanley said.

Artificial intelligence is widely believed to be the next phase of growth for Apple, with many eagerly expecting the tech giant's big AI unveil at its developer conference next month.

"We believe Apple is set to unveil its long awaited AI strategy at WWDC," next month that could spark an "AI-driven supercycle starting with a new iPad lineup and then the iPhone 16 this Fall," Wedbush said in a Tuesday note.

With data the most precious commodity in the world of AI, the untapped data generated by modern, connected cars may spur more tech firms to partner with automakers.

"We would not be surprised to see a marked uptick in activity between tech firms and auto companies in the months/ quarters ahead,' Morgan Stanley added.

The vision data that supports autonomous driving are already proving valuable in the field of robotics.

There are "growing signs that the collection of vision data is driving a significant advancement in robotics," Morgan Stanley said.

"Unlike smartphones, a car can ‘host’ a significantly greater number of external facing cameras, higher powered inference computers," Morgan Stanley says, and portable energy that allows cars to offer much richer and expansive data.

Apple's AI ambitions could see boost from potential team up with Rivian

Microsoft, Brookfield to partner on renewable energy projects

Medium Text

Company Brookfield Asset Management Ltd Follow
Company Microsoft Corp Follow

Reporting by Mrinmay Dey, Surbhi Misra and Devika Nair in Bengaluru; Editing by Sherry Jacob-Phillips and Rashmi Aich

Our Standards: The Thomson Reuters Trust Principles. New Tab , opens new tab

Cracked and dry earth is seen in the wide riverbed of the Loire River in Ancenis-Saint-Gereon

Sustainability Chevron

World's record-breaking temperature streak extends through april.

The world just experienced its hottest April on record, extending an 11-month streak in which every month set a temperature record, the European Union's climate change monitoring service said on Wednesday.

Uber and Lyft drivers demonstrate over basic employee rights in California

IMAGES

(PDF) Big Data Capstone Project
Capstone Project
Curso: Capstone Project de Big Data • Becas Para Hispanos
Crime Data Analysis using Cloud and Big Data Services
Google Data Analytics Capstone Project
Understanding The Capstone Project and Getting Started

VIDEO

Best Capstone Project 2023, Session 1
Capstone Project Demonstration Video
The Capstone Project
CAPSTONE PROJECT 2
ARGD Capstone Poster
Academy Data Engineering Capstone Project

COMMENTS

Big Data
There are 7 modules in this course. Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game ...
GitHub
Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game "Catch the Pink Flamingo". During the ...
21 Interesting Data Science Capstone Project Ideas [2024]
Best Data Science Capstone Project Ideas - According to Skill Level. Data science capstone projects are a great way to showcase your skills and apply what you've learned in a real-world context. Here are some project ideas categorized by skill level: Beginner-Level Data Science Capstone Project Ideas. 1. Exploratory Data Analysis (EDA) on a ...
Big Data
1700 Coursera Courses That Are Still Completely Free. Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our ...
Big Data Capstone Project
The exam will cover content from the first four courses in the Big Data MicroMasters program, including the Ethics section of this capstone course, DataCapX. Itwill include questions on topics such as code structure and testing, variable types, graphs, big data algorithms, regression and ethics. Project Task 1: Data cleaning and Regression.
An Exemplary Data Science Capstone, Annotated
Since there was a lot of content, I'll conclude with my top three tips for doing a great data science capstone project: Choose a good data set: a small, uninteresting, or otherwise hard-to-analyze data set will make it substantially harder to make a great project. Include all of the following: Data cleaning.
10 Unique Data Science Capstone Project Ideas
Project Idea #10: Building a Chatbot. A chatbot is a computer program that uses artificial intelligence to simulate human conversation. It can interact with users in a natural language through text or voice. Building a chatbot can be an exciting and challenging data science capstone project.
AdelaideX: Big Data Capstone Project
The Big Data Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this Big Data MicroMasters program to a medium-scale data science project. Working with organisations and stakeholders of your choice on a real-world dataset, you will further develop your data science skills and knowledge. ...
25+ Solved End-to-End Big Data Projects with Source Code
Apache Flink. Apache Flink is an open-source big data processing framework that provides scalable, high-throughput, and fault-tolerant data stream processing capabilities. It offers low-latency data processing and provides APIs for batch processing, stream processing, and graph processing. 25. Apache Storm.
HKUSTx: Big Data Technology Capstone Project
The Big Data Technology Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this MicroMasters program to a medium-scale project. Play Video for Big Data Technology Capstone Project. Watch the video. 4 weeks. 4-8 hours per week. Self-paced.
Big Data Capstone Project
The Big Data Capstone Project will allow you to apply the techniques and theory you have gained from the four courses in this Big Data MicroMasters program to a medium-scale data science project. Working with organisations and stakeholders of your choice on a real-world dataset, you will further develop your data science skills and knowledge. ...
UCSD Big Data Specialization General Materials and my Capstone Project
In the final Capstone Project, developed in partnership with data software company Splunk, i'll apply the skills i learned to do basic analyses of big data. This specilization contains 6 following courses: Introduction to Big Data; Big Data Modeling and Management Systems; Big Data Integration and Processing; Machine Learning With Big Data
Big Data
Welcome to the Capstone Project for Big Data! In this culminating project, you will build a big data ecosystem using tools and methods form the earlier courses in this specialization. You will analyze a data set simulating big data generated from a large number of users who are playing our imaginary game "Catch the Pink Flamingo". During ...
Master's in Data Science
Capstone 2024: Submissions Now Open Master's in Data Science Capstone Project Capstone Project CDS master's students have a unique opportunity to solve real-world problems through the capstone course in the final year of their program. The capstone course is designed to apply knowledge into practice and to develop and improve critical skills such as problem-solving …
Big Data
Offered by University of California San Diego. Welcome to the Capstone Project for Big Data! In this culminating project, you will build a ... Enroll for free.
UCSD Data Science Capstone Projects: 2021-2022
This page contains the project materials for UCSD's Data Science Capstone sequence. Projects are grouped into subject-matter areas called domains of inquiry, led by the domain mentors listed below. Each project listing contains: The title and abstract, A link to the project's website. A link to the project's code repository.
Google Data Analytics Capstone: Complete a Case Study
Module 1 • 2 hours to complete. A capstone is a crowning achievement. In this part of the course, you'll be introduced to capstone projects, case studies, and portfolios, and will learn how they help employers better understand your skills and capabilities. You'll also have an opportunity to explore the online portfolios of real data ...
Subham2S/BigData-Engineering-Capstone-Project-1
BigData Engineering Capstone Project with Tech-stack : Linux, MySQL, sqoop, HDFS, Hive, Impala, SparkSQL, SparkML, git - Subham2S/BigData-Engineering-Capstone-Project-1. ... One of the big corporations needed data engineering services for a decade's worth of employee data. All employee datasets from that period were provided in six CSV files.
How I created my first Data Analytics Capstone Project
I completed this Data Analytics Capstone Project as a part of Google Data Analytics Professional Course on Coursera. ... And it is a also nearer to Big Data real life vehicle tracking systems case ...
Top 15 Big Data Projects (With Source Code)
Recommendations can also be made based on tendencies in a certain area, as well as age groups, sex, and other shared interests. This is a data warehouse implementation for an e-commerce website "Infibeam" which sells digital and consumer electronics. Source Code - Data Warehouse Design. 5. Text Mining Project.
Data Science Master's Students Tackle Diverse, Real-World Challenges in
Fifteen groups of graduate students at the School of Data Science recently presented the findings of their capstone projects before a sizable crowd of faculty, staff, and fellow students at The Graduate Hotel in Charlottesville. Capstone projects have long been a cornerstone of the data science master's program. In them, students work in ...
Data Science Summer 2024 Capstone Project Showcase
Data Science Summer 2024 Capstone Project Showcase. Add to Calendar. Thursday, August 15, 2024. 5:00 pm to 7:00 pm PDT. Online. Capstone projects are the culmination of the MIDS students' work in the School of Information's Master of Information and Data Science program. Over the course of their final semester, teams of students propose and ...
Individualized Curriculum
Using big data to solve economic and social problems; Senior Capstone Project. We call it a senior capstone project or thesis, but it's really the culmination of your journey here at New College—whether it takes the form of a scientific research paper or a theatrical performance. With coaching from your faculty adviser, you'll be ready to ...
Big Data: Latest Articles, News & Trends
8 Best Data Science Tools and Software. Apache Spark and Hadoop, Microsoft Power BI, Jupyter Notebook and Alteryx are among the top data science tools for finding business insights. Compare their ...
2024 Capstone Award Winners
2024 Capstone Award Winners. Congratulations to the 2024 Dean's Choice Award winners and Honorable Mentions! The School of Information's annual spring poster sessions showcase graduating student's research and professional experience projects. This was the largest poster session in our school's history and the first time that ...
Apple's AI ambitions could see boost from potential team up with ...
Investing.com -- Apple Inc (NASDAQ:AAPL)'s decision to scrap its car project cost billions of dollars, but a potential partnership with Rivian could support Apple's artificial intelligence ...
Microsoft, Brookfield to partner on renewable energy projects
Companies. May 1 (Reuters) - Canada's Brookfield Asset Management (BAM.TO) and technology giant Microsoft (MSFT.O) will develop new wind and solar farms in an attempt to bring over 10.5 gigawatts ...
Big Data: capstone project
En este último curso de la Especialización Big Data el estudiante tendrá la oportunidad de aplicar algunas de las herramientas y métodos aprendidos en los cursos anteriores en un caso práctico. El objetivo de este Capstone Project es mostrar un ejemplo del trabajo que se realiza diariamente en el departamento de Cosmología del Port d ...