COP5725 Advanced Database Systems (Spring 2025)

Overview

There is a semester-long course project that is meant to be a substantial independent research or engineering effort related to the real-world data management issues. The students can choose either of the following two options for their course project:

Research-flavor project: students are required to identify an interesting and nontrivial real-world problem that belongs to the data management, mining, and analytics field. You then need to figure out novel solutions to the problem and perform thorough theoretical/experimental studies to testify the soundness, efficacy, and effectiveness of your methods.
Implementation-flavor project: students select one (or several) published, full research paper from the leading database conferences (the short list is mentioned below), e.g., SIGMOD, VLDB, ICDE, published on or after 2015, implement the core algorithms and systems specified in the paper, and carry out all experimental studies performed in the paper.

Students can form teams of at most three people. Students are welcome to discuss their problems, ideas, and potential solutions with the TA, the instructor, and even other faculty members throughout the semester.

Milestones

Group formation (0%): find a project partner in the class, if needed, and begin to discuss project problems and ideas.
Project proposal (10%): your proposal is one or two pages long and should explicitly state the following: 1. Your project type: research-flavor or implementation-flavor (If this is an implementation-flavor project, please indicate the paper you want to implement, including the title, conference name, and the year the paper was published); 2. The problem your project will address; 3. Your project goal and motivation; 4. The (rough) methodology and plan for your project. Be sure to structure your plan into a set of incremental, implementable milestones and include a schedule for meeting them; 5. The resource needed to carry out your project; 6. The workload distribution if more than one member is involved in the group.
Literature survey (20%): Your should determine the exact paper/idea you want to implement/research at this stage. For implementation-flavor projects, please include the paper information (title, conference/journal, publication year, authors) in the report. Your survey is between two or three pages long for double column (or four to six pages long for single column) and should place a particular focus on the technical discussions about HOW existing algorithms, methods, and solutions differ from the work you propose (implement) and why it is effective for solving the problem, compared with others. The survey should include comparative justification for the pros and cons of different work with technical details. Through this survey, you should be able to convince others that you are addressing (implementing) something fundamentally new, either a brand new problem or a novel approach to a known problem, and you have known existing state of the art for this problem.

What a typical survey might look like? Database Meets AI: A Survey
Status report (10%): Your status report is one or two pages long (single column) and should contain enough implementation, data, and analysis to show that your project is on the right track. You should revise your original proposal to accommodate the TA's and instructor's comments, along with any surprising results or changes in the direction, schedule, etc. You sometimes also need to have a refined version of the problem statement.
Basically, the following items are expected in your report: 1. A very clear and specific problem you want to solve (you've finalized the problem statement so far); 2. Basic goal of the project (what do you want to achieve at the end of the semester); 3. Your assumptions and methods and how they differ from others (in brief) 4. Your software/tools/data sets used in the project; 5. The detailed plan of experimental studies you want to perform (in accordance with the experimental studies mentioned in the paper); 6. Your current status and partial results; 7. You brief plan for the remaining time.
Final report and software/source code (60%): the final report should extend your previous write-ups into a conference-style paper with five to ten pages (single or double-column). The report should: 1. present the research problem and summarize your contributions in the first section; 2. survey related work in the related work section; 3. include a detailed description of your algorithms, analysis, and implementation in the technical section; 4. describe evaluation methodology and significant results in the evaluation section; 5. finally present your conclusions (in the summary section); 6. for team work, the report should also include a paragraph explaining, for each group member, their contributions and duties in the project. 7. Please specify a hyperlink through which we can download your source code, software, and data set for reproducing your experimental results.
Software, source code submission: Please provide with the TA your COMPLETE source code, datasets, and runnable software in one package. You may provide a link to DropBox, Github, or Bitbucket, etc., or you may copy the code to the TA's machine via USB drives. Please include a README file specifying how to install your software, and all script files specifying how to compile and run your software and all the experiments. You have the option to demo your project to the TA for 15 minutes. Please send an email to the TA for appointment.
Programming language requirement: Students are required to use C/C++ or Java for programming. Students can use libraries or online code during implementation, but such source code won't be considered as your workload. Furthermore, we will do plagiarism detection for your implementation against the online code, if available, and once detected as a plagiarism, your project will be given zero credit, and your case will be reported to FSU.

Research Ideas

When you look for related work in your project, the designated database conferences are: SIGMOD/PODS, VLDB, and ICDE. The designated database journals are TODS, VLDBJ, and TKDE. Please choose regular research papers to work on. The posters, demos, industry and vision papers are disqualified.

Useful websites for scientific publications: DBLP; CiteSeerX; Google Scholar; Microsoft Academic Search; ACM Digital Library; IEEE Xplore.

SIGMOD-2021 VLDB-2021 ICDE-2021

Some sample topics for projects:

Top-k query processing and optimization. Traditionally, queries over structured data identify the exact matches for the queries. This exact-match query model is not appropriate for many database applications and scenarios where queries are inherently fuzzy -- often expressing user preferences and not hard Boolean constraints -- and are best answered with a ranked list of the best matching objects, for some definition of degree of match. This "top-k" query model is natural in many scenarios and application domains, and has been extensively studied in the literature.
Graph Databases and Graph Data Management. Recently, there has been a lot of interest in the application of graphs in different domains. They have been widely used for data modeling of different application domains such as chemical compounds, multimedia databases, protein networks, social networks and semantic web. With the continued emergence and increase of massive and complex structural graph data, a graph database that efficiently supports elementary data management mechanisms is crucially required to effectively understand and utilize any collection of graphs.
Data Stream Analysis and Management. A data stream is an unbounded data set that is produced incrementally over time, rather than being available in full before its processing begins. A traditional database management system typically processes a stream of ad-hoc queries over relatively static data. In contrast, a data stream management system evaluates static (long-running) queries on streaming data, making a single pass over the data and using limited working memory.
Social Media Analysis.World Wide Web, blogging platforms, instant messaging and Facebook can be characterized by the interplay between rich information content, the millions of individuals and organizations who create and use it, and the technology that supports it. Recent research has been focused on the structure and analysis of such large social and information networks and on models and algorithms that abstract their basic properties. Research topics include methods for link analysis and network community detection, diffusion and information propagation on the web, virus outbreak detection in networks, and connections with work in the social sciences and economics.
Big Data Analytics. Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce or Spark is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance.
Web Data Analytics and Management.Internet and the Web have revolutionized access to information. Today, one finds primarily on the Web, HTML (the standard for the Web) but also documents in pdf, doc, plain text as well as images, music and videos. The public Web is composed of billions of pages on millions of servers. It is a fantastic means of sharing information. Typical research problems here include, but are not limited to, Web data crawling, integration and retrieval, hidden Web discovery, information extraction and entity resolution, and dataspaces.
Miscellaneous Topics in Data Management. Nowadays, traditional database systems begin to embrace new technologies from other research domains, such as data mining, information retrieval, pattern recognition, and machine learning. There exist an array of research problems on how to bridge the gaps between different data analytical methods and extend them in database fields, for example, supporting effective and efficient keyword search on databases, and embedding classification or clustering algorithms deep into database systems.