COP5725 Advanced Database Systems (Fall 2018)
Instructor: Peixiang Zhao
| Syllabus | Announcement | Schedule | Assignment | Project | Resources |
Overview
There is a semester-long course project that is meant to be a substantial independent research or engineering effort related to the real-world data management issues. The students can choose either of the following two options for their course project:
- Research-flavor project: students are required to identify an interesting and nontrivial real-world problem that belongs to the data management, mining and analytics field. You then need to figure out novel solutions to the problem and perform thorough theoretical/experimental studies to testify the effectiveness of your methods.
- Implementation-flavor project: students select one (or several) published paper from the leading database or data mining conferences, e.g., SIGMOD, VLDB, ICDE, published on or after 2008, implement the core algorithms and systems specified in the paper, and carry out all experimental studies mentioned in the paper.
Students can form teams of (at most) two people or work individually. Students are welcome to discuss their problems, ideas, and potential solutions with the TA, the instructor, and even other faculty members throughout the semester.
Milestones
- Group formation (0%): find a project partner and begin to discuss project problems and ideas.
- Project proposal (10%): your proposal is one or two pages long and should explicitly state the following: 1. Your project type: research-flavor or implementation-flavor (If this is an implementation-flavor project, please indicate the paper you want to implement); 2. The problem your project will address; 3. Your project goal and motivation; 4. The (rough) methodology and plan for your project. Be sure to structure your plan into a set of incremental, implementable milestones and include a schedule for meeting them; 5. The resourse needed to carry out your project; 6. The workload distribution if two members are involved in the group.
- Liturature survey (20%): Your should determine the exact paper/idea you want to implement/research at this stage. For implementation-flavor projects, please include the paper information (title, conference/journal, publication year, authors) in the report. Your survey is between two or three pages long for double column (or four to six pages long for single column) and should place a particular focus on the technical disucssions about HOW existing algorithms, methods,
and solutions differ from the work you propose (implement) and why it is effective for solving the problem, compared with others. The survey should include comparative justification for the pros and cons of different work with techincal details. Through this survey, you should be able to convince others that you are addressing (implementing) something fundamentally new, either a brand new problem or a novel approach to a known problem, and you have known existing state-of-the-art for this
problem.
What a typical survey might look like? A SURVEY OF ALGORITHMS FOR KEYWORD SEARCH ON GRAPH DATA
- Status report (10%): Your status report is one or two pages long (single column) and should contain enough imlementation, data, and analysis to show that your project is on the right track. You should revise your original proposal to accommodate the TA's and instructor's comments, along with any suprising results or changes in the direction, schedule, etc. You sometimes also need to have a refined version of the problem statement.
Basically, the following items are expected in your report: 1. A very clear and specific problem you want to solve (you've finalized the problem statement so far); 2. Basic goal of the project (what do you want to achieve at the end of the semester); 3. Your assumptions and methods and how they differ from others (in brief) 4. Your software/tools/data sets used in the project; 5. The detailed plan of experimental studies you want to perform (in accordance with the experimental studies mentioned
in the paper); 6. Your current status and partial results; 7. You brief plan for the remaining one month.
- Final report and software/source code (60%): the final report should extend your previous writeups into a conference-style paper with five to ten pages (single or double-column). The report should: 1. present the research problem and summarize your contributions in the first section; 2. survey related work in the related work section; 3. include a detailed description of your algorthms, analysis, and implementation in the technical section; 4. describe evaluation methodology and
significatnt results in the evaluation section; 5. finally present your conclusions (in the summary section); 6. for team work, the report should also include a paragraph explaining, for each group member, their contributions and duties in the project. 7. Please specify a hyperlink through which we can download your source code, software, and data set for reproducing your experimental results.
Software, source code submission:
Please provide with the TA your COMPLETE source code, datasets, and runnable software in one package. You may provide a link to DropBox, Github, or Bitbucket, etc., or you may copy the code to the TA's machine via USB drives. Please include a README file specifying how to install your software, and all script files specifying how to compile and run your software and all the experiments. You have the option to demo your project to the TA for 20 minutes. Please send an email to the TA for appointment.
Students are required to use C/C++/C# or Java for programming. Students can use libraries or online code during implementation, but such source code won't be considered as your workload.
Research Ideas
The following is a list of possible research ideas or directions; you are not required to choose from this list. You can also use the ideas below as inspiration for your own variants of similar problems.
When you look for related work, the main database conferences are: SIGMOD/PODS, VLDB, ICDE, and CIDR. The main database journals are TODS, VLDBJ, TKDE, and SIGMOD Record. The main data mining conferences are ACM KDD, IEEE ICDM and SIAM SDM. The main information retrieval conferences are ACM SIGIR, WWW and ACM CIKM.
Useful websites for scientific publications: DBLP; CiteSeerX; Google Scholar; Microsoft Academic Search; ACM Digital Library; IEEE Xplore.
- Top-k query processing and optimization. Traditionally, queries over structured data identify the exact matches for the queries. This exact-match query model is not appropriate for many database applications and scenarios where queries are inherently fuzzy -- often expressing user preferences and not hard Boolean constraints -- and are best answered with a ranked list of the best matching objects, for some definition of degree of match. This "top-k" query model is natural in many scenarios and application domains, and has been extensively studied in the literature.
- Graph Databases and Graph Data Management. Recently, there has been a lot of interest in the application of graphs in different domains. They have been widely used for data modeling of different application domains such as chemical compounds, multimedia databases, protein networks, social networks and semantic web. With the continued emergence and increase of massive and complex structural graph data, a graph database that efficiently supports elementary data management mechanisms is crucially required to effectively understand and utilize any collection of graphs.
- Data Stream Analysis and Management. A data stream is an unbounded data set that is produced incrementally over time, rather than being available in full before its processing begins. A traditional database management system typically processes a stream of ad-hoc queries over relatively static data. In contrast, a data stream management system evaluates static (long-running) queries on streaming data, making a single pass over the data and using limited working memory.
- Social and Information Network Analysis.World Wide Web, blogging platforms, instant messaging and Facebook can be characterized by the interplay between rich information content, the millions of individuals and organizations who create and use it, and the technology that supports it. Recent research has been focused on the structure and analysis of such large social and information networks and on models and algorithms that abstract their basic properties. Research topics include methods for link analysis and network community detection, diffusion and information propagation on the web, virus outbreak detection in networks, and connections with work in the social sciences and economics.
- Big Data Analytics with MapReduce. Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance.
- Web Data Analytics and Management.Internet and the Web have revolutionized access to information. Today, one finds primarily on the Web, HTML (the standard for the Web) but also documents in pdf, doc, plain text as well as images, music and videos. The public Web is composed of billions of pages on millions of servers. It is a fantastic means of sharing information. Typical research problems here include, but are not limited to, Web data crawling, integration and retrieval, hidden Web discovery, information extraction and entity resolution, and dataspaces.
- Miscellaneous Topics in Data Management. Nowadays, traditional database systems begin to embrace new technologies from other research domains, such as data mining, information retrieval, pattern recognition and machine learning. There exist an array of research problems on how to bridge the gaps between different data analytical methods and extend them in database fields, for example, supporting effective and efficient keyword search on databases, and embedding classification or clustering algorithms deep into database systems.