COP5725 Advanced Database Systems (Spring 2013)
Instructor: Peixiang Zhao
| Syllabus | Announcement | Schedule | Assignment | Project | Resources |
Overview
There is a semester-long course project that is meant to be a substantial independent research or engineering effort related to the real-world data management issues. The students can choose either of the following two options for their course project:
- Research-flavor project: students are required to identify an interesting and nontrivial real-world problem that belongs to the data management, mining and analytics field. You then need to figure out novel solutions to the problem and perform thorough experimental studies to testify the effectiveness of your method.
- Implementation-flavor project: students select one (or several) published paper from the leading database conferences, e.g., SIGMOD, VLDB, ICDE, and implement the core algorithms and systems specified in the paper.
Students can form teams of (at most) two people or work individually. For project option 1 (research-flavor project), a team of two students is allowed. For project option 2 (implementation-flavor project), only individuals (teams with one single student) are allowed.
Students are welcome to discuss their problems, ideas, and potential solutions with the TA, the instructor, and even other faculty members throughout the semester.
Milestones
- Group formation (0%, Due:01/24, Thursday): find a project partner and begin to discuss project problems and ideas.
- Project proposal (10%, Due:02/14, Thursday): your proposal is one or two pages long and should explicitly state the following: 1. Your project type: research-flavor or implementation-flavor; 2. The problem your project will address; 3. Your project goal and motivation; 4. The methodology and plan for your project. Be sure to structur your plan into a set of incremental, implementable milestones and include a schedule for meeting them; 5. The resourse needed to carry out your project; 6. The workload distribution if two members are involved in the group.
- Liturature survey (20%, Due:noon, 03/08 Friday
03/07, Thursday): your survey is between three and five pages long for double column (or six to ten pages long for single column) and should place a particular focus on how existing work (at least four different papers should be mentioned in details) differs from your proposed work and why it is effective for solving the problem you propose. Through this survey, you should be able to convince the reader that you are addressing something fundamentally new, either a brand new problem or a novel approach to a known problem.
- Status report (20%, Due:04/02, Tuesday): Your status report is between three and five pages long (single column) and should contain enough imlementation, data, and analysis to show that your project is on the right track. You should revise your original proposal to accommodate the TA's and instructor's comments, along with any suprising results or changes in the direction, schedule, etc. You sometimes also need to have a refined version of the problem statement.
Basically, the following items are expected in your report: 1. A very clear and specific problem you want to solve (you've finalized the problem statement so far); 2. Basic goal of the project (what do you want to achieve at the end of the semester); 3. Your methods and how it differs from others (in brief) 4. Your software/tools/data sets used in the project; 5. You current status and partial results; 6. You brief plan for the remaining one month.
- Final report (50%, Due:04/25, Thursday): the final report should extend your previous writeups into a conference-style paper with six to ten pages. The report should: 1. present the research problem and summarize your contributions in the first section; 2. survey related work in the related work section; 3. include a detailed description of your algorthms, analysis, and implementation in the technical section; 4. describe evaluation methodology and significatnt
results in the evaluation section; 5. finally present your conclusions (in the summary section); 6. for team work, the report should also include a paragraph explaining, for each group member, their contributions and duties in the project. 7. Please specify a hyperlink through which I can download your source code and data set for reproducing your experimental results. Please include a README file specifying how to compile and run your software.
Research Ideas
The following is a list of possible research ideas or directions; you are not required to choose from this list. You can also use the ideas below as inspiration for your own variants of similar problems.
When you look for related work, the main database conferences are: SIGMOD/PODS, VLDB, ICDE, and CIDR. The main database journals are TODS, VLDBJ, TKDE, and SIGMOD Record. The main data mining conferences are ACM KDD, IEEE ICDM and SIAM SDM. The main information retrieval conferences are ACM SIGIR, WWW and ACM CIKM.
- Top-k query processing and optimization. Traditionally, queries over structured data identify the exact matches for the queries. This exact-match query model is not appropriate for many database applications and scenarios where queries are inherently fuzzy -- often expressing user preferences and not hard Boolean constraints -- and are best answered with a ranked list of the best matching objects, for some definition of degree of match. This "top-k" query model is natural in many scenarios and application domains, and has been extensively studied in the literature.
- Graph Databases and Graph Data Management. Recently, there has been a lot of interest in the application of graphs in different domains. They have been widely used for data modeling of different application domains such as chemical compounds, multimedia databases, protein networks, social networks and semantic web. With the continued emergence and increase of massive and complex structural graph data, a graph database that efficiently supports elementary data management mechanisms is crucially required to effectively understand and utilize any collection of graphs.
- Data Stream Analysis and Management. A data stream is an unbounded data set that is produced incrementally over time, rather than being available in full before its processing begins. A traditional database management system typically processes a stream of ad-hoc queries over relatively static data. In contrast, a data stream management system evaluates static (long-running) queries on streaming data, making a single pass over the data and using limited working memory.
- Social and Information Network Analysis.World Wide Web, blogging platforms, instant messaging and Facebook can be characterized by the interplay between rich information content, the millions of individuals and organizations who create and use it, and the technology that supports it. Recent research has been focused on the structure and analysis of such large social and information networks and on models and algorithms that abstract their basic properties. Research topics include methods for link analysis and network community detection, diffusion and information propagation on the web, virus outbreak detection in networks, and connections with work in the social sciences and economics.
- Big Data Analytics with MapReduce. Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance.
- Web Data Analytics and Management.Internet and the Web have revolutionized access to information. Today, one finds primarily on the Web, HTML (the standard for the Web) but also documents in pdf, doc, plain text as well as images, music and videos. The public Web is composed of billions of pages on millions of servers. It is a fantastic means of sharing information. Typical research problems here include, but are not limited to, Web data crawling, integration and retrieval, hidden Web discovery, information extraction and entity resolution, and dataspaces.
- Miscellaneous Topics in Data Management. Nowadays, traditional database systems begin to embrace new technologies from other research domains, such as data mining, information retrieval, pattern recognition and machine learning. There exist an array of research problems on how to bridge the gaps between different data analytical methods and extend them in database fields, for example, supporting effective and efficient keyword search on databases, and embedding classification or clustering algorithms deep into database systems.