SPIDER, (July 2006 - Present)
Data quality is a serious concern in every organization that relies on data.
The quality of data is commonly poor due to a multitude of reasons including,
but not limited to, spelling mistakes, abbreviations, lack of standards and inconsistent notations.
SPIDER is a declarative data cleaning tool. It incorporates a set of algorithms
that can be used to aid the improvement of data quality on any relational data source
SPIDER can be used for flexible querying, approximate joins, schema matching and
data exploration.
Advisor: Prof. Nick Koudas, University of Toronto
Fast Identification of Relational Constraint Violations, (Jan - July 2006)
Logical constraints, (e.g., phone numbers in toronto
can have prefixes 416, 647, 905 only), are ubiquitous in
relational databases. Traditional integrity constraints
,such as functional dependencies, are examples of such
logical constraints as well. However, under frequent
database updates, schema evolution and transformations,
they can be easily violated. As a result, tables
become inconsistent and data quality is degraded.
We study the problem of validating collections
of user defined constraints on a number of relational tables.
Our primary goal is to quickly identify which tables violate such constraints.
Logical constraints are potentially complex logical formuli, and we
demonstrate that they cannot be efficiently evaluated by
SQL queries. In order to enable fast identification of
constraint violations, we propose to build and maintain
specialized logical indices on the relational tables. We
choose Boolean Decision Diagrams (BDD) as the index
structure to aid in this task. We first propose efficient
algorithms to construct and maintain such indices in
a space efficient manner. We then describe a set of
query re-write rules that aid in the efficient utilization
of logical indices during constraint validation.
We have implemented our approach on top of a relational
database and tested our techniques using large
collections of real and synthetic data sets. Our results
indicate that utilizing our techniques in conjunction
with logical indices during constraint validation offers
very significant performance advantages.
Advisor: Prof. Nick Koudas, University of Toronto
Efficient Batch Top-k Search for Dictionary-based Entity Recognition, (Aug 2004 - Aug 2005)
We consider the problem of speeding up Entity Recognition
systems that exploit existing large databases of structured
entities to improve extraction accuracy. These systems
require the computation of the maximum similarity scores of
several overlapping segments of the input text with the entity
database. We formulate a Batch-Top-K problem with
the goal of sharing computations across overlapping segments.
Our proposed algorithm performs a factor of three
faster than independent Top-K queries and only a factor of
two slower than an unachievable lower bound on total cost.
We then propose a novel modification of the popular Viterbi
algorithm for recognizing entities so as to work with easily
computable bounds on match scores, thereby reducing the
total inference time by a factor of eight compared to stateof-
the-art methods.
Advisor: Prof. Sunita Sarawagi, IIT Bombay
Data Integration from Web-Pages, (Feb-Apr, 2005)
Designed a technique to extract publication entries from web-pages and storing these entries
into a structured database. The creation of structured database is performed in two steps: first
step identifies individual publication entry and second step performs fine grained information
extraction. For the first step, we implemented a classifier on DOM nodes, while for the second
step we implemented an efficient inference algorithm using A* technique.
Advisor: Prof. Soumen Chakrabarti, IIT Bombay
Network Intrusion Detection using Stide Methodology, (Feb-Apr, 2005)
Designed an intelligent system to automatically detect possible
events of network intrusion. The system monitored network logs
generated by tcpdump (per-packet activity) for anomalies and
raised a flag whenever observed behavior deviated significantly from normal.
We employed stide-methodology for classifying,
where we used sequences of consecutive log-records (over a sliding window of fixed size) to
represent activity. The basic approach is to construct a normal dictionary from data collected
when there was no intrusion. This dictionary is used to compute anomaly count of incoming
log-data. Stide-methodology has been previously shown to be effective in system intrusion detection
problems. We proposed a novel encoding scheme for sequences
in network activity log that enabled us to use same technique in this domain as well.
Our results were verified by experimenting on real world datasets.
Advisor: Prof. Sunita Sarawagi, IIT Bombay
Summarizing Tree Structured XML Data Quantitatively, (May-Nov 2004)
Developed an algorithm for constructing a summary of an XML
document to discover the structural aspect of its schema, and to use
the summary for other tasks like - query result size estimation,
structural compression and exploration. The summary is capable of preserving
various kinds of quantificational information, which can be used to extract
knowledge on number of edges or paths following a certain label pattern.
Advisor: Prof. Laks Lakshmanan and Prof. Raymond Ng, UBC, Vancouver
Managing Database Snapshots in Mobile Environment,(Aug - Nov 2004)
Designed methods and tools to assist the building of database applications
to be used on mobile devices keeping in view their frequent communication
breakdowns. The key idea is to maintain partial weakly consistent view of the
central database on the mobile device during disconnectivity and
synchronize the data when the connection is available.
Advisor: Prof. Krithi Ramamritham, IIT Bombay
Development Projects
IITB Navigator, (Aug-Nov, 2003):
Developed a GUI with web front-end to locate different people, places and
locations of various ongoing events in a region, showing the shortest path to
the destination on a map.
Advisor: Prof. S. Sudarshan, IIT Bombay
CMS: Course Management System, (May-Dec, 2003):
Provided a common web based interface between instructors, students and teaching
assistants in an institute for doing mundane tasks such as giving and submitting
assignments, assigning projects and demo scheduling, course information, notices,
grading and messages. Implemented using servlets, JDBC, SQL and Java.
Advisor: Prof. S. Sudarshan, IIT Bombay
Other Projects
Cricket Animation,
Oct. 2005 - Dec 2005
CSC2504H Computer Graphics Course Project, University of Toronto
We designed an animation for a cricket match between India and Australia using OpenGL.
The main attraction of the animation was beautiful Toronto city, the cricket stadium with a
lots of people and the lighting effects. It consisted of various complex 3D objects, which
were designed from scratch using OpenGL. Object oriented C++ design was used to provide
many interactive functionalities to assist the modeling. This project won the second prize
in the Wooden Monkey Hall of Fame, Fall 2005.
Moving Object Segmentation To Optimize Video Transmission, (Feb-Apr, 2004):
Implemented a background registration technique to segment a given video stream spatially,
that is to separate the foreground region (moving objects) from the background region.
Such techniques are useful for the applications like video conferencing where the camera is stationary
and so is the background, and hence only the speaker's face needs to be transmitted.
Advisor: Prof. S. Arunkumar
VNC-Server, (Feb-Apr, 2003):
Modified the vnc code (version 3.3.7) to add new
functionalities, remove existing bugs, support sound export on remote desktop.
The pixel depth was also modified. A GUI was provided to make it user friendly.
IP restriction on vnc-server was another achievement of the project.
Advisor: Prof. G. Sivakumar, IIT Bombay
Train Scheduling Optimization, (Feb-Apr, 2003):
Simulated a train-network having several tracks and stations
on a FPGA kit, with the aim to move the trains in such a way that each train should pick up the maximum number of
passengers on the route. It was programmed in VHDL.
Advisor: Prof. M. R. Bhujade, IIT Bombay
Mail application, (Feb-Mar, 2003):
Designed mail application using sun and java packages.
Advisor: Prof. G. Sivakumar, IIT Bombay
Image-Compression, (July-Nov, 2002):
Implemented a image compression algorithm using C++ and achieved an
efficient encoding scheme thereby compressing a ppm image to its one-fourth.
Decompression was also very efficient as per the need of the client.
Advisor: Prof. S. Arunkumar, IIT Bombay
Electrical Circuit Analysis, (Feb-Apr, 2002):
Designed and implemented a technique in "scheme" to
analyse a resistive circuit, i.e. calculating all the circuit variables at a given time and to
draw graphs relating these variables.
Advisor: Prof. Abhiram Ranade, IIT Bombay
ScoreBoard Maintainence, (July-Nov, 2001):
Implemented a "fortran" application to maintain scoreboard
for an ongoing cricket match. For each player, it keeps track of individual statistics of runs, balls etc.
In second innings it also displayed requirements to win the match.
Advisor: Prof. Ajit Deewan, IIT Bombay