My broad research focus is in informatics and data science with an emphasis on healthcare and biomedical problems involving both structured and textual data sources. I am also interested in information security and privacy. My doctoral research is in the analysis of security measures for pseudorandom sequences used in key streams for stream ciphers. My research efforts are supported by the National Library of Medicine [R01LM013240, R21LM012274], National Institute on Drug Abuse [R01DA057686], National Cancer Institute [R21CA218231], Kentucky Lung Cancer Research Program, and National Center for Advancing Translational Sciences (NCATS).

Please continue browsing for specific details of my current research activities.

In biomedical sciences and healthcare operations, text arises in the form of scientific publications, clinical narratives (discharge summaries, pathology notes, progress notes), and interview narratives (drug abuse, relationship counseling). While a human reader can readily glean the information a textual fragment conveys, it is a very challenging task for machines. Text mining is the process of converting textual data into, ideally, 'actionable' information. But, often, it also includes converting unstructured text into structured data that is more straightforward to process using computers. Some text mining tasks are named entity recognition, triple extraction, word sense disambiguation, classification, clustering, and sentiment analysis. My current focus is in extracting strong predictive signal by merging both structured and unstructured data sources from EMRs. I design computational methods that are essentially applications of foundational ideas from machine learning, natural language processing, data mining, and biostatistics.

Current research themes

  • Automatic extraction of coded information from unstructured narratives
  • Health information content and network analysis on social media
  • Relation extraction and literature based knowledge discovery
  • Rule based predictive modeling in biomedicine and healthcare
  • Human-computer information retrieval using interactive text visualization
Knowledge based search systems: While there is a wealth of bibliographic information on life sciences and biomedical literature (MEDLINE), the sheer amount available makes it very hard for life scientists to manually analyze the literature and come to reasonable conclusions. One approach to improve the situation is to use computational techniques to semi-automatically build domain hierarchies of concepts in a given area and then align relationships extracted from the literature along these hierarchies. The aim is to provide life scientists with domain expert supervised ontologies and effective ways to search and query them. At a high level my efforts leverage natural language processing techniques and Semantic Web technologies to build domain models and facilitate high quality tools to elicit information in the area of human performance and cognition. Specifically, I co-lead the Human Performance and Cognition Ontology project to build a
  1. knowledge-base for the human performance and cognition domain and
  2. a browsing application that uses the knowledge-base to facilitate scientific literature search and exploration (watch screen cast or read more on system architecture; links open in a new window)
to assist biologists at the AFRL in fulfilling their information needs.

Access Control for RDF: With the rapid growth of semantic web technologies, many applications have been developed in several areas such as social networks and healthcare using the resource description framework (RDF) data format. To achieve the goals of machine processability and semantic interoperability, researchers in life sciences are also converting repositories of experimental data into RDF. Many general purpose RDF data sets are made public on Linked Open Data (e.g., DrugBank, LinkedCT, KEGG). However, protecting private data sets, some of which might be new results not yet made public, is also essential. While scientists would like to use public data sets in their experiments, it is crucial to protect the results that are output. That is, many practical situations warrant that interoperability be mediated through careful authorization mechanisms. As a postdoc, I helped develop and evaluate a state-of-the-art discretionary access control framework for RDF datasets with which each user can grant access privileges and additionally delegate access granting rights to other trustworthy users. In the context of delegation of access granting rights, a natural requirement is for a resource owner to track how certain other user obtained access to his/her resource because delegations can lead to chains of access privilege propogation. This is exteremely important for auditing purposes in case of malicious breaches or inadvertent sharing. I handled this specific part of the project in devicing an algorithm that lets users track delegation chains for resources their own.

Secure Data Outsourcing: The elasticity that cloud computing offers is encouraging many comapanies to host their data and applications on cloud infrastructures. Naturally, making the outsourced data 'secure' is one of the important problems. Especially, making the data and its retrieval patterns secure against attacks from the cloud providers is a major challenge. We need to ensure efficient retrieval of required portions of the data while maintaining the confidentiality of both the data and retrieval requests. As a posdoc, I helped design a methodology that executes range queries over encrypted data hosted on a third-party server using random space perturbation.
Sequence Security Measures: Sequences with "good'' randomness and correlation properties and large periods have applications in cryptography (stream ciphers), CDMA, Monte Carlo and quasi-Monte Carlo methods, radar, and other areas. My doctoral research deals with design and analysis of sequences in the context of stream ciphers. Feedback shift registers (FSRs) are fast devices that are generally used to build sequence generators. Sequences (key streams) used in stream ciphers should withstand specific attacks based on shift register synthesis algorithms which give rise to several sequence security (complexity) measures. My research deals with the analysis of different types of FSRs and the corresponding measures in order to characterize and count cryptographically "strong" ( "weak") sequences.

Development Lead: At Wright State University, my duties as the development lead for the human performance and cognition ontology project included monitoring project progress, contributing in design and implementation, writing and reviewing code (mostly in Java), testing various components, assigning duties, setting and revising timelines, writing monthly status reports and presenting results in quarterly meetings with the clients at the AFRL, and co-ordinating team logistics. The main technologies used in the project are Java (Java 6 in Eclipse with SVN, open source projects Jena for RDF data, Lucene for indexing and searching, LingPipe for parsing Medline abstract files), JavaScript (ExtJS) for front end UI, MySQL as the backend database, and Tomcat as the web server.

Application Security: With my background in cryptography, I also focus on these application security aspects: authentication and authorization techniques; secure coding principles; awareness for buffer over flows and SQL injections, cross-domain security; encryption algorithms and password security; key management and public key infrastructure; ethical hacking; security awareness and education. I am also familiar with the fundamentals of network security, especially the transport layer security protocol TLS/SSL and the associated infrastructure (digital certificates, certificate authorities, and MACs). I currently hold the GIAC-GSEC and SCJP certifications.

Recent peer reviews (not an exhaustive list):