Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

Blog Post number 4

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

awards

Recognition and Awards

Published:

AwardYear
Outstanding Technical Achievement Award (OTAA)2021
Outstanding Technical Achievement Award (OTAA)2020
IBM Service Corps Alumni2020
IBM Invention Plateau2020
Client Value Outstanding Technical Achievement Award(OTAA)2019
Research Division Award2019
Manager’s Choice Award2019
Manager’s Choice Award2018
Second Runner-Up in WSDM cup competition2017
Insta Award for client success2015

education

experience

Infosys Limited [ AUG 2013 - DEC 2015]

Published:


Tool usage: R, SQL, CARET, NLTK
At Infosys, Jay worked for a European banking clients on projects related to Data Analysis and Machine Learning. He has experience working on different phases of end to end data science pipeline

  • Requirements: Communicated with client and Subject Matter Experts to identify and document business needs and problem statements that require data driven solutions
  • Data Extraction: Used SQL,PL/SQL, Hive query language to obtain data required for analysis from multiple sources such as Oracle DB, Hive tables
  • Data Preparation: Used R packages such as dplyr, tidyr, lubridate for data preparation activities such as data cleaning,dealing with missing values,parsing the dates,creating new variables ,regular expressions to process text etc.
  • Data Exploration: Used R packages ggplot2 , pca, alr4,dplyr to check distributional assumptions using statistical tests,for obtaining summary statistics and aggregated tables, checking associations among variables using correlation analysis/plots and removing high correlations in data, principal component analysis for dimensionality reduction etc.,
  • Model Building: Used CARET,h20 packages in R to train different supervised learning models on data such as Regression, Logistic regression, Decision Tree, Random forests and methods such as cross validation,regularization to deal with over fitting
  • Reporting: Presented and discussed the data driven model results with non technical audience (client) to identify the factors that can help in business decision making.


    During his stint with Infosys,Jay got the opportunity to work on multiple projects and learned many things like text mining methods,HIVE and core banking domain.Apart from work routine, Jay delivered training sessions to peers on technology such as SQL,Big Data Basics and for the best efforts as a team player he was honored with “INSTA AWARD”. ​ ​

University of Illinois [ MAY 2016 - DEC 2017]

Published:

Tools usage : R,Python,Markdown,Shiny,Spark,APIs

Jay worked at University of Illinois as a Research Assistent while pursuing his Master’s degree. His work was about finding answers to several media research questions by analyzing natural language/text on issues such as Privacy, Secrecy ,Immigration etc.During the course of research,he worked on multiple projects using text mining techniques such as topic modelling ,text classification,Named entity recognition,POS tagging,sentiment analysis and advanced ML techniques such as SVM,Xgboost,clustering methods etc.,
Below are the projects
1)”Privacy VS Secrecy analysis” using data from twitter feeds
2)”Immigration Analysis” with data from the New York times articles. 

Git links of projects
Twitter Privacy VS Secrecy Analysis
NYtimes Immigration Analysis

Mayo Clinic [ MAY 2017 - AUG 2017]

Published:

In the summer of 2017, Jay worked at Mayo clinic as a Intern.There he was responsible to develop tools for data pipelines and enhance existing tools.

Tool usage: Python, Unix, Perl, Selenium, beautiful soup, pptx

At Mayo clinic, Jay worked on multiple projects and his major project was ‘Automated Gene Report Generation’.Goal of this project is to automate the patient gene report (.pptx format) generation. Creating these reports is cumbersome as gene data comes from various sources such as files, web sites etc., and genetic counselors required these reports for analyzing genetic variations.
As obtaining gene data from multiple sources, creating a power point gene report is tedious and often gene counselors spend hours in preparing these reports. So through this project Jay automated pptx document generation using python to save time (saved ~ 3 hours per report) such that gene counselors can spend more time with patients than to worry about making reports manually.
Other than his major project,Jay developed enhancements for a “Metabolic Modelling tool”.Also,Jay was part of Machine learning club at Mayo where he presented,discussed research work with peers.

Git links of projects:
flash_ppt
cobra_babel

IBM Research Labs [ MAR 2018 - Present]

Published:


Tool usage: Python, docker , OpenShift , pytorch, FastAPI

At present Jay is working as Advisory Research Engineer , developing AI solutions for enterprise automation using Machine Learning, Deep Learning, Natural Language Processing, Process Mining, and cloud-native tools. He created significant business impact for IBM by developing several AI solutions for products/platforms and lead several proof-of-concepts(POC) on AI use-cases with clients. Along with creating business impact his day to day work involves contributing to reserach papers and patents. Below are few example projects and POCs

  • Developed knowledge curation from documents solution, curated knowledge was used for bootstrapping chat-bots, enhancing document search, and constructing knowledge graphs
    • Used by 300+ accounts and A-level research accomplishment with over 25M USD per year revenue
    • 2 OTAA awards, from IBM and client for project contributions
  • Developed automated ticket dispatching system for assigning correct resolver group to tickets. Used a combination of automated rules,SVM and deep learning methods to achieve human-level accuracy
    • Work published in AI magazine in deployment practices and best deployment paper award at IAAI
    • Used by 12 different clients and served over 2 Million tickets till date
  • Developing AI for business automation platform for busines/IT process discovery, performance monitoring, workforce insights and process automation on the event logs
    • A-level research accomplishment with revenue of 3M USD
    • 2 product related patents, Application paper at ICPM
    • OTAA award for leading and developing the product features
  • Client-POCs:
    • Bootstrapping chat-bots for retail and manufacturing domain documents and evaluating chat-bot knowledge gap with SME in the loop
    • Process discovery from banking application click-stream data, to identify the process behind application interactions and cognitive load estimation, automation recommendation for applications

      More details on projects can be found in ‘projects’ section

      Apart from his daily work , he likes to mentor student/professionals willing for a switch to Data science/ML related roles.

expertise

Tools

Published:

Jay uses several tools for developing AI solutions. He has experience with the following

  • Object oriented programming ,Data strcutures and Algorithms
  • Python : Collections,itertools,NumPy,Pandas,Scikit Learn,pysprak,pytorch,pm4py,networkx,NLTK,gensim,Spacy,
  • RDBMS,SQL,MongoDB,Elastic Search,Kibana
  • Devops : docker,openshift,FastAPI
  • Experience in web-scraping and worked with several APIs such as NYtimes,twitter etc., to extract data
  • Basics of R:Caret,h2o,ggplot2,dplyr

Skills

Published:

Jay has expertise in Machine Learning, Deep Learning, Natural Language Processing, Statistical Learning and Process Mining

  • Statistical Learning: Distributions,Inferential Statistics, Data Exploration, Regression Analysis, Variable selection, ANOVA, Logistic Regression,Discriminant Analysis
  • Machine Learning: Pattern Mining,PCA,K-Means,density based Clustering,t-SNE,Naïve Bayes, KNN, Boosted Tree Models, SVM, Kernel methods, Ensemble Learning ,Regularization, Cross Validation , Variable and Model selection
  • Natural Language Processing: Sentiment Analysis,Topic Modelling,Text Classification,Transformer models,Named Entity Recognition, Language models,Procedure extraction,Knowledge graphs
  • Deep Learning : RNN,LSTM,GAN,VAE,Siamese Networks
  • Process Mining : Log analysis,FuzzyMiner,Heuristic Miner,performance monitoring

projects

Customer sentiment analysis

Published:

​Key Words: R,SQL Logistic Regression , LASSO , tf-idf
​​
This project was done at Infosys and due to restrictions code is not available.Goal of the project is to build a classification model for a banking client to categorize customers into “Happy” or “Unhappy” categories based on email text and survey feedback
Below are the steps in the project
1) Data was queried using SQL and complied data set for the analysis with email text, survey feedback and target sentiment variable
2) Performed Feature engineering,data cleaning on email text,survey feedback in two steps i) For the email text used text mining methods such as regular expressions, tf-idf to convert raw text to numeric vectors and then used PCA (Principal Component Analysis) to obtain dense representation of features. ii) For the survey feedback,it represents scores to a set of questions in the range of 0-7 and this data has high proportion of missing values ,to handle missing data used NNMF (Non negative matrix factorization) to fill the missing scores. and later both these feature sets were combined with target variable to obtain training data set.
3) Modeled logistic regression for the binary classification of sentiment and to deal with over fitting techniques such as cross validation and regularization(LASSO,Ridge) were considered, used Accuracy as model evaluation metric.
4) Interpreted variable selection using odds ratio and LASSO model to find out influential variables and it turns out that some of the survey feedback variables are important in classifying sentiment where as the dense features from text could not be interpreted as they are the principal components.
5) Obtained the best model that has better accuracy and predicted sentiment on unseen customer data. ​

Loan Defaulter Prediction

Published:

​Key Words: R,SQL,Hive,Logistic Regression,Random Forests
​​
This project was done at Infosys and due to restrictions code is not available.The Goal of this project is to predict whether a customer of the bank will default or not in near future, and help the bank to make a decision in the approval process of loan.
Below are the steps in the project
1) Identified the key variables for analysis by working together with banking domain Subject Matter Expert (SME) and documented the requirements
2) Studied the schema of database and wrote SQL queries to pull data from several tables in order to compile data set for the analysis
3) Performed data cleaning and manipulation such as deleting records with majority missing columns,created new variables from existing variables like maximum transaction amount,discretized income etc.,
4) Hand annotated response classes “Default”,”Non Default” for a sample of customer records using the business rules to create training labels
5) Explored several attributes of customer such as age,income,number of loans,sex,credit score etc., w.r.to response variable
6) Modeled logistic regression for the classification to interpret the variable importance based on odds ratio and decision tree model to interpret variables ,derive decision rules based on tree structure
7) Evaluated model performance based on F1-Score and tree model outperformed logistic model ,hence as a next step Random Forest model was trained to achieve better results
8) Tested the model performance on future loan data, both logistic regression and Random forest achieved better F1 score than expert judgement by 0.15 this is because existing expert judgement over looked some dynamic variables like late fee

SPAM classification

Published:

​Key Words: R, Logistic Regression , GBM , tf-idf​​
​​
This project is to classify email into SPAM , NOT SPAM and it is a in-class competition for the statistical learning course. Data set for the analysis can be found at UCI ML Repo , it contains normalized tf-idf of several important words that can discriminate an email into SPAM/NOT SPAM. As part of data exploration ,i) variable selection was performed based on the feature density plots ,ii) removed high correlations based on vif measure and iii) identified significant variables and their interactions by interpreting odds ratio from logistic regression. For the binary classification objective several machine learning models were used such as GLM, ElasticNet, GBM, Radial SVM with best set of variables and their interactions. Our feature selection with GLM and GBM model as classifier ranked top on the competition leader board. ​
Links : Git Repo

WSDM cup,2017 Triple Scoring Task

Published:

Key Words: Python, NLTK ,word2Vec, GloVe, SVR ,PCA
​​
Triple-scoring task is to compute relevance scores for knowledge-base triples from type-like relations such as PERSON-PROFESSION-RELEVANCE SCORE or PERSON-NATIONALITY-RELEVANCE SCORE. For example, Julius Caesar has various professions, including Politician and Author. Given the (Julius Caesar,{ profession} Politician) triple and (Julius Caesar, {profession} Author) triple, the former triple is expected to have a higher relevance score (so called “triple score”) because politician is the major profession of Julius Caesar. Accurate estimation of such triple score benefits ranking for knowledge base queries. Since in the training data we just had three columns “Person”,”Profession”,”Score” so lot of feature engineering was performed to create vector space for training ,in our approach we propose to integrate knowledge from both latent features and explicit features with an ensemble method to address the problem. The latent features consist of person entities representations learned from word2vec model and profession/nationality values’ representations learned from GloVe model. In addition, we also incorporated the explicit features of person entities from the Freebase knowledge base and from Wikipedia. Experimental results show that our feature integration method with and Support Vector Machine algorithm training performed well and ranked 3rd place in the competition.
Methodology Paper was submitted to WSDM ,2017 conference
Title : “ Integrating Knowledge from Latent and Explicit Features for Triple Scoring”

Links : Git Repo

Analyzing Heart Disease using Patient Medical Records

Published:

Key words : SAS, Descriptive Statistics , Logistic Regression, Discriminant Analysis

​Health sciences is one of the areas where Data analysis plays a crucial role and it offers rich medical records data for the patients. By analyzing the patient’s historical data, we can find some interesting insights about a certain disease, which will benefit in predicting a new patient getting a similar disease. So, that we can take necessary actions possible to prevent the disease or else we can treat the disease effectively by identifying the major cause.
For this group project, Jay was the leader of the group and guide small team of under-graduate students.Our group choose to work on chronic heart disease and it is one of major disease that causes thousands of deaths. Through Data modelling on patient data we try to predict the chances of a patient getting chronic heart disease. Also, we figure out what are the major factors that cause heart disease, and tried to answer questions like
​​1) Is the heart disease depends on Age of the patient, then what is the age range that is more prone to the heart disease?
2) What are the effects of Smoking and drinking habits, and do they most likely cause heart disease
3) How obesity or cholesterol, effects the chances of getting a heart disease?
4) Does the heart disease depended on family history? i.e; If any family member has the heart disease then how likely it is, patient getting a heart disease.

​Jay’s role in this project was to lead the team through guidance and discussions ,building predictive model for hereditary factor using Discriminant analysis and interpret ,visualize the effect of significant variables causing heart disease.More details can be found on links below
​Links : Project report and Git Repository

Insurance Claims Loss Prediction

Published:

Key words: R, h2o, XGboost,LASSO,PCA,K-Means Clustering

​ Insurance domain is one of the prominent domains where statistical learning/Machine learning plays a crucial role. With the use of statistical learning we can automate the process of claim-cost prediction and, also the severity. This automated process ensures a worry-free customer experience and offers insight into better ways to predict claims severity.
For this Group project, our task was to predict the target variable “loss” (numeric value) based on the several anonymous predictor variables which include 116 categorical variables,14 continuous variables. Considering the large dimension of data with over 180,000 observations and 130 predictor variables great deal of feature selection ,engineering was needed.
​At first data exploration was performed for better understanding the problem using statistical tests and correlation analysis, then performed feature engineering using dimensionality reduction,clustering and feature selection using LASSO and Random Forest. Later modeled “loss” using XGboost , GBM ,LASSO with the selected important features.
​Jay’s role in this project was to design the pipeline for the analysis and feature engineering using PCA (Numeric data) , K-Means clustering by using Gower-distance, Correlation Analysis and Modelling “loss” using XGboost algorithm.
​Links : Project Report and Git Repo

Faulty Water pump prediction

Published:

Key words: R, h2o , XGboost ,Sparklyr ​​
​​
This Project is a competition on DRIVENDATA, to predict which pumps are functional, which need some repairs, and which don’t work at all. Predicting one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which water points will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania. ​
Data cleaning was as performed such as
i) Removed some of the variables as they are constant or no variation
ii) Some variables are similar to others such as latitude and longitude variables conveys location information ,so other redundant location variables were removed
iii) Some categorical variables have too many levels and such variables were discarded
iv) created new variables to account for age of pump based on installation date etc.,
After feature engineering ,Classification model was trained using Xgboost and Random Forest using h2o, sparklyr packages and interpreted significant variables for classifying faulty water based on information gain.
Links : Git Repo

Automated Gene Report Generation

Published:

Key words: Python,SQL,APIs,web scraping,Selenium,pptx ​​
​​
This Project is about a tool called flash_ppt developed at Mayo clinic.Flash ppt is a software/tool written in python ,to automate the pptx genetic report generation process in data pipelines.

Back ground:
Genetic report contains information about gene variants from a subject/patient and it is used by Genetic counselors to analyse disease and provide counseling to patients. Genetic information comes from several sources such as multiple catalog files(from clinical labs across the world) and websites such as OMIM,NCBI,HGMD etc.,
At Mayo Genetic counselors prepares these reports manually typically it takes 3 hours to put together a single report .As report preparation is cumbersome so the whole process is automated in order to reduce the human effort.With automation a single report can be generated in less than a minute.
Links : Git Repo

process mining on documents and clickstreams

Published:

Tools: Python, REST APIs, Docker, networkx, pandas

Methods:Several NLP and graph techniques such as 

  • Extraction of Entities , decisions , OCR from documents and automated process model generation.
  • Page-rank algorithm to identify critical application
  • Heuristic automation amenable candidate discovery using KPIs such as Time duration , Execution frequency , Centrality etc.

    Description: Banking Client POC to identify automatically business process models from 1) unstructured documents with text ,images 2) banking applications user click data . This work helped client to better understand the process behind their Invoice processing , Fraudulent transaction detection systems by analysing the in-efficiencies , enhancements , conformance issues , application interaction , automation recommendations.
    1) Process mining on documents : Invoice processing process documents contain information in the form text descriptions and flow chart images. Our aim is to identify process elements from both text and image modalities , then combining results to generate discovered. Process model in BPMN format.We used pre-trained model for learning structure of the document using CCS (Corpus conversion service) and CCS tool can be customised to documents of different styles.
    i)Text2process: Extraction of task nodes , activity description, decision nodes of the process from text
    ii) Image2process: Extraction of task nodes from flow diagram using CCS OCR
    iii) Discovered process BPMN: Automated BPMN process model generation combining Text2process and Image2process.

    2) Process mining on application click data : Fraudulent transaction process contains different applications with users accessing them.Click streams data modality captures user activity at application level. We use click data to discover process using Record2process to represent interaction between different applications at the window level .  i)Record2process:
    a) Main process at the application level to show the user control flow between different applications
    Sub-process for each application at window level to understand the user interaction within each application
    ii) Cognitive load estimation :
    a) For each application , human cognitive effort required is estimated based on user actions
    iii) Automation amenability:
    a) For each application , automation feasibility is measured based on determinism w.r.to time duration spent in the application. 

    ​​ Related papers:
  • CCS - https://arxiv.org/abs/1806.02284

    Realted Patents :
  • Discovering automation amenable candidates 

Bootstrapping chat-bots and Gap analysis

Published:

Tools: Python, REST APIs, Docker, Mongo DB, Watson Assistant

Methods:Several ML,NLP techniques such as 

  • Extraction of Entities , Pivot sentences , Procedures , Symptoms, Conditional Blocks , Images and Tables. 
  • Text classification using universal sentence encoder (USE) , Watson NLP classifier
  • Seed based clustering using Gaussian Mixture Models

    Description: This project involves two steps 1) Bootstrapping chat-bots with knowledge extracted from documents 2) Assessing chat-bot coverage and automatically identifying gaps in the bot’s knowledge
  • Knowledge Extraction from documents for Question Answering - Using technical documents we extract several QA pairs from pivot sentences, procedural sections, paragraphs, images, tables etc . These QA pairs will be used to bootstrapped chat-bot , which will will help agents to solve incidents/service requests. 
  • Gap Analysis and allowing the SME to test the bot’s knowledge- Once the chat-bot is bootstrapped , using an UI we enable the SME to see  what the bot has learnt from a given document in terms of QA pairs to approve the chat-bot or even SME can provide few annotations to enhance chat-bot coverage.


    ​​ Outcomes:
  • Client value OTAA award for contributing to client-success
  • Bootstrapped chat-bot from approximately 100 technical documents and reduced chat-bot development time by 95 percent

AI for business process automation

Published:

Tools: Python , docker , REST APIs , OpenShift , pm4py , networkx

Methods:Several Deeplearning and Statistical techniques such as 

  • Heuristic miner algorithm for process discovery from logs
  • Efficiency ranking of agent using KL divergence based statistical distribution comparison
  • GARCH , LSTM models for KPI forecasting
  • Auto process aware feature engineering pipeline
  • Recommendations for workflow decisions using log

    Description: Business processes are quintessential for effective business operations in an enterprise. There exists a wide range of processes such as HR process and customer support which are executed in an enterprise. A workflow is defined for each process which comprises various activities, decisions, and resources such as actors performing the activities. Often some factors are overlooked or missed out while designing any workflow which in turn leads to inefficiencies. Workflow inefficiencies are undesired as they degrade the user experience and the organisations’ productivity.There is a need to identify and mitigate inefficiencies in the workflow thus improve the process performance.
    AI Integration in business process is crucial to enable a more efficient and cost effective business process and below are the capabilities
    1)Realtime hotspot analysis: offers the ability to mine process logs and unstructured documents to identify bottlenecks and measure the effects of the workforce using unsupervised pattern mining for hotspot detection, including work vs wait times and path frequencies and actor efficiency ranking to compare relative performance of team members
    2)Process KPI forecasting:  To predict KPIs and alert users when there is a risk of violating SLAs in the future
    3)Decision recommendation: To recommend decisions to agents at decision points in a workflow based on predicted outcomes. Including capabilities such as process aware machine learning pipelines to train ML models for decisions based on historical data and give predictions with process aware explanation

    ​​ Outcomes:
  • Inlucded in IBM cloudpak for automation product
  • A-level research accomplishment with 3M USD per year revenue
  • OTAA award for leading and developing the product

    Related papers:
  • ICPM application paper: https://icpmconference.org/2020/wp-content/uploads/sites/4/2020/10/ICPM_2020_paper_152.pdf 
  • Blog on product features: https://www.linkedin.com/pulse/identifying-hotspots-improvement-opportunities-using-tool-bandlamudi/?trackingId=3kf9b2HOY25SDVOewTCn2w%3D%3D

    Realted Patents :
  • Enhancing process model and computing multi dimensional KPI index
  • Conversational Process conformance using historic chats

Automated Dispatch of Helpdesk Email Tickets

Published:

Tools: Python, REST APIs, Docker, Mongo DB

Methods:Several ML,NLP techniques such as 

  • SVM , Siamese LSTM Networks , Automated rule engine using Gini impurity & Association rule mining , Model Selection , Continuous retraining

    Description: Framework for end-to-end automated help desk email ticket assignment system driven by high accuracy, coverage, business continuity, scalability, and optimal usage of computational resources. The primary objective of the system is to determine the problem mentioned in an incoming email ticket and then automatically dispatch it to an appropriate resolver group with high accuracy. While meeting this objective, it should also meet the objective of being able to operate at desired accuracy levels in the face of changing business needs by automatically adapting to the changes. The proposed system uses a system of classifiers with separate strategies for handling frequent and sparse resolver groups augmented with a semiautomatic rule engine and retraining strategies to ensure that it is accurate, robust, and adaptive to changing business needs. Our system has been deployed in the production of six major service providers in diverse service domains and currently assigns 100,000 emails per month, on an average, with an accuracy close to ninety percent and covering at least ninety percent of email tickets. This translates to achieving human-level accuracy and results in a net savings of more than 50,000 man-hours of effort per annum. To date, our deployed system has already served more than two million tickets in production.


    ​​ Outcomes:
  • Deployed for 10+ Telco and Retail clients , served 2 Million tickets till date
  • AI magazine article and Best deployment paper award at IAAI

    Related papers:
  • IAAI paper : https://doi.org/10.1609/aaai.v33i01.33019381
  • AI Magazine :  https://doi.org/10.1609/aimag.v41i3.5321

    Realted Patents :
  • Automated rule extraction for ticket assignment

Enterprise NLP : Scalable Knowledge Curation from Technical Documents

Published:

Tools: Python, REST APIs, Docker, Elastic Search , Mongo DB, Watson Discovery , Watson Assistant.

Methods:Several NLP techniques such as 

  • Extraction of Entities , Pivot sentences , Procedures , Symptoms, Conditional Blocks , Images and Tables. 
  • Improving search relevance using Boosting and Learn2rank 
  • Text classification using universal sentence encoder (USE) , Watson NLP

    Description: There is a wide variety of technical documents being generated in the IT industry. In order to build generic AI solutions faster that can cater to various kinds of enterprise IT scenarios, there is a need to automatically and generically extract the knowledge from technical documents. However challenges here include (a) the lack of standardization in size, style, and organization in the documentation (b) specific structural patterns present in IT documents that differentiate them from generic documents. To overcome these challenge of automated understanding of technical documents, we propose a framework to represent technical documents, inspired by how humans read and comprehend documents. We also detail on automatic extraction of components from technical documents, how they relate to each other and demonstrate the value by using two example applications: Search and Question Answering (QA) systems.​​


    ​​ Outcomes:
  • Product inclusion to IBM GBS Cognitive Assistant Platform
  • Deployed for over 300 clients
  • A - Level Research accomplishment with over 25M USD revenur per year
  • Two OTAA awards from IBM and Client

    Related papers:
  • CCS - https://arxiv.org/abs/1806.02284
  • QQI paper - https://ojs.aaai.org//index.php/AAAI/article/view/7024
  • BERT - https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15848021.pdf
  • AAAI submission - Document anatomy framework

    Realted Patents :
  • Evaluating chat-bots w.r.to gap analysis 
  • Adding context to non-factoid questions using document structure 

resources

Links to resources

Published:

This section is about useful R and Python scripts for Data analysis,Machine learning and web scraping.

1) New York Times web scraper.R script for web scraping NYT API, input custom Search Term like “Immigration”, Start Date and run script to save articles data into CSV file.

2) Data Viewer Simple data viewing application using R, Shiny.Useful for reading text files with large content and input your text file in CSV format.

3) R_basics.R template script for Regular Expressions, dplyr and functional programming in R.

4) Python-programming repo : Repository for python programming includes data structures,algorithms implementations.

5) Caret_models.R template for several machine learning models such as Elasticnet, GBM ,SVM with cross validation using R “caret” package.

6) h2o_ensemble.R script for faster ensemble learning using several h2o models such as GLM.RF,GBM for binary classification.

7) Pathsim.py similarity measure algorithm for finding and ranking similar authors based on DBLP publications uni-directional graph structures such as Author-Venue-Paper.

8) ML Algorithm Implementations:
Decision Tree, Random Forest in python
Apriori for Frequent Patterns in python
K Nearest Neighbors in python
Naive Bayes in R
Discriminant Analysis in R ​