Help | Advanced Search
Computer Science > Human-Computer Interaction
Title: goals, process, and challenges of exploratory data analysis: an interview study.
Abstract: How do analysis goals and context affect exploratory data analysis (EDA)? To investigate this question, we conducted semi-structured interviews with 18 data analysts. We characterize common exploration goals: profiling (assessing data quality) and discovery (gaining new insights). Though the EDA literature primarily emphasizes discovery, we observe that discovery only reliably occurs in the context of open-ended analyses, whereas all participants engage in profiling across all of their analyses. We describe the process and challenges of EDA highlighted by our interviews. We find that analysts must perform repetitive tasks (e.g., examine numerous variables), yet they may have limited time or lack domain knowledge to explore data. Analysts also often have to consult other stakeholders and oscillate between exploration and other tasks, such as acquiring and wrangling additional data. Based on these observations, we identify design opportunities for exploratory analysis tools, such as augmenting exploration with automation and guidance.
Submission history
Access paper:.
- Other Formats
References & Citations
- Google Scholar
- Semantic Scholar
DBLP - CS Bibliography
Bibtex formatted citation.
Bibliographic and Citation Tools
Code, data and media associated with this article, recommenders and search tools.
- Institution
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
Role of Exploratory Data Analysis in Data Science
Ieee account.
- Change Username/Password
- Update Address
Purchase Details
- Payment Options
- Order History
- View Purchased Documents
Profile Information
- Communications Preferences
- Profession and Education
- Technical Interests
- US & Canada: +1 800 678 4333
- Worldwide: +1 732 981 0060
- Contact & Support
- About IEEE Xplore
- Accessibility
- Terms of Use
- Nondiscrimination Policy
- Privacy & Opting Out of Cookies
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
The use of Exploratory Data Analysis in Information Retrieval Research
Cite this chapter.
- Warren R. Greiff 3
Part of the book series: The Information Retrieval Series ((INRE,volume 7))
277 Accesses
2 Citations
We report on a line of work in which techniques of Exploratory Data Analysis (EDA) have been used as a vehicle for better understanding of the issues confronting the researcher in information retrieval (IR). EDA is used for visualizing and studying data for the purpose of uncovering statistical regularities that might not be apparent otherwise. The analysis is carried out in terms of the formal notion of Weight of Evidence (WOE). As a result of this analysis, a novel theory in support of the use of inverse document frequency ( idf ) for document ranking is presented, and experimental evidence is given in favor of a modification of the classical idf formula motivated by the analysis. This approach is then extended to other sources of evidence commonly used for ranking in information retrieval systems.
- Information Retrieval
- Relevant Document
- Term Frequency
- Exploratory Data Analysis
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
- Available as PDF
- Read on any device
- Instant download
- Own it forever
- Compact, lightweight edition
- Dispatched in 3 to 5 business days
- Free shipping worldwide - see info
- Durable hardcover edition
Tax calculation will be finalised at checkout
Purchases are for personal use only
Institutional subscriptions
Unable to display preview. Download preview PDF.
Andrews, D. E (1978). Data analysis, exploratory. In Kruskal, W. H. and Tanur, J. M., editors, International Encyclopedia of Statistics , volume 7, pages 210–218. Free Press, New York.
Google Scholar
Beniger, J. R. and Brown, D. L. (1978). Quantitative graphics in statistics: A brief history. The American Statistician , 32(1):1–9.
Bookstein, A. and Cooper, W. (1976). A general mathematical model for information retrieval systems. Library Quarterly , 46(2): 153–167.
Callan, J. P., Croft, W. B., and Broglio, J. (1995). TREC and TIPSTER experiments with INQUERY. Information Processing & Management , 31(3):327–343.
Article Google Scholar
Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications , pages 78–83.
Church, K., Gale, W., Hanks, P., and Hindle, D. (1991). Using statistics in lexical analysis. In Zernik, U., editor, Lexical Acquisition: Exploiting OnLine Resources to Build a Lexicon , pages 115–164, Hillsdale, NJ. Lawrence Erlbaum Associates.
Cooper, W. S. (1994). The formalism of probability theory in IR: A foundation or an encumbrance. In Croft, W. B. and van Rijsbergen, C. J., editors, Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval , pages 242–248, Dublin, Ireland.
Croft, W. B. and Xu, J. (1995). Corpus-specific stemming using word form co-occurence. In Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval , pages 147–159, Las Vegas, Nevada.
Devroye, L. (1987). A Course in Density Estimation . Birkhauser, Boston.
Fano, R. M. (1961). Transmission of Information; a Statistical Theory of Communications . MIT Press, Cambridge, MA.
Good, I. J. (1950). Probability and the Weighing of Evidence . Charles Griffin, London.
Good, I. J. (1983a). Good Thinking: The Foundations of Probability and its Applications . University of Minnesota Press, Minneapolis.
Good, I. J. (1983b). Weight of evidence: A brief survey. In Bernardo, J. M., DeGroot, M. H., Lindley, D. V., and Smith, A. F. M., editors, Bayesian Statistics 2, pages 249–269. North-Holland, Amsterdam.
Good, I. J. (1989). Statistical evidence. In Kotz, S. and Johnson, N. L., editors, Encyclopedia of Statistical Sciences , pages 651–656. Wiley.
Greiff, W. R. (1999). Maximum Entropy, Weight of Evidence and Information Retrieval . PhD thesis, University of Massachusetts, Amherst, Massachusetts.
Greiff, W. R. and Ponte, J. (1999). The maximum entropy approach and probabilistic IR models. To appear in ACM Transactions on Information Systems .
Haines, D. and Croft, W. B. (1993). Relevance feedback and inference networks. In Korfhage, R., Rasmussen, E., and Willett, P., editors, Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 191–203, Pittsburgh, Pa. USA.
Harman, D. (1993). Overview of the first Text REtrieval Conference (TREC-1). In Harman, D. K., editor, The First Text REtrieval Conference (TRECI) , pages 1–20, Gaithersburg, Md. NIST Special Publication 500-207.
Harman, D. (1995). Overview of the third Text REtrieval Conference (TREC-3). In Harman, D. K., editor, The Third Text REtreival Conference (TREC-3), pages 1–20, Gaithersburg, Md. NIST Special Publication 500-225.
Harman, D. (1997). Overview of the fifth Text REtrieval Conference (TREC-5). In Voorhees, E. M. and Harman, D. K., editors, The Fifth Text REtreival Conference (TREC-5), pages 1–28, Gaithersburg, Md. 500-238. NIST Special Publication 500-238.
Hartwig, F. and Dearing, B. E. (1979). Exploratory Data Analysis . Number 07-016 in Sage university paper series: Quantitative applications in the social sciences. Sage Publications, Beverly Hills.
Hardle, W. (1990). Applied Nonparametric Regression . Cambridge University Press, Cambridge.
Jeffreys, H. (1961). Theory of Probability . Oxford University Press, Oxford, 3 edition.
Minsky, M. and Selfridge, O. G. (1961). Learning in random nets. In Cherry, C., editor, Information Theory: Fourth London Symposium , pages 335–347, London. Butterworths.
Neter, J., Wasserman, W., and Kutner, M. H. (1985). Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs . R. D. Irwin, Homewood, Ill., 2 edition.
Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation , 33:294–304.
Robertson, S. E. and Sparck Jones, K. (1977). Relevance weighting of search terms. Journal of the American Society for Information Science , 27:129–146.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis . Chapman and Hall, London.
Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation , 28:11–21.
Tukey, J. W. (1977). Exploratory Data Analysis . Addison-Wesley Publishing Company, Reading, MA.
van Rijsbergen, C. J. (1979). Information Retrieval . Butterworths, London, 2 edition.
Witten, I. H., Moffat, A., and Bell, T. C. (1994). Managing Gigabytes: Compressing and Indexing Documents and Images . van Nostrand Reinhold, New York.
Download references
Author information
Authors and affiliations.
The MITRE Corporation, Bedford, Massachusetts
Warren R. Greiff
You can also search for this author in PubMed Google Scholar
Editor information
Editors and affiliations.
University of Massachusetts, Amherst
W. Bruce Croft
Rights and permissions
Reprints and permissions
Copyright information
© 2002 Kluwer Academic Publishers
About this chapter
Greiff, W.R. (2002). The use of Exploratory Data Analysis in Information Retrieval Research. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_2
Download citation
DOI : https://doi.org/10.1007/0-306-47019-5_2
Publisher Name : Springer, Boston, MA
Print ISBN : 978-0-7923-7812-9
Online ISBN : 978-0-306-47019-6
eBook Packages : Springer Book Archive
Share this chapter
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Publish with us
Policies and ethics
- Find a journal
- Track your research
COMMENTS
Exploratory data analysis stems from the collection of work by the statistician John Tukey in the 1960s and 1970s [24,39,40,67]. His seminal book [67] compiles a collection of data visualization tech-niques as well as robust and non-parametric statistics for data explo-ration. Many communities including Statistics, Human-Computer In-
Challenges in Incorporating Exploratory Data Analysis Into Statistical Work ow. manually override defaults, in some cases (Figures 4, 5) we failed to get the specifications to our liking on all details. This is as much the result of our (perhaps unwise) choice to work outside of tools in our comfort zone as it is poor defaults as Unwin suggests.
ing at numbers to be tedious, boring, and/or overwhelming. Exploratory data analysis techniques have been devised as an aid in this situation. Most of these techniques work in part by hiding certain aspects of the data while making other aspects more clear. Exploratory data analysis is generally cross-classi ed in two ways. First, each
15.1 Introduction. Exploratory data analysis (EDA) is an essential step in any research analysis. The. primary aim with exploratory analysis is to examine the data for distribution, outliers and ...
The goal of this thesis is to apply Exploratory Data Analysis to the qualitative and quantitative dataset, to answer the research questions on the characteristics of the users, patterns of their behavior, and at the same time on the in uence of environmental factors. Multiple data mining models have been applied in the thesis, including
Such critiques motivate our proposal that research on supporting exploratory visual analysis should embrace theories of graphical inference. In the following section we propose an alternative understanding of exploratory visual analysis as guided by model checks, and describe possible formalizations of this theory. 4.
Exploratory Data Analysis (EDA) is an approach advocated by renowned statistician J. W. Tukey and others. It uses data visualization as applied to raw data or summarized information (Chapter 5) from a dataset to understand relationships within a dataset.It may be used to discover patterns which can then be tested using standard inferential statistics (Chapter 6) [].
of the first half of the twentieth century. Exploratory data analysis (EDA), on the contrary, had not been widely used until the groundbreaking work of Tukey [6] [7]. Tukey [7] stated that exploratory data analysis is about looking at data to see what it seems to say. He argued that data analysis is not just testing a pre-defined hypothesis and ...
How do analysis goals and context affect exploratory data analysis (EDA)? To investigate this question, we conducted semi-structured interviews with 18 data analysts. We characterize common exploration goals: profiling (assessing data quality) and discovery (gaining new insights). Though the EDA literature primarily emphasizes discovery, we observe that discovery only reliably occurs in the ...
15.1 Introduction. Exploratory data analysis (EDA) is an essential step in any research analysis. The primary aim with exploratory analysis is to examine the data for distribution, outliers and anomalies to direct specific testing of your hypothesis. It also provides tools for hypothesis generation by visualizing and understanding the data ...
Abstract. Exploratory data analysis (EDA) is an iterative, open-ended data analysis procedure that allows practitioners to examine data without pre-conceived notions to advise improvement processes and make informed decisions. Education is a data-rich field that is primed for a transition into a deeper, more purposeful use of data.
Exploratory data analysis is a set of techniques that have been principally developed by Tukey, John Wilder since 1970. The philosophy behind this approach is to examine the data before applying a specific probability model. According to Tukey, J.W., exploratory data analysis is similar to detective work. In exploratory data analysis, these ...
2Exploratory Data Analysis and Graphics. T. his chapter covers both the practical details and the broader philosophy of (1) reading data into R and (2) doing exploratory data analysis, in particular graph- ical analysis. To get the most out of the chapter you should already have some basic knowledge of R's syntax and commands (see the R ...
Exploratory data analysis (EDA) is a well-established statistical tradition that provides conceptual and computational tools for discovering patterns to foster hypothesis development and refinement.
Introduction. Exploratory Data Analysis (EDA) is the single most important task to conduct at the beginning of every data science project. In essence, it involves thoroughly examining and characterizing your data in order to find its underlying characteristics, possible anomalies, and hidden patterns and relationships.
Exploratory research can help you narrow down your topic and formulate a clear hypothesis and problem statement, as well as giving you the "lay of the land" on your topic. Data collection using exploratory research is often divided into primary and secondary research methods, with data analysis following the same model. Primary research
Abstract. Exploratory Data Analysis (EDA) is an approach employed to analyze datasets. Primarily, EDA uses data visualization methods and often statistical models to (i) assess a dataset's general structure, (ii) obtain descriptive summaries of the data, and (iii) provide the basis for model formulation. EDA includes checks on data quality ...
This is where Exploratory Data Analysis (EDA) comes to the rescue. According to Wikipedia, EDA "is an approach to analyzing datasets to summarize their main characteristics, often with visual methods". In my own words, it is about knowing your data, gaining a certain amount of familiarity with the data, before one starts to extract insights ...
The largest representation of our world is written by data, usually digital data. The analysis of these data is the key to understand our world better. Anal-ysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.
Exploratory data analysis (hereafter denoted as EDA) is not a technique but it is a paradigm or strategy for robust analysis of data (Tukey 1977).It comprises descriptive, statistical, and, largely, graphical investigations meant for (a) obtaining the greatest understanding of a set of data, (b) revealing structure of data, (c) determining important data variables, (d) determining outlying and ...
Exploratory Data analysis (EDA) is one of the hidden and mundane tasks in analysis of Data, as a Model, Project or analysis is based on data, which is intuitive, extremely heterogenous and distorted in its form. (Data has become an integral part of every project, Model &) The analyzed data is more insightful for identifying and improving extremely critical business insights across the ...
This is done not only to familiarize yourself with all the data you have collected, but also to reduce the workload during analysis. The initial data investigation has been termed exploratory data analysis or EDA and it primarily focuses on visually inspecting the data. The main aim of EDA is to understand what data you have, what possible ...
Abstract. We report on a line of work in which techniques of Exploratory Data Analysis (EDA) have been used as a vehicle for better understanding of the issues confronting the researcher in information retrieval (IR). EDA is used for visualizing and studying data for the purpose of uncovering statistical regularities that might not be apparent ...