Help | Advanced Search

Computer Science > Human-Computer Interaction

Title: goals, process, and challenges of exploratory data analysis: an interview study.

Abstract: How do analysis goals and context affect exploratory data analysis (EDA)? To investigate this question, we conducted semi-structured interviews with 18 data analysts. We characterize common exploration goals: profiling (assessing data quality) and discovery (gaining new insights). Though the EDA literature primarily emphasizes discovery, we observe that discovery only reliably occurs in the context of open-ended analyses, whereas all participants engage in profiling across all of their analyses. We describe the process and challenges of EDA highlighted by our interviews. We find that analysts must perform repetitive tasks (e.g., examine numerous variables), yet they may have limited time or lack domain knowledge to explore data. Analysts also often have to consult other stakeholders and oscillate between exploration and other tasks, such as acquiring and wrangling additional data. Based on these observations, we identify design opportunities for exploratory analysis tools, such as augmenting exploration with automation and guidance.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Role of Exploratory Data Analysis in Data Science

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

The use of Exploratory Data Analysis in Information Retrieval Research

Cite this chapter.

Book cover

  • Warren R. Greiff 3  

Part of the book series: The Information Retrieval Series ((INRE,volume 7))

277 Accesses

2 Citations

We report on a line of work in which techniques of Exploratory Data Analysis (EDA) have been used as a vehicle for better understanding of the issues confronting the researcher in information retrieval (IR). EDA is used for visualizing and studying data for the purpose of uncovering statistical regularities that might not be apparent otherwise. The analysis is carried out in terms of the formal notion of Weight of Evidence (WOE). As a result of this analysis, a novel theory in support of the use of inverse document frequency ( idf ) for document ranking is presented, and experimental evidence is given in favor of a modification of the classical idf formula motivated by the analysis. This approach is then extended to other sources of evidence commonly used for ranking in information retrieval systems.

  • Information Retrieval
  • Relevant Document
  • Term Frequency
  • Exploratory Data Analysis

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unable to display preview.  Download preview PDF.

Andrews, D. E (1978). Data analysis, exploratory. In Kruskal, W. H. and Tanur, J. M., editors, International Encyclopedia of Statistics , volume 7, pages 210–218. Free Press, New York.

Google Scholar  

Beniger, J. R. and Brown, D. L. (1978). Quantitative graphics in statistics: A brief history. The American Statistician , 32(1):1–9.

Bookstein, A. and Cooper, W. (1976). A general mathematical model for information retrieval systems. Library Quarterly , 46(2): 153–167.

Callan, J. P., Croft, W. B., and Broglio, J. (1995). TREC and TIPSTER experiments with INQUERY. Information Processing & Management , 31(3):327–343.

Article   Google Scholar  

Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications , pages 78–83.

Church, K., Gale, W., Hanks, P., and Hindle, D. (1991). Using statistics in lexical analysis. In Zernik, U., editor, Lexical Acquisition: Exploiting OnLine Resources to Build a Lexicon , pages 115–164, Hillsdale, NJ. Lawrence Erlbaum Associates.

Cooper, W. S. (1994). The formalism of probability theory in IR: A foundation or an encumbrance. In Croft, W. B. and van Rijsbergen, C. J., editors, Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval , pages 242–248, Dublin, Ireland.

Croft, W. B. and Xu, J. (1995). Corpus-specific stemming using word form co-occurence. In Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval , pages 147–159, Las Vegas, Nevada.

Devroye, L. (1987). A Course in Density Estimation . Birkhauser, Boston.

Fano, R. M. (1961). Transmission of Information; a Statistical Theory of Communications . MIT Press, Cambridge, MA.

Good, I. J. (1950). Probability and the Weighing of Evidence . Charles Griffin, London.

Good, I. J. (1983a). Good Thinking: The Foundations of Probability and its Applications . University of Minnesota Press, Minneapolis.

Good, I. J. (1983b). Weight of evidence: A brief survey. In Bernardo, J. M., DeGroot, M. H., Lindley, D. V., and Smith, A. F. M., editors, Bayesian Statistics 2, pages 249–269. North-Holland, Amsterdam.

Good, I. J. (1989). Statistical evidence. In Kotz, S. and Johnson, N. L., editors, Encyclopedia of Statistical Sciences , pages 651–656. Wiley.

Greiff, W. R. (1999). Maximum Entropy, Weight of Evidence and Information Retrieval . PhD thesis, University of Massachusetts, Amherst, Massachusetts.

Greiff, W. R. and Ponte, J. (1999). The maximum entropy approach and probabilistic IR models. To appear in ACM Transactions on Information Systems .

Haines, D. and Croft, W. B. (1993). Relevance feedback and inference networks. In Korfhage, R., Rasmussen, E., and Willett, P., editors, Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 191–203, Pittsburgh, Pa. USA.

Harman, D. (1993). Overview of the first Text REtrieval Conference (TREC-1). In Harman, D. K., editor, The First Text REtrieval Conference (TRECI) , pages 1–20, Gaithersburg, Md. NIST Special Publication 500-207.

Harman, D. (1995). Overview of the third Text REtrieval Conference (TREC-3). In Harman, D. K., editor, The Third Text REtreival Conference (TREC-3), pages 1–20, Gaithersburg, Md. NIST Special Publication 500-225.

Harman, D. (1997). Overview of the fifth Text REtrieval Conference (TREC-5). In Voorhees, E. M. and Harman, D. K., editors, The Fifth Text REtreival Conference (TREC-5), pages 1–28, Gaithersburg, Md. 500-238. NIST Special Publication 500-238.

Hartwig, F. and Dearing, B. E. (1979). Exploratory Data Analysis . Number 07-016 in Sage university paper series: Quantitative applications in the social sciences. Sage Publications, Beverly Hills.

Hardle, W. (1990). Applied Nonparametric Regression . Cambridge University Press, Cambridge.

Jeffreys, H. (1961). Theory of Probability . Oxford University Press, Oxford, 3 edition.

Minsky, M. and Selfridge, O. G. (1961). Learning in random nets. In Cherry, C., editor, Information Theory: Fourth London Symposium , pages 335–347, London. Butterworths.

Neter, J., Wasserman, W., and Kutner, M. H. (1985). Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs . R. D. Irwin, Homewood, Ill., 2 edition.

Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation , 33:294–304.

Robertson, S. E. and Sparck Jones, K. (1977). Relevance weighting of search terms. Journal of the American Society for Information Science , 27:129–146.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis . Chapman and Hall, London.

Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation , 28:11–21.

Tukey, J. W. (1977). Exploratory Data Analysis . Addison-Wesley Publishing Company, Reading, MA.

van Rijsbergen, C. J. (1979). Information Retrieval . Butterworths, London, 2 edition.

Witten, I. H., Moffat, A., and Bell, T. C. (1994). Managing Gigabytes: Compressing and Indexing Documents and Images . van Nostrand Reinhold, New York.

Download references

Author information

Authors and affiliations.

The MITRE Corporation, Bedford, Massachusetts

Warren R. Greiff

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

University of Massachusetts, Amherst

W. Bruce Croft

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Kluwer Academic Publishers

About this chapter

Greiff, W.R. (2002). The use of Exploratory Data Analysis in Information Retrieval Research. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_2

Download citation

DOI : https://doi.org/10.1007/0-306-47019-5_2

Publisher Name : Springer, Boston, MA

Print ISBN : 978-0-7923-7812-9

Online ISBN : 978-0-306-47019-6

eBook Packages : Springer Book Archive

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

COMMENTS

  1. Goals, Process, and Challenges of Exploratory Data Analysis: An

    Exploratory data analysis stems from the collection of work by the statistician John Tukey in the 1960s and 1970s [24,39,40,67]. His seminal book [67] compiles a collection of data visualization tech-niques as well as robust and non-parametric statistics for data explo-ration. Many communities including Statistics, Human-Computer In-

  2. Data Analysis Into Incorporating Exploratory

    Challenges in Incorporating Exploratory Data Analysis Into Statistical Work ow. manually override defaults, in some cases (Figures 4, 5) we failed to get the specifications to our liking on all details. This is as much the result of our (perhaps unwise) choice to work outside of tools in our comfort zone as it is poor defaults as Unwin suggests.

  3. PDF Chapter 4 Exploratory Data Analysis

    ing at numbers to be tedious, boring, and/or overwhelming. Exploratory data analysis techniques have been devised as an aid in this situation. Most of these techniques work in part by hiding certain aspects of the data while making other aspects more clear. Exploratory data analysis is generally cross-classi ed in two ways. First, each

  4. (PDF) Exploratory Data Analysis

    15.1 Introduction. Exploratory data analysis (EDA) is an essential step in any research analysis. The. primary aim with exploratory analysis is to examine the data for distribution, outliers and ...

  5. PDF Exploratory Data Analysis on Multivariate Data

    The goal of this thesis is to apply Exploratory Data Analysis to the qualitative and quantitative dataset, to answer the research questions on the characteristics of the users, patterns of their behavior, and at the same time on the in uence of environmental factors. Multiple data mining models have been applied in the thesis, including

  6. PDF Exploratory Data Analysis

    Such critiques motivate our proposal that research on supporting exploratory visual analysis should embrace theories of graphical inference. In the following section we propose an alternative understanding of exploratory visual analysis as guided by model checks, and describe possible formalizations of this theory. 4.

  7. Exploratory Data Analysis

    Exploratory Data Analysis (EDA) is an approach advocated by renowned statistician J. W. Tukey and others. It uses data visualization as applied to raw data or summarized information (Chapter 5) from a dataset to understand relationships within a dataset.It may be used to discover patterns which can then be tested using standard inferential statistics (Chapter 6) [].

  8. PDF Data Visualization in Exploratory Data Analysis: an Overview of Yingsen

    of the first half of the twentieth century. Exploratory data analysis (EDA), on the contrary, had not been widely used until the groundbreaking work of Tukey [6] [7]. Tukey [7] stated that exploratory data analysis is about looking at data to see what it seems to say. He argued that data analysis is not just testing a pre-defined hypothesis and ...

  9. Goals, Process, and Challenges of Exploratory Data Analysis: An

    How do analysis goals and context affect exploratory data analysis (EDA)? To investigate this question, we conducted semi-structured interviews with 18 data analysts. We characterize common exploration goals: profiling (assessing data quality) and discovery (gaining new insights). Though the EDA literature primarily emphasizes discovery, we observe that discovery only reliably occurs in the ...

  10. PDF Chapter 15 Exploratory Data Analysis

    15.1 Introduction. Exploratory data analysis (EDA) is an essential step in any research analysis. The primary aim with exploratory analysis is to examine the data for distribution, outliers and anomalies to direct specific testing of your hypothesis. It also provides tools for hypothesis generation by visualizing and understanding the data ...

  11. PDF Exploratory Data Analysis in Schools: A Logic Model to Guide ...

    Abstract. Exploratory data analysis (EDA) is an iterative, open-ended data analysis procedure that allows practitioners to examine data without pre-conceived notions to advise improvement processes and make informed decisions. Education is a data-rich field that is primed for a transition into a deeper, more purposeful use of data.

  12. Exploratory Data Analysis

    Exploratory data analysis is a set of techniques that have been principally developed by Tukey, John Wilder since 1970. The philosophy behind this approach is to examine the data before applying a specific probability model. According to Tukey, J.W., exploratory data analysis is similar to detective work. In exploratory data analysis, these ...

  13. PDF 2 Exploratory Data Analysis and Graphics

    2Exploratory Data Analysis and Graphics. T. his chapter covers both the practical details and the broader philosophy of (1) reading data into R and (2) doing exploratory data analysis, in particular graph- ical analysis. To get the most out of the chapter you should already have some basic knowledge of R's syntax and commands (see the R ...

  14. (PDF) Visualization and Explorative Data Analysis

    Exploratory data analysis (EDA) is a well-established statistical tradition that provides conceptual and computational tools for discovering patterns to foster hypothesis development and refinement.

  15. A Data Scientist's Essential Guide to Exploratory Data Analysis

    Introduction. Exploratory Data Analysis (EDA) is the single most important task to conduct at the beginning of every data science project. In essence, it involves thoroughly examining and characterizing your data in order to find its underlying characteristics, possible anomalies, and hidden patterns and relationships.

  16. Exploratory Research

    Exploratory research can help you narrow down your topic and formulate a clear hypothesis and problem statement, as well as giving you the "lay of the land" on your topic. Data collection using exploratory research is often divided into primary and secondary research methods, with data analysis following the same model. Primary research

  17. Exploratory Data Analysis

    Abstract. Exploratory Data Analysis (EDA) is an approach employed to analyze datasets. Primarily, EDA uses data visualization methods and often statistical models to (i) assess a dataset's general structure, (ii) obtain descriptive summaries of the data, and (iii) provide the basis for model formulation. EDA includes checks on data quality ...

  18. Exploratory Data Analysis: A Practical Guide and Template for

    This is where Exploratory Data Analysis (EDA) comes to the rescue. According to Wikipedia, EDA "is an approach to analyzing datasets to summarize their main characteristics, often with visual methods". In my own words, it is about knowing your data, gaining a certain amount of familiarity with the data, before one starts to extract insights ...

  19. PDF Exploratory'Data'Analysis'Using' Network'Based'Techniques'

    The largest representation of our world is written by data, usually digital data. The analysis of these data is the key to understand our world better. Anal-ysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.

  20. Exploratory Data Analysis

    Exploratory data analysis (hereafter denoted as EDA) is not a technique but it is a paradigm or strategy for robust analysis of data (Tukey 1977).It comprises descriptive, statistical, and, largely, graphical investigations meant for (a) obtaining the greatest understanding of a set of data, (b) revealing structure of data, (c) determining important data variables, (d) determining outlying and ...

  21. Role of Exploratory Data Analysis in Data Science

    Exploratory Data analysis (EDA) is one of the hidden and mundane tasks in analysis of Data, as a Model, Project or analysis is based on data, which is intuitive, extremely heterogenous and distorted in its form. (Data has become an integral part of every project, Model &) The analyzed data is more insightful for identifying and improving extremely critical business insights across the ...

  22. Exploratory Data Analysis

    This is done not only to familiarize yourself with all the data you have collected, but also to reduce the workload during analysis. The initial data investigation has been termed exploratory data analysis or EDA and it primarily focuses on visually inspecting the data. The main aim of EDA is to understand what data you have, what possible ...

  23. The use of Exploratory Data Analysis in Information ...

    Abstract. We report on a line of work in which techniques of Exploratory Data Analysis (EDA) have been used as a vehicle for better understanding of the issues confronting the researcher in information retrieval (IR). EDA is used for visualizing and studying data for the purpose of uncovering statistical regularities that might not be apparent ...