Wednesday, June 16, 2010

The 10 problems

One of our goals is to define a list of 10 problems/challenges which we can address successfully over the next year. This is following Jim Gray's idea of the 20 queries: some specific requirements which can focus our work. This blow is a good forum to start suggesting the ideas. So, let me start with some suggestions:

1. Provide a user-friendly data mining service, which will be easy to learn and apply by a typical astronomer, and provide at least some new functionality which is not commonly available. Obviously, this is something that can grow in time.

2. Create a virtual exchange forum for educational materials and experiences for teaching the methods of computational, data-driven science on a serious undergraduate or graduate level.

3. Provide a user-friendly visualization package/service/toolkit for exploration of highly-dimensional data. It could be more than one approach.

4. Establish an effective virtual forum where people interested in astroinformatics can meet and exchange and debate ideas, both in real time and in an off-line fashion. It could be a combination of different tools and approaches.

4a. Have an entirely virtual conference/workshop in this arena.

5. Establish an electronic, non-commercial, peer-reviewed journal for large data sets/archives, algorithms, and other astroinformatics tools and methods.

Let's hear more ideas, and the comments and improvements on these!


  1. 1. what is "data mining?" in the sense that where along a from search-to-computation (a la semantic) spectrum would this service fall? maybe we could simply teach the typical astronomer what tools are av now.

    4. my suggestion would be to adopt a Q/A social website model a la stackoverflow. doing so would ensure the audience includes the help/guide me/ I need to know element that might be missed in a pure forum format.

    5. seems to me that the VAO would meet a functional goal (Hanisch presentation, page 10 on Data Preservation & Curation) by partnering in this endeavor. Maybe users understand the value of publication & citation a bit better than archive (in a repository).

    Neither effort (a new journal like this or a new VAO repository) can have an impact if users aren't pulling the algorithm/data/code back out for reuse and effectively giving the authors the attribution they deserve. I've tried to figure out the value of the algorithm journals like Nature Methods, BioTechniques, etc. They frequently seem to be based on the pub as paper model, and don't contain "digital" assets that enable efficient/accurate reference and reuse.

  2. I'm not sure I agree with (5). There's no reason that papers on algorithms, methods, datasets and tools can't be published as regular papers in regular journals (see e.g. here). And even: creating a dedicated journal would be counter-productive to raising the profile of AstroInformatics, as it would make it seem like "another discipline", rather than integral to the way we work in astronomy.We need to work with existing journals to provide better ways of publishing alternative content like data or code.

  3. On challenge number 1. Data mining tools have been built several times in the past, and I have been close to some of them. Some of the questions driving the creation of "data mining tools" must include these:

    (1) Simplicity or feature-rich
    When explaining the tool that has been built, many astronomers say that it would be really useful -- but only if feature X is included. Unfortunately, each customer has a different idea of what the critical X factor is. Making something simple and robust may not offer functionality that can attract astronomers; making it feature-rich delays release, increases cost, and reduces robustness.

    (2) Handling real data
    A non-robust data mining tool may require all cells in the table filled in, may not be able to handle upper-limit data, may not understand that magnitude = 99.0 is missing data, may not be able to handle text fields with strange characters, and so on. Often much of the work of any data mining enterprise is getting the input dataset in a good shape that the algorithm can handle it. How much effort should be spent on the actual algorithm of data mining, and how much on the data input and cleaning.

    (3) Too complex to understand
    It is a temptation to build a sophisticated tool, but scientists like to know what is really happening in their analysis, and a data mining tool that appears to be magical will not be well-received. Use of terminology and acronyms from outside of astronomy will push away the astronomers. Examples are critical, especially those focused on common tasks that the astronomer wants to do.

  4. 1. I think the expression 'data mining service' is not well enough defined. If I have a Tb of data, would I need to upload it to an online service in order to be able to analyze it? Or would this be specifically to data mine VO resources? If so, how is it different from standard VO tools? I think a database of existing tools and techniques would be more useful than trying to build a monolithic 'service'.

    3. I don't think we should necessarily once more reinvent the wheel and develop our own visualization package/service, but rather make the information available to users as to what tools are already available, including HOWTO guides, etc. We can use tools developed for other disciplines (see e.g. the Astronomical Medicine Project at

    4. I agree with Gus's comment that a model would work well at least for a Q/A part of such an exchange medium. There are existing open source clones of stackoverflow, so such a system would not need to be developed from scratch.

    5. I think a better approach would be to make sure that all journals recognize the need for a section on methods/astrostatistics/codes, as a few already do. I agree with Sarah that creating a separate journal would make it look like astroinformatics is not relevant to 'regular' astronomers, when it clearly is. However, one useful thing to do might be an AstroInformatics newsletter (like e.g. the star formation newsletter at, which acts as a central collection of papers that relate to astroinformatics. This would require one person to maintain the newsletter, but would be much less effort than creating a whole new journal from scratch.

  5. I agree with 1-4

    I disagree with 5. Echoing other comments, I also think that establishing a new journal for astroinfo is the wrong way to go, if the aim is to obtain recognition for practitioners (which I would endorse). Instead, we must make sure that established journals (e.g. ApJ) publish astroinfo papers.

  6. Here is a suggestion for #6 that came up over lunch: the creation of an "astroinformatics" keyword, or even a set of "astroinformatics: ..." keywords to be used in refereed publications.

  7. to follow up astrofrog, one could then collate by keyword from arxiv to make a newsletter/site. If we could somehow html the paper, we could have reddit like discussions at a granularity smaller than the paper.

    Note that none this means we should publish a journal. But i suspect this sort of discussion if it takes off would make a better paper before it got to the referee.

  8. Since we are looking for 10 goals we can achieve over the next year, I think that one that could easily be achieved is the creation of a page (or a series of pages) on AstroInformatics on Wikipedia. This might help define the term and popularize its use? If I tell colleagues that I have been to as astroinformatics conference, their first question will be 'what is astroinformatics'?

  9. I agree that the creation of a wikipedia page will be a great first step in the definition and promotion of astroinformatics. Since a wikipedia page is typically the first introduction of a subject for many people, I could see how this resource will be very powerful. And as examples, the first links on google searches for bio- and cheminformatics are wikipedia page.

  10. Here's another suggestion: set up a system like a Digital Object Identifier (or even use DOIs) which describes a set of data used in a publication. Note that this identifier doesn't just refer to a dataset, but can identify very specifically the subset of data used in the publication. Then (assuming the dataset is in the public domain, which most major datasets will be) if I publish an astronomical result, someone else can access the same data to reproduce my result, or perhaps add an alternative interpretation

  11. Ray's idea is a good one, since a single DOI for the whole LSST data would not be very useful for example. The DOI for a subset of data idea is along the lines of the Data Tag concept in IRSA.

  12. More and more astronomers will need access to distributed resources in the future, and will need to know how to access them, what technologies to use etc. Would it be a good idea to create a web page/social network/blog that compiled this information. For example, I have been looking into cloud computing. How would astronomer go about finding out which service to use? What apps are best to run on the cloud? Where are the traps? What are the best practices?

    Bruce Berriman

  13. I would add two simple points to the 10 problems:

    6) Create a simple dictionary of AstroInformatics specific terms for classical astronomer. Different jargons form multiple disciplines (statistics, data mining, semantics, computer science, web design, sociology, etc.) have been used during the conference by speakers with completely/partially different expertise, leading to some confusion in most/part of the audience in all/most discussions and talks. Let people not involved in a specific aspect of AstroInformatics (or completely newcomers to this field) understand what we are really talking about.

    7) Sketch a specific path for professional advancement and "success recognition" in "AstroInformatics": surprisingly, at the conference there was few or no discussion about possible new models for the acknowledgement/evaluation of the work done by people working in this emerging field which usually allows only marginal active involvement in "science" and research. This, in turn, leads to few published papers, few citations and poor (from a classical point of view) CVs, and consequently few job opportunities. IMHO, this point should be addressed asap, or there is the risk that the development of what we call AstroInformatics could be hindered by the lack of commitment from young researchers, or (and this could be worse for AstroInformatics...) left only to unconventionally minded academics with permanent positions and plenty of time to spare.

  14. I think Raffaele's #6 is very doable and valuable. I know I would benefit from the process of processing the terms into something more generally digestible.

  15. At AI2010 the use of statistical terminology was twice offered as a "problem" for Astro-informatics. So I'd like to point people to the jargon mapping provided by the CHASC AstroStatistics Group:

    See links "Astro Jargon for Statisticians" and "Stat jargon for Astronomers"

  16. I note that an entry for Astroinformatics has appeared on Wikipedia recently (Aug 18) but it really needs editting:

  17. #8 Go out to our respective academic communities, find an informatics person in another field (bio,geo,etc), meet them and talk about their training, tools, issues of attribution and career building, data. Then report this back to this blog or to another "wiki" to see if we can glean best practices from other fields.