Building up a new standard of eye-tracking API
This discussion is about the conversion of eye-tracking devices into somewhat similar to mouse concerning its easiness and transparency in end-user software development. It is expected that the proposal of “everyone-agree-and-support” standard ways to control devices that access data will help in the achievement of this goal (conversion).
Hi,
I'm highly interested in helping set standards for eye-tracking. And here are my first thoughts in the discussion:
- It is important that, when implementing a standard eye-tracking API, we rely on multi-platform and open-source technologies rather than platform-specific libraries. This would save efforts in reimplementation (should there be changes in the standard) while reaching for the maximal amount of potential users and programmers.
- Such a standard should work in different layers:
* The lowest layer is merely worried about getting raw data from the device and returning something like "The user is looking at the coordinates X,Y of the screen" or "This device's accuracy is 2.0 degrees".
* Higher layers should provide messages that convey the user's intention "Activate that drop-down menu", "select the third entry in that drop-down menu". Mouse emulation, though popular, is hardly sufficient with widgets that can be left-clicked, right-clicked, dragged, react to keypresses, etc. On higher levels it's better to have messages with more meaning.
* Higher layers should also proactively activate extra tools to facilitate user's interaction, such as "A text input area was selected, let's automatically activate the system's default gaze text-input software".
* Higher layers should also be able to disambiguate user's intention, when necessary. For example: "The user is trying to push either button 1 or button 2, but I'm not sure which. Let's call a disambiguation software to help us decide what the user wants. Maybe we'll zoom automatically in this area or push the buttons away from each other so it is easier to tell."
- There's great potential for creating an intensely active market for eye-trackers if it becomes easy to develop for them:
* Webcam providers could experience a increase in sales in webcams due to lower-accuracy software eye-trackers, such as the OpenGazer.
* This could create a more active market for head-mounted cameras to be used for eye-tracking. Current producers of webcams would have little effort in entering this market.
* Producers of more specialized eye-trackers could experience an increase of consumer interest in the area. Plus they'd have the advantage of providing higher-accuracy eye-trackers, and even eye-trackers that have their own processor for eye-tracking (rather than leeching processor time from the user's computer).
Hi Oleg,
As a researcher and developer of applications that use gaze as a pc input, I strongly agree that we need a standard ET API. So how do we go about achieving this?
Hello,
Thanks Oleg for taking the initiative! I think this discussion is overdue and it will provide us with an easy to use, fast and reliable API sketch.
During developement of OGAMA we needed to create a very basic API interface similar to the ETU-Driver definition wich I would just show for discussion at the lowest level of implementation that each hardware manufacturer should be able to provide. Its written in C#.
code:
/// <summary>
/// This interface introduces a possibility to add new tracking hardware
/// to the <see cref="RecordModule"/>.
/// </summary>
/// <remarks>For an example how to implement this interface have a look
/// at the two existing implementations MouseOnlyInterface
/// for the tracking of mouse data and Tobii.TobiiInterface for
/// tracking with a Tobii (www.tobii.com) system and AleaInterface for
/// tracking with a Alea Technologies (www.alea-technologies.com) system.
/// Please also refer to the RecordModule source to add the
/// new tracker to the user interface. Each tracker should have its own
/// TabPage in the RecordModule.</remarks>
public interface ITracker
{
/// <summary>
/// An implementation of this event should
/// send the new sampling data at each sampling time intervall.
/// </summary>
event GazeDataChangedEventHandler GazeDataChanged;
/// <summary>
/// An implementation of this method should do all
/// connection routines for the specific hardware, so that the
/// system is ready for calibration.
/// </summary>
/// <returns><strong>True</strong> if succesful connected to tracker,
/// otherwise <strong>false</strong>.</returns>
bool Connect();
/// <summary>
/// An implementation of this method should do the calibration
/// for the specific hardware, so that the
/// system is ready for recording.
/// </summary>
/// <returns><strong>True</strong> if succesful calibrated,
/// otherwise <strong>false</strong>.</returns>
bool Calibrate();
/// <summary>
/// An implementation of this method should start the recording
/// for the specific hardware, so that the
/// system sends <see cref="GazeDataChanged"/> events.
/// </summary>
void Record();
/// <summary>
/// An implementation of this method should do a clean up
/// for the specific hardware, so that the
/// system is ready for shut down.
/// </summary>
void CleanUp();
/// <summary>
/// An implementation of this method should supply
/// the specific hardware systems current time, so that the
/// recorder could retrieve system times.
/// </summary>
/// <returns>A <see cref="Int64"/> with the current time in
/// milliseconds.</returns>
long GetCurrentTime();
/// <summary>
/// An implementation of this method should show a hardware
/// system specific dialog to change its settings like
/// sampling rate or connection properties. It should also
/// provide a xml serialization possibility of the settings,
/// so that the user can store and backup system settings in
/// a separate file. These settings should be implemented in
/// a separate class and are stored in a special place of
/// Ogamas directory structure.
/// </summary>
/// <remarks>Please have a look at the existing implemention
/// of the tobii system in the namespace Tobii.</remarks>
void ChangeSettings();
}The messages sent contain the GazeData structure which serves the basic input values for default screen based devices.
code:
/// <summary>
/// Gaze data structure with fields that match the database columns
/// that correspond to gaze data. Its a subset of
/// <see cref="Modules.ImportExport.RawData"/>
/// </summary>
public struct GazeData
{
/// <summary>
/// Time in milliseconds from the start of the recording.
/// </summary>
public long Time;
/// <summary>
/// x-diameter of pupil
/// </summary>
public float? PupilDiaX;
/// <summary>
/// y-diameter of pupil
/// </summary>
public float? PupilDiaY;
/// <summary>
/// x-coordinate of gaze position in values ranging between 0..1
/// </summary>
/// <remarks>0 means left margin of presentation screen,
/// 1 means right margin of presentation screen.</remarks>
public float? GazePosX;
/// <summary>
/// y-coordinate of gaze position in values ranging between 0..1
/// </summary>
/// <remarks>0 means top margin of presentation screen,
/// 1 means bottom margin of presentation screen.</remarks>
public float? GazePosY;
}This is the current implementation that serves its purpose well, but for generalization in a standard API there are drawbacks:
The problem is its restriction to 2D, should we replace the X,Y values with degrees and add a DegreeToXY method in the ITracker interface ?
Its missing a confidence value and is not suitable for tracking devices that track both eyes.
I try to post a first GazeData structure that may serve as the basis for further discussion in the next post.
Adrian Voßkühler
Freie Universität Berlin
http://didaktik.physik.fu-berlin.de/projekte/ogama
Hi all,
This post is intended to be a first sketch for a gaze data structure that should be flexible enough for use in a wide range of applications.
The '?' indicates that this value is allowed to be null.
code:
public struct GazeData
{
public long Time;
public Eye Eye;
public Gaze? LeftGaze;
public Gaze? RightGaze;
}It contains several substructures as defined below:
code:
public enum Eye
{
None,
Left,
Right,
Both,
}
public enum Validity
{
None,
Poor,
Uncertain,
Good,
}
public enum Unit
{
None,
Millimeter,
Pixel,
Degree,
}
public enum PupilDataType
{
None,
XYDiameters,
EllipseDiameters,
}
public struct Pupil
{
public Unit DiameterUnit;
public PupilDataType DiameterType;
public double DiameterOne;
public double DiameterTwo;
}
public struct Gaze
{
public Pupil GazePupil;
public Validity GazeValidity;
public Unit GazePosUnit;
public double? GazePosOne;
public double? GazePosTwo;
}So what do you think ?
Best wishes,
Adrian Voßkühler
Freie Universität Berlin
http://didaktik.physik.fu-berlin.de/projekte/ogama/
adrian.vosskuehler@fu-berlin.de
Hi there,
I strongly encourage you not just to think about technologies like COM / low level API / platform dependency etc.
It's seem very important to me to introduce layers of abstractions on the data access and eye tracker control as well. A strict requirement engineering is necessary to not get lost in too much functionality but still be able to support power users which want maximum control.
Even here in this short thread we have a user which comes from the Gaze Interaction side. He wants data like desktop object selection from the eye tracker. Gaze Analysis guys want just gaze data (monocular or binocular) some researchers rely on the fixation/saccade algorithms that hardware manufacturers provide, some like to implement their own filters and event detection algorithms. The next one wants to works in a pupillometry application with pupil diameter only.
An API that provides one layer of data access complexity will either scare users that want to write just a tiny simple eye chess application or it will be insufficient for researchers who want everything, from head acceleration to pupil diameter and ms precise binocular gaze data with confidence measurement.
Anyway, first you should collect application fields you want to address. Can we assume that let's say researchers that usually write every analysis on their own in Matlab will not use the API to access raw pupil position. I have seen researchers that don't even rely on the the calibration of the eye tracker. They just accessed the raw pupil position in video pixels and implemented they own gaze mapping.
It sometimes helps not to think want you want but to also think about what you don't want.
The focus is on gaze interaction, right? The more functionality is put in the eye tracker the better? This would exclude a couple low tech systems that do only gaze x/y. Eye trackers that are designed for the gaze interaction market Tobii/Alea bring already a couple of convenient functions like snap in to desktop elements, cumulative dwell on desktop elements and mouse click and zoom functions and tracking status windows. Should all this be part of the API as well or should developers of gaze interaction applications do this.
Lars Hildebrandt
VP Development
www.alea-technologies.de
Maybe I expressed poorly when talking about layering and platform-independence.
Having a standard driver, such as the ETU is important so that different devices can be treated similarly, without much need to redo driver-handling code. If there's a difference of accuracy, number of eyes tracked, rate at which one's eye tracker information is updated or whatever else, it can be informed by the device itself.
I might have jumped too far ahead when I was talking about desktop behavior, but that still holds. Mapping eye-tracking information into mouse events is relatively straightforward, but it is a sub-optimal solution. I'll hold that thought for the moment and bring this up again once there is more advance in lower-level API standardization.
Oleg, could you provide more information about the ETU driver you're talking about? Is it freely available? Is it open-source? What kind of information does it gather from a device? Can you provide some links?
-Thiago
just found out about this thread. Great initiative.
How about something along the lines of Adrian's suggestion? The gaze data structure looks OK, although I would use double for the timestamp (there are already trackers with sub-millisecond resolution / higher than 1 kHz sampling rate). The API specification is basic enough. I'm only a user, but have found that the listed functionality covers about 95% of my needs. I think C# is not really an option for cross-platform development though--why not stick to plain old C? Maybe look at BSD interfaces as examples of good design? KISS. You can always provide a higher level abstraction layer later.
In another thread there a 50 ms delay was mentioned as desirable; this is definitely too long for research purposes
my 2 cts, jochen
Hi Oleg,
oleg wrote:
Lars, you are absolutely right talking about incompatible needs from various ET users and layers of abstractions. Here (actually, in the "API type" topic) I'm trying to start with the simplest level: come-and-get-gaze-data. No video pixels, however, - too low already. I suppose, all other possible layers will be based on this basic level, and I already mentioned that (possible) specification of such levels is the next task.
I just want to throw up some balls, which might help you to fix some decisions in advance. It’s a decision to provide at the lowest level binocular gaze x/y in screen pixels or degree. You will exclude some researchers already from using the API. As mentioned before I met some who just wanted pupil position in video pixels and they did they screen mapping on their own. I am fine with the decision but I want to stress out that I think it matters from which application field you are coming. Researchers have a different view on the needs of an API than application developers for usability testing or gaze interaction applications. Depending who would design such an API it will have a flavor. For instance developers of AAC applications like TheGrid, Mind Express, Dasher, Viking will not even consider the API if you tell them here is gaze x/y please detect your own fixations, saccades, blinks and eye gestures. I have the feeling that it’s worth to think about higher levels as well before implementing the lowest level of data access. Will you rely on the manufacturers to provide abstract data? Or do you want to implement the functionality of such a layer on your own? Most of the researchers are conditioned to work with gaze x/y, but is this necessary? The reason probably is because users of eye trackers trusted the hardware manufacturers in the past to detect the pupil and calculate the gaze properly but they never trusted them to calculate fixations properly. On the other side many established hardware manufacturers where too lazy to increase the comfort of eye tracker usage. This is a very implicit decision to define gaze x/y as the lowest level of data access.
Another though which is to consider. The Swedish market leader doesn’t give you access to gaze x/y with half of their products unless you pay a fortune for the API. Other manufactures might follow this strategy. Gaze X/Y is common and essential for researchers but the trend goes to higher levels of data abstraction if you leave universities as application fields. Handicapped and gaze interaction market left the research/analysis market behind by numbers. If you want to bring the API out of that niche you should somehow reflect this by design in the API not just as an add-on put on top of the lowest level. Higher data abstractions layer don't just work on the standard gaze x/y data. Sophisticated event detection and eye gesture detetection algorithm take much more internal data from the image processing.
I don’t want to push anyone to do a direction. I just want to raise question that might help to fix decisions and broaden the field of view.
Lars Hildebrandt
VP Development
www.alea-technologies.de
Hi Oleg,
Don’t get me wrong. I don’t want to be the showstopper. I am as enthusiastic as you are about API standardization but every thought and argument needs a cold shower if you don’t want to build on sand. Arguments that survive the first skeptic attack are more likely to be a solid basis for decisions.
oleg wrote:
Probably, I'm too bad in making marketing forecasts, but this my vision and believe.
I am sharing your vision but in addition I am saying if you want to get more people in the boat, not just the usual suspect from universities, I would by design enlarge the scope of the API. There are tons of applications out there in the AAC market that are waiting for eye tracker data input but the developers, which is often just one person per software, don’t have the time and the knowledge to do anything useful with just gaze x/y. They need more comfort if they should use a standardized API.
oleg wrote:
I leave the implementation of higher layers for third-party developers. My guess was that once we get gaze data in some standard way, we may apply the algorithms implementing high-level functionality, discriminating them by the parameters of ET system. Or I am wrong?
Without going into details but much high level functionality can’t be derived by a standardized set of parameters. State of the art eye trackers have tons of internal parameters like head speed, head acceleration, image contrast, facial features, pupil-iris contrast, iris structures, history of pupil data and confidences for all measures. High level algorithm for gesture recognition or fixation detection don’t just take gaze x/y they take a bunch more parameters. The published algorithms for event detection as Salvucci /Goldberg describe them work well low tech systems but Tobii/Alea/SMI remote tracking systems perform better with its own proprietary high-level algorithms.
The decision which is to make is: Do you want to enforce the hardware manufacturers to support very high level data access and make application developers happy? Or do you want to enforce high level functionality programming by third party developers on top of Gaze x/y or by the application developers. In the former case you burden a lot of work on the shoulders of enthusiastic 100$ eye tracker builders but I think application developers will benefit from this. In the second case (which is the status quo) you make the life of application developers hard and you waste some potential because as mentioned before, state of the art eye trackers will always work better with proprietary high-level algorithm than with the community high-level algorithms.
My feeling is that the API should support gaze x/y at its lowest level . This keeps the 100$ eye tracker builders in the race. But you/we should also design higher functionality in the API and leave that open the manufacturer to implement low-level and/or high-level functionality. This allows commercial companies to strategically decide what they want to support. But application developers can rely on the same set of functions. They might not work with every device but the API is the same.
oleg wrote:
then there is no place for standards here, as these "standards" are commercial know-hows, right?), and propose it only for amateurs and academic people who are not thinking about $$$, but rather about making ET technology affordable and useful for plain people.
I just want to point out that there is a gap between hundreds of applications in the gaze interaction market and the current gaze x/y interface. Closing the gap is the challenge and I think it’s doable. I have seen a couple of larger AAC application programmers struggling with 5-6 different APIs from eye tracker manufacturers. It took them almost a year in time to enable gaze control for their application. A standardized interface with a high-level comfortable data access (not just gaze x/y) would allow more application developers to do the same in much less time. Which would be a great thing for the whole technology.
Lars Hildebrandt
VP Development
www.alea-technologies.de
oleg wrote:
Lars, what kind of API Alea Technologies would support? Would it be based on many layers? If so, what are those layers, far high their would be? It would be very useful for us to know what commercial companies expect, and not only suggesting our own solutions.
I don't have to say that I am not neutral, do I?
Alea eye trackers at this time strategically support many applications. (Analysis and AAC) We rely on application developers accessing an API. Other companies which produce the hardware and the application don’t really have an interest in an open API because this would make both of their components exchangeable. We would gladly like to see more and more applications popping up with the support of our device even if it’s not supported exclusively.
The layers which would make researchers and gaze interaction programmers happy could be.
1. Raw data access (Gaze x/y, Pupil diameter etc.)
2. Event data access ( fixations, saccades, blinks, eye gestures, head events )
3. GUI interaction events ( fixations inside a GUI element, blink on a GUI element )
4. GUI action events ( button/region clicked, button/region looked at )
Case 3 and 4 looks similar but they aren’t. In case 4 the application doesn’t get anything but the information that the eye tracker engine now clicked somehow on a button. With blink or dwell. Case 3 leaves the activation up to the application.
-> Up to discussion
This approach could satisfy the need of researchers which want to dig into binocular gaze data for micro saccades and it could catch those hundreds of application developers which never have time but who want to easily add gaze support to their software
Lars Hildebrandt
VP Development
www.alea-technologies.de
Hi Oleg,
regarding the different layers of data access. Let's do a top down approach.
Imaging you are a poor programmer of a communication software or a tiny game like eye chess, whatever. You know how to program a windows application. You can make your software to react on "MouseEnter", "MouseLeave", "MouseDouble-Clicks", "MouseSingle-Clicks". You want to make your application gaze aware. You have no clue about dwell, blink, saccades etc. You don't want to deal with this because this is the job of people you know about the human cognition and the physiology of the eye. This is the job of eye tracking device manufacturers.
The highest level of comfort you are expecting is something similar to what you are getting from the mouse if the eye tracker wants to be an alternative mouse input.
* GazeEnter ( a GUI element )
* GazeLeave ( a GUI element )
* GazeActivation ( user did a selection on a GUI element )
* Pause/UnPause
* Method-To-Tell-The-API-where-GUI-Elements-are.
This is all I can think of at first. But more you don't want to know about eye tracking as application developer.
An api must be as simple as above if you want to enlarge the number of users working with the API.
Even the animation of dwell feedback is the job of the eye tracker manufacturer because they know which feedback is distractive and which is not.
This is the highest level. The lowest is of course something that give you binocular gaze x/y etc. In between this two levels there could be more layers that allow you access to intermediate processing steps like fixations, saccades and blinks.
Lars Hildebrandt
VP Development
www.alea-technologies.de
oleg wrote:
lars wrote:
* GazeEnter ( a GUI element )
* GazeLeave ( a GUI element )
* GazeActivation ( user did a selection on a GUI element )
* Pause/UnPause
* Method-To-Tell-The-API-where-GUI-Elements-are.Reminds MPA SDK, doesn't it :)? Yes, I had the same vision. I think, I have mentioned that initially I though to design this kind of layer to be implemented as auxiliary interface (thus, to be implemented by someone else, not by manufacturers), but once you said X/Y is not enough to make the best implementation of this layer, then this layer will go into "basic" API.
Sounds like a good strategy. I can name you dozens of gaze interaction applications developers which are desperatly waiting for an interface like that. Even if not every hardware manufactorer will support this API level I am sure they will gladly use it and bind their application to the major manufactorers.
Lars Hildebrandt
VP Development
www.alea-technologies.de

“Damn! Where is that… hey you! no… god…” He left the game, nervously hitting the mouse buttons. “Restart it!”. No effect… He tried all the actions he knew that may help to recover a device from an error: pulled out and in the plug and USB wire, then reinstalled the drivers… The eye tracker was dead. “Gad!” – today his score was better than usually and hi almost reached the top 10; just another 20 minutes and he would get it. Disappointed, he stood up made a few circles around the room… then sat down. It was clear that the game was over for today. “Well, not a big deal…” –he muttered, bracing himself and opening CNet. There were about 25 devices in the tab “Eye trackers”, and he has selected one of the most popular that automatically detects its position relatively to a screen and in space. “Twice cheaper that mine” – his eyes were smiling already as his mood was changing rapidly. “.. and twice better” – he easily paid 70 euros, anticipating his following wins as the device were noticeable more accurate than the dead one, almost without any restriction in movements as long as the face was turned to the screen.
In a few days he was attaching four tiny plastic boxes to the monitor corners. The system notified him about successful recognition of a new device, so that there was no need to install anything from the disk he found in the package. “Well-well”, he was impatiently moving the mouse cursor around the screen, willing to start the game after a long pause. But the system popped up a full-screen window with an exciting ad about the new device. In a half of minute it has reported about successful calibration and he could dive finally into the virtual world.
Sure, the tracker was much better that the old one. The pointing was simply perfect. Actions were executing just when he was making a decision to execute them. It was like the machine has finally learnt how to listen his thoughts. Soon he noticed he is missing those movements he used to do previously to correct the tracker’s pointer… Today he got his ever best score, leading the rating list…